charcircuit | Hacker News

Comment by charcircuit | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

[−]charcircuit · 2026-07-02 Thu 07:28 UTC · link

>LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).

And LLMs have gotten good at handling these issues. There is asymmetric difficulty in generating a solution and verifying it correct. And overtime LLMs are getting better and better which allows training on synthetic data to make it better.