purple-leafy | Hacker News

Comment by purple-leafy | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

[−]purple-leafy · 2026-07-02 Thu 03:34 UTC · link

Benchmarks are great, but I feel like there’s a better way this seems quite subjective.

What you really need is an objective benchmark

[−]echelon · 2026-07-02 Thu 03:46 UTC · link

> What you really need is an objective benchmark

"When are all the software engineers unemployed?"

[−]purple-leafy · 2026-07-02 Thu 04:07 UTC · link

Not sure I follow haha

[−]eli · 2026-07-02 Thu 03:53 UTC · link

I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.

[−]charcircuit · 2026-07-02 Thu 04:50 UTC · link

The issue is that you can't do unsupervised learning if you require humans.

[−]rhdunn · 2026-07-02 Thu 07:08 UTC · link

LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).

I'm investigating/experimenting with using traditional NLP (stanza, spaCy, etc.) to try and grade the responses according to different metrics (is the response in first/second/third person?, is it written as poetry, prose, or drama? etc.). I'm also thinking about using information extraction and synonym detection to handle data queries and the like.

[−]charcircuit · 2026-07-02 Thu 07:28 UTC · link

>LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).

And LLMs have gotten good at handling these issues. There is asymmetric difficulty in generating a solution and verifying it correct. And overtime LLMs are getting better and better which allows training on synthetic data to make it better.