Hacker News

Favorites Setup
Comment by LiamPowell | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers
[−]LiamPowell · 2026-07-02 Thu 08:07 UTC · link
This is not actually what the reviewer prompt says, or perhaps it is, I don't know since they don't make it public. I'm just pointing out how it seems like a bad idea to ask a LLM to make a subjective judgement on things like "taste". If the SOTA LLM witting the code could not produce tasteful code then why would a different LLM be able to judge the "taste" of that code?

Which LLM should we even use to judge taste? Is it giving an unfair advantage to Model X if we use Model X as the judge? Maybe we should use multiple models as the judge, but now the model that's best at recognising and praising its own code has an advantage. The whole thing is just an unsolvable problem when a LLM is the judge.