Hacker News

Favorites Setup
Comment by re-thc | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers
[−]re-thc · 2026-07-02 Thu 05:32 UTC · link
> It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project.

At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.

It's also possible that it's just a harness problem more than model.

[−]e9 · 2026-07-02 Thu 05:39 UTC · link
I agree with you on the harness. I find that Claude can be good in any harness but GPT is only superior inside Codex.