jonathanleane | Hacker News

Comment by jonathanleane | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

[−]jonathanleane · 2026-07-02 Thu 03:24 UTC · link

Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?

[−]lacunary · 2026-07-02 Thu 03:33 UTC · link

presumably whatever the top model uses and then some, since the human can use the model.

I wonder if a model could score higher if it had a human at its disposal?

[−]pishpash · 2026-07-02 Thu 05:09 UTC · link

Maybe models should ask for human-in-the-loop input, as a matter of convention.

[−]sinuhe69 · 2026-07-02 Thu 06:24 UTC · link

A model that can ask questions or ask for help when in doubt is indeed a major feat. None of the current frontier models can do that.

[−]olmo23 · 2026-07-02 Thu 06:37 UTC · link

With a human at its disposal, it could probably count the number of R's in strawberry!

In all seriousness though, adding capabilities should not normally reduce the effectiveness of a model (within reason: don't pollute the context window with millions of useless tools).

[−]jascha_eng · 2026-07-02 Thu 06:42 UTC · link

I mean these were all solved before I assume so 100% not the same human ofc but models are expected to be good at a variety of code bases while human can specialize in one and learn. I think it's fair to compare to an individual that is used to working on a product.

I'm more interested in how fable would do