Hacker News

Favorites Setup
Comment by allan_s | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers
[−]allan_s · 2026-07-02 Thu 04:22 UTC · link
I think the source of your issue is in your statement itself, why do you want a task that evaluate things as broad to be only a coding task ? Shouldn't it be a planning task, documentation task, knowledge retrieval task etc. And very certainly not with just an initial prompt but an existing codebase + existing doc + tickets ?