Hacker News

Favorites Setup
Comment by Madmallard | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers
[−]Madmallard · 2026-07-02 Thu 04:11 UTC · link
next round of trust me bro benchmarks
[−]dozerly · 2026-07-02 Thu 04:54 UTC · link
Just wait for the next 100 rounds. People love seeing the 65% -> 85% seemingly over and over again for every new model.