magnio | Hacker News

Comment by magnio | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

[−]magnio · 2026-07-02 Thu 05:05 UTC · link

I saw on Twitter that in an ML course at Tsinghua University, one of the tests asks students to write quizzes that fail the most LLM models as possible.

What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.

[−]vincnetas · 2026-07-02 Thu 05:39 UTC · link

We could call this "generative adversarial network" (GAN) :)

https://en.wikipedia.org/wiki/Generative_adversarial_network

[−]wwind123 · 2026-07-02 Thu 06:37 UTC · link

This kind of approach would generally still need human guidance, otherwise these models might get stuck in weird niche corners of the problem space that would not be relevant to any real world project.

[−]ben_w · 2026-07-02 Thu 07:19 UTC · link

We could call this "reinforcement learning from human feedback" (RLHF) :)

https://en.wikipedia.org/wiki/Reinforcement_learning_from_hu...

[−]olmo23 · 2026-07-02 Thu 06:33 UTC · link

How do you prevent degenerate strategies? I could trivially give a model a SHA256 hash and ask it to provide the source input.

In class you'd probably want a rule saying at least one LLM should be able to figure out the answer, but in a head-to-head I'm not sure how to solve it.

[−]wwind123 · 2026-07-02 Thu 07:34 UTC · link

Who knows. Maybe Mythos 5 already found a hole in SHA256, so this won't be too hard. :)