But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.
Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
What you really need is an objective benchmark
"When are all the software engineers unemployed?"
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
Of course, no-one seems to be (publicly) doing the comparative measurements that might allow us to reach rational conclusions here.
This is common in system prompts and frames the responses.
For example, you'd get different responses saying:
1. you are a pirate writing sea shanties about programming;
2. you are a news reporter writing an article on physics;
3. you are a senior software engineer with complete knowledge of PostgreSQL.
For 1 you could get responses along the lines of the Wellerman sea shanty -- "There once was a program that was set to C ...".
The "make no mistakes" bit does look dubious. It would be interesting comparing the results with and without that bit and trying alternative ways of getting the same desired behavior.
But it's a lie. Nobody's paying you to make paintings. They're paying you to build machines. The comparison between "making working software" with "taste" always devolves into bikeshedding and subjective opinionism, uses subjective human feelings to describe what should be objective and functional, isn't rooted in scientific rigor, and detracts from the real purpose of the thing. The work doesn't actually get better by trying to apply artistic principles to engineering. It just feels better for the people making it.
Once you make the machine work, then you can go about gilding the lily. But this is unromantic, unsatisfying, boring. Since the inmates run this particular asylum, we end up with a benchmark that tries to accurately mimic the human ego as applied to software design. Thus the new Gods create their digital Adams and Eves in their image.
"Does it work" glosses over a bunch of things: is it fast, cheap, secure, reliable, easy to understand, easy to modify? And that's just for server software where you've nailed down all the functional requirements. Determining what the functional requirements is it's own question.
And all these other non-happy path requirements are somewhat in tension with each other, so what is ideal in one environment is not necessarily ideal in another.
And in particular, "easy to understand/modify" is truly subjective. Different people have different ideas of what easy to understand means. Even if we get to a world where AI is writing all our code, "easy to understand/modify for the AI" is still an important question. We've probably all seen prototypes that collapse under their own weight of slop by now.
That’s the reason why I buy Apple products in private, because I value the design over the exorbitant prices they charge; and it’s the reason why I mull over code that’s already functional until it’s pleasing my ideas of elegance.
I can come up with all kinds of justifications and explanations why the code I’ve written a certain way is objectively better too - understandability matters to the next guy after all - but I won’t be ashamed for taking a certain pride in my work, even if nobody other than me ever values it. That’s fine.
When the LLMs finally take over coding altogether, you’ll have your raw, functional code. Won’t be long anymore. But for now, I’m a human, and I will do human things.
What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.
https://en.wikipedia.org/wiki/Generative_adversarial_network
In class you'd probably want a rule saying at least one LLM should be able to figure out the answer, but in a head-to-head I'm not sure how to solve it.
At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.
It's also possible that it's just a harness problem more than model.
I don't think either harnesses do enough to encourage the model to challenge all assumptions and ask questions, maybe because users might find it annoying. That step is basically a requirement IMO.
I've found all of the GPT-5 models to be very nit-picky, useful for code review and mathematics (important for my work), but seemingly gets in the way of "aesthetic" code, e.g. overly defensive code to cover all edge cases, even if unlikely.
There is seemingly also a tradeoff between flexibility vs instruction following. In my experience Opus will sometimes ignore instructions but can "fill in the blanks" more, vs GPT-5.5 follows instructions better but perhaps at the cost of rigidity.
Minimizes effort, is the obvious answer.
I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
For those who can, I can’t find much of a difference between them. Codex has the slight edge, but that’s all just “feels” to me.
Because you'd not want to forever loop outside your home when asked to "while you're out, grab some eggs" :)
4.8 also requires more than one prompt but its output is significantly higher quality and offers more insight
Fable 5 is a different beast however.
There's specific tasks that Opus does better on like Frontend Dev and Design but for anything else 5.5 just laps it.
Tool expectations
Anyone can run something and make a web page. These people just do it instead of questioning. Main difference. If everyone asks "how could you" "are you qualified" then we have nothing but gatekeeping.
I wonder if a model could score higher if it had a human at its disposal?
In all seriousness though, adding capabilities should not normally reduce the effectiveness of a model (within reason: don't pollute the context window with millions of useless tools).
I'm more interested in how fable would do