_345 | Hacker News

Comment by _345 | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

[−]_345 · 2026-07-02 Thu 05:15 UTC · link

This makes so much sense as to why I've always felt that Opus 4.8 was leagues ahead of GPT 5.5. It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project

[−]re-thc · 2026-07-02 Thu 05:32 UTC · link

> It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project.

At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.

It's also possible that it's just a harness problem more than model.

[−]e9 · 2026-07-02 Thu 05:39 UTC · link

I agree with you on the harness. I find that Claude can be good in any harness but GPT is only superior inside Codex.

[−]nsingh2 · 2026-07-02 Thu 05:48 UTC · link

Why supply underspecified requirements in the first place? Both models are good at challenging assumptions/edge cases and asking questions to clarify, but seemingly only when explicitly asked (i.e. something like a "brainstorm" skill).

I don't think either harnesses do enough to encourage the model to challenge all assumptions and ask questions, maybe because users might find it annoying. That step is basically a requirement IMO.

I've found all of the GPT-5 models to be very nit-picky, useful for code review and mathematics (important for my work), but seemingly gets in the way of "aesthetic" code, e.g. overly defensive code to cover all edge cases, even if unlikely.

There is seemingly also a tradeoff between flexibility vs instruction following. In my experience Opus will sometimes ignore instructions but can "fill in the blanks" more, vs GPT-5.5 follows instructions better but perhaps at the cost of rigidity.

[−]antonvs · 2026-07-02 Thu 05:55 UTC · link

> Why supply underspecified requirements in the first place?

Minimizes effort, is the obvious answer.

[−]cyberpunk · 2026-07-02 Thu 06:41 UTC · link

Poor trade off, the model is then designing a massive chunk of your solution instead of you. With a good spec, bits of typo’d pseudocode, and slightly more effort than a couple of sentences they can actually produce passable software.

I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.

For those who can, I can’t find much of a difference between them. Codex has the slight edge, but that’s all just “feels” to me.

[−]ben_w · 2026-07-02 Thu 07:28 UTC · link

You call it a poor trade off, but:

> I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.

This is exactly the benefit for most people.

Most people don't want to code the app, they just want the app.

Even people like us who do like coding, we can only think of all of these things within a domain that we already know; somebody who writes shaders for games isn't likely to know or care much about the ins and outs of database development or how healthcare privacy law and KYC interact with zero-knowledge proofs.

(Of course, if the AI knows about these things and then completely fails to make use of that knowlege, that's still a fail).

[−]fooker · 2026-07-02 Thu 06:04 UTC · link

> Why supply underspecified requirements in the first place?

Because you'd not want to forever loop outside your home when asked to "while you're out, grab some eggs" :)

[−]zuzululu · 2026-07-02 Thu 06:04 UTC · link

same observation here opus 4.8 (and i dont understand the people defending gpt 5.5 constantly) was significantly mature, it would even push back against anything off putting where as GPT 5.5 will happily agree and do what is asked but I would note that it takes several tries.

4.8 also requires more than one prompt but its output is significantly higher quality and offers more insight

Fable 5 is a different beast however.

[−]CSMastermind · 2026-07-02 Thu 06:26 UTC · link

Man I don't know if I'm living in a crazy bubble or something but GPT 5.5 is lightyears better than Opus 4.8 for me to the point where I'm honestly wondering how you're evaluating them or what kind of work you're doing.

There's specific tasks that Opus does better on like Frontend Dev and Design but for anything else 5.5 just laps it.

[−]dools · 2026-07-02 Thu 06:37 UTC · link

Yeah I’ve been consistently underwhelmed by anthropic models, but then I don’t use their harness so maybe that’s it

[−]wwind123 · 2026-07-02 Thu 07:40 UTC · link

In my experience, for more mechanical refactoring work (like splitting a big source code file into multiple smaller ones), GPT 5.5 runs way faster than any of the Claude models. But for other tasks that require deeper reasoning, it's not that clear who is the winner.

[−]hypfer · 2026-07-02 Thu 06:44 UTC · link

Similarly, it explains to me why people found Claude so amazing, while I just thought "eh."

Tool expectations