saberience | Hacker News

Comment by saberience | original | Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

[−]saberience · 2026-07-03 Fri 10:50 UTC · link

For me it's the exact opposite, Anthropics models seem great for "vibe coding" by non engineers. My girlfriend uses Claude and loves it because she doesn't know any of the terminology and Claude happily fills in the gaps.

For me, with 20 years experience engineering across the stack for venture backed companies to FAANG, I cannot handle Claude at all, it writes way too much garbage that I never asked for. Codex is like a surgical instrument, it does exactly what I want it to and never bloats the codebase.

Anyone spending days with Claude with almost inevitably end up with a bloated buggy mess. Note: Codex also finds bugs and correctness issues that Claude misses, again, I've seen this probably 90% of the time. That is, Claude will happily tell you the feature is complete, but then get Codex to review the code and it will find 2-5 actual correctness bugs. Take those bugs and give back to Claude and it will admit it fucked up.

I've seen this behavior again and again and again. If you're not a strong/experienced engineer, Claude can seem perfect, but it's writing buggy code and you're just not aware of it, unless you're double checking with Codex or another LLM.