GPT-5 Reality Check

August 8, 2025 | By Matt Koppenheffer

OpenAI dropped GPT-5 yesterday and everyone's asking the same question: Is this the model that changes everything?

After a day of extensive testing, my answer is... it depends on what you're trying to change.

I've been putting GPT-5 through its paces across three different implementations: ChatGPT, Cursor's new CLI tool, and my own QualRank comparison platform. Here's what I actually found when the hype settled down.

(Note: I'm not among the influencer technorati who got early access at OpenAI HQ... hint, hint @Kevin Weil, maybe next time? But hey, public release means real-world testing conditions.)

ChatGPT: Rocky Start, Then... Nothing

Right out of the gate, I hit a wall. My very first question—help keeping birds from nesting under my grill cover to munch on seeds—triggered what felt like an AI existential crisis. GPT-5 gave me a "let me think a while to concoct a plan" message (paraphrasing) and then... hung. For a loooong time. I finally clicked "give me a shorter answer" and it snapped back to normal.

But here's the kicker: I'm pretty sure I lost access to GPT-5 immediately after. I'm on ChatGPT's free tier (can't pay for everything), and when I ask the model directly what version it is, it tells me I'm back to GPT-4o. My assumption is this is either phased rollout hiccups or they're being conservative with free tier access during the initial rush.

Cursor CLI: Where GPT-5 Actually Showed Up

The Cursor setup was straightforward, and I was quickly running GPT-5 through a real coding challenge—the kind of multi-part ticket I regularly throw at Claude Code. The task: simultaneously debug an issue in my QualRank codebase, fix it, then update both the backend handling and front-end UI.

The Good News: GPT-5 was methodical and effective on the backend work. It grepped the repo, mapped out the architecture, identified the bug's root cause, and implemented the fix on the first pass. Functionally, everything worked. For pure logical problem-solving, I was genuinely impressed.

The Not-So-Good News: The front-end work was... functional but ugly. The styling was completely inconsistent with the rest of the application. Where I asked for a toggle, it gave me a bog-standard checkbox with some sort of toggle-ish indicator slapped next to it. Font sizes were all over the place.

I ended up taking the front-end mess back to Claude Code, telling it that "a junior developer had made a styling disaster" (technically true), and Claude cleaned it up better than I expected.

One reality check: Cursor gives you GPT-5 free initially, but I burned through that allocation with just one substantial ticket. So it's cool for testing, but you're not getting serious work done without paying.

QualRank Testing: The Numbers Game

This is where things got interesting from a business perspective. Reasoning models can be tricky when you're managing token limits (which I do on QualRank to keep costs reasonable), and GPT-5 seemed to push against those constraints more than other models. I had to adjust my testing parameters to get clean comparisons.

Performance Results: GPT-5 matched or beat Claude Opus on analytical reasoning, code generation, and logical problem-solving tasks. It solidly outpaced Gemini 2.5 Pro, Kimi K2, and Grok 4 across most benchmarks.

The Business Kicker: It delivered this Opus-level performance at roughly half the cost. If you've been using Claude Opus for its reasoning power but wincing at the API bills, GPT-5 might be your budget's new best friend.

The Creative Reality Check: On generative creative tasks, GPT-5 fell behind. My benchmark test—writing a 16-bar rap about sourdough baking—produced some genuinely comedic results. Consider these gems:

"Bar 3: I feed the jar at dawn, watch bubbles blooming soon, monsoon
Bar 4: This dough conducts my day, like an orchestra before monsoon"

I'm still trying to figure out what orchestras and monsoons have to do with sourdough. (For context: Sonnet 4, Gemini 2.5 Pro, and Kimi K2 all did much better. Sonnet did quite well.)

This pattern explains a lot about the coding results—GPT-5 nailed the analytical backend work but stumbled on the more artistic front-end decisions.

What This Actually Means for Real Work

I haven't tested GPT-5-mini or -nano yet, but here's where I land after this initial deep dive:

For ChatGPT users: I'll need to wait for the rollout stabilization to get a fair assessment. When it works, it'll definitely be an improvement over GPT-4o, but probably not enough to pull me away from my Anthropic workflow.

For developers using Cursor CLI: This is genuinely compelling, especially if you're cost-conscious. The backend reasoning is solid, and you could potentially pair GPT-5 for logic with Claude for polish. The multi-model access in Cursor CLI might actually create a best-of-both-worlds scenario that rivals Claude Code.

For API users optimizing cost vs. performance: If you're currently using Claude Opus for analytical heavy lifting, GPT-5 deserves serious consideration. You'll get comparable reasoning power at significantly lower cost. But if your output needs to be user-facing and polished... maybe keep Claude in the mix.

The Bottom Line

GPT-5 isn't the everything-model some were hoping for, but it's a very good something-model. It's particularly strong at the analytical reasoning tasks that often bottleneck product development workflows, and the cost efficiency could reshape how teams allocate their AI budgets.

The creative limitations suggest we're still in a multi-model world, but that's not necessarily bad news. Different tools for different jobs might actually be more strategic than trying to find one model to rule them all.

ChatGPT: Rocky Start, Then... Nothing

Cursor CLI: Where GPT-5 Actually Showed Up

QualRank Testing: The Numbers Game

What This Actually Means for Real Work

The Bottom Line

Have thoughts on this?