Beyond the Honeymoon: What Happens When AI Coding Gets Complicated

Everyone's sharing their AI coding wins. The perfect app built in an afternoon. The flawless feature shipped in minutes. The "I can't believe this actually worked" screenshots flooding LinkedIn.

I've been there. A few weeks ago, I gushed about Claude Sonnet 4 building my first real app—Sir Promptington—with impressive ease. But here's what nobody talks about: what happens when you move past the proof-of-concept phase and things get messy?

Spoiler alert: they get pretty messy.

The Ambitious Pivot

Instead of giving Sonnet 4 another small challenge, I decided to go big. I completely pivoted from Sir Promptington (a multi-model prompt rating tool) to Judge Promptington—a system that helps users choose the best AI model for specific tasks.

The twist? I didn't want to start from scratch. I asked Sonnet 4 to preserve the core functionality while adapting everything for the new concept. This is where things get interesting in real-world development—you're rarely building greenfield projects. You're adapting, refactoring, and building on existing work.

Sonnet 4 handled this beautifully. It understood the architecture, preserved what mattered, and elegantly transformed the rest. I was impressed.

Scaling Up the Complexity

Feeling confident, I pushed further. I moved from accessing a handful of models through frontier providers to integrating OpenRouter—giving me access to lots of models for testing. Again, Sonnet 4 adapted the codebase smoothly.

But then came the first real speedbump. DeepSeek R1 was returning null outputs, and I couldn't figure out why. Sonnet tried various fixes but couldn't crack it either. Turns out R1 was burning all its tokens on reasoning, leaving nothing for the actual response. I wish Sonnet could have diagnosed this, but honestly, it wasn't obvious. This was my first hint that we were approaching the limits of what AI coding could handle autonomously.

When the Wheels Come Off

Up to this point, my workflow was simple: Claude.ai connected to GitHub, generate code, copy to Replit or VS Code. I know it's not the ideal. But it worked.

Then I tried to add two further features: PDF exports of model evaluations and a consistency testing system for the judging logic. This is where Sonnet 4 started to struggle.

I'd request changes and get responses like: "I've updated the PDF generation function to handle multi-page outputs and integrated it with the existing evaluation system." Sounds great, right?

Except there was no code.

Just a description of what it claimed to have done.

Other times, I'd get tiny code snippets with zero context about where they belonged in the codebase. It was like getting puzzle pieces without the box cover.

The Reality Check

This wasn't Sonnet being lazy or broken. This was the reality of AI-assisted development hitting complexity limits. As codebases grow and requirements get more nuanced, the models start to struggle with context, integration points, and the messy realities of real software.

So I adapted. I got much more specific with my prompts, sometimes pasting in exact code sections I wanted modified. I started bouncing between models—Claude Opus 4 was excellent when I could afford the token costs, and Gemini 2.5 Pro through SimTheory.ai struck a nice balance between coding and explanation (Sonnet's eagerness to jump straight to code can be both a blessing and a curse).

What Actually Works

Here's what I learned about making AI coding work beyond the demo phase:

Shift your approach as complexity grows. In the beginning, sweeping guidance works great. "Build me a prompt evaluation tool" gets you a working prototype. But as things get complex, you need to flip the script: use the AI as a thought partner first, then guide targeted changes to specific files.

Use AI to understand before you build. When I wasn't satisfied with how the consistency scoring was working, I didn't just ask for fixes. I used Gemini 2.5 Pro to walk through how Sonnet had originally set up the scoring logic and why it was working that way. Together, we identified that the scoring approach didn't align with what I was actually trying to measure.

Gemini then recommended using Spearman's rank correlation (which I wasn't familiar with) as a better approach. Only after we'd worked through the conceptual problem did I go back and made targeted updates to specific files.

Get specific as complexity grows. Vague requests that worked early start to fail later. The more complex your codebase, the more precise your instructions need to be.

Test everything. AI models are confident even when they're wrong. Every change needs verification, especially as projects scale.

Use multiple models strategically. Different models have different strengths. Opus for complex logic, Gemini for explanation and context, Sonnet for rapid iteration.

Here's My Take

And take it for what it's worth, I'm product and builder first, with some minor dev chops, not the other way around. But…

AI coding tools are genuinely transformative for POCs and early-stage development. They can turn ideas into working prototypes faster than ever before. But they're not magic, and they're not (yet!) ready to handle complex, evolving codebases without significant human guidance.

The real skill isn't getting AI to write perfect code—it's knowing how to work with AI as your codebase grows, when to push it harder, when to switch models, and when to take back the wheel.

The honeymoon phase of AI coding is real and exciting. But the real value comes from learning to work together when things get complicated.

But the payoff for me for now: Judge Promptington is now powering my ongoing "Field Test" series, where I put different AI models head-to-head on real-world tasks.

Have thoughts on this?

I'd love to hear your perspective. Feel free to reach out.