I Asked Claude Sonnet 4 to Build an App. Then I Watched Football.

Last week, I had an idea for a web app I'm calling "Sir Promptington" - a tool that uses multiple LLM models to score, evaluate, and improve user prompts. Instead of writing out a bunch of planning docs, I decided to test Claude Sonnet 4's rumored instruction-following prowess.

I fed the concept to Sonnet 4, then did what any reasonable person would do: turned on the TV to catch the end of Chelsea's Conference League Championship match (on delay, naturally… kids).

When I came back, I had a complete codebase waiting.

The Results: Surprisingly Good

Pulling the code into Replit this morning, I plugged in my API keys and... it basically worked. Out of the box. For a multi-LLM integration with a functional frontend.

What I liked:

  • Instruction following was really solid - Sonnet 4 translated my conceptual description into working code with minimal gaps
  • The frontend was decent … despite zero design input from me
  • Claude and Gemini integrations worked immediately - (mostly) clean API calls, (some) error handling
  • Overall architecture was sound - not the typical AI-generated spaghetti code, especially nice given that I didn't guide on architecture or framework

This is a significant leap from Sonnet 3.5, which often produced code that looked right but fell apart under scrutiny.

The Predictable Pain Points

Of course, it wasn't perfect. Some of the usual suspects appeared:

API integration struggles: While Claude and Gemini worked flawlessly, getting OpenAI to cooperate required substantial back-and-forth. I probably spent more time debugging the OpenAI integration than it would have taken to write it myself from scratch.

This pattern is consistent across LLMs - they seem to struggle with the nuances of different API specifications, even when the documentation is clear. You'd think that search capabilities would have cleared this up, but not in this case.

Frontend formatting issues: Displaying outputs properly required several iterations. Some outputs were getting cut off. If we hold with the "AI as talented intern" analogy, then this is that talented, but a bit lazy and not-good-with-details intern, since anyone taking even a quick glance at the output would have noticed these issues.

The Real Test Ahead

Here's what I'm really curious about: How will Sonnet 4 handle iteration and modification?

Previous Claude versions excelled at generating initial code but often turned elegant solutions into unmaintainable messes when asked to add features or refactor. The next phase of this project will test whether Sonnet 4 can maintain code quality through successive modifications.

I have several directions I want to take Sir Promptington:

  • Advanced scoring algorithms
  • User authentication and prompt history
  • Integration with additional models
  • Export and sharing capabilities

Each addition will test Sonnet 4's ability to work with existing codebases (albeit a small codebase) rather than generating from scratch.

What This Means for AI Product Development

This experience reinforces a few key principles for AI-assisted development:

  1. LLMs excel at initial implementation when given clear, comprehensive requirements
  2. API integrations remain a consistent weakness - plan extra time for debugging third-party connections
  3. The real value test comes in iteration - can the AI maintain code quality as complexity grows?

For product managers evaluating AI development tools, focus less on the initial demo – they mostly do this quite well. Instead focus on how they work on the next stages and when complexity grows.

Have thoughts on this?

I'd love to hear your perspective. Feel free to reach out.