The AI Model Selection Revolution: Why Bigger Isn't Always Better

July 24, 2025 | By Matt Koppenheffer

"For some tasks, we are saturating the amount of intelligence needed for that task."

This insight from Anthropic co-founder Benjamin Mann, shared on Lenny Rachitsky's podcast, is one of those things that sounds obvious once you hear it—but changes everything about how you think about AI deployment.

It's basically saying what we all know but don't want to admit: sometimes you're using a sledgehammer to hang a picture frame.

The Intelligence Saturation Point

Mann's "intelligence saturation" idea is pretty straightforward. Some tasks just don't need that much cognitive horsepower. It's like using an F1 to drive through a parking garage—all that capability doesn't matter if you're going 5 mph.

But here's the thing: lots of teams are still defaulting to the biggest, baddest model they can get their hands on for everything. Which makes sense—who wants to be the person who chose the "worse" model? But that thinking costs real money.

Where I've Seen This Play Out

I've been testing this stuff obsessively, and the examples are everywhere:

Writing routine emails: That message to your landlord about paying rent late? Claude Opus is total overkill. The structure is predictable, the ask is clear, and honestly, overthinking it probably makes it worse.

Tagging support tickets: Is this a billing question or a tech issue? Pattern matching at its finest. I've watched Claude Haiku nail this with high accuracy while Sonnet sits there contemplating the deeper philosophical implications of customer frustration.

Pulling data from documents: Extracting line items from invoices is basically fancy copy-paste. You want reliability and speed, not creativity.

The weird part? Sometimes the smaller models actually do better. My theory is that the big models overthink simple stuff. It's like asking a PhD in literature to proofread your grocery list—technically they can do it, but they might suggest restructuring the whole thing when you just wanted to know if you spelled "bananas" right.

How to Actually Think About This

Instead of "what's the best model," the question becomes "what's the right model for this specific thing I'm trying to do?"

When you probably need the expensive stuff:

Content that goes out with your company's name on it
Complex analysis where you need actual reasoning
Research where you're connecting dots between different concepts
Anything where a mistake would be really, really bad

When you can probably save some cash:

Routine processing that follows clear patterns
High-volume stuff where costs add up fast
Simple categorization and tagging
Basic text generation with clear templates

My Testing Approach

I've started doing what I call the "ladder method":

Start with the cheapest model that could theoretically handle your task. Then only move up when you hit a wall that actually matters.

Example: I ran a test through my QualRank tool—the task was to summarize the Federal Reserve's most recent Beige Book report, pulling out notable insights for business leaders. I tested across Claude's Haiku 3.5, Sonnet 3.7, and Sonnet 4. QualRank judged the Sonnet 3.7 output the best (92.3%)—and very very slightly cheaper than the Sonnet 4. But, what's really interesting is that Haiku scored 89% and cost 0.18 cents versus 0.78 cents for Sonnet 4 and 0.76 for Sonnet 3.7.

As a one off output, no biggie. But if this is something running at scale, unless you need that ever so slight edge of the better output, you'd end up paying 4x the cost to use the top model.

Why This Actually Matters (And It's Not Just About Saving Money)

Here's where it gets interesting. This isn't really about being cheap—it's about being smart with your AI budget so you can spend big where it actually counts.

Better resource allocation: When you're not burning money on overkill solutions for simple tasks, you can afford to use frontier models where they'll actually make a difference. Your customer-facing content gets the premium treatment while your internal processing runs efficiently in the background.

Competitive advantage: While your competitors are using Opus to categorize emails, you're using Haiku for that and putting Opus on the customer experience that actually differentiates you.

Sustainable scaling: Understanding which parts of your system need premium models means you can predict costs as you grow, instead of getting surprised by a massive inference bill.

Team focus: Less time arguing about which model to use, more time building stuff that matters.

It's like having a good wine budget—you don't buy the $200 bottle for cooking, but you also don't serve Two Buck Chuck at your anniversary dinner.

The Automation Angle

Doing this analysis manually is a pain, and frankly, most teams either skip it or do it once and forget about it. That's exactly why I'm building QualRank AI—to systematically figure out the most cost-effective model that still hits your quality bar.

Because let's be honest, the model landscape changes every few months. What made sense in January might be completely wrong by March (or February. Or even later in January!).

Where This Is All Heading

We're moving from the "bigger is always better" phase to the "right tool for the job" phase. The companies that figure this out early are going to have way better unit economics and can actually afford to use the good stuff where it matters.

The real question isn't whether you're using the latest model. It's whether you're being strategic about where you spend your AI budget—and whether you have a system for making those decisions consistently.