The Model Eval Tool I Wish I Had (And Am Now Building)

We had a problem.

I was at The Motley Fool and we'd just completed content pipelines to create thousands of earnings reports every quarter for our members.

We'd used GPT-4-Turbo because that was the obvious choice at the time (early 2024). It created better quality content than anything else available, and was a massive step-down in price from GPT-4.

But then it's March, we're just getting ready for the onslaught of earnings season, and Anthropic releases the suite of Claude 3 models. Facepalm.

So we run a big set of evals against the current GPT outputs. It's not that bad – the Claude models aren't that good (yet…).

We (mostly) get through that earnings season before… GPT-4o is released. Now, notable improvement, another big drop in price. More hours of evals.

We switch to GPT-4o.

Not long after, Claude 3.5 Sonnet hits, major game changer. Then GPT-4o-mini, massive cost savings. The Perplexity Sonar models. And of course o1-preview and o1-mini.

We were a small team trying to do a lot. We had very solid content outputs, already rated by our members as basically on par with their human-written counterparts.

But could they be better? Could they be cheaper? Probably. But we had limited time to do the work of running new evals on systems that were already running. We were busy running evals on new systems to support further content.

Which is why I'm creating the tool that I'm working on now.

The Solution: LLM-Powered Model Evaluation

Research has shown the preferences of LLMs to match up at 80% or greater to human preferences when judging. And that was GPT-4.

That means that the process of model upgrade/change can be offloaded to LLM judges.

And in fact, in my experience working with LLMs as judges, they do this quite well. Yes, they can find differences in things that are correct when given the right information. But they also do very well comparing the quality of one piece of content versus another.

And you don't need the most powerful model to do this. In fact, the smaller (cheaper!) models do this quite effectively.

I tend to think about it this way: I can't write like Haruki Murakami, but if you put his writing and an 8th grader's in front of me, I can tell you which is better. And that's the same idea with the models. Gemini-1.5-Flash may not be able to create as good of an output as o3, but it can tell when an o3 output is better than one from Grok 3.

What I'm Building

Now putting this all together – and this is what I'm building.

If models can effectively compare the quality of outputs, and currently Product Managers, Devs, and builders are either (1) spending lots of time running evals to decide whether to upgrade or change models, (2) not considering the cost and potential highly-cost-effective alternate models, or (3) not really thinking much about changing models at all… then, a tool that speeds this up and makes it far easier to:

  • Find the model that gives you the highest quality output for your specific prompt (not some benchmark!)
  • Find models that are nearly as performant on your prompt at a fraction of the cost
  • Help you keep tabs on new models as they come out

Will make a lot of lives easier and cycles faster.

Beta Testing

I'm working towards a widening beta test, let me know if you're interested in testing!

Have thoughts on this?

I'd love to hear your perspective. Feel free to reach out.