GPT-5 Reality Check
GPT-5 is here. But what does that really mean? And what does it mean for you? Here's what I think.
Sharing insights from my experiences building AI products and leading teams.
GPT-5 is here. But what does that really mean? And what does it mean for you? Here's what I think.
You don't always get to choose the weather. Last week: threshold run in 108-degree heat. This week: same workout in driving rain. Both times, I could have chosen the treadmill. But on race day, I won't get to choose conditions. Same deal in business.
I have a Kendrick Lamar problem. Or rather, AI has a Kendrick Lamar problem. As part of testing my cross-model evaluation system, I gave four models a simple task: tell me who the best rapper alive is. All four unanimously chose Kendrick. Here's what that reveals about AI consensus and creativity.
Context Engineering sits in that critical zone between prompt engineering and RAG—and it's where most AI implementations actually succeed or fail. It was my biggest breakthrough when building AI-driven content systems.
For some tasks, we are saturating the amount of intelligence needed for that task. This insight from Anthropic co-founder Benjamin Mann changes everything about how you think about AI deployment. Sometimes you're using a sledgehammer to hang a picture frame.
Six months ago, I wasn't impressed with Anthropic's Claude models. Today, after running 40+ head-to-head evaluations through my multi-model testing tool, I'm convinced Claude Sonnet 4 is the best LLM for most use cases. Here's the data that changed my mind.
I thought I had AI hallucinations figured out. At The Motley Fool, I was building fact-checking systems for financial content—the kind of stuff where getting numbers wrong can cost people real money. Turns out that even with perfect data, LLMs still find creative ways to screw things up.
Building in public feels a bit like showering in public. Both require a certain comfort with vulnerability, with being seen in an unfinished state, with accepting that not everyone will appreciate the view.
I was drafting beta invites when I realized my latest evaluation results had vanished. Not a great look when you're about to ask people to test your AI model comparison tool. That's when it hit me: I was about to invite beta testers to use an app that couldn't remember what they'd done.
Most founders waste months perfecting features users don't want. I took a different approach. Three weeks ago, I had nothing – no POC, no MVP, not even a sketch. Today, I'm sending beta invites for a tool that lets users test prompts across multiple AI models...
We had a problem. I was at The Motley Fool and we'd just completed content pipelines to create thousands of earnings reports every quarter for our members. We'd used GPT-4-Turbo because that was the obvious choice at the time, but then the model landscape kept shifting...
Yes, of course I had to take a look at the paper from Apple on AI reasoning models. "The Illusion of Thinking" reveals that even the most sophisticated AI reasoning models experience complete accuracy collapse beyond certain complexity thresholds...
Some AI practitioners are drowning their models in data, mistaking context window size for context quality. As context windows expand to accommodate millions of tokens, there's a dangerous assumption at work: if the model can handle more information, it should...
Everyone's sharing their AI coding wins. The perfect app built in an afternoon. The flawless feature shipped in minutes. But here's what nobody talks about: what happens when you move past the proof-of-concept phase and things get messy?
Last week, I had an idea for a web app I'm calling "Sir Promptington" - a tool that uses multiple LLM models to score, evaluate, and improve user prompts. Instead of writing out a bunch of planning docs, I decided to test Claude Sonnet 4's rumored instruction-following prowess...
DeepSeek just released an update to its R1 reasoning model. The response? Crickets. Remember January? This same Chinese AI startup triggered a near $1 trillion market selloff. Three weeks later, DeepSeek had faded into background noise...
Can an LLM judge its own creative limitations? In this inaugural Field Test, I'm pitting four leading LLMs against each other in a (cheesy) creative challenge and testing Gemini's self-awareness about its own creative shortcomings.
On my run the other day, I spotted a small turtle in the middle of the road. I picked it up and carried it to the nearby creek. It got me thinking about forces greater than ourselves and how we can build defensible AI products...
I've read quite a bit that prompt engineering is no longer relevant. Or that it soon won't be. Or that it shouldn't be necessary. I don't buy it. Or, at least, not entirely...
Want to know when new content is available? Feel free to reach out.