Can an LLM judge its own creative limitations? In this inaugural Field Test, I'm pitting four leading LLMs against each other in a (cheesy) creative challenge and testing Gemini's self-awareness about its own creative shortcomings.
Key Findings
My testing revealed several insights worth noting:
OpenAI's o3 is a surprising creative powerhouse. Given all the hype about o3's scientific and mathematical capabilities, I underestimated its creative potential, especially on deliberately silly tasks. If you're seeking creative outputs, o3 is a solid choice.
Gemini is competitively creative. While it didn't top the results, Gemini performed reasonably well. Given that it's become the go-to model for many users (surprising given Google's previous lag in this space), it's reassuring that you're not sacrificing dramatically on the creative front.
Meta's Llama significantly underperforms. I would not recommend this model for creative tasks based on these results. Its outputs consistently lacked the imagination and humor of the others.
Claude Sonnet 3.7 remains a strong contender. Sonnet narrowly lost to OpenAI's o3 (26.5/30 points versus 27.5/30), but both significantly outperformed Gemini (24/30) and especially Llama (21/30). It remains a reliable choice for creative generation.
The Question
I began this Field Test with a hypothesis: Gemini 2.5 Pro isn't particularly creative compared to models like Claude Sonnet 3.7.
But I wondered: Does Gemini recognize its own creative limitations? If I have it generate creative content, compare that against outputs from other LLMs, and then have Gemini judge them all blindly, would it acknowledge that other models produced superior results?
Methodology
For this test, I used an intentionally silly movie pitch prompt (cheddar and gouda cheese falling in love, getting separated, and reuniting after 10 years). I generated three outputs each from:
- Gemini 2.5 Pro
- Claude 3.7 Sonnet
- Meta Llama 4 Maverick
- OpenAI o3
I also included single outputs from Grok 3 and Nvidia Nemotron 70B for additional comparison.
All testing was conducted through Simtheory.ai, a multi-model service that allows quick switching between models. While this means results aren't perfectly comparable to using provider interfaces directly, the consistent testing environment ensures fair comparison across models.
After collection, I assigned random numbered labels to each output, stripped identifying information, and had Gemini 2.5 Pro evaluate all submissions using consistent criteria.
Results
- OpenAI o3: 27.5/30 points (across three outputs)
- Claude 3.7 Sonnet: 26.5/30 points
- Gemini 2.5 Pro: 24/30 points
- Llama 4 Maverick: 21/30 points
(Note: NVIDIA scored 8/10 on its single entry and Grok scored 8.5/10)
Implications for AI Users
Perhaps the most interesting finding is Gemini's ability to recognize creative quality in outputs—even when its own submissions were in the mix. While creativity evaluation is inherently subjective, Gemini's rankings aligned surprisingly well with my own assessments across multiple evaluation runs.
This suggests that models like Gemini could potentially serve as unbiased arbiters when comparing outputs from different LLMs, which could streamline the process of identifying which model works best for specific tasks.
The rankings themselves challenge some common assumptions. I expected Claude Sonnet to dominate creative tasks with a significant lead over Gemini. Instead, o3 claimed the top spot, and while Gemini did lag behind the leaders, the gap wasn't as dramatic as anticipated.
For practical applications, this means:
- If you can easily switch between models (using services like Simtheory), it's worth selecting the right tool for creative tasks.
- If you're primarily using Gemini, don't stress too much about its creative capabilities unless you need absolutely top-tier creative output.
- While Llama may have its strengths, creative generation isn't one of them. As I reviewed its outputs, Llama felt like the student in the back, sticking gum under the desk, reading comics behind their Shakespeare, and completely mailing in the assignment.
Testing Limitations
While this test provides interesting directional insights, it has limitations. The creative task was deliberately silly, Gemini itself served as judge (though I reviewed), and I only tested one creative scenario. Future Field Tests will explore different task types and judging methodologies.
For Fun
Here's the prompt I used to get the outputs (note, not an optimized prompt!):
Can you please write a 200 word pitch for a hilarious movie about a slice of cheddar cheese meeting, and falling in love, with a slice of gouda. The two get separated based on cheese type, but through a hilarious set of circumstances, find each other again 10 years later.
The success of your pitch will be judged on: 1) its creativity, 2) its laugh-out-loud moments, and 3) sticking to the ask and creating a cohesive story from it.
And here's the "winning" output from OpenAI o3:
Title: Grate Expectations
Cheddar Chad is a sharp-tongued, neon-orange slice stacked in Midwest Mellow Market, endlessly cracking jokes about crackers. Gouda Gabriella is a smoky, sophisticated wedge imported from the Netherlands dairy case next door, serenading customers with her velvety accent. One steamy summer night the deli clerk misfiles Gabriella into the plastic clamshell with Chad. Sparks melt. They vow to stay together forever, but morning inventory cruelly ships them to different grocery chains, sorted by UPC prejudice.
Ten years pass. Chad is now grated onto nachos at a minor-league ballpark, dreaming of the one that got a whey. Gabriella resides in a hipster charcuterie truck, forced to pose beside fig jam influencers. During a freak promotional stunt, the ballpark caters the truck for "Fromage Friday." A runaway Roomba carrying a mountain of crackers collides with the buffet, launching both cheeses onto the Kiss-Cam, reunifying them under 40-foot Jumbotron glory. Viral video fame propels the duo on a cross-country food-comedy quest to prove true love isn't limited by milk-fat percentage. Picture "When Harry Met Sally" served on a toothpick with "Toy Story" charm and "Planes, Trains & Automobiles" chaos. Audiences will laugh, melt, and shout, "That's nacho average love story!"