Another day, another model—and this one didn’t come out swinging.
OpenAI’s latest lightweight model, o3, showed up in the LMSYS Chatbot Arena, ready to impress.
But instead of flexing, it flopped. Quietly.
Let’s unpack why this is making noise (even if o3 itself isn’t).
What’s Going On?
OpenAI launched o3 as a lightweight model, the kind meant to be faster, cheaper, and “good enough” for most tasks.
It’s not GPT-4, and it’s not trying to be.
But that didn’t stop independent testers from throwing it into the ring.
And here’s the twist:
🔹 It underperformed.
🔹 It ranked lower than Claude 3 Opus, GPT-4, and Gemini 1.5 Pro.
🔹 In some tasks, it even lagged behind older models.
Basically, it got benched on its own debut.
What Are They Testing in These Benchmarks?
This isn’t one of those mysterious “trust us, it’s great” reviews.
This is real testing—run by the LMSYS team through their Chatbot Arena.
Here’s what o3 got tested on:
-
🧠 Math & Logic: Can it think through problems, puzzles, and calculations?
-
👨💻 Code Writing: Can it write working code or fix broken scripts?
-
✍️ Creative Writing: Can it tell a story, crack a joke, or write a love letter?
-
🌍 General Knowledge: Is it smarter than a fifth grader? Or Wikipedia?
-
💬 Multi-turn Chat: Can it hold a conversation like it remembers what you said?
-
🎭 Instruction Following & Roleplay: Can it be helpful, weird, or both?
-
🌐 Translation: Does it understand languages beyond Silicon Valley English?
Each test was a head-to-head blind comparison.
Real users picked their favourite answers without knowing which model wrote them.
That’s how o3 ended up with a lower Elo score—the AI version of a report card.
What the OpenAi are Saying
To be fair to OpenAI, they never called o3 their golden child.
It’s a lightweight model, not meant to compete with GPT-4.
The goal? Save on compute, run fast, and still sound pretty smart.
But here’s the thing:
Even lightweights need to hold their own in the ring—and this one stumbled early.
What That Means (In Human Words)
Here’s the human-friendly summary:
-
o3 is not bad—it’s just not great either.
-
It’s made to be cheaper and faster, but in trying to be light, it also went… light on performance.
-
People expect anything from OpenAI to feel premium, and this felt more like a free sample that didn’t convince us to buy.
If you’re using AI in products or workflows, this is your heads-up:
You might want to test o3 yourself before swapping it in.
It’s good for lightweight tasks—but if you need brains, memory, or subtlety? Might wanna call Claude or Gemini instead.
🔧 What We Can Show in the Table:
Model |
Elo Ranking (Chatbot Arena) |
Strengths |
Weak Spots |
Best For |
GPT-4 |
🔵 Top 3 |
Strong general reasoning, good at code, memory across turns |
Slower, expensive |
Premium apps, heavy logic tasks |
Claude 3 Opus |
🟣 #1 right now |
Best overall Elo score, smooth responses, great memory |
Slightly verbose |
Assistants, research, long chats |
Gemini 1.5 Pro |
🟢 Top 5 |
Fast, good in multilingual, solid reasoning |
Can go off-track |
Mixed-use, team integrations |
OpenAI o3 |
🟡 Lower third |
Cheap, fast, okay at basics |
Struggles with nuance, code, multi-step |
Lightweight apps, summaries, drafts |
Frozen Light Team Perspective
o3 feels like one of those free samples at the supermarket.
Nice idea, but you’re not putting it in your cart.
We know OpenAI is building different models for different jobs.
But when your name is OpenAI, people expect every model to be a top student.
Here’s how we see it:
-
This isn’t a big disaster.
-
But it is a good reminder that not all AI models are the same.
-
“Light” models can be great—but only if they still get the job done.
Right now, o3 is like a smart intern that’s still learning.
You might use it for quick tasks, but you’re not asking it to write your next book.