#AI Insights #Technology #AI News #AI Tools 21 Apr. 2025

o3 Underperforms: OpenAI’s Light Model Gets a Low Score

By Frozen Light Team

Another day, another model—and this one didn’t come out swinging.

OpenAI’s latest lightweight model, o3, showed up in the LMSYS Chatbot Arena, ready to impress.
But instead of flexing, it flopped. Quietly.
Let’s unpack why this is making noise (even if o3 itself isn’t).

What’s Going On?

OpenAI launched o3 as a lightweight model, the kind meant to be faster, cheaper, and “good enough” for most tasks.

It’s not GPT-4, and it’s not trying to be.
But that didn’t stop independent testers from throwing it into the ring.

And here’s the twist:
🔹 It underperformed.
🔹 It ranked lower than Claude 3 Opus, GPT-4, and Gemini 1.5 Pro.
🔹 In some tasks, it even lagged behind older models.

Basically, it got benched on its own debut.

What Are They Testing in These Benchmarks?

This isn’t one of those mysterious “trust us, it’s great” reviews.
This is real testing—run by the LMSYS team through their Chatbot Arena.

Here’s what o3 got tested on:

🧠 Math & Logic: Can it think through problems, puzzles, and calculations?
👨‍💻 Code Writing: Can it write working code or fix broken scripts?
✍️ Creative Writing: Can it tell a story, crack a joke, or write a love letter?
🌍 General Knowledge: Is it smarter than a fifth grader? Or Wikipedia?
💬 Multi-turn Chat: Can it hold a conversation like it remembers what you said?
🎭 Instruction Following & Roleplay: Can it be helpful, weird, or both?
🌐 Translation: Does it understand languages beyond Silicon Valley English?

Each test was a head-to-head blind comparison.
Real users picked their favourite answers without knowing which model wrote them.

That’s how o3 ended up with a lower Elo score—the AI version of a report card.

What the OpenAi are Saying

To be fair to OpenAI, they never called o3 their golden child.
It’s a lightweight model, not meant to compete with GPT-4.
The goal? Save on compute, run fast, and still sound pretty smart.

But here’s the thing:
Even lightweights need to hold their own in the ring—and this one stumbled early.

What That Means (In Human Words)

Here’s the human-friendly summary:

o3 is not bad—it’s just not great either.
It’s made to be cheaper and faster, but in trying to be light, it also went… light on performance.
People expect anything from OpenAI to feel premium, and this felt more like a free sample that didn’t convince us to buy.

If you’re using AI in products or workflows, this is your heads-up:
You might want to test o3 yourself before swapping it in.

It’s good for lightweight tasks—but if you need brains, memory, or subtlety? Might wanna call Claude or Gemini instead.

🔧 What We Can Show in the Table:

Model	Elo Ranking (Chatbot Arena)	Strengths	Weak Spots	Best For
GPT-4	🔵 Top 3	Strong general reasoning, good at code, memory across turns	Slower, expensive	Premium apps, heavy logic tasks
Claude 3 Opus	🟣 #1 right now	Best overall Elo score, smooth responses, great memory	Slightly verbose	Assistants, research, long chats
Gemini 1.5 Pro	🟢 Top 5	Fast, good in multilingual, solid reasoning	Can go off-track	Mixed-use, team integrations
OpenAI o3	🟡 Lower third	Cheap, fast, okay at basics	Struggles with nuance, code, multi-step	Lightweight apps, summaries, drafts

Frozen Light Team Perspective

o3 feels like one of those free samples at the supermarket.
Nice idea, but you’re not putting it in your cart.

We know OpenAI is building different models for different jobs.
But when your name is OpenAI, people expect every model to be a top student.

Here’s how we see it:

This isn’t a big disaster.
But it is a good reminder that not all AI models are the same.
“Light” models can be great—but only if they still get the job done.

Right now, o3 is like a smart intern that’s still learning.
You might use it for quick tasks, but you’re not asking it to write your next book.

Share Article

Latest articles

#AI Insights, #Technology, #AI News 27 April

❄️ Surprise! You’ve Been Listening to an AI DJ for Six Months — And Nobody Noticed

#AI Insights, #Technology, #AI News 26 April

🧊 OpenAI Launches Lightweight Deep Research — Here's What We Think

#AI Insights, #Technology, #AI News 25 April

GPT-2 Is Coming Because we follow Sam Altman on Twitter or X

#Technology, #AI News, #AI Tools 25 April

OpenAI’s Image API - quick test on Make

Get stories direct to your inbox

We’ll never share your details. View our Privacy Policy for more info.