Another day, another model—and this one didn’t come out swinging.

OpenAI’s latest lightweight model, o3, showed up in the LMSYS Chatbot Arena, ready to impress.
But instead of flexing, it flopped. Quietly.
Let’s unpack why this is making noise (even if o3 itself isn’t).

What’s Going On?

OpenAI launched o3 as a lightweight model, the kind meant to be faster, cheaper, and “good enough” for most tasks.

It’s not GPT-4, and it’s not trying to be.
But that didn’t stop independent testers from throwing it into the ring.

And here’s the twist:
🔹 It underperformed.
🔹 It ranked lower than Claude 3 Opus, GPT-4, and Gemini 1.5 Pro.
🔹 In some tasks, it even lagged behind older models.

Basically, it got benched on its own debut.

 

What Are They Testing in These Benchmarks?

This isn’t one of those mysterious “trust us, it’s great” reviews.
This is real testing—run by the LMSYS team through their Chatbot Arena.

Here’s what o3 got tested on:

  • 🧠 Math & Logic: Can it think through problems, puzzles, and calculations?

  • 👨‍💻 Code Writing: Can it write working code or fix broken scripts?

  • ✍️ Creative Writing: Can it tell a story, crack a joke, or write a love letter?

  • 🌍 General Knowledge: Is it smarter than a fifth grader? Or Wikipedia?

  • 💬 Multi-turn Chat: Can it hold a conversation like it remembers what you said?

  • 🎭 Instruction Following & Roleplay: Can it be helpful, weird, or both?

  • 🌐 Translation: Does it understand languages beyond Silicon Valley English?

Each test was a head-to-head blind comparison.
Real users picked their favourite answers without knowing which model wrote them.

That’s how o3 ended up with a lower Elo score—the AI version of a report card.

 

What the OpenAi are Saying

To be fair to OpenAI, they never called o3 their golden child.
It’s a lightweight model, not meant to compete with GPT-4.
The goal? Save on compute, run fast, and still sound pretty smart.

But here’s the thing:
Even lightweights need to hold their own in the ring—and this one stumbled early.

 

What That Means (In Human Words)

Here’s the human-friendly summary:

  • o3 is not bad—it’s just not great either.

  • It’s made to be cheaper and faster, but in trying to be light, it also went… light on performance.

  • People expect anything from OpenAI to feel premium, and this felt more like a free sample that didn’t convince us to buy.

If you’re using AI in products or workflows, this is your heads-up:
You might want to test o3 yourself before swapping it in.

It’s good for lightweight tasks—but if you need brains, memory, or subtlety? Might wanna call Claude or Gemini instead.

 

🔧 What We Can Show in the Table:

Model

Elo Ranking (Chatbot Arena)

Strengths

Weak Spots

Best For

GPT-4

🔵 Top 3

Strong general reasoning, good at code, memory across turns

Slower, expensive

Premium apps, heavy logic tasks

Claude 3 Opus

🟣 #1 right now

Best overall Elo score, smooth responses, great memory

Slightly verbose

Assistants, research, long chats

Gemini 1.5 Pro

🟢 Top 5

Fast, good in multilingual, solid reasoning

Can go off-track

Mixed-use, team integrations

OpenAI o3

🟡 Lower third

Cheap, fast, okay at basics

Struggles with nuance, code, multi-step

Lightweight apps, summaries, drafts

 

Frozen Light Team Perspective

o3 feels like one of those free samples at the supermarket.
Nice idea, but you’re not putting it in your cart.

We know OpenAI is building different models for different jobs.
But when your name is OpenAI, people expect every model to be a top student.

Here’s how we see it:

  • This isn’t a big disaster.

  • But it is a good reminder that not all AI models are the same.

  • “Light” models can be great—but only if they still get the job done.

Right now, o3 is like a smart intern that’s still learning.
You might use it for quick tasks, but you’re not asking it to write your next book.

Share Article

Get stories direct to your inbox

We’ll never share your details. View our Privacy Policy for more info.