New AGI Test Challenges Existing AI Models - scored between 1% and 1.3%. Humans? Around 60%.
Turns out, AI might still need to show its work—like a nervous student on exam day.
A new test called ARC-AGI-2 was launched by the Arc Prize Foundation (led by François Chollet, the same brain behind Keras).
This isn’t your average benchmark. It’s a mind-bending puzzle set designed to check if AI can actually think, not just predict.
So how did AGI do on the test?
Well, top AI models from OpenAI, Anthropic, and others... flunked!
Most scored between 1% and 1.3%, in case you are wondering humans scored around 60%. That’s not a small gap—it’s a reality check.
This test is different from anything they have been measured against. Instead of asking AI to complete a sentence or summarise a paragraph, it throws it into unknown territory—problems it’s never seen before, with no training data to lean on.
It’s trying to ask: Can AI think like a human when no one is guiding it?
🧠 What kind of questions are in the test?
We know you are curious to know what kind of questions can help us tell the difference between AI and a human—and how well you'd do if you were asked the same thing. Would you be closer to 60%... or???
So the test is presenting logic puzzles, shown as small grids (like pictures made of coloured blocks or shapes). The AI has to look at a few examples and figure out the hidden rule.
For example:
You’re shown 3 images:
-
Image 1: a red square on the top left, and a blue circle on the bottom right
-
Image 2: a blue circle on the top right, and a red square on the bottom left
-
Image 3: a red square on the top right, and a blue circle on the bottom left
Now, you’re asked: what comes next?
💡 Human thinking: “Oh, the red square moves clockwise, and the blue circle mirrors it.”
The AI has to spot that pattern and generate the next correct image.
Sounds simple, right? But for AI, this is mental gymnastics—especially if it hasn’t seen anything like this before.
Why is this important?
Because we still don’t fully understand AI as a society and as individuals.
AI is a technology built to mimic human cognitive abilities, but there’s a big difference between what we call an LLM and what we expect from AGI.
This test shows that even the smartest AI models today are still focused on pattern matching, not real reasoning.
It’s one thing to predict the next word. It’s another to solve a problem from scratch.
That brings us back to the purpose of AGI: it won’t stay inside a screen. It will live with us, surrounded by people—humans it could easily harm (with no harm intended).
That’s why it needs the ability to handle real-world complexity, beyond its training.
Living in a complex, ever-changing environment is something AGI can’t be trained for—it has to reason through it.
This is where true reasoning matters.
Frozen Light perspective
This isn’t about failure.
It’s about the new standard for what’s coming next: starting to build the benchmark for safe AGI.
We’re entering a world where AI won’t just be on your phone—it will walk, clean, talk, and make decisions in your living space.
Yes, we get it—it's glamorous to dream of reading a book while your robot does the dishes.
But simple tasks for us can be huge puzzles for AGI.
Take mopping the floor as an example. Sounds easy, right?
But wait... Is it wood?
You can’t use water on that.
What detergent do you use?
Do you sweep first? What cloth do you use?
Oh—and your kid is running around barefoot.
Yep, every single part matters. And AGI needs to figure it all out—before it even starts cleaning.
This is a major shift we’re all watching carefully.
Because our world will change dramatically when AGI becomes available.
Not just because it’s exciting to hear big promises from AI company leaders that it’s coming soon.
But because the impact will be real.
From our perspective, we want to remind everyone: we’re still arguing about copyright and the AI Act—and that’s with models that don’t even have a body.
So what happens when they do?
Do they get their own police department? (Kidding. Kind of.)
But you get our drift.
Having this kind of regulation standard and test will be the beginning of understanding the bare minimum these models should be qualified by.
Up till now, only the vendors were internally making those calls.
A personal note from us
We’re actually glad we have time.
Time to figure things out.
Time to experiment with LLMs.
Time before AGI knocks on our door—ready to mop the floor.
—The Frozen Light Team
You can read more about it here.