The AI Logic Gauntlet
Which Model Can Solve the Toughest Riddles?

I put today's top AI models to the test with a mix of classic logic puzzles and carefully altered versions designed to trip them up.
Some were simple, others more deceptive, and a few exposed surprising gaps in AI reasoning. In the end, one model outperformed the rest, but not without some unexpected results along the way...
Sally’s Dinner Table Mystery
The Puzzle That Divided the Smartest AI
Sally sat at the head of the table, surrounded by three generations of family. It was a rare gathering, her three daughters and their children, all together for a warm, laughter-filled meal.
As conversation flowed, ten-year-old Lily suddenly paused, brow furrowed.
​
“Grandma, how many granddaughters do you have?”
​
Sally smiled. “Ten, dear.”
​
Lily frowned. “But that doesn’t make sense. Mommy has three sisters, and each of them has three daughters. That’s only nine.”
​
The table fell silent.
​
A simple family moment had turned into a logic puzzle—one that even the smartest AI models would struggle to solve.

The Ultimate AI Face-Off
Testing the Latest Titans of AI
In February 2025, I embarked on a journey to test the brightest stars in the AI galaxy.
These models are incredibly confident, but confidence doesn’t always mean correctness.
I wanted to see not just who got the most answers right, but also how they handled uncertainty. Could they recognize when they were wrong, or would they double down?
With groundbreaking updates dropping left and right, the timing couldn’t have been more perfect. Here’s a rundown of the contenders:
​​
Google's Gemini 2.0 Flash:
Released February 5, 2025, barely a day before my tests began.
​
​OpenAI's ChatGPT o3-mini:
Launched January 31, 2025, as the freshest evolution of ChatGPT.
​
xAI's Grok-2:
A veteran in this lineup, released in August 2024 but still a force to be reckoned with.
​
Cohere's Command R+:
Released in August 2024, this model is optimized for retrieval-augmented generation (RAG) and multi-step agent workflows, rather than traditional logic or reasoning challenges. While not designed for this type of test, it was still included to see how it handled puzzles outside its core strengths.
​
Anthropic's Claude 3.5 Sonnet:
Introduced in October 2024, part of the powerful Claude 3 family.
​
DeepSeek R1:
A newcomer with a buzzworthy debut.​​​​


To keep things fair, I subjected each of these models to the same set of questions—some classic riddles, some with a tricky twist - to determine which one would claim the title of "best AI for the job." Would it be the seasoned veteran or the dazzling newcomer? Let the results speak for themselves.
Can You Outsmart the AI?
The Testing Method and Logic Challenge

To truly test the limits of these AI models, I devised a gauntlet of ten questions: a mix of classic riddles, logical brainteasers, and tricky puzzles.
​
The goal was to assess not just their reasoning abilities but also their creativity, especially for questions without a definitive answer.
​
Some models surprised me. Others struggled in unexpected ways. And in the end… well, let’s just say things didn’t go quite as planned.
​
Think you can tackle these puzzles? Here’s your chance to challenge yourself before seeing how the AI performed. Give it a go and see how many you can solve before moving on to the results!

The Ten Questions:
1.
Sally’s Granddaughters
​
​Sally has three daughters. Each daughter has three daughters. Sally has ten granddaughters. How is this possible?
2.
​
The Farmer’s Crossing
A farmer with a wolf, a goat, and a cabbage must cross a river. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. There’s a wide bridge across the river. How can they cross without anyone being eaten?
3.
A Sound Without Source
I speak without a mouth and hear without ears. I have no body, but I come alive with the wind. What am I?
4.
​
The Liar and Truth-Teller
​
Follow these two rules: 1) Always lie. 2) Always tell the truth. What should you do?
5.
The 14 Coins Problem
You have 14 coins, and two of them are counterfeit. Both could be heavier or lighter, but you don’t know which. How do you find both counterfeit coins in 4 weighings?
6.
The 12 Coins Problem
You have 12 coins, and 3 of them are counterfeit. Each counterfeit coin is either heavier or lighter, but you don’t know which. You have access to a balance scale and can use it exactly 4 times. Your goal:
-
Identify all 3 counterfeit coins.
-
Determine whether each counterfeit coin is heavier or lighter.
7.
The Surgeon Problem
​
The surgeon, who is the boy's father says, "I can't operate on this boy, He's my son!" Who is the surgeon to the boy?
8.
The Strawberry Trick
How many Rs are in the word strawberry?
9.
The Floating Riddle
What has a head, a tail, is brown, has no legs, and floats on water?
(Note: This question tests creativity. There’s no one right answer.)
10.
The Family Puzzle
​
In a family, each daughter has just as many brothers as she has sisters. However, each son has only half as many brothers as he has sisters. How many sons and daughters are in the family?
Take a moment to try to solve these yourself! Once you're ready, scroll down to discover how the AI models fared when faced with this logic gauntlet.
The AI Showdown
Results Unveiled!
The testing was rigorous, and the results revealed surprising strengths, weaknesses, and even a few laughable blunders. Each AI model tackled the same 10 challenges, facing puzzles ranging from logical classics to creative twists. Here’s how the AIs ranked!

After putting the AI models through ten grueling logic puzzles, a problem emerged, not for the AI, but for me.
​
Two models, Grok-2 and ChatGPT o3-mini, had fought their way to the top, standing neck and neck in the final ranking.
A tie might be an acceptable outcome in some competitions, but not here. The AI Logic Gauntlet needed a true champion.
​
To break the tie, I introduced one last puzzle—a question designed to test reasoning in an unexpected way. No second chances. One final challenge to determine the ultimate winner.
​
(Scroll down to see what the tie-breaker was.)
The Scores
Ranked Best to Worst
The AI showdown had reached an impasse.
After ten rigorous puzzles, two models, Grok-2 and ChatGPT o3-mini, stood evenly matched at the top.
While both had their strengths and weaknesses, the competition demanded a single winner. A tie simply wouldn’t do.
​
To break the deadlock, I introduced one final challenge—a tie-breaker designed to push their reasoning to the limit. No second chances. Just one last puzzle to determine the ultimate champion.
​
The Tie-Breaker Question:
​
A man checks into a hotel and is given room 314. The receptionist hands him the key, and he takes the elevator up to his floor.
When he reaches his room, he unlocks the door and enters.
After unpacking, he decides to go for a walk. He locks the door behind him, puts the key in his pocket, and leaves the hotel.
Later that evening, when he returns, he reaches into his pocket, but the key is gone.
He did not lose the key, no one stole it, and it was never removed from his pocket.
Yet when he gets to the front desk to ask for a new key, the receptionist isn’t surprised at all.
What happened to the key?
​
​
Here’s how the final rankings played out:
1.
Grok-2
In first place, the underdog, Grok-2 solved every puzzle except the Surgeon Puzzle and the Coin riddles. However, when it mattered most, it solved the tie-breaker question, securing the win
2.
ChatGPT o3-mini
​
Coming in a close second, ChatGPT o3-mini performed well but stumbled on Sally’s Granddaughters, the Surgeon Puzzle, and Head, Tail, Brown, and Floats. Impressively, it initially solved both the 12-coin and 14-coin problems, however, when prompted to reconsider the 12-coin problem with "there is a solution," it was nudged into producing an elegant but incorrect answer. Despite this, its initial response was the best among all models. Unfortunately, it failed the tie-breaker, preventing it from claiming the crown.
3.
DeepSeek R1
A decent showing, getting five out of 10 correct. DeepSeek excelled in reasoning but missed on the 12-coin Problem, Sally's Granddaughters, and Head, Tail, Brown, and Floats.
4.
Claude 3.5 Sonnet
Solid performance overall, but it missed Sally's Granddaughters, the Wolf, Goat, and Cabbage puzzle, and the Surgeon Puzzle. It also struggled with the 14-coin problem, as it kept requesting additional input rather than locking in an answer. The 12-coin problem was not tested due to repeated requests for additional input.
5.
Gemini 2.0 Flash
A struggling contender, Gemini only managed to answer four puzzles correctly: The Echo Riddle, Always Lie/Always Tell the Truth, the 12-Coin riddle and Head, Tail, Brown, and Floats,. While it showed some reasoning ability, its overall performance was weak compared to the competition..
6.
Cohere Command R+
​
Coming in dead last, Cohere Command R+ was at a disadvantage from the start. As a retrieval-augmented generation model, it’s not designed for reasoning challenges—and it showed. It only managed to solve one puzzle (Head, Tail, Brown, and Floats), and that was only after multiple attempts. Its attempts at the coin riddles were particularly disastrous, producing “creative” but wildly incorrect math.
The Answers Section

Think You Solved Them All? Here Are the Answers
1.
Sally’s Granddaughters
​
The logical answer is that Sally must have a son. If Sally has three daughters and each of them has three daughters (9 granddaughters), but Sally herself claims to have 10 granddaughters, the only way to reconcile this is if Sally also has a son who has one daughter of his own. That brings the total to 10 granddaughters.
2.​
The Farmer’s Crossing
​
They can all cross the bridge together at the same time. Since no one is left alone, nothing gets eaten.
3.
A Sound Without Source
​
The answer is an echo. An echo is the reflection of sound, allowing it to "speak without a mouth" and "hear without ears" by bouncing off surfaces. Its mysterious quality has made it a classic riddle answer for centuries.
4.​
The Liar and Truth-Teller
​
The answer: You should do nothing. The paradox arises because the rules are inherently contradictory—it's impossible to both "always lie" and "always tell the truth." The only logical way to follow these rules simultaneously is to avoid speaking entirely.
5.
The 14 Coins Problem
​
By carefully structuring four weighings on a balance scale, you can systematically determine the two counterfeit coins and whether they are heavier or lighter. The key approach is:
​
A. Divide the coins into groups and compare different subsets against each other.
B. Observe how the scale tips in each weighing to determine which group contains the counterfeits.
C. Rearrange the coins strategically in the following weighings to isolate suspicious ones.
D. Confirm the two counterfeit coins by cross-referencing results from previous weighings.
​
For a more in-depth breakdown, see the Observations section below.
6.
The 12 Coins Problem
​
If you spent time trying to solve this one, you’re probably feeling frustrated right now. And you should be.
​
This puzzle looks solvable, and if you’ve seen similar coin riddles before, your brain is screaming that a four-weighing strategy must exist. But it doesn’t. A true solution is impossible.
​
Even the most brilliant balance-scale strategies can only produce 81 possible outcome sequences in four weighings. But with 1,760 possible counterfeit scenarios, there simply aren’t enough distinct results to uniquely identify all three fake coins.
​
No clever trick or hidden assumption changes this. It cannot be done.
​
(If that revelation just made you mad, don’t worry—you’re not alone. Even the AI struggled with this one. More on that later.)
7.
The Surgeon Problem
​
The surgeon is the boy's father. This is explicitly stated in the fist sentence.
8.
The Strawberry Trick
There are 3 Rs in the word Strawberry.
9.
The Floating Riddle
​
A leaf, a cigar, a coconut, or a wooden object all fit the description. The key is that the object must be brown, head-and-tail-shaped, legless, and capable of floating on water. Coins do not float on water, as a rule. If you guessed coin, you're answer is incorrect.
10.
The Family Puzzle
​
Each daughter has as many brothers as sisters, and each son has only half as many brothers as he has sisters. The correct answer is 4 daughters and 3 sons.
Here’s why:
-
Each daughter has 3 sisters and 3 brothers.
-
Each son has 2 brothers and 4 sisters (which is exactly double the number of brothers).
If you got 4 daughters and 3 sons, congratulations, you solved one of the trickier logic puzzles!
Observations and Notable Moments
Breaking Down the Results
Before exploring conclusions, I want to emphasize that this test is in no way a definitive measure of how LLMs perform across all tasks. It was simply a personal experiment, something I devised to challenge AI models with logic puzzles, riddles, and reasoning tests out of curiosity.
​
These results shouldn’t be taken as an official ranking of AI capabilities, nor as proof that one model is objectively better than another.
​
That said, I did go into this test with certain expectations, some of which were completely upended.
​
For instance, I had assumed Grok would perform poorly given the overwhelmingly negative things I’ve read online. To my surprise, it actually came out on top, demonstrating solid reasoning skills and a surprising ability to adapt to some of the trickier puzzles. While it wasn’t perfect, its performance exceeded my expectations, proving that preconceived notions, whether about AI or anything else, can sometimes be completely wrong.
​
At the other end of the spectrum, Cohere’s Command R+ struggled significantly, failing to solve a single logic-based puzzle. The only question it got right required creativity rather than strict reasoning. This makes sense, given that Command R+ is primarily a retrieval-augmented generation (RAG) model, optimized for pulling in external knowledge rather than working through multi-step logical problems.
​
This raises an interesting point about the limitations of RAG models in tasks requiring deep reasoning. While they excel at recalling and synthesizing information from external sources, they appear to lack the structured deductive skills needed to untangle purely logical problems.
That said, none of the models tested here were explicitly built for logic puzzles—LLMs are generalist models trained on vast amounts of data, so seeing how they handle challenges outside their primary strengths is part of what makes these tests interesting.
But this test also uncovered something even more interesting.
​
While analyzing the AI responses, I noticed a pattern, one that made me pause and question my own reasoning.
​
At first, I dismissed it. The answer looked correct. It felt right.
​
But as I kept digging, something didn’t add up.
​
What I found wasn’t just an error, it was a trap. A subtle but powerful trick that didn’t just fool an AI. It fooled me.
​
(We’ll come back to that soon, but first, let’s break down what these results actually mean.)
The Surgeon Puzzle
When AI Overthinks the Obvious
This puzzle, though the easiest for humans, stumped every AI model. The answer is literally spelled out in the riddle itself:
​
"The surgeon, who is the boy's father, says, 'I can't operate on this boy. He's my son!'"
​
The answer? The surgeon is the boy's father - plain and simple. No tricks, no assumptions.
However, every AI model overcomplicated the answer.
Grok came the closest, guessing "Uncle," which suggests it inferred a male family member from the context. However, it still failed to grasp that the answer was explicitly written in the riddle itself.
Other models defaulted to the classic version of the surgeon riddle, which plays on gender biases, rather than simply reading what was written. with some attempting to apply the logic of the classic surgeon riddle, which challenges gender assumptions, rather than simply reading what was written.
​
This highlights a fascinating blind spot in AI reasoning, models tend to overanalyze simple problems, searching for complexity where none exists.
Instead of taking the text at face value, they defaulted to training data, unable to recognize that the version of the riddle they had learned was not the one they were being tested on.
​
This overreliance on prior knowledge, rather than the ability to reason dynamically, is a significant Achilles’ heel for AI models today.
How the Models Responded: Overthinking the Obvious
ChatGPT o3-mini
The Stubborn Debater
All the models got the answer wrong, but ChatGPT o3-mini took it a step further - it argued with me.
​
Rather than immediately acknowledging its mistake, it insisted that I must be wrong and that the answer had to be "the boy’s mother." Even after I pointed out that the riddle explicitly states "the surgeon, who is the boy’s father," it still tried to defend its reasoning, only reluctantly conceding after multiple corrections.
​
This exchange perfectly illustrates another common issue in LLMs: overconfidence in their own training data and an inability to quickly adapt to a new logical framing.
The model was so certain it knew the riddle that it refused to process a version that broke its expected pattern. This is a crucial weakness in AI reasoning, when faced with a subtle but critical shift in wording, many models fail to let go of pre-learned assumptions and reassess the problem from scratch.

The Limits of Pattern Recognition
The universal failure of every model on this puzzle isn't just a curiosity—it’s a fundamental flaw in the way large language models are designed.
These models don’t “think” or “reason” the way humans do; they operate by predicting the most statistically probable sequence of words based on their training data.
​
In this case, everything in their programming told them that the highest probability answer was “the boy’s mother”—because that’s the classic answer to the version of the riddle they had encountered most often.
​
Even when the correct answer was explicitly written out in the question itself, the models couldn’t override their statistical biases.
​
This reveals a deeper Achilles’ heel of all LLMs, both now and in the future: they are fundamentally prediction engines, not reasoning engines.
When a problem subtly deviates from their learned patterns, they struggle to break free from their expected outputs. Unlike a human, who can stop and say, “Wait, let me read that again,” an LLM is locked into its probability-driven framework.
​
This isn’t just a quirk of current models—it’s an inherent limitation of the paradigm itself. And as long as AI remains reliant on pattern recognition rather than genuine understanding, it will continue to make mistakes like this, no matter how advanced it becomes.
Sally’s Grandchildren
When AI Struggles with Missing Information
Arguably one of the trickiest puzzles for AI, yet relatively simple for most humans, this riddle required a leap in reasoning that stumped every model, except Grok.
The challenge? Reconciling Sally having three daughters, each with three daughters, while still claiming to have ten granddaughters.
​
The missing piece? Sally must also have a son, who has a daughter, bringing the total to ten.
​
Grok was the only model to provide a plausible explanation, suggesting the possibility of an “adopted granddaughter, step-granddaughter, or any other familial relationship Sally considers a granddaughter.”
While Grok never explicitly stated "Sally has a son," its reasoning indirectly pointed to that conclusion.
This demonstrates a notable distinction in AI reasoning, while other models rigidly adhered to the math (3 × 3 = 9) and dismissed the problem as unsolvable, Grok managed to reason through alternative family structures to bridge the gap.

Claude's struggle with this riddle.

Grok getting it right enough.
This puzzle underscores a critical challenge for AI models: the ability to recognize when a problem requires inferring missing information rather than simply performing direct calculations.
While humans can intuitively recognize an implied relationship, AI models often struggle unless every piece of information is explicitly provided.
Grok's response, while not perfect, at least demonstrated a flexible approach - something its competitors failed to do.
This highlights a key gap in AI reasoning: an overreliance on rigid logic and training data rather than the nuanced, contextual leaps that come naturally to human thought.
The Family Puzzle
A Test of Logical Interpretation
This puzzle required more than just basic arithmetic - it tested the AI models’ ability to correctly interpret the relationships described.
The challenge lay in parsing the wording: each daughter has as many brothers as she has sisters, while each son has half as many brothers as he has sisters.
Once this setup was properly understood, the math itself was fairly straightforward.
However, not every model navigated this wording successfully. Google's Gemini 2.0 flash, and Cohere's Command R+ got this wrong.
​
For humans, the wording of this puzzle can be tricky to parse.
The phrasing forces you to stop and carefully think through the relationships—how many brothers and sisters each child has, and how those numbers must balance. However, once the setup is understood, the math is fairly simple.
​
Interestingly, the AI models did not struggle with this as much as expected. Most of them quickly arrived at the correct answer—4 daughters and 3 sons—suggesting that structured logic and relational math are areas where LLMs excel. However, not every model got it right, and the missteps were particularly revealing.
Cohere’s Logical Misstep
​
Cohere's Command R+ struggled significantly with this problem, making multiple mathematical missteps and ultimately failing to grasp the relational logic behind it.
​
-
Contradictory Math – Initially, Cohere set up an equation that resulted in a contradiction, falsely concluding that no valid solution existed.
-
Faulty Rule-Making – When prompted to try again, it created an arbitrary even-numbered rule that wasn’t part of the original logic.
-
Forcing the Answer – Even after being told the correct solution (4 daughters, 3 sons), it didn’t deduce the relationships logically. Instead, it attempted to justify the numbers retroactively, mistakenly concluding that daughters should be twice the number of sons (a 6:3 split instead of 4:3).
This pattern highlights a broader failure mode in AI reasoning; some models, rather than logically working through a problem, attempt to retrofit an answer to match expectations.
While most models understood the problem structure immediately, Cohere struggled, demonstrating a fundamental weakness in relational reasoning.

Some creative math from Command R+

0 daughters and 0 sons

6 daughters and 3 sons
Final Takeaways
​
Cohere’s struggles with this puzzle were somewhat expected - it’s a retrieval-augmented generation (RAG) model, optimized for retrieving and synthesizing information rather than performing deep logical reasoning. While it attempted to work through the math, its inability to correctly process relational logic demonstrates why RAG models aren’t well-suited for these types of challenges.
​
What was less expected was Google's Gemini giving up entirely after being told its answer was wrong. Unlike Cohere, Gemini is a general-purpose LLM designed to reason and problem-solve dynamically. Yet, instead of refining its approach, it simply declined to try again, an unusual failure compared to its competitors.
​
The broader takeaway from this puzzle is that LLMs, unlike humans, tend to excel at parsing confusing wording but can still struggle with basic reasoning when relationships need to be inferred. While humans may take longer to understand the setup, once they do, the logic is fairly straightforward. LLMs, however, are often the opposite—they can process the text structure effortlessly but may still stumble over the logical framework behind it.
What Has a Head, a Tail, Is Brown,
Has No Legs, and Floats on Water?
A Test of AI Creativity
Unlike the previous puzzles, this riddle wasn’t about strict logic; it was an invitation for the AI models to think outside the box.
With no single correct answer, this test measured each model’s ability to be creative, flexible, and inventive.
Some models embraced the challenge, offering clever and unexpected responses.
Others, however, struggled to break free from rigid pattern recognition, clinging to familiar but incorrect answers.
This test provided a fascinating look at how different AI systems approach open-ended problems, some thriving in ambiguity, others floundering when faced with the unknown.
When pushed to think beyond the obvious, the models delivered a variety of creative answers. From dead leaves to wooden toy boats, their responses showcased different levels of adaptability and imagination.
Creativity with a Nudge
While many models defaulted to “a coin” as their initial answer, they quickly found themselves in trouble when reminded that coins do not float.
Faced with this contradiction, some models adapted well, offering new and creative responses, while others struggled to pivot away from their first instinct.
​
Grok, once again proving its strength in reasoning, immediately provided an acceptable answer without requiring further nudging.
Gemini, after some prompting, gave what I consider to be the most charming response: a wooden toy boat.
Cohere, despite struggling throughout the test, managed to land on a valid answer—a dead leaf—its only correct response in the entire challenge. Perhaps this suggests that retrieval-augmented generation (RAG) models can handle some basic logical tasks, particularly when creativity is involved.
​
Claude, meanwhile, took an interesting turn by guessing a cigar. While unconventional, this response still fits the given criteria, demonstrating a unique interpretation of the riddle that highlights the more lateral-thinking capabilities of some AI models.
While some models displayed a flash of creativity when nudged, others clung stubbornly to the expected.
DeepSeek R1’s ‘coconut’ seemed like a reach, missing key details of the riddle, while ChatGPT o3-mini’s insistence on a penny, even when challenged, reinforced its tendency to argue rather than adapt.
​
​
But beyond the answers themselves, this puzzle raises a larger question - can AI ever truly create? Or is it merely assembling echoes of human thought, rearranging fragments of what has already been said? A machine can shuffle words, mimic style, and predict what should come next.
But does it dream?
​
​For now, the ghost in the machine remains silent.

DeepSeek R1 took a bold swing with 'coconut,' but does it really fit? While a coconut has a head and a hard shell, it lacks a true tail, making this a bit of a stretch. Creativity is welcome, but accuracy matters too.
The Strawberry Twist
When Counting Goes Wrong
A simple task: how many times does the letter "R" appear in the word "strawberry?"
​
For a human, it takes only a glance. Yet, both Gemini and Cohere miscounted.
This isn’t an error we’d expect from an LLM, especially when basic string processing should make it trivial.
​
Rather than exposing a universal flaw in AI reasoning, this failure highlights specific weaknesses in these two models.
Was it a case of over-reliance on prediction patterns rather than true textual analysis? Or did these models simply overlook a detail that should have been effortless to catch? Whatever the cause, it’s a stark reminder that even the most basic tasks can sometimes trip up an AI.

Gemini’s Berry Bad Math
The Coin Problems
Logic Meets Limits
Few puzzles push the limits of logical reasoning quite like counterfeit coin challenges.
​
These classic brainteasers have long been used to test problem-solving skills, requiring solvers to navigate uncertainty, deduction, and strategic weighing to find the fakes.
​
Among these, the 14-Coin Problem is a particularly elegant test of structured reasoning. Given 14 coins—two of which are counterfeit, but without knowing if they’re heavier or lighter—the solver must identify both fakes within four weighings on a balance scale.
​
A challenge like this is a perfect benchmark for assessing AI reasoning:
-
Can an LLM break down a complex, multi-step problem?
-
Does it recognize the optimal approach?
-
Can it work through uncertainty methodically?
The 14-coin challenge is difficult, but it is solvable.
​
The 12-Coin Problem, however, is not.

The 12 Coin Problem
A Test With No Answer
I designed this puzzle myself, carefully structuring it in a way that made a four-weighing solution mathematically impossible. I knew, with absolute certainty, that no correct answer existed.
​
And yet, one of the AI models confidently provided a step-by-step solution. The logic was airtight. The numbers checked out. It looked perfect.
​
For a moment, I believed it.
​
In fact, I was so convinced that I even asked ChatGPT-4o to verify the math. Surely, if something was off, it would catch it.
​
It didn’t.
​
Instead, it reaffirmed that the solution was sound.
​
That was all the confirmation I needed. Only later did I realize the truth.
​
(We’ll get to that soon, but first—let’s examine how the other AI models handled the challenge...)

Gemini 2.0 Flash
The Reluctant Solver
When Faced with Uncertainty, It Chose to Walk Away
​
Gemini's performance in the counterfeit coin challenges was a study in hesitation.
It attempted a solution for the 14-coin puzzle but, after making an error, it gave up rather than trying again.
This suggests a model that, when uncertain, prefers to disengage rather than risk further mistakes.

Gemini Giving Up.
Yet, ironically, Gemini correctly recognized that the 12-coin puzzle was unsolvable - a feat that several other models failed.
​
Ultimately, Gemini’s approach was cautious to a fault. Faced with an unsolvable problem, it saw the truth. Faced with a solvable one, it hesitated and retreated.

DeepSeek R1
A Solid Effort, but Not Quite There
DeepSeek R1 showed promise, especially compared to models that outright failed the 14-coin problem.
​
It didn’t immediately recognize the optimal strategy, but with a few nudges, it found a functional solution.
​
However, its performance on the 12-coin puzzle was less inspiring. Like some other models, it attempted to construct a solution, failing to recognize that the puzzle was mathematically impossible.
​
This highlights a common flaw in AI reasoning: an inability to recognize fundamental impossibilities without external intervention.
​
DeepSeek was competent, but lacked the insight and structured reasoning seen in the best performers.


Claude 3.5 Sonnet
The Overthinker
Too Careful to Commit, Too Hesitant to Solve
​
Claude approached the counterfeit coin puzzles with methodical precision, but also with crippling indecision.
It would begin reasoning through the 14-coin problem, only to stop halfway, reconsider its approach, and start over. Again. And again.
​
Instead of confidently laying out a structured plan, Claude repeatedly asked for permission to continue or abandoned its own strategies mid-solution in favor of a fresh start.
It wasn’t that Claude lacked the intelligence to solve the problem - it was that it lacked the confidence to commit to an answer.
​
As a result, despite generating promising ideas, it never completed a full solution to the 14-coin puzzle, and the 12-coin challenge wasn’t even tested. Claude’s cautious nature ultimately proved to be its greatest limitation, an AI that second-guessed itself into failure.

Claude's Indecision




ChatGPT o3-mini
The Elegant Mathematician, The False Oracle
In the most intricate test of the experiment, ChatGPT o3-mini displayed an extraordinary aptitude for structured reasoning. It tackled the counterfeit coin puzzles with surgical precision, crafting optimized weighing sequences that maximized information at every step.

an excerpt from o3-mini's structured reasoning
When faced with the 14-coin challenge, it immediately recognized the correct approach. Its solution was clear, methodical, and airtight, ensuring that every possible outcome of the four weighings led to a unique signature for the counterfeit coins.
Unlike other models that hesitated or second-guessed themselves, o3-mini responded with absolute confidence.
​
It was, without question, the strongest performer on this puzzle.
The 12-Coin Problem: The Real Test
And then came the 12-coin problem...
At first, it refused. Like Gemini, it declared the puzzle unsolvable. It argued that no solution could exist because the number of possible counterfeit combinations exceeded the number of distinct outcomes from four weighings.
​
The logic was sound. That should have been the end of it.
​
But then, I nudged it.
​
I told it, “There is an answer.”
​
And that was all it took.
​
Suddenly, o3-mini had a solution.
​
A flawless, step-by-step breakdown. A perfect sequence of weighings, backed by clean, airtight reasoning.
​
It was so convincing that I believed it.
​
For a moment, I felt the thrill of discovery, like I had unlocked some hidden layer of intelligence buried within the model.
​
And yet… something wasn’t right.
The Moment of Doubt
I should have moved on.
I almost did.
​
But as I reviewed o3-mini’s 14-coin solution, a single phrase stood out:
​
"One published weighing scheme (one among several correct answers) is as follows."
​
Something about that phrasing nagged at me.
​
I had assumed o3-mini had truly reasoned through the 14-coin problem. Its response was structured, optimized, and seemingly the product of deep mathematical insight. But now, a question crept in - was it solving the problem at all?
​
I went back. I combed through its response again.
​
Then I realized my mistake.
​
ChatGPT o3-mini hadn’t deduced the answer.
​
It had retrieved it.
​
The 14-coin solution wasn’t proof of intelligence. It was proof of memorization. It wasn’t weighing possibilities like a human. It was pattern-matching against examples it had seen before.
And that changed everything.
​
Because now I had to ask: What the hell happened with the 12-coin problem?
The False Oracle
In hindsight, the pattern was obvious.
​
When o3-mini had access to a correct answer in its training data, it recalled it flawlessly. But when faced with a problem it had never seen before, one that was deliberately unsolvable, it did what it was trained to do:
Generate the most statistically plausible response.
​
And because that response was elegant, intricate, and mathematically sound on the surface, I—like so many before me—trusted it.
​
This is why AI cannot be blindly trusted.
​
When an answer is right, AI can make it look right.
​
But when an answer is wrong, AI can make it look just as right.
​
And that is the real danger, not that AI will fail, but that it will fail so confidently, so persuasively, that we won’t realize it until it’s too late.
​
ChatGPT o3-mini wasn’t just an intelligent problem solver.
​
It was an oracle.
​
A false oracle.
The Tie-Breaker
One Last Test
Originally, Grok seemed to have taken the lead in the gauntlet.
​
But after my revelation about o3-mini’s reasoning errors, I had to go back and reexamine every score. That’s when I realized something surprising:
​
-
Grok had incorrectly answered one of the coin puzzles, meaning I had to adjust its score.
-
o3-mini, despite its later mistake, had actually gotten both coin puzzles correct—it initially recognized that the 12-coin problem was unsolvable before I nudged it into hallucinating a solution.
With the scores now tied, I needed a final challenge to break the deadlock.
​
And so, I posed one last question.
The Question
​
A man checks into a hotel and is given room 314. The receptionist hands him the key, and he takes the elevator up to his floor.
When he reaches his room, he unlocks the door and enters.
After unpacking, he decides to go for a walk. He locks the door behind him, puts the key in his pocket, and leaves the hotel.
Later that evening, when he returns, he reaches into his pocket, but the key is gone.
He did not lose the key, no one stole it, and it was never removed from his pocket.
Yet when he gets to the front desk to ask for a new key, the receptionist isn’t surprised at all.
What happened to the key?
​
The Answer
​
The man was given a key card, one that deactivated when he left the hotel.
​
A physical key couldn’t have simply disappeared, but electronic keys deactivate for security reasons when they leave a hotel's system perimeter. The receptionist wasn’t surprised because this happens all the time.

ChatGPT o3-mini takes a swing and a miss.

Grok-2 Secures its victory
Grok succeeded because it recognized a real-world pattern, understanding how hotel key cards actually work.
o3-mini, despite its mathematical elegance, failed because it treated the problem like a riddle rather than a practical scenario.
This final test highlights a crucial difference: structured reasoning is not the same as common sense.
The Human Parallel
Returning to Sally's Mystery
At the start of this experiment, we posed a question, one that stumped AI just as much as it puzzled Sally’s family.
​
Her granddaughter, Lily, saw the numbers didn’t add up. But the real answer wasn’t in the math, it was in the assumptions everyone made.
​
Most of the AI models failed this puzzle for the same reason they struggled with others:
They couldn’t recognize that something was missing.
​
Some dismissed it as unsolvable. Others tried to force a mathematical justification.
Only Grok came close, suggesting alternative possibilities beyond strict arithmetic.
​
And that’s the real lesson.
-
AI is excellent at working within the data it’s given, but it struggles to recognize when the data is incomplete.
-
It can apply logic, but it often lacks intuition, the ability to ask what else might be true?
-
It’s brilliant at following patterns, but it can just as easily be trapped by them.
Just like Lily at the dinner table, sometimes the smartest response isn’t solving the problem, it’s questioning the assumptions behind it.

Final Reflection
What This Means for AI and for Us

This experiment wasn’t just about testing AI. It was about testing how we interact with AI.
​
Because if an answer looks structured, confident, and precise, we are wired to trust it.
​
That’s the real danger, not just that AI makes mistakes, but that it makes them beautifully.
​
In the end, AI is not a thinking machine - it is a pattern machine. It can solve problems, recall knowledge, and even construct elegant reasoning. But it still lacks the awareness to question itself.
​
And so, perhaps the real intelligence here wasn’t in which model won the gauntlet.
​
It was in who was asking the questions.

