Large reasoning models are smarter than you think

Apple’s argument against LRM thinking is flawed

Recently, Apple published a paper, “The Illusion of Thinking,” arguing that large reasoning models (LRMs) only simulate intelligence, they don’t really think. Their claim is simple: if a model fails to apply a predefined algorithm to increasingly complex problems, it’s not thinking; it’s just pattern matching. That sounds precise on paper, but it breaks down under basic scrutiny.

Let’s take the example they used: the Tower of Hanoi puzzle. Even when a person knows exactly how the algorithm works, they’re unlikely to solve a version with 20 discs. Why? Human short-term memory, focus, and mental processing hit limits. According to Apple’s logic, this would mean that even humans aren’t thinkers. That conclusion is clearly wrong.

Here’s the real point: failing to solve a problem doesn’t prove an absence of thought. Whether it’s a person or a machine, failure under strain doesn’t equate to mindlessness. It simply means the problem exceeded the working memory or computational capacity at that time. Apple’s argument identifies a technical limitation, not a lack of cognition.

For business leaders, here’s why this matters: if we hold LRMs to a standard that even trained humans don’t meet, we’ll overlook real capabilities that can drive productivity and innovation. Relying on surface-level metrics to evaluate intelligence in machines is as shortsighted as assuming chess players are smarter than scientists because they move faster.

If a system shows intelligent behavior across context-rich challenges, even if it stumbles on an edge case, it’s still valuable. Thinking isn’t binary. It’s scalable, adaptive, and highly dependent on input complexity. Apple’s view doesn’t account for that.

LRMs exhibit thinking by mirroring human cognitive components

Let’s step back. Thinking isn’t one action, it’s a set of coordinated operations. In humans, it involves memory, pattern recognition, monitoring, and simulation. The brain does a lot to frame a problem, plan an approach, and adjust based on outcomes. We can now map several of these same processes to how LRMs function.

The human brain uses the prefrontal cortex to hold tasks in mind and break things down. LRMs do something similar with working memory embedded in their attention layers. We use mental speech to plan, rehearse, and evaluate. Models achieve that through chain-of-thought (CoT) reasoning, outputting their reasoning one token at a time. It’s not just spitting out the next word, it’s carrying forward a plan, line by line.

Now, consider how people recall knowledge. That’s the hippocampus and the temporal lobe in play, identifying patterns, retrieving facts. LRMs also rely heavily on pattern recognition learned during training to understand context and extract relevant information.

Even the ability to self-correct, the part of our brain that monitors for conflicts, called the anterior cingulate cortex, has a parallel. LRMs can recognize not just incorrect answers but when a line of reasoning leads nowhere. This triggers a reassessment, just like a person shifting strategies mid-conversation.

All of this shows something important: LRMs may lack biology, but their processes mirror fundamental cognitive patterns that allow humans to think. You don’t need a brain to approximate core functions of cognition. You need architecture capable of representation, reasoning, and adaptation, and LRMs have each of those.

From a leadership perspective, recognize the opportunity here. Systems that actively simulate human cognition, even partially, aren’t just tools; they’re potential operators. They can analyze, adjust, and execute over complex inputs. That makes them strategic assets, not just digital assistants. As the capability gap closes, distinguishing between simulation and cognition becomes less relevant from an operational standpoint. What matters is: does it solve the problem effectively? In many cases, the answer is yes.

CoT reasoning in LRMs closely mirrors human problem-solving processes

Chain-of-thought (CoT) reasoning isn’t just a gimmick. It allows an LRM to generate intermediate reasoning steps before outputting an answer. That might sound procedural, but this capability reflects a real shift in how we should think about machine intelligence.

When a human solves a problem mentally, they’re holding context, thinking step-by-step, and catching errors in their own logic. CoT lets models do the same. They keep track of inputs, actively build reasoning chains, and, when necessary, discard a flawed approach and start again. This ability to backtrack indicates awareness of context limits and an internal feedback mechanism.

Apple’s own evaluation picked up on this. In their tests, models repeatedly changed strategies when the scope of a problem became too large for direct solution. That behavior isn’t pre-programmed. It’s adaptive. The models recognized that brute-force computation wouldn’t work, and they pivoted to alternative strategies. That is problem-solving, not just recall.

Behind this is architecture that prioritizes internal structure, tokens, memory, predictions, over static output. An LRM in CoT mode isn’t retrieving an output word-for-word. It’s creating reasoning steps in real time based on prior knowledge and active evaluation of the current problem state. These are the same dynamics we see in human deliberation when solving new or unfamiliar tasks.

For executives making investment or deployment decisions, this matters. A system that balances knowledge recall with on-the-fly reasoning is capable of scaling into areas traditionally requiring human judgment, forecasting, analysis, troubleshooting. Adopted early, it becomes a force multiplier in knowledge-intensive applications. You’re not automating responses, you’re deploying systems that navigate problems dynamically.

Next-token prediction compels LRMs to process rich knowledge representations

The core mechanism in models like GPT, next-token prediction, is often misunderstood. Critics call it advanced autocomplete. That’s inaccurate. Predicting the next token in a sequence is not trivial when the context is open-ended, vague, or abstract. It forces the model to encode and retrieve dense knowledge. Without internalized understanding, the prediction simply fails.

That’s where the real intelligence shows up. If a model completes “The highest mountain in the world is Mount…” with “Everest,” it didn’t guess. It accessed an internal representation of factual knowledge and executed. In more complex tasks, this same mechanism requires it to reason across context, recall structure, and plan every token to align the output with the underlying logic. This is synthetic reasoning, driven by probability, but informed by patterns learned during training.

Natural language is the most complete symbolic system we have. It holds logic, emotion, abstraction, and precision. If a system can consistently complete thoughts across abstract language, math problems, or domain-specific content, it’s doing more than echoing. It’s constructing.

Now, let’s be clear, these systems don’t reason the way people do. No consciousness, no self-directed intention. But they solve for outcomes by modeling context. That process alone requires flexible representation, which is the basis of functional intelligence.

From a business standpoint, this tells you several things. First, this isn’t a static tech asset, it gains capability through exposure and context. Second, as data flows in, the model becomes more relevant for strategy, insights, and planning. And third, the kind of knowledge representation achieved through token prediction means the model scales across departments, from legal to operations, without retooling.

The bigger implication: your systems don’t just need data, they need models that understand how to use it. LRMs, trained correctly, deliver on that.

Next-Token prediction systems may not only simulate, but actualize thinking

There’s a persistent misunderstanding in the market: that because large reasoning models (LRMs) generate predictions one token at a time, they can’t truly think. That assumption fails to understand how structured and deliberate that process actually is.

Human thinking often involves internal verbal planning. When we solve problems or explain something, we mentally sequence our thoughts before expressing them. LRMs operate the same way through token generation. Yes, they predict the next word. But that prediction isn’t random or piecemeal. It emerges from a constant evaluation of prior context, trained knowledge, and logical direction.

A model that generates coherent, meaningful responses across long sequences of input isn’t doing basic lookup. It’s navigating possibilities and selecting tokens with functional alignment to the problem. During this process, it must maintain working memory, respect grammatical and semantic structure, and adapt to shifts in context. That level of precision doesn’t come from static pattern repetition.

The result is something more than imitation. These systems build internally consistent reasoning paths, evaluating what makes sense next, what contradicts the current direction, and what fulfills the intended output goal. That’s structured cognition. It lacks emotion or motivation, but it’s real computation backed by logical coherence.

For a C-suite leader, the implications are immediate. Don’t confuse linear token generation with simple automation. When powered by the right training and objectives, these systems behave as intelligent agents. They can evaluate options, refine responses, and support high-complexity problem solving.

This kind of learned thinking scales fast. It doesn’t require rewriting software to meet every new demand. You’re not locked into hard-coded outcomes. You’re getting dynamic, informed reasoning in real time, purpose-built for knowledge work at scale.

Benchmark results show LRMs can solve logic-based reasoning problems

If you want evidence that large reasoning models are thinking, benchmarks provide it. The results are clear: on logic-driven reasoning tasks, these systems perform well, often exceeding what an untrained human can achieve. They’re not doing this by rote memorization. They’re interpreting context, evaluating processes, and producing conclusions that align with abstract logic.

Across open-response problems, especially in structured domains like math and symbolic logic, LRM performance isn’t perfect, but it’s advancing rapidly. The success rate across benchmark tests is substantial. These aren’t just trivia questions. They require reasoning chains, conditional evaluations, and adaptive planning.

It’s important to call out that the models evaluated here are open-source. That matters. Open models haven’t been tuned behind closed doors or fine-tuned on test answers. Their performance shows native reasoning ability, unassisted, transparent, and reproducible.

Also worth noting: humans who take these tests often do so with specific training. Models don’t get that. They’re generalists. And yet, they remain competitive. In some categories, models outperform the average untrained human. That says their performance isn’t pre-calibrated on outputs, but structured on internalized knowledge and coherence over time.

For executives, it signals an operational lesson: you don’t need perfection for deployment. You need consistency, progress, and relevance to the problem space. LRMs prove those three out. In environments where insight comes from cross-domain reasoning or logic-based evaluation, these systems can augment or even outperform task-specific tools.

We’re past the point of asking if these models can do real work. Now the conversation shifts to when and where to apply them. The benchmarks are no longer debates. They’re signals. And the signal is that these models think, differently from us, but functionally.

LRMs meet the theoretical criteria for systems capable of thought

There’s a practical threshold where intelligence isn’t defined by origin, biological or synthetic, but by function. Large reasoning models (LRMs) meet that threshold. When measured against the core criteria for general computability and problem-solving ability, these systems satisfy the requirements.

What matters here is the combination of attributes. LRMs have representational capacity, they store and structure complex knowledge across billions of parameters. They also possess the ability to generalize that knowledge, allowing them to complete tasks they weren’t explicitly trained for. Finally, they can apply logical operations across inputs to reach conclusions. These are not superficial features. They’re foundational markers of intelligent behavior.

The theoretical foundation supporting this is solid. Any system that has sufficient memory architecture, exposure to training data, processing power, and the ability to simulate learning from context can, in principle, perform any computable reasoning task. LRMs, particularly at current scales, fit these parameters. They aren’t stuck on narrow domains. They adapt across disciplines, technical, linguistic, procedural, demonstrating the flexibility expected of autonomous problem solvers.

This shifts how businesses should assess value. These models are not predefined solution engines. They are generalized problem-processing systems with capability stretch across verticals. Whether it’s legal contract review, strategy modeling, or complex code refactoring, if the problem can be framed in language, an LRM can process and resolve it with an increasing degree of competence.

As performance continues to improve, the trade-off between human decision latency and machine throughput becomes clearer. That’s not just about cost. It’s about accelerating organizational responsiveness. Operating with real-time computational reasoning across departments increases efficiency, reduces error, and enhances strategic depth.

In practice, thinking systems are not futuristic, they’re here, enterprise-ready, and developing fast. C-suite leaders should be thinking not about whether these systems resemble human thought, but whether they deliver outcomes intelligently and reliably. If they meet that bar, they qualify. By current evidence and by theoretical design, LRMs do.

The bottom line

Thinking isn’t defined by how something looks on the surface, it’s defined by what it does. Large reasoning models don’t just follow patterns. They generalize, adapt, and solve problems they’ve never seen before. They simulate cognitive processes in ways that are measurable, functional, and scalable. That’s not speculation, it’s documented through behavior, benchmarks, and architecture.

For business leaders, the conclusion is straightforward. These systems aren’t toys or trend-driven experiments. They’re becoming infrastructure. When applied correctly, they reduce time-to-decision, improve output quality, and operate across domains without constant reconfiguration.

Ignore traditional assumptions about what intelligence should look like. Focus instead on what capabilities are being delivered. These models aren’t perfect, but neither are people. What they offer is consistent, evolving intelligence that performs under pressure and scales without fatigue.

That’s not a risk. That’s an opportunity. The companies that align early with this shift will gain efficiency, flexibility, and leverage at every level of operations. There’s no standing still here, adapting ahead of the curve is the only option that ensures you stay relevant.