How Do Reasoning Models Work?

Laurence Liang,

Published as part of LaunchYard.xyz (opens in a new tab) - figures in process of being uploaded (currently transcribing to markdown!)


Photo credits: Unsplash/Alexandre Boucey

Over the past few months, you may have noticed an increasing number of language models that are bundled with "reasoning capabilities". A critical reader might wonder: why should we be interested in "reasoning" models?

It turns out that "reasoning" - if implemented correctly - can have many tangible benefits:

Arguably, the promise of cheaper models and leaps in model performance is quite desirable. Given that there have been encouraging empirical results over the past several months, the following sections aim to explain how reasoning models typically work, and what next steps may be on the horizon.

Context

The most widespread choice of language model today is the transformer, or a transformer derivative. After the "Attention Is All You Need" paper published in 2017, transformer models generally perform the strongest on natural language tasks relative to other architectures (i.e. LSTMs or GANs), making them a default choice for many applications.

However, transformer models seem to have some sort of deterministic trade-offs. As described in the 2020 OpenAI "Scaling Laws for Neural Language Models" paper, the analogy is that to achieve a 2x improvement in model performance, you need either a 10x increase in dataset size, compute time or model size.

Scaling laws for language models. Figure credits: “Scaling Laws for Neural Language Models” by Kaplan et al.

These numbers are an approximation (the actual ratios are a bit different), though the premise remains: in practical terms, a company that wants a linear increase in model performance would need to spend exponentially more in dataset size, compute time or model size.

Let's imagine that we want to train a language model. Using some fictional numbers, we would have to spend the following to minimize the word error rate:

It’s important to restate that the fictional ratios above are different from the actual ratios (as described in the scaling laws paper), though the linear performance vs exponential cost tradeoff analogy remains.

This is where reasoning models come in: you can take an existing transformer model, and by extending that model's output length instead of increasing the traditional "scaling laws variables", models suddenly experience leaps in performance. As we will show later on, the longer the reasoning step, the higher the accuracy. As a result, training a model to reason with longer outputted text is known as inference time scaling.

In short: reasoning models are almost like a supplement that models can take, after which they suddenly become measurably smarter at a similar operating cost.

Assumptions Towards a Case for Reasoning Models

Reasoning models hold a competitive edge if the following assumptions hold:

Given that ML research is still very active, it would not be a surprise if (1) someone discovers a non-transformer architecture with more generous scaling laws - or (2) if some non-text modality is more effective at compressing information and logical processes. My take is that case (1) or (2) would diminish the competitive advantage that reasoning models hold with transformers. My point isn't to minimize the impact that reasoning models have, but rather to exercise caution that the premise for reasoning as a cost-effective model may only hold as long as these assumptions remain valid at scale.

How Do Reasoning Models Work?

Reasoning models are trained to write longer outputs that describe their logical thinking steps in natural language.

Let's imagine a hypothetical example. If a user asks:

user: "solve x^3 = 8"

A non-reasoning model may output:

model: "x=2"

A reasoning model, however, would describe its reasoning in length in natural language. Its response could look like:

model:

"""
<think>
The goal is to solve x^3=8.
There is a single variable to solve: x.
We need to isolate x, by taking the cube inverse of 8.
We know that 8^(1/3)=2
So x=2.
</think>
x=2
"""

where the reasoning steps are enclosed by the <think> tags, and the output is outside of the enclosed <think> tags.

While the use of detailed reasoning steps may seem redundant and trivial, training a model to describe its reasoning steps - in a process known as "chain of thought" (CoT) - has demonstrated stronger performance for longer reasoning steps for more challenging problems.

What Evidence Do We Have, and What Techniques Are Being Used?

As there exists a vast amount of literature on reasoning topics, the following is by no means an exhaustive list, but rather a few selected highlights demonstrating encouraging signs that reasoning models perform strongly.

Chain-of-Thought (CoT) Paper (2022).

This paper introduced prompting with a chain of thoughts: the initial prompt would contain an example of breaking down a problem into multiple steps, which would incite the language model to break down a subsequent problem similarly into intermediate steps. The authors discovered that CoT showed improved problem-solving accuracy at larger model scales on math-related benchmarks, as exemplified by the following figure from the authors’ paper:

Model performance on math benchmarks with respect to model size. Figure credit: “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” by Wei et al. (2022)

InstructGPT Paper (2022).

Formally titled “Training language models to follow instructions with human feedback”, this OpenAI paper benchmarked language models being fine-tuned based on human feedback. A notable contribution from this paper is how it shows that fine-tuning a model to follow instructions is preferred by users over a baseline model. Furthermore, this paper also benchmarks how a reinforcement learning model generally outranks a supervised fine-tuned model. While this paper doesn’t directly address reasoning, its methodology bear some ressemblance to those that were employed in subsequent years to train reasoning models, showing the usefulness of post-training methods such as supervised fine tuning and reinforcement learning fine tuning. This paper also shows how these methods can allow a small 1.3B model to achieve similar performance compared to a 175B model for preferred outputs (to what extent a human user prefers the outputs), a similar comparison that reasoning model papers would employ (small models performing comparably to larger baseline models).

How a reinforcement learning fine-tuned model (PPO and PPO-tx) outranks a baseline model (GPT) and a supervised fine-tuned model (SFT). Figure credit: “Training language models to follow instructions with human feedback” by Ouyang et al.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters paper (2024).

This paper compared pre-trained models to models that had inference time scaling (where the model spends more computing power on generating longer outputs describing its reasoning process) using certain inference time strategies. When benchmakred on math-related tasks, the authors found that in some cases, inference time scaled-models can outperform baseline pre-trained models, and the authors report that this also applies to smaller inference time-scaled models outperforming models an order of magnitude larger.

Comparing how 2 strategies for inference-time scaled models compare to baseline pre-trained models (stars). Figure credit: “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters” by Snell et al.

DeepSeek R1 paper (2025).

While the DeepSeek R1 paper’s most striking claim was being able to train a model that could perform similarly or even surpass OpenAI’s o1 model on tasks requiring some degree of reasoning capabilities, there are multiple notable contributions that the paper makes:

The result of these contributions was condensed in the results table of the DeepSeek R1 paper, shown as:

Figure credit: “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”

Current Limitations?

Next Steps?

Photo credit: Unsplash/Hannah Reding

A Short Summary

Photo credit: Unsplash/Liis Glükmann

In summary, reasoning supercharges model performance with surprisingly limited additional costs, making the cost-benefit tradeoff quite desirable. This makes reasoning an encouraging field to further explore, both with regard to the research aspect and the practical product-side potential.