Thinking Without Tokens: How Latent Reasoning Makes a 3.5B Model Perform Like 50B

A new architecture lets AI models reason in latent space instead of generating tokens. The recurrent depth approach scales test-time compute without chain-of-thought.

Blake Crosley

Jan 07, 2026 7 min read Disclaimer

Thinking Without Tokens: How Latent Reasoning Makes a 3.5B Model Perform Like 50B

Chain-of-thought reasoning has dominated the test-time scaling paradigm: models generate visible reasoning tokens, "thinking out loud" before producing answers.¹ A new paper challenges this assumption entirely. By iterating a recurrent block in latent space, researchers demonstrate that a 3.5 billion parameter model achieves performance equivalent to 50 billion parameters on reasoning benchmarks.² The thinking happens invisibly, encoded in hidden states rather than generated text.

TL;DR

The paper "Scaling up Test-Time Compute with Latent Reasoning" introduces an architecture that unrolls to arbitrary depth at inference time without producing additional tokens.³ A recurrent block processes hidden states iteratively, enabling the model to "think harder" on difficult problems by computing longer.⁴ Trained on 800 billion tokens, the 3.5B parameter model demonstrates dramatic improvements on reasoning benchmarks when allowed more iterations.⁵ The approach requires no specialized training data, no extended context windows, and no chain-of-thought supervision.⁶ All code, model weights, and training recipes have been released publicly on HuggingFace and GitHub.⁷

The Token Generation Problem

Current reasoning models like OpenAI's o1 and DeepSeek-R1 achieve their capabilities by generating extended chains of reasoning tokens.⁸ This approach works but introduces several constraints:

Context Window Consumption: Every reasoning token occupies context space that could otherwise hold problem information or instructions.⁹

Latency Scaling: Generating tokens remains fundamentally sequential. Longer reasoning chains translate directly to longer response times.¹⁰

Verbalization Bottleneck: Some reasoning may prove difficult or inefficient to express in natural language. Mathematical intuition, spatial reasoning, and pattern recognition resist easy verbalization.¹¹

The latent reasoning approach sidesteps these issues by performing computation in the model's hidden representation space rather than in token space.

How Latent Reasoning Works

The architecture centers on a recurrent block that can be applied iteratively during inference:¹²

The Recurrent Block

Traditional transformers apply each layer exactly once per forward pass. The latent reasoning architecture allows a designated block to iterate multiple times:¹³

Input → Layers 1-N → Recurrent Block ↺ → Layers N+1-M → Output
                          ↑___________|
                          (iterate K times)

The recurrent block transforms hidden states without producing any visible output. Each iteration refines the internal representation, potentially discovering patterns or relationships that a single pass would miss.¹⁴

Variable Depth at Inference

The key innovation lies in decoupling training depth from inference depth. During training, the model learns to iterate the recurrent block a fixed number of times.¹⁵ At inference, the system can unroll to arbitrary depth based on problem difficulty or compute budget.¹⁶

Inference Depth	Compute Cost	Performance
Baseline (1x)	Standard	Baseline
5x iterations	~5x compute	Moderate gains
10x iterations	~10x compute	Significant gains
50x+ iterations	~50x compute	Dramatic gains¹⁷

This property enables flexible compute-performance tradeoffs at deployment time.

No Token Generation Required

Unlike chain-of-thought approaches, latent reasoning produces no intermediate tokens. The model's "thinking" manifests as repeated transformations of hidden states.¹⁸ From an external perspective, the model simply takes longer to respond while consuming no additional context.

The researchers describe this as "reasoning that cannot easily be represented in words."¹⁹ The model may discover geometric relationships, statistical patterns, or logical structures that resist natural language description.

Training the Recurrent Architecture

Scaling the approach required solving several technical challenges:²⁰

Gradient Flow

Deep recurrent unrolling risks gradient issues familiar from RNN training. The researchers employed techniques including:²¹

Careful initialization of recurrent weights
Gradient checkpointing for memory efficiency
Layer normalization within the recurrent block

Training Stability

Training on 800 billion tokens at 3.5B parameters required stable optimization over extended periods.²² The paper reports successful training without the loss spikes that often plague large-scale runs.

Data Requirements

Notably, the approach requires no specialized training data:²³

No chain-of-thought annotations
No reasoning traces
No human labels of solution steps
Standard language modeling objective

The model learns to use additional iterations through the implicit signal that more computation improves prediction accuracy.

Benchmark Performance

The 3.5B model demonstrates consistent improvements when given more inference-time iterations:²⁴

Scaling Curves

Performance improves smoothly with additional compute, exhibiting the predictable scaling that makes test-time approaches attractive:²⁵

Improvements appear on multiple reasoning benchmarks
Gains continue at high iteration counts (50x+)
The relationship between compute and performance remains roughly log-linear

Effective Parameter Comparison

At maximum unrolling, the 3.5B model achieves performance equivalent to approximately 50 billion parameters under standard inference.²⁶ This 14x effective capacity expansion comes at the cost of increased inference compute rather than increased model size.

Benchmark Categories

The paper evaluates across diverse reasoning types:²⁷

Category	Improvement Pattern
Mathematical reasoning	Strong gains
Logical deduction	Moderate gains
Common sense	Variable
Factual recall	Limited benefit

The pattern suggests latent reasoning excels on problems requiring multi-step inference rather than simple retrieval.

Comparison to Token-Based Reasoning

How does latent reasoning compare to chain-of-thought approaches?²⁸

Dimension	Chain-of-Thought	Latent Reasoning
Compute scaling	Token generation	Recurrent iterations
Context usage	Consumes context	Context-neutral
Interpretability	Visible reasoning	Opaque
Training data	Needs traces	Standard LM data
Latency profile	Sequential tokens	Parallel-friendly

Strengths of Latent Reasoning

Context Efficiency: All context remains available for the actual problem, not consumed by reasoning tokens.²⁹

Potential Parallelism: Recurrent iterations may prove more amenable to parallel acceleration than sequential token generation.³⁰

Training Simplicity: No need to collect or generate reasoning traces for supervision.³¹

Strengths of Chain-of-Thought

Interpretability: Users can examine reasoning steps to verify correctness or identify errors.³²

Steerability: Visible reasoning enables intervention and correction mid-process.³³

Proven Scale: Token-based approaches have demonstrated success at frontier scale (o1, DeepSeek-R1).³⁴

Open Questions and Future Directions

The paper opens several research directions:³⁵

Hybrid Approaches

Combining latent reasoning with selective token generation might capture benefits of both approaches. The model could think in latent space for most computation while surfacing key intermediate conclusions as tokens.³⁶

Scaling Limits

How far does the recurrent depth approach scale? The current work demonstrates 3.5B parameters on 800B tokens. Would larger models trained on more data exhibit similar scaling properties?³⁷

Training for Depth

The current approach trains at fixed depth, then unrolls further at inference. Training directly for variable depth might improve efficiency and capability.³⁸

Interpretability Research

Understanding what happens inside latent reasoning iterations could reveal new insights about neural network computation. Do the iterations correspond to identifiable reasoning steps? Can they be probed or steered?³⁹

Practical Implications

For organizations deploying reasoning models, latent reasoning offers distinct tradeoffs:⁴⁰

When Latent Reasoning Fits

Context-limited applications where reasoning tokens compete with problem content
Batch processing where interpretability matters less than throughput
Problems requiring "intuitive" reasoning that resists verbalization

When Chain-of-Thought Fits

User-facing applications where reasoning transparency builds trust
Debugging and development where visible reasoning aids iteration
Tasks where intermediate steps require verification

Open Source Availability

The researchers released complete artifacts for replication:⁴¹

Model Weights: 3.5B parameter model on HuggingFace
Training Code: Full pipeline on GitHub
Training Recipe: Hyperparameters and configuration for 800B token training

This openness enables the research community to build on the work directly.

Key Takeaways

The latent reasoning paper establishes a new paradigm for test-time scaling:

Tokens Not Required: Reasoning can occur entirely in hidden representation space
Massive Effective Expansion: 3.5B → 50B equivalent through recurrent depth
Training Simplicity: Standard language modeling, no specialized data
Flexible Inference: Compute-performance tradeoff controllable at deployment
Complementary Approach: May combine with token-based reasoning for hybrid systems

The work suggests that the visible chain-of-thought paradigm, while effective, may not capture the full potential of test-time compute scaling.

References

Wei, J., Wang, X., Schuurmans, D., et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. https://arxiv.org/abs/2201.11903 ↩
"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach." arXiv:2502.05171. February 2026. https://arxiv.org/abs/2502.05171 ↩
Ibid., Abstract. ↩
Ibid., Section 2: Method. ↩
Ibid., Section 4: Experiments. ↩
Ibid., Section 1: Introduction. ↩
Ibid., Section 6: Open Release. ↩
OpenAI. "Learning to Reason with LLMs." September 2024. https://openai.com/index/learning-to-reason-with-llms/ ↩
Muennighoff, N. et al. "s1: Simple test-time scaling." arXiv:2501.19393. January 2026. ↩
Ibid. ↩
Latent Reasoning paper, op. cit., Section 1. ↩
Ibid., Section 2.1: Architecture. ↩
Ibid., Figure 1. ↩
Ibid., Section 2.2: Recurrent Processing. ↩
Ibid., Section 3: Training. ↩
Ibid., Section 2.3: Variable Depth Inference. ↩
Ibid., Table 1: Scaling Results. ↩
Ibid., Section 2. ↩
Ibid., Abstract. ↩
Ibid., Section 3: Training. ↩
Ibid., Section 3.1: Gradient Management. ↩
Ibid., Section 3.2: Training Stability. ↩
Ibid., Section 3.3: Data. ↩
Ibid., Section 4: Experiments. ↩
Ibid., Figure 2: Scaling Curves. ↩
Ibid., Section 4.2: Effective Capacity. ↩
Ibid., Table 2: Benchmark Breakdown. ↩
Ibid., Section 5: Comparison. ↩
Ibid., Section 5.1. ↩
Ibid. ↩
Ibid. ↩
Wei et al., op. cit. ↩
Ibid. ↩
OpenAI, op. cit. ↩
Latent Reasoning paper, op. cit., Section 6: Future Work. ↩
Ibid., Section 6.1. ↩
Ibid., Section 6.2. ↩
Ibid., Section 6.3. ↩
Ibid., Section 6.4. ↩
Ibid., Section 7: Discussion. ↩
Ibid., Section 8: Open Release. HuggingFace: https://huggingface.co/papers/2502.05171 ↩

Thinking Without Tokens: How Latent Reasoning Makes a 3.5B Model Perform Like 50B

TL;DR

The Token Generation Problem

How Latent Reasoning Works

The Recurrent Block

Variable Depth at Inference

No Token Generation Required

Training the Recurrent Architecture

Gradient Flow

Training Stability

Data Requirements

Benchmark Performance

Scaling Curves

Effective Parameter Comparison

Benchmark Categories

Comparison to Token-Based Reasoning

Strengths of Latent Reasoning

Strengths of Chain-of-Thought

Open Questions and Future Directions

Hybrid Approaches

Scaling Limits

Training for Depth

Interpretability Research

Practical Implications

When Latent Reasoning Fits

When Chain-of-Thought Fits

Open Source Availability

Key Takeaways

References

You Might Also Like

Immersion Cooling ROI Calculator: 2-4 Year Payback for AI Wo...

UK AI Corridor: London's Emerging Compute Hub

vLLM Production Deployment: Building High-Throughput Inferen...

Request a Quote_

Request Received_