Thinking Without Tokens: How Latent Reasoning Makes a 3.5B Model Perform Like 50B
Chain-of-thought reasoning has dominated the test-time scaling paradigm: models generate visible reasoning tokens, "thinking out loud" before producing answers.1 A new paper challenges this assumption entirely. By iterating a recurrent block in latent space, researchers demonstrate that a 3.5 billion parameter model achieves performance equivalent to 50 billion parameters on reasoning benchmarks.2 The thinking happens invisibly, encoded in hidden states rather than generated text.
TL;DR
The paper "Scaling up Test-Time Compute with Latent Reasoning" introduces an architecture that unrolls to arbitrary depth at inference time without producing additional tokens.3 A recurrent block processes hidden states iteratively, enabling the model to "think harder" on difficult problems by computing longer.4 Trained on 800 billion tokens, the 3.5B parameter model demonstrates dramatic improvements on reasoning benchmarks when allowed more iterations.5 The approach requires no specialized training data, no extended context windows, and no chain-of-thought supervision.6 All code, model weights, and training recipes have been released publicly on HuggingFace and GitHub.7
The Token Generation Problem
Current reasoning models like OpenAI's o1 and DeepSeek-R1 achieve their capabilities by generating extended chains of reasoning tokens.8 This approach works but introduces several constraints:
Context Window Consumption: Every reasoning token occupies context space that could otherwise hold problem information or instructions.9
Latency Scaling: Generating tokens remains fundamentally sequential. Longer reasoning chains translate directly to longer response times.10
Verbalization Bottleneck: Some reasoning may prove difficult or inefficient to express in natural language. Mathematical intuition, spatial reasoning, and pattern recognition resist easy verbalization.11
The latent reasoning approach sidesteps these issues by performing computation in the model's hidden representation space rather than in token space.
How Latent Reasoning Works
The architecture centers on a recurrent block that can be applied iteratively during inference:12
The Recurrent Block
Traditional transformers apply each layer exactly once per forward pass. The latent reasoning architecture allows a designated block to iterate multiple times:13
Input → Layers 1-N → Recurrent Block ↺ → Layers N+1-M → Output
↑___________|
(iterate K times)
The recurrent block transforms hidden states without producing any visible output. Each iteration refines the internal representation, potentially discovering patterns or relationships that a single pass would miss.14
Variable Depth at Inference
The key innovation lies in decoupling training depth from inference depth. During training, the model learns to iterate the recurrent block a fixed number of times.15 At inference, the system can unroll to arbitrary depth based on problem difficulty or compute budget.16
| Inference Depth | Compute Cost | Performance |
|---|---|---|
| Baseline (1x) | Standard | Baseline |
| 5x iterations | ~5x compute | Moderate gains |
| 10x iterations | ~10x compute | Significant gains |
| 50x+ iterations | ~50x compute | Dramatic gains17 |
This property enables flexible compute-performance tradeoffs at deployment time.
No Token Generation Required
Unlike chain-of-thought approaches, latent reasoning produces no intermediate tokens. The model's "thinking" manifests as repeated transformations of hidden states.18 From an external perspective, the model simply takes longer to respond while consuming no additional context.
The researchers describe this as "reasoning that cannot easily be represented in words."19 The model may discover geometric relationships, statistical patterns, or logical structures that resist natural language description.
Training the Recurrent Architecture
Scaling the approach required solving several technical challenges:20
Gradient Flow
Deep recurrent unrolling risks gradient issues familiar from RNN training. The researchers employed techniques including:21
- Careful initialization of recurrent weights
- Gradient checkpointing for memory efficiency
- Layer normalization within the recurrent block
Training Stability
Training on 800 billion tokens at 3.5B parameters required stable optimization over extended periods.22 The paper reports successful training without the loss spikes that often plague large-scale runs.
Data Requirements
Notably, the approach requires no specialized training data:23
- No chain-of-thought annotations
- No reasoning traces
- No human labels of solution steps
- Standard language modeling objective
The model learns to use additional iterations through the implicit signal that more computation improves prediction accuracy.
Benchmark Performance
The 3.5B model demonstrates consistent improvements when given more inference-time iterations:24
Scaling Curves
Performance improves smoothly with additional compute, exhibiting the predictable scaling that makes test-time approaches attractive:25
- Improvements appear on multiple reasoning benchmarks
- Gains continue at high iteration counts (50x+)
- The relationship between compute and performance remains roughly log-linear
Effective Parameter Comparison
At maximum unrolling, the 3.5B model achieves performance equivalent to approximately 50 billion parameters under standard inference.26 This 14x effective capacity expansion comes at the cost of increased inference compute rather than increased model size.
Benchmark Categories
The paper evaluates across diverse reasoning types:27
| Category | Improvement Pattern |
|---|---|
| Mathematical reasoning | Strong gains |
| Logical deduction | Moderate gains |
| Common sense | Variable |
| Factual recall | Limited benefit |
The pattern suggests latent reasoning excels on problems requiring multi-step inference rather than simple retrieval.
Comparison to Token-Based Reasoning
How does latent reasoning compare to chain-of-thought approaches?28
| Dimension | Chain-of-Thought | Latent Reasoning |
|---|---|---|
| Compute scaling | Token generation | Recurrent iterations |
| Context usage | Consumes context | Context-neutral |
| Interpretability | Visible reasoning | Opaque |
| Training data | Needs traces | Standard LM data |
| Latency profile | Sequential tokens | Parallel-friendly |
Strengths of Latent Reasoning
Context Efficiency: All context remains available for the actual problem, not consumed by reasoning tokens.29
Potential Parallelism: Recurrent iterations may prove more amenable to parallel acceleration than sequential token generation.30
Training Simplicity: No need to collect or generate reasoning traces for supervision.31
Strengths of Chain-of-Thought
Interpretability: Users can examine reasoning steps to verify correctness or identify errors.32
Steerability: Visible reasoning enables intervention and correction mid-process.33
Proven Scale: Token-based approaches have demonstrated success at frontier scale (o1, DeepSeek-R1).34
Open Questions and Future Directions
The paper opens several research directions:35
Hybrid Approaches
Combining latent reasoning with selective token generation might capture benefits of both approaches. The model could think in latent space for most computation while surfacing key intermediate conclusions as tokens.36
Scaling Limits
How far does the recurrent depth approach scale? The current work demonstrates 3.5B parameters on 800B tokens. Would larger models trained on more data exhibit similar scaling properties?37
Training for Depth
The current approach trains at fixed depth, then unrolls further at inference. Training directly for variable depth might improve efficiency and capability.38
Interpretability Research
Understanding what happens inside latent reasoning iterations could reveal new insights about neural network computation. Do the iterations correspond to identifiable reasoning steps? Can they be probed or steered?39
Practical Implications
For organizations deploying reasoning models, latent reasoning offers distinct tradeoffs:40
When Latent Reasoning Fits
- Context-limited applications where reasoning tokens compete with problem content
- Batch processing where interpretability matters less than throughput
- Problems requiring "intuitive" reasoning that resists verbalization
When Chain-of-Thought Fits
- User-facing applications where reasoning transparency builds trust
- Debugging and development where visible reasoning aids iteration
- Tasks where intermediate steps require verification
Open Source Availability
The researchers released complete artifacts for replication:41
- Model Weights: 3.5B parameter model on HuggingFace
- Training Code: Full pipeline on GitHub
- Training Recipe: Hyperparameters and configuration for 800B token training
This openness enables the research community to build on the work directly.
Key Takeaways
The latent reasoning paper establishes a new paradigm for test-time scaling:
- Tokens Not Required: Reasoning can occur entirely in hidden representation space
- Massive Effective Expansion: 3.5B → 50B equivalent through recurrent depth
- Training Simplicity: Standard language modeling, no specialized data
- Flexible Inference: Compute-performance tradeoff controllable at deployment
- Complementary Approach: May combine with token-based reasoning for hybrid systems
The work suggests that the visible chain-of-thought paradigm, while effective, may not capture the full potential of test-time compute scaling.
References
-
Wei, J., Wang, X., Schuurmans, D., et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. https://arxiv.org/abs/2201.11903 ↩
-
"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach." arXiv:2502.05171. February 2026. https://arxiv.org/abs/2502.05171 ↩
-
Ibid., Abstract. ↩
-
Ibid., Section 2: Method. ↩
-
Ibid., Section 4: Experiments. ↩
-
Ibid., Section 1: Introduction. ↩
-
Ibid., Section 6: Open Release. ↩
-
OpenAI. "Learning to Reason with LLMs." September 2024. https://openai.com/index/learning-to-reason-with-llms/ ↩
-
Muennighoff, N. et al. "s1: Simple test-time scaling." arXiv:2501.19393. January 2026. ↩
-
Ibid. ↩
-
Latent Reasoning paper, op. cit., Section 1. ↩
-
Ibid., Section 2.1: Architecture. ↩
-
Ibid., Figure 1. ↩
-
Ibid., Section 2.2: Recurrent Processing. ↩
-
Ibid., Section 3: Training. ↩
-
Ibid., Section 2.3: Variable Depth Inference. ↩
-
Ibid., Table 1: Scaling Results. ↩
-
Ibid., Section 2. ↩
-
Ibid., Abstract. ↩
-
Ibid., Section 3: Training. ↩
-
Ibid., Section 3.1: Gradient Management. ↩
-
Ibid., Section 3.2: Training Stability. ↩
-
Ibid., Section 3.3: Data. ↩
-
Ibid., Section 4: Experiments. ↩
-
Ibid., Figure 2: Scaling Curves. ↩
-
Ibid., Section 4.2: Effective Capacity. ↩
-
Ibid., Table 2: Benchmark Breakdown. ↩
-
Ibid., Section 5: Comparison. ↩
-
Ibid., Section 5.1. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Wei et al., op. cit. ↩
-
Ibid. ↩
-
OpenAI, op. cit. ↩
-
Latent Reasoning paper, op. cit., Section 6: Future Work. ↩
-
Ibid., Section 6.1. ↩
-
Ibid., Section 6.2. ↩
-
Ibid., Section 6.3. ↩
-
Ibid., Section 6.4. ↩
-
Ibid., Section 7: Discussion. ↩
-
Ibid., Section 8: Open Release. HuggingFace: https://huggingface.co/papers/2502.05171 ↩