Thinking Without Tokens: How Latent Reasoning Makes a 3.5B Model Perform Like 50B

A new architecture lets AI models reason in latent space instead of generating tokens. The recurrent depth approach scales test-time compute without chain-of-thought.

Thinking Without Tokens: How Latent Reasoning Makes a 3.5B Model Perform Like 50B

Thinking Without Tokens: How Latent Reasoning Makes a 3.5B Model Perform Like 50B

Chain-of-thought reasoning has dominated the test-time scaling paradigm: models generate visible reasoning tokens, "thinking out loud" before producing answers.1 A new paper challenges this assumption entirely. By iterating a recurrent block in latent space, researchers demonstrate that a 3.5 billion parameter model achieves performance equivalent to 50 billion parameters on reasoning benchmarks.2 The thinking happens invisibly, encoded in hidden states rather than generated text.

TL;DR

The paper "Scaling up Test-Time Compute with Latent Reasoning" introduces an architecture that unrolls to arbitrary depth at inference time without producing additional tokens.3 A recurrent block processes hidden states iteratively, enabling the model to "think harder" on difficult problems by computing longer.4 Trained on 800 billion tokens, the 3.5B parameter model demonstrates dramatic improvements on reasoning benchmarks when allowed more iterations.5 The approach requires no specialized training data, no extended context windows, and no chain-of-thought supervision.6 All code, model weights, and training recipes have been released publicly on HuggingFace and GitHub.7

The Token Generation Problem

Current reasoning models like OpenAI's o1 and DeepSeek-R1 achieve their capabilities by generating extended chains of reasoning tokens.8 This approach works but introduces several constraints:

Context Window Consumption: Every reasoning token occupies context space that could otherwise hold problem information or instructions.9

Latency Scaling: Generating tokens remains fundamentally sequential. Longer reasoning chains translate directly to longer response times.10

Verbalization Bottleneck: Some reasoning may prove difficult or inefficient to express in natural language. Mathematical intuition, spatial reasoning, and pattern recognition resist easy verbalization.11

The latent reasoning approach sidesteps these issues by performing computation in the model's hidden representation space rather than in token space.

How Latent Reasoning Works

The architecture centers on a recurrent block that can be applied iteratively during inference:12

The Recurrent Block

Traditional transformers apply each layer exactly once per forward pass. The latent reasoning architecture allows a designated block to iterate multiple times:13

Input → Layers 1-N → Recurrent Block ↺ → Layers N+1-M → Output
                          ↑___________|
                          (iterate K times)

The recurrent block transforms hidden states without producing any visible output. Each iteration refines the internal representation, potentially discovering patterns or relationships that a single pass would miss.14

Variable Depth at Inference

The key innovation lies in decoupling training depth from inference depth. During training, the model learns to iterate the recurrent block a fixed number of times.15 At inference, the system can unroll to arbitrary depth based on problem difficulty or compute budget.16

Inference Depth Compute Cost Performance
Baseline (1x) Standard Baseline
5x iterations ~5x compute Moderate gains
10x iterations ~10x compute Significant gains
50x+ iterations ~50x compute Dramatic gains17

This property enables flexible compute-performance tradeoffs at deployment time.

No Token Generation Required

Unlike chain-of-thought approaches, latent reasoning produces no intermediate tokens. The model's "thinking" manifests as repeated transformations of hidden states.18 From an external perspective, the model simply takes longer to respond while consuming no additional context.

The researchers describe this as "reasoning that cannot easily be represented in words."19 The model may discover geometric relationships, statistical patterns, or logical structures that resist natural language description.

Training the Recurrent Architecture

Scaling the approach required solving several technical challenges:20

Gradient Flow

Deep recurrent unrolling risks gradient issues familiar from RNN training. The researchers employed techniques including:21

  • Careful initialization of recurrent weights
  • Gradient checkpointing for memory efficiency
  • Layer normalization within the recurrent block

Training Stability

Training on 800 billion tokens at 3.5B parameters required stable optimization over extended periods.22 The paper reports successful training without the loss spikes that often plague large-scale runs.

Data Requirements

Notably, the approach requires no specialized training data:23

  • No chain-of-thought annotations
  • No reasoning traces
  • No human labels of solution steps
  • Standard language modeling objective

The model learns to use additional iterations through the implicit signal that more computation improves prediction accuracy.

Benchmark Performance

The 3.5B model demonstrates consistent improvements when given more inference-time iterations:24

Scaling Curves

Performance improves smoothly with additional compute, exhibiting the predictable scaling that makes test-time approaches attractive:25

  • Improvements appear on multiple reasoning benchmarks
  • Gains continue at high iteration counts (50x+)
  • The relationship between compute and performance remains roughly log-linear

Effective Parameter Comparison

At maximum unrolling, the 3.5B model achieves performance equivalent to approximately 50 billion parameters under standard inference.26 This 14x effective capacity expansion comes at the cost of increased inference compute rather than increased model size.

Benchmark Categories

The paper evaluates across diverse reasoning types:27

Category Improvement Pattern
Mathematical reasoning Strong gains
Logical deduction Moderate gains
Common sense Variable
Factual recall Limited benefit

The pattern suggests latent reasoning excels on problems requiring multi-step inference rather than simple retrieval.

Comparison to Token-Based Reasoning

How does latent reasoning compare to chain-of-thought approaches?28

Dimension Chain-of-Thought Latent Reasoning
Compute scaling Token generation Recurrent iterations
Context usage Consumes context Context-neutral
Interpretability Visible reasoning Opaque
Training data Needs traces Standard LM data
Latency profile Sequential tokens Parallel-friendly

Strengths of Latent Reasoning

Context Efficiency: All context remains available for the actual problem, not consumed by reasoning tokens.29

Potential Parallelism: Recurrent iterations may prove more amenable to parallel acceleration than sequential token generation.30

Training Simplicity: No need to collect or generate reasoning traces for supervision.31

Strengths of Chain-of-Thought

Interpretability: Users can examine reasoning steps to verify correctness or identify errors.32

Steerability: Visible reasoning enables intervention and correction mid-process.33

Proven Scale: Token-based approaches have demonstrated success at frontier scale (o1, DeepSeek-R1).34

Open Questions and Future Directions

The paper opens several research directions:35

Hybrid Approaches

Combining latent reasoning with selective token generation might capture benefits of both approaches. The model could think in latent space for most computation while surfacing key intermediate conclusions as tokens.36

Scaling Limits

How far does the recurrent depth approach scale? The current work demonstrates 3.5B parameters on 800B tokens. Would larger models trained on more data exhibit similar scaling properties?37

Training for Depth

The current approach trains at fixed depth, then unrolls further at inference. Training directly for variable depth might improve efficiency and capability.38

Interpretability Research

Understanding what happens inside latent reasoning iterations could reveal new insights about neural network computation. Do the iterations correspond to identifiable reasoning steps? Can they be probed or steered?39

Practical Implications

For organizations deploying reasoning models, latent reasoning offers distinct tradeoffs:40

When Latent Reasoning Fits

  • Context-limited applications where reasoning tokens compete with problem content
  • Batch processing where interpretability matters less than throughput
  • Problems requiring "intuitive" reasoning that resists verbalization

When Chain-of-Thought Fits

  • User-facing applications where reasoning transparency builds trust
  • Debugging and development where visible reasoning aids iteration
  • Tasks where intermediate steps require verification

Open Source Availability

The researchers released complete artifacts for replication:41

  • Model Weights: 3.5B parameter model on HuggingFace
  • Training Code: Full pipeline on GitHub
  • Training Recipe: Hyperparameters and configuration for 800B token training

This openness enables the research community to build on the work directly.

Key Takeaways

The latent reasoning paper establishes a new paradigm for test-time scaling:

  1. Tokens Not Required: Reasoning can occur entirely in hidden representation space
  2. Massive Effective Expansion: 3.5B → 50B equivalent through recurrent depth
  3. Training Simplicity: Standard language modeling, no specialized data
  4. Flexible Inference: Compute-performance tradeoff controllable at deployment
  5. Complementary Approach: May combine with token-based reasoning for hybrid systems

The work suggests that the visible chain-of-thought paradigm, while effective, may not capture the full potential of test-time compute scaling.


References


  1. Wei, J., Wang, X., Schuurmans, D., et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. https://arxiv.org/abs/2201.11903 

  2. "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach." arXiv:2502.05171. February 2026. https://arxiv.org/abs/2502.05171 

  3. Ibid., Abstract. 

  4. Ibid., Section 2: Method. 

  5. Ibid., Section 4: Experiments. 

  6. Ibid., Section 1: Introduction. 

  7. Ibid., Section 6: Open Release. 

  8. OpenAI. "Learning to Reason with LLMs." September 2024. https://openai.com/index/learning-to-reason-with-llms/ 

  9. Muennighoff, N. et al. "s1: Simple test-time scaling." arXiv:2501.19393. January 2026. 

  10. Ibid. 

  11. Latent Reasoning paper, op. cit., Section 1. 

  12. Ibid., Section 2.1: Architecture. 

  13. Ibid., Figure 1. 

  14. Ibid., Section 2.2: Recurrent Processing. 

  15. Ibid., Section 3: Training. 

  16. Ibid., Section 2.3: Variable Depth Inference. 

  17. Ibid., Table 1: Scaling Results. 

  18. Ibid., Section 2. 

  19. Ibid., Abstract. 

  20. Ibid., Section 3: Training. 

  21. Ibid., Section 3.1: Gradient Management. 

  22. Ibid., Section 3.2: Training Stability. 

  23. Ibid., Section 3.3: Data. 

  24. Ibid., Section 4: Experiments. 

  25. Ibid., Figure 2: Scaling Curves. 

  26. Ibid., Section 4.2: Effective Capacity. 

  27. Ibid., Table 2: Benchmark Breakdown. 

  28. Ibid., Section 5: Comparison. 

  29. Ibid., Section 5.1. 

  30. Ibid. 

  31. Ibid. 

  32. Wei et al., op. cit. 

  33. Ibid. 

  34. OpenAI, op. cit. 

  35. Latent Reasoning paper, op. cit., Section 6: Future Work. 

  36. Ibid., Section 6.1. 

  37. Ibid., Section 6.2. 

  38. Ibid., Section 6.3. 

  39. Ibid., Section 6.4. 

  40. Ibid., Section 7: Discussion. 

  41. Ibid., Section 8: Open Release. HuggingFace: https://huggingface.co/papers/2502.05171 

Request a Quote_

Tell us about your project and we'll respond within 72 hours.

> TRANSMISSION_COMPLETE

Request Received_

Thank you for your inquiry. Our team will review your request and respond within 72 hours.

QUEUED FOR PROCESSING