s1: How 1,000 Training Examples Beat OpenAI's o1-preview by 27%

Stanford's s1 model uses 'budget forcing' to exceed o1-preview on math benchmarks with just 1K training examples. The test-time scaling breakthrough explained.

s1: How 1,000 Training Examples Beat OpenAI's o1-preview by 27%

s1: How 1,000 Training Examples Beat OpenAI's o1-preview by 27%

A team of researchers from Stanford, University of Washington, and other institutions asked a simple question: what constitutes the minimum viable approach to test-time scaling?1 Their answer upends assumptions about the computational requirements for building reasoning models. The s1 model, fine-tuned on just 1,000 carefully selected examples, exceeds OpenAI's o1-preview by up to 27% on competition mathematics benchmarks.2

TL;DR

The s1 paper introduces "budget forcing," a technique that controls how long a model thinks by either terminating reasoning early or appending "Wait" tokens to extend deliberation.3 Researchers curated s1K, a dataset of 1,000 questions selected for difficulty, diversity, and quality from 59,000 candidates.4 Fine-tuning Qwen2.5-32B-Instruct on s1K produced a model that scales predictably with inference compute.5 On AIME 2024, s1-32B achieves 57% accuracy with extended thinking versus o1-preview's approximate 44%.6 The entire approach requires no reinforcement learning, no process reward models, and no specialized infrastructure beyond standard fine-tuning.7

The Test-Time Scaling Paradigm

Traditional AI scaling invested compute during training: more parameters, more data, more GPU hours. Test-time scaling inverts the equation by investing compute during inference.8 Rather than building larger models, researchers enable smaller models to "think longer" on difficult problems.

OpenAI's o1 family demonstrated this paradigm at scale, generating extended chains of reasoning before producing final answers.9 The models achieve state-of-the-art results on mathematical reasoning and coding tasks by spending orders of magnitude more inference compute than standard chat models.10

The approach raises an obvious question: how much complexity does test-time scaling actually require?

The s1 Approach: Radical Simplicity

The s1 team pursued the simplest possible implementation that still achieves competitive performance.11 Their method involves three components:

1. Dataset Curation (s1K)

Starting from approximately 59,000 questions spanning mathematics, science, and puzzle domains, researchers applied three filtering criteria:12

Criterion Purpose Implementation
Difficulty Select problems requiring extended reasoning Chose questions where Claude 3.5 Sonnet needed >4,000 thinking tokens13
Diversity Prevent overfitting to narrow problem types Clustered questions and sampled across clusters14
Quality Ensure correct reasoning traces Human verification of solution accuracy15

The resulting s1K dataset contains just 1,000 question-answer pairs with detailed reasoning traces.16 For context, typical instruction-tuning datasets contain tens of thousands to millions of examples.

2. Standard Fine-Tuning

The team fine-tuned Qwen2.5-32B-Instruct using standard supervised learning on s1K.17 No reinforcement learning from human feedback. No process reward models to score intermediate reasoning steps. No specialized training infrastructure.18

Training completed in under 26 minutes on 16 H100 GPUs.19

3. Budget Forcing at Inference

Budget forcing constitutes the paper's core technical contribution. The technique controls inference-time computation through two mechanisms:20

Forced Termination: When the model generates an end-of-thinking token before reaching a target reasoning length, the system removes that token and appends a special "Wait" token instead.21 This forces the model to continue deliberating.

Forced Continuation: By repeatedly inserting "Wait" tokens, researchers extend reasoning chains arbitrarily.22 The model interprets these tokens as signals to reconsider its approach, often catching and correcting errors in previous reasoning steps.23

The name "budget forcing" reflects controlling the compute "budget" spent on each query.

Benchmark Results

s1-32B demonstrates consistent improvements over o1-preview across mathematical reasoning benchmarks:24

Benchmark s1-32B o1-preview Improvement
MATH Up to +27%25 Baseline Significant
AIME 2024 57%26 ~44% +13 points
AIME 2024 (no forcing) 50%27 ~44% +6 points

The AIME (American Invitational Mathematics Examination) represents a particularly challenging benchmark, featuring competition-level problems that test deep mathematical reasoning.28

Scaling Behavior

Performance improves predictably with inference compute. On AIME 2024, accuracy increased from 50% to 57% when budget forcing extended reasoning chains.29 The model exhibits "sample efficiency" at test time, meaning additional compute produces meaningful gains rather than diminishing returns.30

This scaling property suggests s1's approach captures something fundamental about how extended reasoning improves performance, not merely an artifact of the training data.

Why Budget Forcing Works

The researchers hypothesize that budget forcing succeeds because it triggers specific beneficial behaviors:31

Self-Verification: Extended reasoning allows the model to check its own work, identifying logical errors or calculation mistakes before committing to an answer.32

Alternative Exploration: When forced to continue thinking, models often explore alternative solution paths they would otherwise skip.33

Error Correction: The "Wait" token appears to function as a metacognitive prompt, signaling the model to pause and reconsider rather than rushing to conclusion.34

Qualitative analysis of reasoning traces shows models frequently change their answers after budget forcing, with corrections more often moving from wrong to right than vice versa.35

Implications for the Field

Democratizing Reasoning Models

The s1 approach suggests reasoning capabilities may prove more accessible than previously assumed. The key requirements:36

  • Data: 1,000 high-quality examples (curated from public sources)
  • Compute: 26 minutes on 16 H100s for training
  • Infrastructure: Standard fine-tuning pipeline
  • Expertise: Dataset curation judgment

Organizations without OpenAI's resources can potentially build competitive reasoning models using s1's methodology.

Open Questions

The paper acknowledges limitations and open problems:37

Generalization: Does the approach transfer to non-mathematical domains? Early results on coding and general reasoning appear promising but less dramatic.38

Ceiling Effects: How far can budget forcing scale? Extremely long reasoning chains may eventually produce diminishing or negative returns.39

Training Integration: Would reinforcement learning on top of s1's approach yield further gains? The simplicity advantage might disappear if RL proves necessary for additional capability.40

The Research Team

The paper represents a collaboration across multiple institutions:41

  • Stanford University: Niklas Muennighoff, Fei-Fei Li, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
  • University of Washington: Zitong Yang, Weijia Shi, Hannaneh Hajishirzi, Luke Zettlemoyer
  • Other Contributors: Xiang Lisa Li

The combination of statistical learning expertise (Candès), NLP research (Hajishirzi, Zettlemoyer), and AI safety/alignment focus (Hashimoto) shaped the paper's emphasis on simple, interpretable methods.

Open Source Release

The researchers released all components publicly:42

  • Model Weights: s1-32B available on Hugging Face
  • Dataset: s1K published with full reasoning traces
  • Code: Training and inference pipelines on GitHub

This openness enables direct replication and extension by the research community.

Key Takeaways

The s1 paper challenges prevailing assumptions about test-time scaling complexity:

  1. Quantity vs. Quality: 1,000 excellent examples outperform millions of mediocre ones for reasoning fine-tuning
  2. Simplicity Wins: Budget forcing achieves competitive results without RL, reward models, or exotic architectures
  3. Scaling Predictability: Performance improves reliably with inference compute, enabling cost-performance tradeoffs
  4. Accessibility: The approach requires modest resources compared to training frontier reasoning models from scratch

For organizations evaluating reasoning model strategies, s1 demonstrates that sophisticated capabilities may emerge from surprisingly simple interventions on strong base models.


References


  1. Muennighoff, N., Yang, Z., Shi, W., Li, X.L., Li, F., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., Hashimoto, T. "s1: Simple test-time scaling." arXiv:2501.19393. January 2026. https://arxiv.org/abs/2501.19393 

  2. Ibid., Abstract. 

  3. Ibid., Section 3: Budget Forcing. 

  4. Ibid., Section 2: The s1K Dataset. 

  5. Ibid., Section 4: Experiments. 

  6. Ibid., Table 2: AIME 2024 Results. 

  7. Ibid., Section 1: Introduction. 

  8. Snell, C., Lee, J., Xu, K., Kumar, A. "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314. August 2024. https://arxiv.org/abs/2408.03314 

  9. OpenAI. "Learning to Reason with LLMs." OpenAI Blog. September 2024. https://openai.com/index/learning-to-reason-with-llms/ 

  10. Ibid. 

  11. Muennighoff et al., op. cit., Section 1. 

  12. Ibid., Section 2.1: Curation Pipeline. 

  13. Ibid., Section 2.1.1: Difficulty Filtering. 

  14. Ibid., Section 2.1.2: Diversity Sampling. 

  15. Ibid., Section 2.1.3: Quality Verification. 

  16. Ibid., Table 1: Dataset Statistics. 

  17. Ibid., Section 3.1: Training Setup. 

  18. Ibid. 

  19. Ibid., Section 3.1: Training Efficiency. 

  20. Ibid., Section 3.2: Budget Forcing. 

  21. Ibid. 

  22. Ibid. 

  23. Ibid., Section 5: Analysis. 

  24. Ibid., Section 4: Results. 

  25. Ibid., Abstract. 

  26. Ibid., Table 2. 

  27. Ibid. 

  28. Mathematical Association of America. "American Invitational Mathematics Examination." https://www.maa.org/math-competitions/aime 

  29. Muennighoff et al., op. cit., Figure 3: Scaling Curves. 

  30. Ibid. 

  31. Ibid., Section 5: Why Does Budget Forcing Work? 

  32. Ibid., Section 5.1. 

  33. Ibid., Section 5.2. 

  34. Ibid. 

  35. Ibid., Section 5.3: Qualitative Analysis. 

  36. Ibid., Section 6: Discussion. 

  37. Ibid., Section 7: Limitations. 

  38. Ibid. 

  39. Ibid. 

  40. Ibid. 

  41. Ibid., Author Affiliations. 

  42. Ibid., Section 8: Open Release. GitHub: https://github.com/simplescaling/s1 

Request a Quote_

Tell us about your project and we'll respond within 72 hours.

> TRANSMISSION_COMPLETE

Request Received_

Thank you for your inquiry. Our team will review your request and respond within 72 hours.

QUEUED FOR PROCESSING