s1: How 1,000 Training Examples Beat OpenAI's o1-preview by 27%
A team of researchers from Stanford, University of Washington, and other institutions asked a simple question: what constitutes the minimum viable approach to test-time scaling?1 Their answer upends assumptions about the computational requirements for building reasoning models. The s1 model, fine-tuned on just 1,000 carefully selected examples, exceeds OpenAI's o1-preview by up to 27% on competition mathematics benchmarks.2
TL;DR
The s1 paper introduces "budget forcing," a technique that controls how long a model thinks by either terminating reasoning early or appending "Wait" tokens to extend deliberation.3 Researchers curated s1K, a dataset of 1,000 questions selected for difficulty, diversity, and quality from 59,000 candidates.4 Fine-tuning Qwen2.5-32B-Instruct on s1K produced a model that scales predictably with inference compute.5 On AIME 2024, s1-32B achieves 57% accuracy with extended thinking versus o1-preview's approximate 44%.6 The entire approach requires no reinforcement learning, no process reward models, and no specialized infrastructure beyond standard fine-tuning.7
The Test-Time Scaling Paradigm
Traditional AI scaling invested compute during training: more parameters, more data, more GPU hours. Test-time scaling inverts the equation by investing compute during inference.8 Rather than building larger models, researchers enable smaller models to "think longer" on difficult problems.
OpenAI's o1 family demonstrated this paradigm at scale, generating extended chains of reasoning before producing final answers.9 The models achieve state-of-the-art results on mathematical reasoning and coding tasks by spending orders of magnitude more inference compute than standard chat models.10
The approach raises an obvious question: how much complexity does test-time scaling actually require?
The s1 Approach: Radical Simplicity
The s1 team pursued the simplest possible implementation that still achieves competitive performance.11 Their method involves three components:
1. Dataset Curation (s1K)
Starting from approximately 59,000 questions spanning mathematics, science, and puzzle domains, researchers applied three filtering criteria:12
| Criterion | Purpose | Implementation |
|---|---|---|
| Difficulty | Select problems requiring extended reasoning | Chose questions where Claude 3.5 Sonnet needed >4,000 thinking tokens13 |
| Diversity | Prevent overfitting to narrow problem types | Clustered questions and sampled across clusters14 |
| Quality | Ensure correct reasoning traces | Human verification of solution accuracy15 |
The resulting s1K dataset contains just 1,000 question-answer pairs with detailed reasoning traces.16 For context, typical instruction-tuning datasets contain tens of thousands to millions of examples.
2. Standard Fine-Tuning
The team fine-tuned Qwen2.5-32B-Instruct using standard supervised learning on s1K.17 No reinforcement learning from human feedback. No process reward models to score intermediate reasoning steps. No specialized training infrastructure.18
Training completed in under 26 minutes on 16 H100 GPUs.19
3. Budget Forcing at Inference
Budget forcing constitutes the paper's core technical contribution. The technique controls inference-time computation through two mechanisms:20
Forced Termination: When the model generates an end-of-thinking token before reaching a target reasoning length, the system removes that token and appends a special "Wait" token instead.21 This forces the model to continue deliberating.
Forced Continuation: By repeatedly inserting "Wait" tokens, researchers extend reasoning chains arbitrarily.22 The model interprets these tokens as signals to reconsider its approach, often catching and correcting errors in previous reasoning steps.23
The name "budget forcing" reflects controlling the compute "budget" spent on each query.
Benchmark Results
s1-32B demonstrates consistent improvements over o1-preview across mathematical reasoning benchmarks:24
| Benchmark | s1-32B | o1-preview | Improvement |
|---|---|---|---|
| MATH | Up to +27%25 | Baseline | Significant |
| AIME 2024 | 57%26 | ~44% | +13 points |
| AIME 2024 (no forcing) | 50%27 | ~44% | +6 points |
The AIME (American Invitational Mathematics Examination) represents a particularly challenging benchmark, featuring competition-level problems that test deep mathematical reasoning.28
Scaling Behavior
Performance improves predictably with inference compute. On AIME 2024, accuracy increased from 50% to 57% when budget forcing extended reasoning chains.29 The model exhibits "sample efficiency" at test time, meaning additional compute produces meaningful gains rather than diminishing returns.30
This scaling property suggests s1's approach captures something fundamental about how extended reasoning improves performance, not merely an artifact of the training data.
Why Budget Forcing Works
The researchers hypothesize that budget forcing succeeds because it triggers specific beneficial behaviors:31
Self-Verification: Extended reasoning allows the model to check its own work, identifying logical errors or calculation mistakes before committing to an answer.32
Alternative Exploration: When forced to continue thinking, models often explore alternative solution paths they would otherwise skip.33
Error Correction: The "Wait" token appears to function as a metacognitive prompt, signaling the model to pause and reconsider rather than rushing to conclusion.34
Qualitative analysis of reasoning traces shows models frequently change their answers after budget forcing, with corrections more often moving from wrong to right than vice versa.35
Implications for the Field
Democratizing Reasoning Models
The s1 approach suggests reasoning capabilities may prove more accessible than previously assumed. The key requirements:36
- Data: 1,000 high-quality examples (curated from public sources)
- Compute: 26 minutes on 16 H100s for training
- Infrastructure: Standard fine-tuning pipeline
- Expertise: Dataset curation judgment
Organizations without OpenAI's resources can potentially build competitive reasoning models using s1's methodology.
Open Questions
The paper acknowledges limitations and open problems:37
Generalization: Does the approach transfer to non-mathematical domains? Early results on coding and general reasoning appear promising but less dramatic.38
Ceiling Effects: How far can budget forcing scale? Extremely long reasoning chains may eventually produce diminishing or negative returns.39
Training Integration: Would reinforcement learning on top of s1's approach yield further gains? The simplicity advantage might disappear if RL proves necessary for additional capability.40
The Research Team
The paper represents a collaboration across multiple institutions:41
- Stanford University: Niklas Muennighoff, Fei-Fei Li, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
- University of Washington: Zitong Yang, Weijia Shi, Hannaneh Hajishirzi, Luke Zettlemoyer
- Other Contributors: Xiang Lisa Li
The combination of statistical learning expertise (Candès), NLP research (Hajishirzi, Zettlemoyer), and AI safety/alignment focus (Hashimoto) shaped the paper's emphasis on simple, interpretable methods.
Open Source Release
The researchers released all components publicly:42
- Model Weights: s1-32B available on Hugging Face
- Dataset: s1K published with full reasoning traces
- Code: Training and inference pipelines on GitHub
This openness enables direct replication and extension by the research community.
Key Takeaways
The s1 paper challenges prevailing assumptions about test-time scaling complexity:
- Quantity vs. Quality: 1,000 excellent examples outperform millions of mediocre ones for reasoning fine-tuning
- Simplicity Wins: Budget forcing achieves competitive results without RL, reward models, or exotic architectures
- Scaling Predictability: Performance improves reliably with inference compute, enabling cost-performance tradeoffs
- Accessibility: The approach requires modest resources compared to training frontier reasoning models from scratch
For organizations evaluating reasoning model strategies, s1 demonstrates that sophisticated capabilities may emerge from surprisingly simple interventions on strong base models.
References
-
Muennighoff, N., Yang, Z., Shi, W., Li, X.L., Li, F., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., Hashimoto, T. "s1: Simple test-time scaling." arXiv:2501.19393. January 2026. https://arxiv.org/abs/2501.19393 ↩
-
Ibid., Abstract. ↩
-
Ibid., Section 3: Budget Forcing. ↩
-
Ibid., Section 2: The s1K Dataset. ↩
-
Ibid., Section 4: Experiments. ↩
-
Ibid., Table 2: AIME 2024 Results. ↩
-
Ibid., Section 1: Introduction. ↩
-
Snell, C., Lee, J., Xu, K., Kumar, A. "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314. August 2024. https://arxiv.org/abs/2408.03314 ↩
-
OpenAI. "Learning to Reason with LLMs." OpenAI Blog. September 2024. https://openai.com/index/learning-to-reason-with-llms/ ↩
-
Ibid. ↩
-
Muennighoff et al., op. cit., Section 1. ↩
-
Ibid., Section 2.1: Curation Pipeline. ↩
-
Ibid., Section 2.1.1: Difficulty Filtering. ↩
-
Ibid., Section 2.1.2: Diversity Sampling. ↩
-
Ibid., Section 2.1.3: Quality Verification. ↩
-
Ibid., Table 1: Dataset Statistics. ↩
-
Ibid., Section 3.1: Training Setup. ↩
-
Ibid. ↩
-
Ibid., Section 3.1: Training Efficiency. ↩
-
Ibid., Section 3.2: Budget Forcing. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid., Section 5: Analysis. ↩
-
Ibid., Section 4: Results. ↩
-
Ibid., Abstract. ↩
-
Ibid., Table 2. ↩
-
Ibid. ↩
-
Mathematical Association of America. "American Invitational Mathematics Examination." https://www.maa.org/math-competitions/aime ↩
-
Muennighoff et al., op. cit., Figure 3: Scaling Curves. ↩
-
Ibid. ↩
-
Ibid., Section 5: Why Does Budget Forcing Work? ↩
-
Ibid., Section 5.1. ↩
-
Ibid., Section 5.2. ↩
-
Ibid. ↩
-
Ibid., Section 5.3: Qualitative Analysis. ↩
-
Ibid., Section 6: Discussion. ↩
-
Ibid., Section 7: Limitations. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid., Author Affiliations. ↩
-
Ibid., Section 8: Open Release. GitHub: https://github.com/simplescaling/s1 ↩