MiroThinker: The Third Scaling Dimension for AI Agents
AI scaling has focused on two dimensions: model size and context length.1 MiroThinker introduces a third: interaction depth. The research agent, released with 8B, 30B, and 72B parameter variants, trains models to handle up to 600 tool calls per task through reinforcement learning.2 On the GAIA benchmark, the 72B variant achieves 81.9% accuracy, approaching commercial systems like GPT-5-high while remaining fully open-source.3
TL;DR
MiroThinker explores "interaction scaling" at the model level, systematically training models to handle deeper and more frequent agent-environment interactions.4 Unlike test-time scaling in isolation, interaction scaling uses environment feedback to correct errors and refine trajectories.5 With a 256K context window, the agent performs up to 600 tool calls per task, enabling sustained multi-turn reasoning for complex research workflows.6 Training uses three phases: supervised fine-tuning, preference learning, and reinforcement learning with group relative policy optimization.7 Across GAIA, HLE, BrowseComp, and BrowseComp-ZH benchmarks, MiroThinker demonstrates consistent improvements as interaction depth increases.8
The Agent Scaling Problem
Current AI agents face a fundamental limitation. As reasoning chains extend, errors compound.9 A single mistake early in a trajectory can derail the entire task. Traditional approaches address this through:
Larger Models: More parameters for better single-step accuracy10 Longer Context: More space to hold reasoning history11 Better Prompting: Improved instructions to reduce errors12
Yet these interventions don't address the core issue: agents operating in isolation from their environment during extended reasoning.
Reasoning Drift
Long reasoning chains without environmental feedback exhibit "reasoning drift"—gradual divergence from correct trajectories.13 The agent continues reasoning based on increasingly stale or incorrect assumptions.
| Chain Length | Error Rate | Cause |
|---|---|---|
| Short (1-5 steps) | Low | Limited compound error |
| Medium (5-20 steps) | Moderate | Accumulating mistakes |
| Long (20+ steps) | High | Reasoning drift dominates |
The Feedback Solution
MiroThinker's insight: let the environment correct the agent continuously.14 Rather than reasoning in isolation, the agent checks its work by interacting with external tools, catching errors before they compound.
Interaction Scaling Defined
Interaction scaling treats the depth of agent-environment interaction as a scalable dimension analogous to model size or context length.15
The Three Dimensions
| Dimension | What Scales | How It Helps |
|---|---|---|
| Model Size | Parameters | Better single-step quality |
| Context Length | Token window | More information available |
| Interaction Depth | Tool calls | Error correction, grounding |
Why Interaction Differs
Unlike model size (fixed at training) or context (passive storage), interaction depth enables active verification and course correction.16
Passive Scaling: Larger models and contexts provide more capacity Active Scaling: More interactions provide more opportunities to check, correct, and refine
The agent can discover that a web search returned outdated information, that a file path doesn't exist, or that a calculation produced unexpected results—and adapt accordingly.
MiroThinker Architecture
The agent follows the ReAct framework with specific enhancements for deep interaction:17
Core Loop
Thought → Action (Tool Call) → Observation → Thought → ...
Each observation feeds back into the agent's context, informing subsequent reasoning.18
Tool Suite
MiroThinker includes a comprehensive toolkit:19
| Tool Category | Examples |
|---|---|
| Web Search | Query formulation, result parsing |
| Web Browsing | Page navigation, content extraction |
| Code Execution | Python runtime, result analysis |
| File Operations | Read, write, analyze documents |
600 Tool Calls
The 256K context window supports up to 600 tool calls per task.20 For context, most agent benchmarks involve fewer than 20 tool calls. MiroThinker operates at 30x the typical interaction depth.
Training Methodology
MiroThinker training proceeds in three phases:21
Phase 1: Supervised Fine-Tuning
Initial training on successful agent trajectories teaches basic tool use patterns:22
- When to search vs. browse
- How to formulate effective queries
- Interpreting tool outputs
- Synthesizing multi-source information
Phase 2: Preference Learning
The model learns to prefer successful trajectories over failed ones:23
- Binary feedback on trajectory outcomes
- Implicit learning of error recovery
- Preference for efficient tool sequences
Phase 3: Reinforcement Learning
Group Relative Policy Optimization (GRPO) trains for extended interaction:24
- Rewards for correct final answers
- Implicit credit assignment across long trajectories
- Learning when to persist vs. pivot strategies
Base Models
MiroThinker builds on open-weight foundations:25
| Size | Base Model |
|---|---|
| 8B | Qwen2.5-8B |
| 30B | Qwen3-30B |
| 72B | Qwen2.5-72B |
Benchmark Performance
GAIA (General AI Assistants)
GAIA tests realistic assistant tasks requiring web search, reasoning, and multi-step problem solving:26
| Model | Accuracy |
|---|---|
| MiroThinker-72B | 81.9% |
| GPT-5-high | ~85% (estimated) |
| Previous open-source SOTA | ~65% |
MiroThinker approaches commercial performance while remaining fully open.
HLE (Humanity's Last Exam)
Extremely challenging questions across diverse domains:27
| Model | Accuracy |
|---|---|
| MiroThinker-72B | 37.7% |
| Human expert | Variable |
BrowseComp
Complex web browsing and information synthesis:28
| Model | Accuracy |
|---|---|
| MiroThinker-72B (English) | 47.1% |
| MiroThinker-72B (Chinese) | 55.6% |
The Chinese performance suggests strong multilingual transfer.
Scaling Behavior
Critical finding: performance improves predictably with interaction depth.29
As MiroThinker engages in more tool calls: - Accuracy increases (up to hardware/context limits) - Error recovery becomes more effective - Complex tasks become tractable
This demonstrates that interaction depth exhibits true scaling behavior, not merely diminishing returns.
Comparison to Other Approaches
vs. Chain-of-Thought
| Dimension | Chain-of-Thought | MiroThinker |
|---|---|---|
| Feedback | None (isolated reasoning) | Continuous (tool results) |
| Error handling | Hope for the best | Detect and correct |
| Grounding | Text patterns only | External verification |
vs. ReAct Agents
| Dimension | Standard ReAct | MiroThinker |
|---|---|---|
| Interaction depth | 10-20 calls typical | Up to 600 calls |
| Training | Prompt engineering | RL for deep interaction |
| Persistence | Short tasks | Extended workflows |
vs. Test-Time Scaling
| Dimension | Test-Time Scaling | Interaction Scaling |
|---|---|---|
| Compute use | Extended reasoning | Extended interaction |
| Verification | Self-consistency | Environmental feedback |
| Training focus | Reasoning quality | Interaction handling |
Why Interaction Scaling Works
The paper identifies several mechanisms behind interaction scaling's effectiveness:30
Error Detection
More tool calls create more opportunities to discover mistakes:31
- Contradictory search results reveal incorrect assumptions
- Failed operations expose invalid states
- Unexpected outputs trigger reconsideration
Information Acquisition
Extended interaction gathers more relevant information:32
- Follow-up searches refine understanding
- Multiple sources enable cross-validation
- Deep browsing uncovers hidden details
Strategy Refinement
Long trajectories allow strategy evolution:33
- Initial approaches can be abandoned
- New angles can be explored
- Synthesis can incorporate late-arriving information
Open Source Release
The MiroMind team released comprehensive resources:34
Models
| Variant | HuggingFace |
|---|---|
| MiroThinker-v1.0-8B | Available |
| MiroThinker-v1.0-30B | Available |
| MiroThinker-v1.0-72B | Available |
| MiroThinker-v1.5-30B | Available (updated) |
Code
- Full training pipeline
- Inference implementation
- Tool integration examples
- Evaluation scripts
Documentation
- Architecture details
- Training recipes
- Deployment guides
- Benchmark reproduction
Implications for Agent Development
Training Paradigm Shift
Effective agents may require specific training for deep interaction, not just better base models.35
| Old Approach | New Approach |
|---|---|
| Train LLM, add tools | Train for tool use at depth |
| Prompt engineering | Reinforcement learning |
| Single-digit calls | Hundreds of calls |
Benchmark Evolution
Current agent benchmarks may prove too easy for interaction-scaled agents.36 MiroThinker's 81.9% on GAIA suggests the need for more challenging evaluations.
Resource Requirements
600 tool calls per task implies significant compute and latency:37
- Extended inference time
- Multiple external API calls
- Large context window maintenance
Production deployment must account for these costs.
Key Takeaways
MiroThinker establishes interaction scaling as a viable third dimension for AI capability:
- New Dimension: Interaction depth scales like model size and context length
- 600 Tool Calls: Trained for 30x typical agent interaction depth
- 81.9% GAIA: Approaches commercial performance while fully open
- Three-Phase Training: SFT → Preference Learning → RL pipeline
- Error Correction: Environmental feedback prevents reasoning drift
- Open Release: Models, code, and training recipes all available
The next generation of AI agents may prove capable not through larger models alone, but through deeper engagement with their environments.
References
-
Kaplan, J. et al. "Scaling Laws for Neural Language Models." arXiv:2001.08361. January 2020. https://arxiv.org/abs/2001.08361 ↩
-
MiroMind Team. "MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling." arXiv:2511.11793. November 2025. https://arxiv.org/abs/2511.11793 ↩
-
Ibid., Table 1: Benchmark Results. ↩
-
Ibid., Abstract. ↩
-
Ibid., Section 1: Introduction. ↩
-
Ibid. ↩
-
Ibid., Section 3: Training. ↩
-
Ibid., Section 4: Experiments. ↩
-
Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. https://arxiv.org/abs/2210.03629 ↩
-
Kaplan et al., op. cit. ↩
-
Anthropic. "Claude's 200K Context Window." November 2024. ↩
-
Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. ↩
-
MiroThinker paper, op. cit., Section 2.1. ↩
-
Ibid., Section 2.2. ↩
-
Ibid., Section 2: Interaction Scaling. ↩
-
Ibid. ↩
-
Ibid., Section 3.1: Architecture. ↩
-
Yao et al., op. cit. ↩
-
MiroThinker paper, op. cit., Section 3.2: Tool Suite. ↩
-
Ibid., Abstract. ↩
-
Ibid., Section 3.3: Training Pipeline. ↩
-
Ibid., Section 3.3.1: SFT. ↩
-
Ibid., Section 3.3.2: Preference Learning. ↩
-
Ibid., Section 3.3.3: GRPO. ↩
-
Ibid., Section 3.4: Base Models. ↩
-
Ibid., Table 1. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid., Section 5: Scaling Analysis. ↩
-
Ibid., Section 6: Why Interaction Scaling Works. ↩
-
Ibid., Section 6.1. ↩
-
Ibid., Section 6.2. ↩
-
Ibid., Section 6.3. ↩
-
MiroMind. "MiroThinker." GitHub. https://github.com/MiroMindAI/MiroThinker ↩
-
MiroThinker paper, op. cit., Section 7: Implications. ↩
-
Ibid. ↩
-
Ibid., Section 8: Limitations. ↩
-
Neurohive. "MiroThinker v1.0: Open-Source AI Research Agent Learns to Make Up to 600 Tool Calls Per Task." November 2025. https://neurohive.io/en/state-of-the-art/mirothinker-v1-0-open-source-ai-research-agent-learns-to-make-up-to-600-tool-calls-per-task/ ↩
-
Hugging Face. "MiroThinker-v1.5-30B." https://huggingface.co/miromind-ai/MiroThinker-v1.5-30B ↩