MiroThinker: The Third Scaling Dimension for AI Agents

MiroThinker introduces interaction scaling—training agents to handle 600 tool calls per task. 81.9% on GAIA benchmark. A new dimension beyond model size and context.

MiroThinker: The Third Scaling Dimension for AI Agents

MiroThinker: The Third Scaling Dimension for AI Agents

AI scaling has focused on two dimensions: model size and context length.1 MiroThinker introduces a third: interaction depth. The research agent, released with 8B, 30B, and 72B parameter variants, trains models to handle up to 600 tool calls per task through reinforcement learning.2 On the GAIA benchmark, the 72B variant achieves 81.9% accuracy, approaching commercial systems like GPT-5-high while remaining fully open-source.3

TL;DR

MiroThinker explores "interaction scaling" at the model level, systematically training models to handle deeper and more frequent agent-environment interactions.4 Unlike test-time scaling in isolation, interaction scaling uses environment feedback to correct errors and refine trajectories.5 With a 256K context window, the agent performs up to 600 tool calls per task, enabling sustained multi-turn reasoning for complex research workflows.6 Training uses three phases: supervised fine-tuning, preference learning, and reinforcement learning with group relative policy optimization.7 Across GAIA, HLE, BrowseComp, and BrowseComp-ZH benchmarks, MiroThinker demonstrates consistent improvements as interaction depth increases.8

The Agent Scaling Problem

Current AI agents face a fundamental limitation. As reasoning chains extend, errors compound.9 A single mistake early in a trajectory can derail the entire task. Traditional approaches address this through:

Larger Models: More parameters for better single-step accuracy10 Longer Context: More space to hold reasoning history11 Better Prompting: Improved instructions to reduce errors12

Yet these interventions don't address the core issue: agents operating in isolation from their environment during extended reasoning.

Reasoning Drift

Long reasoning chains without environmental feedback exhibit "reasoning drift"—gradual divergence from correct trajectories.13 The agent continues reasoning based on increasingly stale or incorrect assumptions.

Chain Length Error Rate Cause
Short (1-5 steps) Low Limited compound error
Medium (5-20 steps) Moderate Accumulating mistakes
Long (20+ steps) High Reasoning drift dominates

The Feedback Solution

MiroThinker's insight: let the environment correct the agent continuously.14 Rather than reasoning in isolation, the agent checks its work by interacting with external tools, catching errors before they compound.

Interaction Scaling Defined

Interaction scaling treats the depth of agent-environment interaction as a scalable dimension analogous to model size or context length.15

The Three Dimensions

Dimension What Scales How It Helps
Model Size Parameters Better single-step quality
Context Length Token window More information available
Interaction Depth Tool calls Error correction, grounding

Why Interaction Differs

Unlike model size (fixed at training) or context (passive storage), interaction depth enables active verification and course correction.16

Passive Scaling: Larger models and contexts provide more capacity Active Scaling: More interactions provide more opportunities to check, correct, and refine

The agent can discover that a web search returned outdated information, that a file path doesn't exist, or that a calculation produced unexpected results—and adapt accordingly.

MiroThinker Architecture

The agent follows the ReAct framework with specific enhancements for deep interaction:17

Core Loop

Thought → Action (Tool Call) → Observation → Thought → ...

Each observation feeds back into the agent's context, informing subsequent reasoning.18

Tool Suite

MiroThinker includes a comprehensive toolkit:19

Tool Category Examples
Web Search Query formulation, result parsing
Web Browsing Page navigation, content extraction
Code Execution Python runtime, result analysis
File Operations Read, write, analyze documents

600 Tool Calls

The 256K context window supports up to 600 tool calls per task.20 For context, most agent benchmarks involve fewer than 20 tool calls. MiroThinker operates at 30x the typical interaction depth.

Training Methodology

MiroThinker training proceeds in three phases:21

Phase 1: Supervised Fine-Tuning

Initial training on successful agent trajectories teaches basic tool use patterns:22

  • When to search vs. browse
  • How to formulate effective queries
  • Interpreting tool outputs
  • Synthesizing multi-source information

Phase 2: Preference Learning

The model learns to prefer successful trajectories over failed ones:23

  • Binary feedback on trajectory outcomes
  • Implicit learning of error recovery
  • Preference for efficient tool sequences

Phase 3: Reinforcement Learning

Group Relative Policy Optimization (GRPO) trains for extended interaction:24

  • Rewards for correct final answers
  • Implicit credit assignment across long trajectories
  • Learning when to persist vs. pivot strategies

Base Models

MiroThinker builds on open-weight foundations:25

Size Base Model
8B Qwen2.5-8B
30B Qwen3-30B
72B Qwen2.5-72B

Benchmark Performance

GAIA (General AI Assistants)

GAIA tests realistic assistant tasks requiring web search, reasoning, and multi-step problem solving:26

Model Accuracy
MiroThinker-72B 81.9%
GPT-5-high ~85% (estimated)
Previous open-source SOTA ~65%

MiroThinker approaches commercial performance while remaining fully open.

HLE (Humanity's Last Exam)

Extremely challenging questions across diverse domains:27

Model Accuracy
MiroThinker-72B 37.7%
Human expert Variable

BrowseComp

Complex web browsing and information synthesis:28

Model Accuracy
MiroThinker-72B (English) 47.1%
MiroThinker-72B (Chinese) 55.6%

The Chinese performance suggests strong multilingual transfer.

Scaling Behavior

Critical finding: performance improves predictably with interaction depth.29

As MiroThinker engages in more tool calls: - Accuracy increases (up to hardware/context limits) - Error recovery becomes more effective - Complex tasks become tractable

This demonstrates that interaction depth exhibits true scaling behavior, not merely diminishing returns.

Comparison to Other Approaches

vs. Chain-of-Thought

Dimension Chain-of-Thought MiroThinker
Feedback None (isolated reasoning) Continuous (tool results)
Error handling Hope for the best Detect and correct
Grounding Text patterns only External verification

vs. ReAct Agents

Dimension Standard ReAct MiroThinker
Interaction depth 10-20 calls typical Up to 600 calls
Training Prompt engineering RL for deep interaction
Persistence Short tasks Extended workflows

vs. Test-Time Scaling

Dimension Test-Time Scaling Interaction Scaling
Compute use Extended reasoning Extended interaction
Verification Self-consistency Environmental feedback
Training focus Reasoning quality Interaction handling

Why Interaction Scaling Works

The paper identifies several mechanisms behind interaction scaling's effectiveness:30

Error Detection

More tool calls create more opportunities to discover mistakes:31

  • Contradictory search results reveal incorrect assumptions
  • Failed operations expose invalid states
  • Unexpected outputs trigger reconsideration

Information Acquisition

Extended interaction gathers more relevant information:32

  • Follow-up searches refine understanding
  • Multiple sources enable cross-validation
  • Deep browsing uncovers hidden details

Strategy Refinement

Long trajectories allow strategy evolution:33

  • Initial approaches can be abandoned
  • New angles can be explored
  • Synthesis can incorporate late-arriving information

Open Source Release

The MiroMind team released comprehensive resources:34

Models

Variant HuggingFace
MiroThinker-v1.0-8B Available
MiroThinker-v1.0-30B Available
MiroThinker-v1.0-72B Available
MiroThinker-v1.5-30B Available (updated)

Code

  • Full training pipeline
  • Inference implementation
  • Tool integration examples
  • Evaluation scripts

Documentation

  • Architecture details
  • Training recipes
  • Deployment guides
  • Benchmark reproduction

Implications for Agent Development

Training Paradigm Shift

Effective agents may require specific training for deep interaction, not just better base models.35

Old Approach New Approach
Train LLM, add tools Train for tool use at depth
Prompt engineering Reinforcement learning
Single-digit calls Hundreds of calls

Benchmark Evolution

Current agent benchmarks may prove too easy for interaction-scaled agents.36 MiroThinker's 81.9% on GAIA suggests the need for more challenging evaluations.

Resource Requirements

600 tool calls per task implies significant compute and latency:37

  • Extended inference time
  • Multiple external API calls
  • Large context window maintenance

Production deployment must account for these costs.

Key Takeaways

MiroThinker establishes interaction scaling as a viable third dimension for AI capability:

  1. New Dimension: Interaction depth scales like model size and context length
  2. 600 Tool Calls: Trained for 30x typical agent interaction depth
  3. 81.9% GAIA: Approaches commercial performance while fully open
  4. Three-Phase Training: SFT → Preference Learning → RL pipeline
  5. Error Correction: Environmental feedback prevents reasoning drift
  6. Open Release: Models, code, and training recipes all available

The next generation of AI agents may prove capable not through larger models alone, but through deeper engagement with their environments.


References


  1. Kaplan, J. et al. "Scaling Laws for Neural Language Models." arXiv:2001.08361. January 2020. https://arxiv.org/abs/2001.08361 

  2. MiroMind Team. "MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling." arXiv:2511.11793. November 2025. https://arxiv.org/abs/2511.11793 

  3. Ibid., Table 1: Benchmark Results. 

  4. Ibid., Abstract. 

  5. Ibid., Section 1: Introduction. 

  6. Ibid. 

  7. Ibid., Section 3: Training. 

  8. Ibid., Section 4: Experiments. 

  9. Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. https://arxiv.org/abs/2210.03629 

  10. Kaplan et al., op. cit. 

  11. Anthropic. "Claude's 200K Context Window." November 2024. 

  12. Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. 

  13. MiroThinker paper, op. cit., Section 2.1. 

  14. Ibid., Section 2.2. 

  15. Ibid., Section 2: Interaction Scaling. 

  16. Ibid. 

  17. Ibid., Section 3.1: Architecture. 

  18. Yao et al., op. cit. 

  19. MiroThinker paper, op. cit., Section 3.2: Tool Suite. 

  20. Ibid., Abstract. 

  21. Ibid., Section 3.3: Training Pipeline. 

  22. Ibid., Section 3.3.1: SFT. 

  23. Ibid., Section 3.3.2: Preference Learning. 

  24. Ibid., Section 3.3.3: GRPO. 

  25. Ibid., Section 3.4: Base Models. 

  26. Ibid., Table 1. 

  27. Ibid. 

  28. Ibid. 

  29. Ibid., Section 5: Scaling Analysis. 

  30. Ibid., Section 6: Why Interaction Scaling Works. 

  31. Ibid., Section 6.1. 

  32. Ibid., Section 6.2. 

  33. Ibid., Section 6.3. 

  34. MiroMind. "MiroThinker." GitHub. https://github.com/MiroMindAI/MiroThinker 

  35. MiroThinker paper, op. cit., Section 7: Implications. 

  36. Ibid. 

  37. Ibid., Section 8: Limitations. 

  38. Neurohive. "MiroThinker v1.0: Open-Source AI Research Agent Learns to Make Up to 600 Tool Calls Per Task." November 2025. https://neurohive.io/en/state-of-the-art/mirothinker-v1-0-open-source-ai-research-agent-learns-to-make-up-to-600-tool-calls-per-task/ 

  39. Hugging Face. "MiroThinker-v1.5-30B." https://huggingface.co/miromind-ai/MiroThinker-v1.5-30B 

Request a Quote_

Tell us about your project and we'll respond within 72 hours.

> TRANSMISSION_COMPLETE

Request Received_

Thank you for your inquiry. Our team will review your request and respond within 72 hours.

QUEUED FOR PROCESSING