MiroThinker: The Third Scaling Dimension for AI Agents

MiroThinker introduces interaction scaling—training agents to handle 600 tool calls per task. 81.9% on GAIA benchmark. A new dimension beyond model size and context.

Blake Crosley

Jan 07, 2026 7 min read Disclaimer

MiroThinker: The Third Scaling Dimension for AI Agents

AI scaling has focused on two dimensions: model size and context length.¹ MiroThinker introduces a third: interaction depth. The research agent, released with 8B, 30B, and 72B parameter variants, trains models to handle up to 600 tool calls per task through reinforcement learning.² On the GAIA benchmark, the 72B variant achieves 81.9% accuracy, approaching commercial systems like GPT-5-high while remaining fully open-source.³

TL;DR

MiroThinker explores "interaction scaling" at the model level, systematically training models to handle deeper and more frequent agent-environment interactions.⁴ Unlike test-time scaling in isolation, interaction scaling uses environment feedback to correct errors and refine trajectories.⁵ With a 256K context window, the agent performs up to 600 tool calls per task, enabling sustained multi-turn reasoning for complex research workflows.⁶ Training uses three phases: supervised fine-tuning, preference learning, and reinforcement learning with group relative policy optimization.⁷ Across GAIA, HLE, BrowseComp, and BrowseComp-ZH benchmarks, MiroThinker demonstrates consistent improvements as interaction depth increases.⁸

The Agent Scaling Problem

Current AI agents face a fundamental limitation. As reasoning chains extend, errors compound.⁹ A single mistake early in a trajectory can derail the entire task. Traditional approaches address this through:

Larger Models: More parameters for better single-step accuracy¹⁰ Longer Context: More space to hold reasoning history¹¹ Better Prompting: Improved instructions to reduce errors¹²

Yet these interventions don't address the core issue: agents operating in isolation from their environment during extended reasoning.

Reasoning Drift

Long reasoning chains without environmental feedback exhibit "reasoning drift"—gradual divergence from correct trajectories.¹³ The agent continues reasoning based on increasingly stale or incorrect assumptions.

Chain Length	Error Rate	Cause
Short (1-5 steps)	Low	Limited compound error
Medium (5-20 steps)	Moderate	Accumulating mistakes
Long (20+ steps)	High	Reasoning drift dominates

The Feedback Solution

MiroThinker's insight: let the environment correct the agent continuously.¹⁴ Rather than reasoning in isolation, the agent checks its work by interacting with external tools, catching errors before they compound.

Interaction Scaling Defined

Interaction scaling treats the depth of agent-environment interaction as a scalable dimension analogous to model size or context length.¹⁵

The Three Dimensions

Dimension	What Scales	How It Helps
Model Size	Parameters	Better single-step quality
Context Length	Token window	More information available
Interaction Depth	Tool calls	Error correction, grounding

Why Interaction Differs

Unlike model size (fixed at training) or context (passive storage), interaction depth enables active verification and course correction.¹⁶

Passive Scaling: Larger models and contexts provide more capacity Active Scaling: More interactions provide more opportunities to check, correct, and refine

The agent can discover that a web search returned outdated information, that a file path doesn't exist, or that a calculation produced unexpected results—and adapt accordingly.

MiroThinker Architecture

The agent follows the ReAct framework with specific enhancements for deep interaction:¹⁷

Core Loop

Thought → Action (Tool Call) → Observation → Thought → ...

Each observation feeds back into the agent's context, informing subsequent reasoning.¹⁸

Tool Suite

MiroThinker includes a comprehensive toolkit:¹⁹

Tool Category	Examples
Web Search	Query formulation, result parsing
Web Browsing	Page navigation, content extraction
Code Execution	Python runtime, result analysis
File Operations	Read, write, analyze documents

600 Tool Calls

The 256K context window supports up to 600 tool calls per task.²⁰ For context, most agent benchmarks involve fewer than 20 tool calls. MiroThinker operates at 30x the typical interaction depth.

Training Methodology

MiroThinker training proceeds in three phases:²¹

Phase 1: Supervised Fine-Tuning

Initial training on successful agent trajectories teaches basic tool use patterns:²²

When to search vs. browse
How to formulate effective queries
Interpreting tool outputs
Synthesizing multi-source information

Phase 2: Preference Learning

The model learns to prefer successful trajectories over failed ones:²³

Binary feedback on trajectory outcomes
Implicit learning of error recovery
Preference for efficient tool sequences

Phase 3: Reinforcement Learning

Group Relative Policy Optimization (GRPO) trains for extended interaction:²⁴

Rewards for correct final answers
Implicit credit assignment across long trajectories
Learning when to persist vs. pivot strategies

Base Models

MiroThinker builds on open-weight foundations:²⁵

Size	Base Model
8B	Qwen2.5-8B
30B	Qwen3-30B
72B	Qwen2.5-72B

Benchmark Performance

GAIA (General AI Assistants)

GAIA tests realistic assistant tasks requiring web search, reasoning, and multi-step problem solving:²⁶

Model	Accuracy
MiroThinker-72B	81.9%
GPT-5-high	~85% (estimated)
Previous open-source SOTA	~65%

MiroThinker approaches commercial performance while remaining fully open.

HLE (Humanity's Last Exam)

Extremely challenging questions across diverse domains:²⁷

Model	Accuracy
MiroThinker-72B	37.7%
Human expert	Variable

BrowseComp

Complex web browsing and information synthesis:²⁸

Model	Accuracy
MiroThinker-72B (English)	47.1%
MiroThinker-72B (Chinese)	55.6%

The Chinese performance suggests strong multilingual transfer.

Scaling Behavior

Critical finding: performance improves predictably with interaction depth.²⁹

As MiroThinker engages in more tool calls: - Accuracy increases (up to hardware/context limits) - Error recovery becomes more effective - Complex tasks become tractable

This demonstrates that interaction depth exhibits true scaling behavior, not merely diminishing returns.

Comparison to Other Approaches

vs. Chain-of-Thought

Dimension	Chain-of-Thought	MiroThinker
Feedback	None (isolated reasoning)	Continuous (tool results)
Error handling	Hope for the best	Detect and correct
Grounding	Text patterns only	External verification

vs. ReAct Agents

Dimension	Standard ReAct	MiroThinker
Interaction depth	10-20 calls typical	Up to 600 calls
Training	Prompt engineering	RL for deep interaction
Persistence	Short tasks	Extended workflows

vs. Test-Time Scaling

Dimension	Test-Time Scaling	Interaction Scaling
Compute use	Extended reasoning	Extended interaction
Verification	Self-consistency	Environmental feedback
Training focus	Reasoning quality	Interaction handling

Why Interaction Scaling Works

The paper identifies several mechanisms behind interaction scaling's effectiveness:³⁰

Error Detection

More tool calls create more opportunities to discover mistakes:³¹

Contradictory search results reveal incorrect assumptions
Failed operations expose invalid states
Unexpected outputs trigger reconsideration

Information Acquisition

Extended interaction gathers more relevant information:³²

Follow-up searches refine understanding
Multiple sources enable cross-validation
Deep browsing uncovers hidden details

Long trajectories allow strategy evolution:³³

Initial approaches can be abandoned
New angles can be explored
Synthesis can incorporate late-arriving information

Open Source Release

The MiroMind team released comprehensive resources:³⁴

Models

Variant	HuggingFace
MiroThinker-v1.0-8B	Available
MiroThinker-v1.0-30B	Available
MiroThinker-v1.0-72B	Available
MiroThinker-v1.5-30B	Available (updated)

Code

Full training pipeline
Inference implementation
Tool integration examples
Evaluation scripts

Documentation

Architecture details
Training recipes
Deployment guides
Benchmark reproduction

Implications for Agent Development

Training Paradigm Shift

Effective agents may require specific training for deep interaction, not just better base models.³⁵

Old Approach	New Approach
Train LLM, add tools	Train for tool use at depth
Prompt engineering	Reinforcement learning
Single-digit calls	Hundreds of calls

Benchmark Evolution

Current agent benchmarks may prove too easy for interaction-scaled agents.³⁶ MiroThinker's 81.9% on GAIA suggests the need for more challenging evaluations.

Resource Requirements

600 tool calls per task implies significant compute and latency:³⁷

Extended inference time
Multiple external API calls
Large context window maintenance

Production deployment must account for these costs.

Key Takeaways

MiroThinker establishes interaction scaling as a viable third dimension for AI capability:

New Dimension: Interaction depth scales like model size and context length
600 Tool Calls: Trained for 30x typical agent interaction depth
81.9% GAIA: Approaches commercial performance while fully open
Three-Phase Training: SFT → Preference Learning → RL pipeline
Error Correction: Environmental feedback prevents reasoning drift
Open Release: Models, code, and training recipes all available

The next generation of AI agents may prove capable not through larger models alone, but through deeper engagement with their environments.

References

Kaplan, J. et al. "Scaling Laws for Neural Language Models." arXiv:2001.08361. January 2020. https://arxiv.org/abs/2001.08361 ↩
MiroMind Team. "MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling." arXiv:2511.11793. November 2025. https://arxiv.org/abs/2511.11793 ↩
Ibid., Table 1: Benchmark Results. ↩
Ibid., Abstract. ↩
Ibid., Section 1: Introduction. ↩
Ibid. ↩
Ibid., Section 3: Training. ↩
Ibid., Section 4: Experiments. ↩
Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. https://arxiv.org/abs/2210.03629 ↩
Kaplan et al., op. cit. ↩
Anthropic. "Claude's 200K Context Window." November 2024. ↩
Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. ↩
MiroThinker paper, op. cit., Section 2.1. ↩
Ibid., Section 2.2. ↩
Ibid., Section 2: Interaction Scaling. ↩
Ibid. ↩
Ibid., Section 3.1: Architecture. ↩
Yao et al., op. cit. ↩
MiroThinker paper, op. cit., Section 3.2: Tool Suite. ↩
Ibid., Abstract. ↩
Ibid., Section 3.3: Training Pipeline. ↩
Ibid., Section 3.3.1: SFT. ↩
Ibid., Section 3.3.2: Preference Learning. ↩
Ibid., Section 3.3.3: GRPO. ↩
Ibid., Section 3.4: Base Models. ↩
Ibid., Table 1. ↩
Ibid. ↩
Ibid. ↩
Ibid., Section 5: Scaling Analysis. ↩
Ibid., Section 6: Why Interaction Scaling Works. ↩
Ibid., Section 6.1. ↩
Ibid., Section 6.2. ↩
Ibid., Section 6.3. ↩
MiroMind. "MiroThinker." GitHub. https://github.com/MiroMindAI/MiroThinker ↩
MiroThinker paper, op. cit., Section 7: Implications. ↩
Ibid. ↩
Ibid., Section 8: Limitations. ↩
Neurohive. "MiroThinker v1.0: Open-Source AI Research Agent Learns to Make Up to 600 Tool Calls Per Task." November 2025. https://neurohive.io/en/state-of-the-art/mirothinker-v1-0-open-source-ai-research-agent-learns-to-make-up-to-600-tool-calls-per-task/ ↩
Hugging Face. "MiroThinker-v1.5-30B." https://huggingface.co/miromind-ai/MiroThinker-v1.5-30B ↩

MiroThinker: The Third Scaling Dimension for AI Agents

TL;DR

The Agent Scaling Problem

Reasoning Drift

The Feedback Solution

Interaction Scaling Defined

The Three Dimensions

Why Interaction Differs

MiroThinker Architecture

Core Loop

Tool Suite

600 Tool Calls

Training Methodology

Phase 1: Supervised Fine-Tuning

Phase 2: Preference Learning

Phase 3: Reinforcement Learning

Base Models

Benchmark Performance

GAIA (General AI Assistants)

HLE (Humanity's Last Exam)

BrowseComp

Scaling Behavior

Comparison to Other Approaches

vs. Chain-of-Thought

vs. ReAct Agents

vs. Test-Time Scaling

Why Interaction Scaling Works

Error Detection

Information Acquisition

Strategy Refinement

Open Source Release

Models

Code

Documentation

Implications for Agent Development

Training Paradigm Shift

Benchmark Evolution

Resource Requirements

Key Takeaways

References

You Might Also Like

Japan AI Infrastructure: Asia's Largest Economy Awakens

KV Cache Optimization: Memory Efficiency for Production LLMs

Modular Data Center Design for Rapid AI Deployment: 12-Month...

Request a Quote_

Request Received_