Recursive Language Models: Teaching AI to Manage Its Own Context
Context windows have expanded dramatically: 100K, 200K, even 1 million tokens.1 Yet fundamental limits remain. Linear memory costs, attention degradation at extreme lengths, and the inability to revisit or reorganize information once consumed all constrain what long-context models can achieve.2 Recursive Language Models (RLMs) take a different approach entirely. Rather than stuffing everything into context, RLMs teach models to actively manage their own context using Python scripts and sub-LLM calls.3
TL;DR
The RLM paper from MIT introduces an architecture where the main language model delegates work to a persistent Python REPL and spawnable sub-LLM instances.4 Instead of loading massive inputs directly, the model inspects and transforms data programmatically.5 Testing shows RLMs handle inputs up to 100x beyond model context windows while dramatically outperforming base models and common long-context scaffolds.6 On CodeQA, GPT-5 achieves 24% accuracy baseline while RLM reaches 62%.7 Prime Intellect has implemented RLM training infrastructure and predicts this approach will define the next major breakthrough in AI agents.8
The Long Context Problem
Transformer attention scales quadratically with sequence length.9 While efficient attention variants reduce this cost, fundamental challenges persist:
Context Degradation
Studies demonstrate that model performance degrades as context grows, even when the model technically supports the length.10 The famous "needle in a haystack" tests reveal that information in the middle of long contexts often gets ignored or forgotten.11
Static Context
Traditional context windows operate as write-once buffers. Once tokens enter context, the model cannot reorganize, summarize, or selectively retrieve them.12 Irrelevant information persists alongside crucial details.
Memory Costs
Each additional token in context requires proportional memory for key-value caches during inference.13 Million-token contexts demand substantial GPU memory even for single queries.
The RLM Solution
RLMs flip the paradigm from "model receives context" to "model manages context."14
Core Architecture
The RLM provides three key capabilities to the main model:15
| Capability | Implementation | Purpose |
|---|---|---|
| Python REPL | Persistent environment | Store, transform, retrieve data |
| Sub-LLMs | Spawnable instances via llm_batch() |
Delegate analysis tasks |
| Answer Variable | answer["content"] + answer["ready"] |
Iterative response refinement |
The main model never directly processes massive inputs. Instead, it writes Python code to manage the information flow.
The Control Plane Design
The Python REPL serves as what Prime Intellect calls "a control plane for long context."16 The environment provides:
Persistent State: Variables survive across model turns, enabling complex multi-step workflows.17
Parallel Processing: The llm_batch() function spawns multiple sub-LLM calls simultaneously, dramatically accelerating tasks that can be parallelized.18
Tool Isolation: Only sub-LLMs receive tool access (web search, file reading, etc.), preventing token bloat in the main model's context.19
Pre-installed Packages: Math-heavy tasks get numpy, scipy, sympy automatically available.20
The Answer Variable Pattern
Final responses emerge iteratively rather than in a single generation:21
# Model writes to answer variable
answer["content"] = "Partial solution..."
answer["ready"] = False # Continue processing
# Later iterations refine
answer["content"] = "Complete solution..."
answer["ready"] = True # Signal completion
This pattern enables the model to revise and improve answers through multiple REPL interactions.22
How Context Folding Works
The researchers describe RLM's approach as "learned context folding" rather than summarization:23
Not Summarization
Traditional approaches compress context through summarization, inevitably losing information.24 The RLM approach preserves all information by:
- Storing raw data in Python variables
- Querying selectively through sub-LLM calls
- Transforming programmatically when needed
- Never summarizing the original content
Delegation to Sub-LLMs
When the main model needs analysis of a large document section, it spawns a sub-LLM with just that section:25
# Main model code
results = llm_batch([
{"prompt": f"Analyze section 1: {section_1}"},
{"prompt": f"Analyze section 2: {section_2}"},
{"prompt": f"Analyze section 3: {section_3}"}
])
# Process results without ever loading full document
The main model's context stays lean while sub-LLMs handle token-heavy operations.26
Benchmark Performance
Testing across four diverse environments reveals significant capabilities:27
DeepDive (Web Research)
Sub-LLMs handle web scraping and content extraction. The main model orchestrates search strategy and synthesizes findings.28
| Approach | Reward |
|---|---|
| Standard LLM | ~45% |
| RLM (no tips) | ~55% |
| RLM (with strategy tips) | ~70% |
Oolong (Long Context Aggregation)
Real-world complex documents (D&D session transcripts) tested context handling at ~1.5M characters:29
| Approach | Accuracy |
|---|---|
| Standard LLM | ~35% |
| RLM | ~75% |
Notably, RLM underperformed on synthetic data, suggesting the approach excels on realistic complexity rather than artificial benchmarks.
CodeQA (Document QA)
Perhaps the most striking result:30
| Model | Approach | Accuracy |
|---|---|---|
| GPT-5 | Baseline | 24.0% |
| GPT-5 | Summarization agent | 41.3% |
| GPT-5 | RLM | 62.0% |
RLM more than doubled baseline performance while substantially outperforming summarization.
Verbatim-Copy
JSON structure preservation through iterative refinement:31
| Approach | Accuracy |
|---|---|
| Standard LLM | ~65% |
| RLM | ~77% |
Math-Python (Caveat)
RLM currently underperforms on math tasks by 15-25%:32
| Approach | Accuracy |
|---|---|
| Standard LLM | ~70% |
| RLM | ~50% |
Researchers attribute this to models not yet trained to effectively use RLM scaffolding for mathematical reasoning.
Token Efficiency
Beyond accuracy, RLM dramatically improves token efficiency:33
Main Model Tokens: 2-3x reduction in tokens processed by the primary model for equivalent or better results.34
Total Tokens: May increase due to sub-LLM calls, but main model context stays bounded regardless of input size.35
Latency Tradeoff: Sequential REPL operations add 40-80% latency compared to single-pass inference.36
Prime Intellect's 2026 Predictions
Prime Intellect has built RLM training infrastructure and makes bold predictions:37
The Paradigm of 2026
They position RLMs as the next major breakthrough based on three premises:38
1. Training Advantage: Unlike fixed scaffolds, RLMs can be trained end-to-end with reinforcement learning to improve context management.39
2. Complementary to Attention: "Both efficient attention and context folding are needed for true long agents. Better attention delays context degradation. Context folding enables active management."40
3. Long-Horizon Agents: RLMs enable agents that operate over weeks or months, managing context across extended task timelines.41
RLMEnv Infrastructure
Prime Intellect released RLM-compatible environments and training infrastructure:42
- Multiple environments on their Environments Hub
- Integration with prime-rl training framework
- Open for community experimentation
Untapped Potential
Current models show "significant performance being left untapped due to poor usage of the scaffolding."43 Models not specifically trained for RLM underutilize its capabilities. This suggests major gains from RLM-native training.
Open Source Release
The MIT team released complete resources:44
- Paper: arXiv:2512.24601
- Code: https://github.com/alexzhang13/rlm
- Environments: Various long-context benchmarks
Implications for AI Development
Agent Architecture
RLMs suggest a new pattern for building capable agents:45
- Orchestrator model with bounded context
- Worker sub-LLMs handling specific tasks
- Python environment for state management
- Iterative refinement rather than single-shot
Training Requirements
To fully leverage RLMs, models need training that includes:46
- Code generation for REPL interaction
- Sub-LLM delegation strategies
- Multi-turn answer refinement
- Long-horizon reward signals
Cost Structure
RLMs shift costs from context length to orchestration complexity:47
| Dimension | Traditional | RLM |
|---|---|---|
| Main model context | Scales with input | Bounded |
| Sub-LLM calls | N/A | Scales with complexity |
| Latency | Single pass | Multi-turn |
| Memory | Scales with context | Bounded |
Key Takeaways
Recursive Language Models introduce a paradigm shift in context handling:
- Active Context Management: Models control their own context rather than passively receiving it
- 100x Extension: Handle inputs far beyond native context windows
- Preserved Information: No summarization-based information loss
- Token Efficiency: 2-3x reduction in main model token consumption
- Training Potential: Major gains expected from RLM-native training
- Long-Horizon Agents: Architecture suited for extended task timelines
Prime Intellect's conviction that RLMs represent "the paradigm of 2026" reflects growing recognition that context management may matter more than context length.
References
-
Anthropic. "Claude 3 Model Card." March 2024. https://www.anthropic.com/claude-3-model-card ↩
-
Liu, N. et al. "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172. July 2023. https://arxiv.org/abs/2307.03172 ↩
-
Zhang, A. et al. "Recursive Language Models." arXiv:2512.24601. December 2025. https://arxiv.org/abs/2512.24601 ↩
-
Ibid., Abstract. ↩
-
Ibid., Section 2: Method. ↩
-
Ibid., Section 4: Experiments. ↩
-
Ibid., Table 1: CodeQA Results. ↩
-
Prime Intellect. "Recursive Language Models: the paradigm of 2026." January 2026. https://www.primeintellect.ai/blog/rlm ↩
-
Vaswani, A. et al. "Attention Is All You Need." NeurIPS 2017. ↩
-
Liu et al., op. cit. ↩
-
Ibid. ↩
-
Zhang et al., op. cit., Section 1. ↩
-
Kwon, W. et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. ↩
-
Zhang et al., op. cit., Section 2. ↩
-
Ibid., Section 2.1: Architecture Components. ↩
-
Prime Intellect, op. cit. ↩
-
Zhang et al., op. cit., Section 2.2. ↩
-
Ibid., Section 2.3: Parallel Sub-LLMs. ↩
-
Ibid. ↩
-
Prime Intellect, op. cit. ↩
-
Zhang et al., op. cit., Section 2.4: Answer Variable. ↩
-
Ibid. ↩
-
Prime Intellect, op. cit. ↩
-
Zhang et al., op. cit., Section 5: Comparison. ↩
-
Ibid., Section 2.3. ↩
-
Prime Intellect, op. cit. ↩
-
Zhang et al., op. cit., Section 4. ↩
-
Ibid., Section 4.1: DeepDive. ↩
-
Ibid., Section 4.2: Oolong. ↩
-
Ibid., Table 1. ↩
-
Ibid., Section 4.3: Verbatim-Copy. ↩
-
Ibid., Section 4.4: Math-Python. ↩
-
Prime Intellect, op. cit. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Ibid. ↩
-
Prime Intellect. "RLMEnv." https://github.com/PrimeIntellect-ai/rlmenv ↩
-
Prime Intellect blog, op. cit. ↩
-
Zhang, A. "Recursive Language Models." GitHub. https://github.com/alexzhang13/rlm ↩
-
MarkTechPost. "Recursive Language Models (RLMs): From MIT's Blueprint to Prime Intellect's RLMEnv." January 2026. https://www.marktechpost.com/2026/01/02/recursive-language-models-rlms-from-mits-blueprint-to-prime-intellects-rlmenv-for-long-horizon-llm-agents/ ↩
-
Prime Intellect, op. cit. ↩
-
Zhang et al., op. cit., Section 6: Discussion. ↩