Recursive Language Models: Teaching AI to Manage Its Own Context

MIT's RLM architecture lets models delegate context to sub-LLMs and Python scripts. 100x context extension with 2-3x token efficiency. Prime Intellect predicts the paradigm of 2026.

Recursive Language Models: Teaching AI to Manage Its Own Context

Recursive Language Models: Teaching AI to Manage Its Own Context

Context windows have expanded dramatically: 100K, 200K, even 1 million tokens.1 Yet fundamental limits remain. Linear memory costs, attention degradation at extreme lengths, and the inability to revisit or reorganize information once consumed all constrain what long-context models can achieve.2 Recursive Language Models (RLMs) take a different approach entirely. Rather than stuffing everything into context, RLMs teach models to actively manage their own context using Python scripts and sub-LLM calls.3

TL;DR

The RLM paper from MIT introduces an architecture where the main language model delegates work to a persistent Python REPL and spawnable sub-LLM instances.4 Instead of loading massive inputs directly, the model inspects and transforms data programmatically.5 Testing shows RLMs handle inputs up to 100x beyond model context windows while dramatically outperforming base models and common long-context scaffolds.6 On CodeQA, GPT-5 achieves 24% accuracy baseline while RLM reaches 62%.7 Prime Intellect has implemented RLM training infrastructure and predicts this approach will define the next major breakthrough in AI agents.8

The Long Context Problem

Transformer attention scales quadratically with sequence length.9 While efficient attention variants reduce this cost, fundamental challenges persist:

Context Degradation

Studies demonstrate that model performance degrades as context grows, even when the model technically supports the length.10 The famous "needle in a haystack" tests reveal that information in the middle of long contexts often gets ignored or forgotten.11

Static Context

Traditional context windows operate as write-once buffers. Once tokens enter context, the model cannot reorganize, summarize, or selectively retrieve them.12 Irrelevant information persists alongside crucial details.

Memory Costs

Each additional token in context requires proportional memory for key-value caches during inference.13 Million-token contexts demand substantial GPU memory even for single queries.

The RLM Solution

RLMs flip the paradigm from "model receives context" to "model manages context."14

Core Architecture

The RLM provides three key capabilities to the main model:15

Capability Implementation Purpose
Python REPL Persistent environment Store, transform, retrieve data
Sub-LLMs Spawnable instances via llm_batch() Delegate analysis tasks
Answer Variable answer["content"] + answer["ready"] Iterative response refinement

The main model never directly processes massive inputs. Instead, it writes Python code to manage the information flow.

The Control Plane Design

The Python REPL serves as what Prime Intellect calls "a control plane for long context."16 The environment provides:

Persistent State: Variables survive across model turns, enabling complex multi-step workflows.17

Parallel Processing: The llm_batch() function spawns multiple sub-LLM calls simultaneously, dramatically accelerating tasks that can be parallelized.18

Tool Isolation: Only sub-LLMs receive tool access (web search, file reading, etc.), preventing token bloat in the main model's context.19

Pre-installed Packages: Math-heavy tasks get numpy, scipy, sympy automatically available.20

The Answer Variable Pattern

Final responses emerge iteratively rather than in a single generation:21

# Model writes to answer variable
answer["content"] = "Partial solution..."
answer["ready"] = False  # Continue processing

# Later iterations refine
answer["content"] = "Complete solution..."
answer["ready"] = True  # Signal completion

This pattern enables the model to revise and improve answers through multiple REPL interactions.22

How Context Folding Works

The researchers describe RLM's approach as "learned context folding" rather than summarization:23

Not Summarization

Traditional approaches compress context through summarization, inevitably losing information.24 The RLM approach preserves all information by:

  1. Storing raw data in Python variables
  2. Querying selectively through sub-LLM calls
  3. Transforming programmatically when needed
  4. Never summarizing the original content

Delegation to Sub-LLMs

When the main model needs analysis of a large document section, it spawns a sub-LLM with just that section:25

# Main model code
results = llm_batch([
    {"prompt": f"Analyze section 1: {section_1}"},
    {"prompt": f"Analyze section 2: {section_2}"},
    {"prompt": f"Analyze section 3: {section_3}"}
])
# Process results without ever loading full document

The main model's context stays lean while sub-LLMs handle token-heavy operations.26

Benchmark Performance

Testing across four diverse environments reveals significant capabilities:27

DeepDive (Web Research)

Sub-LLMs handle web scraping and content extraction. The main model orchestrates search strategy and synthesizes findings.28

Approach Reward
Standard LLM ~45%
RLM (no tips) ~55%
RLM (with strategy tips) ~70%

Oolong (Long Context Aggregation)

Real-world complex documents (D&D session transcripts) tested context handling at ~1.5M characters:29

Approach Accuracy
Standard LLM ~35%
RLM ~75%

Notably, RLM underperformed on synthetic data, suggesting the approach excels on realistic complexity rather than artificial benchmarks.

CodeQA (Document QA)

Perhaps the most striking result:30

Model Approach Accuracy
GPT-5 Baseline 24.0%
GPT-5 Summarization agent 41.3%
GPT-5 RLM 62.0%

RLM more than doubled baseline performance while substantially outperforming summarization.

Verbatim-Copy

JSON structure preservation through iterative refinement:31

Approach Accuracy
Standard LLM ~65%
RLM ~77%

Math-Python (Caveat)

RLM currently underperforms on math tasks by 15-25%:32

Approach Accuracy
Standard LLM ~70%
RLM ~50%

Researchers attribute this to models not yet trained to effectively use RLM scaffolding for mathematical reasoning.

Token Efficiency

Beyond accuracy, RLM dramatically improves token efficiency:33

Main Model Tokens: 2-3x reduction in tokens processed by the primary model for equivalent or better results.34

Total Tokens: May increase due to sub-LLM calls, but main model context stays bounded regardless of input size.35

Latency Tradeoff: Sequential REPL operations add 40-80% latency compared to single-pass inference.36

Prime Intellect's 2026 Predictions

Prime Intellect has built RLM training infrastructure and makes bold predictions:37

The Paradigm of 2026

They position RLMs as the next major breakthrough based on three premises:38

1. Training Advantage: Unlike fixed scaffolds, RLMs can be trained end-to-end with reinforcement learning to improve context management.39

2. Complementary to Attention: "Both efficient attention and context folding are needed for true long agents. Better attention delays context degradation. Context folding enables active management."40

3. Long-Horizon Agents: RLMs enable agents that operate over weeks or months, managing context across extended task timelines.41

RLMEnv Infrastructure

Prime Intellect released RLM-compatible environments and training infrastructure:42

  • Multiple environments on their Environments Hub
  • Integration with prime-rl training framework
  • Open for community experimentation

Untapped Potential

Current models show "significant performance being left untapped due to poor usage of the scaffolding."43 Models not specifically trained for RLM underutilize its capabilities. This suggests major gains from RLM-native training.

Open Source Release

The MIT team released complete resources:44

  • Paper: arXiv:2512.24601
  • Code: https://github.com/alexzhang13/rlm
  • Environments: Various long-context benchmarks

Implications for AI Development

Agent Architecture

RLMs suggest a new pattern for building capable agents:45

  • Orchestrator model with bounded context
  • Worker sub-LLMs handling specific tasks
  • Python environment for state management
  • Iterative refinement rather than single-shot

Training Requirements

To fully leverage RLMs, models need training that includes:46

  • Code generation for REPL interaction
  • Sub-LLM delegation strategies
  • Multi-turn answer refinement
  • Long-horizon reward signals

Cost Structure

RLMs shift costs from context length to orchestration complexity:47

Dimension Traditional RLM
Main model context Scales with input Bounded
Sub-LLM calls N/A Scales with complexity
Latency Single pass Multi-turn
Memory Scales with context Bounded

Key Takeaways

Recursive Language Models introduce a paradigm shift in context handling:

  1. Active Context Management: Models control their own context rather than passively receiving it
  2. 100x Extension: Handle inputs far beyond native context windows
  3. Preserved Information: No summarization-based information loss
  4. Token Efficiency: 2-3x reduction in main model token consumption
  5. Training Potential: Major gains expected from RLM-native training
  6. Long-Horizon Agents: Architecture suited for extended task timelines

Prime Intellect's conviction that RLMs represent "the paradigm of 2026" reflects growing recognition that context management may matter more than context length.


References


  1. Anthropic. "Claude 3 Model Card." March 2024. https://www.anthropic.com/claude-3-model-card 

  2. Liu, N. et al. "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172. July 2023. https://arxiv.org/abs/2307.03172 

  3. Zhang, A. et al. "Recursive Language Models." arXiv:2512.24601. December 2025. https://arxiv.org/abs/2512.24601 

  4. Ibid., Abstract. 

  5. Ibid., Section 2: Method. 

  6. Ibid., Section 4: Experiments. 

  7. Ibid., Table 1: CodeQA Results. 

  8. Prime Intellect. "Recursive Language Models: the paradigm of 2026." January 2026. https://www.primeintellect.ai/blog/rlm 

  9. Vaswani, A. et al. "Attention Is All You Need." NeurIPS 2017. 

  10. Liu et al., op. cit. 

  11. Ibid. 

  12. Zhang et al., op. cit., Section 1. 

  13. Kwon, W. et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. 

  14. Zhang et al., op. cit., Section 2. 

  15. Ibid., Section 2.1: Architecture Components. 

  16. Prime Intellect, op. cit. 

  17. Zhang et al., op. cit., Section 2.2. 

  18. Ibid., Section 2.3: Parallel Sub-LLMs. 

  19. Ibid. 

  20. Prime Intellect, op. cit. 

  21. Zhang et al., op. cit., Section 2.4: Answer Variable. 

  22. Ibid. 

  23. Prime Intellect, op. cit. 

  24. Zhang et al., op. cit., Section 5: Comparison. 

  25. Ibid., Section 2.3. 

  26. Prime Intellect, op. cit. 

  27. Zhang et al., op. cit., Section 4. 

  28. Ibid., Section 4.1: DeepDive. 

  29. Ibid., Section 4.2: Oolong. 

  30. Ibid., Table 1. 

  31. Ibid., Section 4.3: Verbatim-Copy. 

  32. Ibid., Section 4.4: Math-Python. 

  33. Prime Intellect, op. cit. 

  34. Ibid. 

  35. Ibid. 

  36. Ibid. 

  37. Ibid. 

  38. Ibid. 

  39. Ibid. 

  40. Ibid. 

  41. Ibid. 

  42. Prime Intellect. "RLMEnv." https://github.com/PrimeIntellect-ai/rlmenv 

  43. Prime Intellect blog, op. cit. 

  44. Zhang, A. "Recursive Language Models." GitHub. https://github.com/alexzhang13/rlm 

  45. MarkTechPost. "Recursive Language Models (RLMs): From MIT's Blueprint to Prime Intellect's RLMEnv." January 2026. https://www.marktechpost.com/2026/01/02/recursive-language-models-rlms-from-mits-blueprint-to-prime-intellects-rlmenv-for-long-horizon-llm-agents/ 

  46. Prime Intellect, op. cit. 

  47. Zhang et al., op. cit., Section 6: Discussion. 

Request a Quote_

Tell us about your project and we'll respond within 72 hours.

> TRANSMISSION_COMPLETE

Request Received_

Thank you for your inquiry. Our team will review your request and respond within 72 hours.

QUEUED FOR PROCESSING