Recursive Language Models: Teaching AI to Manage Its Own Context

MIT's RLM architecture lets models delegate context to sub-LLMs and Python scripts. 100x context extension with 2-3x token efficiency. Prime Intellect predicts the paradigm of 2026.

Blake Crosley

Jan 07, 2026 7 min read Disclaimer

Recursive Language Models: Teaching AI to Manage Its Own Context

Context windows have expanded dramatically: 100K, 200K, even 1 million tokens.¹ Yet fundamental limits remain. Linear memory costs, attention degradation at extreme lengths, and the inability to revisit or reorganize information once consumed all constrain what long-context models can achieve.² Recursive Language Models (RLMs) take a different approach entirely. Rather than stuffing everything into context, RLMs teach models to actively manage their own context using Python scripts and sub-LLM calls.³

TL;DR

The RLM paper from MIT introduces an architecture where the main language model delegates work to a persistent Python REPL and spawnable sub-LLM instances.⁴ Instead of loading massive inputs directly, the model inspects and transforms data programmatically.⁵ Testing shows RLMs handle inputs up to 100x beyond model context windows while dramatically outperforming base models and common long-context scaffolds.⁶ On CodeQA, GPT-5 achieves 24% accuracy baseline while RLM reaches 62%.⁷ Prime Intellect has implemented RLM training infrastructure and predicts this approach will define the next major breakthrough in AI agents.⁸

The Long Context Problem

Transformer attention scales quadratically with sequence length.⁹ While efficient attention variants reduce this cost, fundamental challenges persist:

Context Degradation

Studies demonstrate that model performance degrades as context grows, even when the model technically supports the length.¹⁰ The famous "needle in a haystack" tests reveal that information in the middle of long contexts often gets ignored or forgotten.¹¹

Static Context

Traditional context windows operate as write-once buffers. Once tokens enter context, the model cannot reorganize, summarize, or selectively retrieve them.¹² Irrelevant information persists alongside crucial details.

Memory Costs

Each additional token in context requires proportional memory for key-value caches during inference.¹³ Million-token contexts demand substantial GPU memory even for single queries.

The RLM Solution

RLMs flip the paradigm from "model receives context" to "model manages context."¹⁴

Core Architecture

The RLM provides three key capabilities to the main model:¹⁵

Capability	Implementation	Purpose
Python REPL	Persistent environment	Store, transform, retrieve data
Sub-LLMs	Spawnable instances via `llm_batch()`	Delegate analysis tasks
Answer Variable	`answer["content"]` + `answer["ready"]`	Iterative response refinement

The main model never directly processes massive inputs. Instead, it writes Python code to manage the information flow.

The Control Plane Design

The Python REPL serves as what Prime Intellect calls "a control plane for long context."¹⁶ The environment provides:

Persistent State: Variables survive across model turns, enabling complex multi-step workflows.¹⁷

Parallel Processing: The llm_batch() function spawns multiple sub-LLM calls simultaneously, dramatically accelerating tasks that can be parallelized.¹⁸

Tool Isolation: Only sub-LLMs receive tool access (web search, file reading, etc.), preventing token bloat in the main model's context.¹⁹

Pre-installed Packages: Math-heavy tasks get numpy, scipy, sympy automatically available.²⁰

The Answer Variable Pattern

Final responses emerge iteratively rather than in a single generation:²¹

# Model writes to answer variable
answer["content"] = "Partial solution..."
answer["ready"] = False  # Continue processing

# Later iterations refine
answer["content"] = "Complete solution..."
answer["ready"] = True  # Signal completion

This pattern enables the model to revise and improve answers through multiple REPL interactions.²²

How Context Folding Works

The researchers describe RLM's approach as "learned context folding" rather than summarization:²³

Not Summarization

Traditional approaches compress context through summarization, inevitably losing information.²⁴ The RLM approach preserves all information by:

Storing raw data in Python variables
Querying selectively through sub-LLM calls
Transforming programmatically when needed
Never summarizing the original content

Delegation to Sub-LLMs

When the main model needs analysis of a large document section, it spawns a sub-LLM with just that section:²⁵

# Main model code
results = llm_batch([
    {"prompt": f"Analyze section 1: {section_1}"},
    {"prompt": f"Analyze section 2: {section_2}"},
    {"prompt": f"Analyze section 3: {section_3}"}
])
# Process results without ever loading full document

The main model's context stays lean while sub-LLMs handle token-heavy operations.²⁶

Benchmark Performance

Testing across four diverse environments reveals significant capabilities:²⁷

DeepDive (Web Research)

Sub-LLMs handle web scraping and content extraction. The main model orchestrates search strategy and synthesizes findings.²⁸

Approach	Reward
Standard LLM	~45%
RLM (no tips)	~55%
RLM (with strategy tips)	~70%

Oolong (Long Context Aggregation)

Real-world complex documents (D&D session transcripts) tested context handling at ~1.5M characters:²⁹

Approach	Accuracy
Standard LLM	~35%
RLM	~75%

Notably, RLM underperformed on synthetic data, suggesting the approach excels on realistic complexity rather than artificial benchmarks.

CodeQA (Document QA)

Perhaps the most striking result:³⁰

Model	Approach	Accuracy
GPT-5	Baseline	24.0%
GPT-5	Summarization agent	41.3%
GPT-5	RLM	62.0%

RLM more than doubled baseline performance while substantially outperforming summarization.

Verbatim-Copy

JSON structure preservation through iterative refinement:³¹

Approach	Accuracy
Standard LLM	~65%
RLM	~77%

Math-Python (Caveat)

RLM currently underperforms on math tasks by 15-25%:³²

Approach	Accuracy
Standard LLM	~70%
RLM	~50%

Researchers attribute this to models not yet trained to effectively use RLM scaffolding for mathematical reasoning.

Token Efficiency

Beyond accuracy, RLM dramatically improves token efficiency:³³

Main Model Tokens: 2-3x reduction in tokens processed by the primary model for equivalent or better results.³⁴

Total Tokens: May increase due to sub-LLM calls, but main model context stays bounded regardless of input size.³⁵

Latency Tradeoff: Sequential REPL operations add 40-80% latency compared to single-pass inference.³⁶

Prime Intellect's 2026 Predictions

Prime Intellect has built RLM training infrastructure and makes bold predictions:³⁷

The Paradigm of 2026

They position RLMs as the next major breakthrough based on three premises:³⁸

1. Training Advantage: Unlike fixed scaffolds, RLMs can be trained end-to-end with reinforcement learning to improve context management.³⁹

2. Complementary to Attention: "Both efficient attention and context folding are needed for true long agents. Better attention delays context degradation. Context folding enables active management."⁴⁰

3. Long-Horizon Agents: RLMs enable agents that operate over weeks or months, managing context across extended task timelines.⁴¹

RLMEnv Infrastructure

Prime Intellect released RLM-compatible environments and training infrastructure:⁴²

Multiple environments on their Environments Hub
Integration with prime-rl training framework
Open for community experimentation

Untapped Potential

Current models show "significant performance being left untapped due to poor usage of the scaffolding."⁴³ Models not specifically trained for RLM underutilize its capabilities. This suggests major gains from RLM-native training.

Open Source Release

The MIT team released complete resources:⁴⁴

Paper: arXiv:2512.24601
Code: https://github.com/alexzhang13/rlm
Environments: Various long-context benchmarks

Implications for AI Development

Agent Architecture

RLMs suggest a new pattern for building capable agents:⁴⁵

Orchestrator model with bounded context
Worker sub-LLMs handling specific tasks
Python environment for state management
Iterative refinement rather than single-shot

Training Requirements

To fully leverage RLMs, models need training that includes:⁴⁶

Code generation for REPL interaction
Sub-LLM delegation strategies
Multi-turn answer refinement
Long-horizon reward signals

Cost Structure

RLMs shift costs from context length to orchestration complexity:⁴⁷

Dimension	Traditional	RLM
Main model context	Scales with input	Bounded
Sub-LLM calls	N/A	Scales with complexity
Latency	Single pass	Multi-turn
Memory	Scales with context	Bounded

Key Takeaways

Recursive Language Models introduce a paradigm shift in context handling:

Active Context Management: Models control their own context rather than passively receiving it
100x Extension: Handle inputs far beyond native context windows
Preserved Information: No summarization-based information loss
Token Efficiency: 2-3x reduction in main model token consumption
Training Potential: Major gains expected from RLM-native training
Long-Horizon Agents: Architecture suited for extended task timelines

Prime Intellect's conviction that RLMs represent "the paradigm of 2026" reflects growing recognition that context management may matter more than context length.

References

Anthropic. "Claude 3 Model Card." March 2024. https://www.anthropic.com/claude-3-model-card ↩
Liu, N. et al. "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172. July 2023. https://arxiv.org/abs/2307.03172 ↩
Zhang, A. et al. "Recursive Language Models." arXiv:2512.24601. December 2025. https://arxiv.org/abs/2512.24601 ↩
Ibid., Abstract. ↩
Ibid., Section 2: Method. ↩
Ibid., Section 4: Experiments. ↩
Ibid., Table 1: CodeQA Results. ↩
Prime Intellect. "Recursive Language Models: the paradigm of 2026." January 2026. https://www.primeintellect.ai/blog/rlm ↩
Vaswani, A. et al. "Attention Is All You Need." NeurIPS 2017. ↩
Liu et al., op. cit. ↩
Ibid. ↩
Zhang et al., op. cit., Section 1. ↩
Kwon, W. et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. ↩
Zhang et al., op. cit., Section 2. ↩
Ibid., Section 2.1: Architecture Components. ↩
Prime Intellect, op. cit. ↩
Zhang et al., op. cit., Section 2.2. ↩
Ibid., Section 2.3: Parallel Sub-LLMs. ↩
Ibid. ↩
Prime Intellect, op. cit. ↩
Zhang et al., op. cit., Section 2.4: Answer Variable. ↩
Ibid. ↩
Prime Intellect, op. cit. ↩
Zhang et al., op. cit., Section 5: Comparison. ↩
Ibid., Section 2.3. ↩
Prime Intellect, op. cit. ↩
Zhang et al., op. cit., Section 4. ↩
Ibid., Section 4.1: DeepDive. ↩
Ibid., Section 4.2: Oolong. ↩
Ibid., Table 1. ↩
Ibid., Section 4.3: Verbatim-Copy. ↩
Ibid., Section 4.4: Math-Python. ↩
Prime Intellect, op. cit. ↩
Ibid. ↩
Ibid. ↩
Ibid. ↩
Ibid. ↩
Ibid. ↩
Ibid. ↩
Ibid. ↩
Ibid. ↩
Prime Intellect. "RLMEnv." https://github.com/PrimeIntellect-ai/rlmenv ↩
Prime Intellect blog, op. cit. ↩
Zhang, A. "Recursive Language Models." GitHub. https://github.com/alexzhang13/rlm ↩
MarkTechPost. "Recursive Language Models (RLMs): From MIT's Blueprint to Prime Intellect's RLMEnv." January 2026. https://www.marktechpost.com/2026/01/02/recursive-language-models-rlms-from-mits-blueprint-to-prime-intellects-rlmenv-for-long-horizon-llm-agents/ ↩
Prime Intellect, op. cit. ↩
Zhang et al., op. cit., Section 6: Discussion. ↩

Recursive Language Models: Teaching AI to Manage Its Own Context

TL;DR

The Long Context Problem

Context Degradation

Static Context

Memory Costs

The RLM Solution

Core Architecture

The Control Plane Design

The Answer Variable Pattern

How Context Folding Works

Not Summarization

Delegation to Sub-LLMs

Benchmark Performance

DeepDive (Web Research)

Oolong (Long Context Aggregation)

CodeQA (Document QA)

Verbatim-Copy

Math-Python (Caveat)

Token Efficiency

Prime Intellect's 2026 Predictions

The Paradigm of 2026

RLMEnv Infrastructure

Untapped Potential

Open Source Release

Implications for AI Development

Agent Architecture

Training Requirements

Cost Structure

Key Takeaways

References

You Might Also Like

Japan AI Infrastructure: Asia's Largest Economy Awakens

KV Cache Optimization: Memory Efficiency for Production LLMs

Modular Data Center Design for Rapid AI Deployment: 12-Month...

Request a Quote_

Request Received_