
pip install rlmflow
tldr
rlmflow turns Recursive Language Models into inspectable execution graphs. It’s a Python library for writing RLM agents where every query, action, observation, delegation, wait, resume, and result is a typed, immutable Pydantic node, and a run is just the tree of those snapshots.
The whole engine is one transition: step(node) → node'. The trace and the execution are the same data structure — there is no separate “tracing mode” to enable — so the same run renders as a Rich live tree, a Mermaid diagram, a Gantt swimlane, or a Gradio step-through viewer, all from one-line projections of the graph.
That graph allows you to inspect each subagent, replay from a checkpoint, fork from any node, and edit a branch before continuing. We’ll walk through those moves on a real coding-agent run shipped with the repo.
Introduction
Context rot is the failure mode every practitioner has hit — a Claude Code session that “gets dumber”, a Cursor chat that forgets the file you opened thirty messages ago, a research agent that can quote your prompt back but can’t use it. Anthropic defines it as recall degrading as the window grows: frontier models advertise 200k–1M tokens and degrade long before that — the tokens fit, the model just can’t reason over them all at once. Easy benchmarks miss this (RULER is constant-complexity and frontier models score 90%+), but Chroma, OOLONG, and lost-in-the-middle all show real degradation well below the nominal limit.
Existing fixes — bigger windows, retrieval, summarization, context-folding — each pick a decomposition strategy for the model, ahead of time. Even though they work in practice, they are also exactly the pattern Sutton’s Bitter Lesson warns about: hard-coded human structure that wins in the short run but loses in the long run to general methods that scale with compute. As capability improves, the fixed strategy becomes the ceiling.
Recursive Language Models
flip that. An LLM sits in a Python REPL with the long context bound
as a variable, and a single extra primitive — delegate —
lets it spawn a fresh sub-agent with its own window. From there the
model peeks, slices, greps, or recursively delegates only when it
decides to. RAG retrieves; RLMs investigate. Empirically the case is strong:
RLM(GPT-5-mini) beats raw GPT-5 on a tough long-context benchmark at
roughly the same API cost, and holds up at 10M+ token corpora no
direct baseline can fit (post,
paper,
rlm-minimal,
verifiers).
But as the number of sub-agents grows, the tree gets hard to observe and control: parents spawn children, children spawn more children, results bubble back up, and a flat transcript hides almost everything you’d want to ask of the run. That’s where rlmflow comes in — representing sprawling trees of recursive agents as inspectable, controllable graphs.
RLMs are graphs
To better understand what this means, start with the canonical RLM demo: needle-in-a-haystack. The context is a huge synthetic document, and the question is simple: what secret code is hidden inside it?
The root agent looks at the document and decides not to read the whole thing itself. It splits the haystack across a few sub-agents:
- one child scans the first third for the needle phrase,
- another scans the middle third,
- a third child scans the final third, finds several near-matches, and spawns two smaller children to inspect the candidate windows,
- a verifier child checks the candidate code against the original question,
- the root agent returns the final code.
That is still a small run, but it is already recursive: the root has children, and one of those children has children of its own. The important detail is that those children are not single black-box API calls. Each child is an agent with its own little loop: inspect the context, run a search, read a passage, maybe delegate again, then return.
In a minimal RLM-style implementation, every
delegate(name, query, ctx) call is the
LLM call: it spins up a fresh sub-LLM with its own REPL — bound to
ctx as CONTEXT — runs that sub-LLM’s agent loop until it calls
done(value), and hands the value back as a str. A child’s REPL
can call delegate again, and so on.
The parent never sees any of it. Click through:
That’s the core observability problem with vanilla RLMs: a single
delegate() call can hide an entire recursive subtree of LLM work,
and nothing about that subtree survives the return. Children can
delegate to children can delegate to children — and all the parent
ever gets is a list[str]. When the answer is wrong, you can’t tell
which level of the recursion went off the rails; when the answer is
right, you can’t tell whether it was right for the right reason. The
abstraction is too clean: the act of delegating throws away exactly
the structure you’d want to debug, evaluate, or steer.
rlmflow keeps that structure — every recursive call is a node in an execution graph that you can step through, inspect, and replay:
This is the same run, but now the children are not opaque recursive
calls. The root reaches a supervising node and stops; at that moment
the runnable frontier is root.chunk_0, root.chunk_1, and
root.chunk_2. Those children can advance independently, so the graph
shows parallel work without pretending it is one conversation.
Then root.chunk_2 reaches its own supervising node. The frontier
changes again: now root.chunk_2.a and root.chunk_2.b are runnable
while both root and root.chunk_2 are parked. When those candidate
readers finish, root.chunk_2 resumes, returns 84721, and only then
can root resume and verify the final code.
That is the step-by-step execution state. You can pause after any node, inspect exactly what one child saw, fork from the candidate reader, or replace a bad child result before the parent resumes. The flat recursive-call view tells you what returned. The graph tells you how the answer moved through the run.
rlmflow stores the run in that shape from the start. The graph isn’t a visualization recovered from logs after the fact — it’s the data model. Every node is a complete checkpoint: enough state to resume the run, inspect what led there, or compare against another branch. That’s why the whole engine fits in one transition:
node = agent.start(query)
while not node.terminal:
node = agent.step(node)
And it’s what gives rlmflow its four primitives:
- Inspect one agent without rereading every sibling’s messages.
- Replay from a saved node instead of starting the whole run over.
- Fork from a node and try a different model, prompt, or workspace.
- Edit a branch by replacing a bad child result and continuing from the parent.
Acknowledgements
Alex Zhang and Omar Khattab for coming up w/ RLMs. The
rlm-minimal and
ypi codebases for being
readable and hackable; most of the prompt structure was
learned from them.
Citation
@misc{sudhakaran2026rlmflow,
author = {Sudhakaran, Shyam},
title = {Recursive Language Models are Graphs},
year = {2026},
howpublished = {\url{https://github.com/shyamsn97/rlmflow}}
}








