GitHub · PyPI · Examples · Changelog

Hero animation: an rlmflow run unfolding from a single root agent into a tree of typed nodes

pip install rlmflow

tldr

rlmflow turns Recursive Language Models into inspectable execution graphs. It’s a Python library for writing RLM agents where every query, action, observation, delegation, wait, resume, and result is a typed, immutable Pydantic node, and a run is just the tree of those snapshots.

The whole engine is one transition: step(node) → node'. The trace and the execution are the same data structure — there is no separate “tracing mode” to enable — so the same run renders as a Rich live tree, a Mermaid diagram, a Gantt swimlane, or a Gradio step-through viewer, all from one-line projections of the graph.

That graph allows you to inspect each subagent, replay from a checkpoint, fork from any node, and edit a branch before continuing. We’ll walk through those moves on a real coding-agent run shipped with the repo.

Introduction

Context rot is the failure mode every practitioner has hit — a Claude Code session that “gets dumber”, a Cursor chat that forgets the file you opened thirty messages ago, a research agent that can quote your prompt back but can’t use it. Anthropic defines it as recall degrading as the window grows: frontier models advertise 200k–1M tokens and degrade long before that — the tokens fit, the model just can’t reason over them all at once. Easy benchmarks miss this (RULER is constant-complexity and frontier models score 90%+), but Chroma, OOLONG, and lost-in-the-middle all show real degradation well below the nominal limit.

Existing fixes — bigger windows, retrieval, summarization, context-folding — each pick a decomposition strategy for the model, ahead of time. Even though they work in practice, they are also exactly the pattern Sutton’s Bitter Lesson warns about: hard-coded human structure that wins in the short run but loses in the long run to general methods that scale with compute. As capability improves, the fixed strategy becomes the ceiling.

Recursive Language Models flip that. An LLM sits in a Python REPL with the long context bound as a variable, and a single extra primitive — delegate — lets it spawn a fresh sub-agent with its own window. From there the model peeks, slices, greps, or recursively delegates only when it decides to. RAG retrieves; RLMs investigate. Empirically the case is strong: RLM(GPT-5-mini) beats raw GPT-5 on a tough long-context benchmark at roughly the same API cost, and holds up at 10M+ token corpora no direct baseline can fit (post, paper, rlm-minimal, verifiers).

But as the number of sub-agents grows, the tree gets hard to observe and control: parents spawn children, children spawn more children, results bubble back up, and a flat transcript hides almost everything you’d want to ask of the run. That’s where rlmflow comes in — representing sprawling trees of recursive agents as inspectable, controllable graphs.

RLMs are graphs

To better understand what this means, start with the canonical RLM demo: needle-in-a-haystack. The context is a huge synthetic document, and the question is simple: what secret code is hidden inside it?

The root agent looks at the document and decides not to read the whole thing itself. It splits the haystack across a few sub-agents:

  • one child scans the first third for the needle phrase,
  • another scans the middle third,
  • a third child scans the final third, finds several near-matches, and spawns two smaller children to inspect the candidate windows,
  • a verifier child checks the candidate code against the original question,
  • the root agent returns the final code.

That is still a small run, but it is already recursive: the root has children, and one of those children has children of its own. The important detail is that those children are not single black-box API calls. Each child is an agent with its own little loop: inspect the context, run a search, read a passage, maybe delegate again, then return.

In a minimal RLM-style implementation, every delegate(name, query, ctx) call is the LLM call: it spins up a fresh sub-LLM with its own REPL — bound to ctx as CONTEXT — runs that sub-LLM’s agent loop until it calls done(value), and hands the value back as a str. A child’s REPL can call delegate again, and so on. The parent never sees any of it. Click through:

1. What the root LLM emits in its REPL block

# In the root delegate — CONTEXT is the haystack, bound as a variable.
n = CONTEXT.line_count()
chunk_0 = delegate("chunk_0", "scan first third",  CONTEXT.lines(0, n // 3))
chunk_1 = delegate("chunk_1", "scan middle third", CONTEXT.lines(n // 3, 2 * n // 3))
chunk_2 = delegate("chunk_2", "scan final third",  CONTEXT.lines(2 * n // 3, n))
done(extract_code([chunk_0, chunk_1, chunk_2]))   # all three are plain str

2. A child's REPL can recursively delegate(...) too

# In the delegate("chunk_2", ...) — its sub-LLM is now the one writing REPL.
hits   = CONTEXT.grep(r"secret|code|passcode|needle").splitlines()
cand_a = delegate("candidate_a", "Inspect candidate window A.", hits[0])
cand_b = delegate("candidate_b", "Inspect candidate window B.", hits[1])
done("candidate code 84721")   # the root never sees this code ran

3. ...so the call stack nests delegate frames, with no fixed depth

# Live Python stack while candidate_b's sub-LLM is reasoning:
delegate("root",         "What secret code is hidden in the haystack?", haystack)
└── delegate("chunk_2",      "Scan final third...",         final_third)
    └── delegate("candidate_b", "Inspect candidate window B.", line_77)
# 3 LLM agent loops live at once, each with its own messages and CONTEXT.
# nothing on an inner frame is visible to any frame above it.

4. All the root's REPL sees back is three str

# Back in the root delegate — every delegate(...) above returned a str.
chunk_0 == "not found"
chunk_1 == "decoy, no code"
chunk_2 == "candidate code 84721"
# 6 hidden delegate frames and dozens of LLM iterations
# have collapsed into 3 strings. if chunk_2 is wrong, the root has no way
# to ask which inner sub-LLM screwed up, or what its CONTEXT even was.

That’s the core observability problem with vanilla RLMs: a single delegate() call can hide an entire recursive subtree of LLM work, and nothing about that subtree survives the return. Children can delegate to children can delegate to children — and all the parent ever gets is a list[str]. When the answer is wrong, you can’t tell which level of the recursion went off the rails; when the answer is right, you can’t tell whether it was right for the right reason. The abstraction is too clean: the act of delegating throws away exactly the structure you’d want to debug, evaluate, or steer.

rlmflow keeps that structure — every recursive call is a node in an execution graph that you can step through, inspect, and replay:

Step 0 / 8 — root receives the query

root receives the query

Step 1 / 8 — root delegates 3 chunks and parks in supervising

root delegates 3 chunks and parks in supervising

Step 2 / 8 — chunks run; chunk_2 sub-delegates two candidates

chunks run; chunk_2 sub-delegates two candidates

Step 3 / 8 — both candidate readers finish

both candidate readers finish

Step 4 / 8 — chunk_2 returns "candidate code 84721"

chunk_2 returns candidate code 84721

Step 5 / 8 — root resumes with all three chunk results

root resumes with all three chunk results

Step 6 / 8 — root delegates to verify and parks again

root delegates to verify and parks again

Step 7 / 8 — verify confirms 84721 matches the question

verify confirms the answer

Step 8 / 8 — root returns 84721

root returns 84721

This is the same run, but now the children are not opaque recursive calls. The root reaches a supervising node and stops; at that moment the runnable frontier is root.chunk_0, root.chunk_1, and root.chunk_2. Those children can advance independently, so the graph shows parallel work without pretending it is one conversation.

Then root.chunk_2 reaches its own supervising node. The frontier changes again: now root.chunk_2.a and root.chunk_2.b are runnable while both root and root.chunk_2 are parked. When those candidate readers finish, root.chunk_2 resumes, returns 84721, and only then can root resume and verify the final code.

That is the step-by-step execution state. You can pause after any node, inspect exactly what one child saw, fork from the candidate reader, or replace a bad child result before the parent resumes. The flat recursive-call view tells you what returned. The graph tells you how the answer moved through the run.

rlmflow stores the run in that shape from the start. The graph isn’t a visualization recovered from logs after the fact — it’s the data model. Every node is a complete checkpoint: enough state to resume the run, inspect what led there, or compare against another branch. That’s why the whole engine fits in one transition:

node = agent.start(query)
while not node.terminal:
    node = agent.step(node)

And it’s what gives rlmflow its four primitives:

  • Inspect one agent without rereading every sibling’s messages.
  • Replay from a saved node instead of starting the whole run over.
  • Fork from a node and try a different model, prompt, or workspace.
  • Edit a branch by replacing a bad child result and continuing from the parent.

Acknowledgements

Alex Zhang and Omar Khattab for coming up w/ RLMs. The rlm-minimal and ypi codebases for being readable and hackable; most of the prompt structure was learned from them.


Citation

@misc{sudhakaran2026rlmflow,
  author       = {Sudhakaran, Shyam},
  title        = {Recursive Language Models are Graphs},
  year         = {2026},
  howpublished = {\url{https://github.com/shyamsn97/rlmflow}}
}