←back to Blog

Building a Context-Folding LLM Agent for Long-Horizon Reasoning with Memory Compression and Tool Use

«`html

Understanding the Target Audience

The target audience for building a Context-Folding LLM Agent primarily consists of AI researchers, data scientists, and business analysts who are interested in enhancing long-horizon reasoning capabilities in AI systems. This audience typically works in technology-driven environments, such as startups, research institutions, and large enterprises focused on AI innovation.

Pain Points

  • Difficulty managing context in complex tasks, leading to inefficiencies.
  • Challenges in scaling AI models for long-term reasoning without overwhelming memory.
  • Need for effective tools to break down intricate tasks into manageable subtasks.

Goals

  • To develop AI systems that can handle complex, multi-step reasoning tasks.
  • To improve the efficiency and accuracy of AI-generated outputs.
  • To create robust frameworks that allow for memory compression while retaining essential information.

Interests

  • Advancements in machine learning and natural language processing.
  • Practical applications of AI in business management and decision-making.
  • Innovative methodologies for task decomposition and reasoning.

Communication Preferences

This audience prefers technical documentation that is concise and well-structured, with clear examples and code snippets. They appreciate peer-reviewed research and case studies that demonstrate practical applications of AI technologies.

Building a Context-Folding LLM Agent for Long-Horizon Reasoning

In this tutorial, we explore how to build a Context-Folding LLM Agent that efficiently solves long, complex tasks by intelligently managing limited context. The agent is designed to break down large tasks into smaller subtasks, perform reasoning or calculations when needed, and then fold each completed sub-trajectory into concise summaries. This approach preserves essential knowledge while keeping the active memory small.

Setting Up the Environment

We begin by setting up our environment and loading a lightweight Hugging Face model. This model is used to generate and process text locally, ensuring the agent runs smoothly on platforms like Google Colab without any API dependencies.

import os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try:
   import transformers
except:
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", "transformers", "accelerate", "sentencepiece"], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

MODEL_NAME = os.environ.get("CF_MODEL", "google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device_map="auto")

Implementing a Folding Memory System

We define a memory system that dynamically folds past context into concise summaries. This helps maintain manageable active memory while retaining essential information.

class FoldingMemory:
   def __init__(self, max_chars:int=800):
       self.active=[]; self.folds=[]; self.max_chars=max_chars
   def add(self,text:str):
       self.active.append(text.strip())
       while len(self.active_text())>self.max_chars and len(self.active)>1:
           popped=self.active.pop(0)
           fold=f"- Folded: {popped[:120]}..."
           self.folds.append(fold)
   def fold_in(self,summary:str): self.folds.append(summary.strip())
   def active_text(self)->str: return "\n".join(self.active)
   def folded_text(self)->str: return "\n".join(self.folds)
   def snapshot(self)->Dict: return {"active_chars":len(self.active_text()),"n_folds":len(self.folds)}

Designing Prompt Templates

Structured prompt templates guide the agent in decomposing tasks, solving subtasks, and summarizing outcomes. These templates enable clear communication between reasoning steps and the model’s responses.

SUBTASK_DECOMP_PROMPT="""You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with '- ' in priority order.
Task: "{task}" """

Running the Agent

The agent’s core logic allows for the execution of each subtask, summarization, and folding back into memory. This demonstrates how context folding enables the agent to reason iteratively without losing track of prior reasoning.

class ContextFoldingAgent:
   def __init__(self,max_active_chars:int=800):
       self.memory=FoldingMemory(max_chars=max_active_chars)
       self.metrics={"subtasks":0,"tool_calls":0,"chars_saved_est":0}
   def decompose(self,task:str)->List[str]:
       plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96)
       subs=parse_bullets(plan)
       return subs[:4] if subs else ["Main solution"]
   def run(self,task:str)->Dict:
       t0=time.time()
       self.memory.add(f"TASK: {task}")
       subtasks=self.decompose(task)
       self.metrics["subtasks"]=len(subtasks)
       folded=[]
       for st in subtasks:
           self.memory.add(f"SUBTASK: {st}")
           final,fold_summary,trace=run_subtask(task,st,self.memory)
           self.memory.fold_in(fold_summary)
           folded.append(f"- {st}: {final}")
           self.memory.add(f"SUBTASK_DONE: {st}")
       final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200)
       t1=time.time()
       return {"task":task,"final":final.strip(),"folded_summaries":self.memory.folded_text(),
               "active_context_chars":len(self.memory.active_text()),
               "subtask_finals":folded,"runtime_sec":round(t1-t0,2)}

Conclusion

We demonstrate how context folding enables long-horizon reasoning while avoiding memory overload. Each subtask is planned, executed, summarized, and distilled into compact knowledge, mimicking how an intelligent agent would handle complex workflows over time. By combining decomposition, tool use, and context compression, we create a lightweight yet powerful agentic system that scales reasoning efficiently.

Check out the FULL CODES and Paper. Feel free to check out our GitHub Page for Tutorials, Codes, and Notebooks. Also, follow us on Twitter and join our 100k+ ML SubReddit. Don’t forget to subscribe to our Newsletter. You can also join us on Telegram.

«`