←back to Blog

LifelongAgentBench: A Benchmark for Evaluating Continuous Learning in LLM-Based Agents

LifelongAgentBench: A Benchmark for Evaluating Continuous Learning in LLM-Based Agents

Lifelong learning is essential for intelligent agents operating in dynamic environments. However, current LLM-based agents often lack memory, treating each task as a new challenge. While LLMs have significantly advanced language tasks and inspired agent-based systems, these agents remain stateless, unable to learn from past experiences. Achieving true general intelligence necessitates agents that can retain, adapt, and reuse knowledge over time. Unfortunately, existing benchmarks primarily focus on isolated tasks, neglecting the critical aspects of skill reuse and knowledge retention.

Lifelong learning, also referred to as continual learning, aims to enable AI systems to accumulate and retain knowledge across various tasks while preventing catastrophic forgetting. Most prior research has concentrated on non-interactive tasks, such as image classification or sequential fine-tuning, where models handle static inputs and outputs without adapting to changing environments. The application of lifelong learning to LLM-based agents in interactive settings remains underexplored. Current benchmarks like WebArena, AgentBench, and VisualWebArena assess one-time task performance but do not facilitate learning over time. Even interactive studies involving games or tools lack standardized frameworks for evaluating lifelong learning in agents.

Researchers from the South China University of Technology, MBZUAI, the Chinese Academy of Sciences, and East China Normal University have introduced LifelongAgentBench, the first comprehensive benchmark for assessing lifelong learning in LLM-based agents. This benchmark features interdependent, skill-driven tasks across three environments: Database, Operating System, and Knowledge Graph. It includes built-in label verification, reproducibility, and a modular design. The study indicates that traditional experience replay methods often fall short due to irrelevant information and context length limitations. To address these challenges, the team proposes a group self-consistency mechanism that clusters past experiences and employs voting strategies, significantly improving lifelong learning performance across various LLM architectures.

LifelongAgentBench is structured to evaluate how effectively language model-based agents learn and adapt over time. The framework treats learning as a sequential decision-making problem using goal-conditioned POMDPs within three environments: Databases, Operating Systems, and Knowledge Graphs. Tasks are organized around core skills and designed to reflect real-world complexities, considering factors such as task difficulty, overlapping skills, and environmental noise. Task generation combines automated and manual validation to ensure quality and diversity. This benchmark assesses whether agents can build on past knowledge and continuously improve in dynamic, skill-driven settings.

The modular system of LifelongAgentBench includes components like an agent, environment, and controller, which can operate independently and communicate via RPC. The framework emphasizes reproducibility and flexibility, accommodating diverse environments and models. Experimental results demonstrate that experience replay—feeding agents successful past trajectories—can significantly enhance performance, particularly on complex tasks. However, larger replays may lead to memory issues, highlighting the need for more efficient replay and memory management strategies.

In conclusion, LifelongAgentBench is a pioneering benchmark designed to evaluate the ability of LLM-based agents to learn continuously over time. Unlike previous benchmarks that treat agents as static, this framework tests their capability to build, retain, and apply knowledge across interconnected tasks in dynamic environments, such as databases, operating systems, and knowledge graphs. It offers a modular design, reproducibility, and automated evaluation. While experience replay and group self-consistency show promise in enhancing learning, challenges such as memory overload and inconsistent gains across models persist. This work lays the groundwork for developing more adaptable, memory-efficient agents, with future directions focusing on smarter memory utilization and real-world multimodal tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.