←back to Blog

Meta AI’s ‘Early Experience’ Trains Language Agents without Rewards—and Outperforms Imitation Learning

Meta AI’s ‘Early Experience’ Trains Language Agents without Rewards—and Outperforms Imitation Learning

Understanding the Target Audience

The target audience for Meta AI’s ‘Early Experience’ research primarily includes AI researchers, business leaders in technology, and product managers focused on AI applications. Their pain points often revolve around the limitations of traditional training methods, such as imitation learning (IL) and reinforcement learning (RL), which can be resource-intensive and difficult to scale. They seek efficient, scalable solutions that can enhance the performance of language agents without relying heavily on expert demonstrations or complex reward systems. Their interests lie in innovative AI methodologies, practical applications in business contexts, and advancements in machine learning that can drive operational efficiency. Communication preferences lean towards clear, concise, and data-driven insights that highlight practical implications and technical specifications.

Overview of Early Experience

Meta Superintelligence Labs propose ‘Early Experience’, a reward-free training approach that enhances policy learning in language agents without the need for extensive human demonstration sets or reinforcement learning in the main loop. The core concept is straightforward: allow the agent to branch from expert states, take its own actions, collect the resulting future states, and convert those outcomes into supervision. This method is instantiated through two strategies: Implicit World Modeling (IWM) and Self-Reflection (SR), which have shown consistent improvements across eight environments and various base models.

Key Changes with Early Experience

Traditional training pipelines typically rely on imitation learning, which is cost-effective but struggles with scalability and robustness in diverse scenarios. In contrast, Early Experience offers a middle ground: it is reward-free like IL, but the supervision is based on the consequences of the agent’s actions rather than solely on expert actions. This means the agent can propose, act, and learn from actual outcomes without needing a reward function.

Strategies Implemented

  • Implicit World Modeling (IWM): This strategy trains the model to predict the next observation based on the current state and chosen action, thereby refining the agent’s internal model of environmental dynamics and minimizing off-policy drift.
  • Self-Reflection (SR): This approach presents both expert and alternative actions at the same state, prompting the model to explain why the expert action is superior based on observed outcomes. This rationale is then used to fine-tune the policy.

Benchmark Evaluation

The research team evaluated Early Experience across eight language-agent environments, including WebShop (transactional browsing), TravelPlanner (constraint-rich planning), and ScienceWorld. The results indicated average absolute gains of +9.6 in success rates and +9.4 in out-of-domain performance compared to IL across all tasks and models. These improvements remain significant even when the same checkpoints are used to initialize RL, enhancing post-RL performance by up to +6.4 compared to RL initialized from IL.

Efficiency and Data Generation

A notable advantage of Early Experience is its demo efficiency. With a fixed optimization budget, it matches or surpasses IL while utilizing a fraction of expert data. For instance, on WebShop, using only 1/8 of the demonstrations with Early Experience already outperforms IL trained on the full demo set. In ALFWorld, parity is achieved with just 1/2 of the demonstrations. This indicates that agent-generated future states provide valuable supervision signals that traditional demonstrations do not capture.

Data Construction

The pipeline begins with a limited set of expert rollouts to identify representative states. At these states, the agent proposes alternative actions, executes them, and records the subsequent observations. For IWM, the training data consists of triplets ⟨state, action, next-state⟩, with the objective of predicting the next state. For SR, the prompts include the expert action and several alternatives, along with their observed outcomes, allowing the model to produce a grounded rationale for the expert action’s superiority, which is then used to enhance the policy.

Integration with Reinforcement Learning

It is important to note that Early Experience is not simply “RL without rewards.” Instead, it is a supervised approach that utilizes outcomes experienced by the agent as labels. In environments where verifiable rewards exist, RL can be added after Early Experience. The improved initialization from Early Experience allows the RL schedule to achieve higher and faster performance, with up to +6.4 final success over IL-initialized RL across tested domains. This positions Early Experience as a bridge between imitation learning and reinforcement learning.

Key Takeaways

  • Reward-free training via agent-generated future states outperforms imitation learning across eight environments.
  • Reported absolute gains over IL include +18.4 (WebShop), +15.0 (TravelPlanner), and +13.3 (ScienceWorld) under matched budgets and settings.
  • Demo efficiency: exceeds IL on WebShop with 1/8 of demonstrations; reaches ALFWorld parity with 1/2 at fixed optimization costs.
  • As an initializer, Early Experience enhances subsequent RL (GRPO) endpoints by up to +6.4 compared to RL starting from IL.
  • Validated across multiple backbone families (3B–8B) with consistent improvements in both in-domain and out-of-domain performance.

Conclusion

Early Experience represents a significant advancement in training language agents, providing a scalable and efficient alternative to traditional methods. By leveraging outcome-grounded supervision, it addresses the challenges of off-policy drift and long-horizon error accumulation, making it a practical solution for production agent stacks in environments where verifiable rewards are scarce.

For further details, refer to the original research paper.