RA3: Mid-Training with Temporal Action Abstractions for Faster Reinforcement Learning (RL) Post-Training in Code LLMs

The recent research from Apple formalizes the concept of “mid-training” and its impact on reinforcement learning (RL) post-training. This study introduces RA3 (Reasoning as Action Abstractions), an iterative procedure that enhances model training through the following strategies:

1. Pruning to a compact near-optimal action subspace / 2. Shortening the effective planning horizon

These approaches significantly improve RL convergence. Empirical results indicate that RA3 enhances performance on code generation tasks such as HumanEval and MBPP by approximately 8 and 4 points, respectively, compared to base and NTP models. Additionally, it accelerates RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

Research Highlights

This research presents the inaugural formal analysis of how mid-training influences the effectiveness of post-training reinforcement learning. Key outcomes are broken down into:

1. Pruning Efficiency: Evaluates the selection of a compact near-optimal action subset which shapes the initial policy prior. / 2. RL Convergence: Measures the speed at which post-training performance improves within the restricted action set.

The team argues that mid-training is most efficient when the decision space is compact and the effective planning horizon is short, favoring temporal abstractions over primitive next-token actions.

RA3 Algorithm Overview

RA3 employs a sequential variational lower bound (a temporal ELBO) and optimizes it through an EM-like iterative loop:

1. E-step (Latent Discovery): Utilizes RL to identify temporally consistent latent structures aligned with expert sequences. / 2. M-step (Model Update): Executes next-token prediction on bootstrapped, latent-annotated traces to integrate these abstractions into the model’s policy.

Performance Results

Across various base models for Python code tasks, the research demonstrates that RA3 achieves:

1. ~8 point improvement on HumanEval / 2. ~4 point improvement on MBPP

Furthermore, initializing post-training with RA3 results in faster convergence and higher final performance on HumanEval+, MBPP+, LiveCodeBench, and Codeforces. This reflects both mid-training and post-training enhancements within the scope of code generation.

Key Takeaways

The research provides a formal framework for mid-training based on two primary determinants: pruning efficiency and RL convergence. The effectiveness of these approaches increases with the compactness of the decision space and the brevity of the effective horizon.

RA3 optimizes a sequential variational lower bound by iteratively discovering temporally consistent latent structures through RL and then fine-tuning on bootstrapped traces. The results indicate notable improvements in code generation tasks, which could have significant implications for business applications in software development and automation.

For further details, please refer to the Technical Paper.

Explore our GitHub Page for tutorials and code resources. Join our community on Twitter and connect with over 100k members on our ML SubReddit. For updates, subscribe to our Newsletter. We’re also available on Telegram.