A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples

Understanding the Target Audience

The target audience for this content primarily includes AI researchers, data scientists, software engineers, and business managers interested in implementing AI solutions. Their pain points include:

Difficulty in effectively training AI models with limited data.
Challenges in achieving high performance and generalization in AI applications.
Finding efficient methods to balance data quality with quantity.

Their goals are to:

Implement AI models that require less data while maintaining or improving performance.
Stay updated on innovative AI research and methodologies.
Enhance productivity through efficient AI tools in business processes.

Interests include:

Latest advancements in AI and machine learning.
Practical applications of AI in business and research.
Strategies for improving AI model training and deployment.

They prefer concise, data-driven content with a focus on practical applications and clear metrics.

Research Overview

A team of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes the LIMI method (“Less Is More for Agency”), a supervised fine-tuning approach that transforms a base model into a capable software/research agent using only 78 samples. LIMI achieves an average score of 73.5% on AgencyBench, outperforming several strong baselines.

Key Features of LIMI

The LIMI method introduces the Agency Efficiency Principle, stating that agentic competence scales more effectively with data quality and structure than with the sheer volume of samples. The researchers fine-tune the GLM-4.5 and GLM-4.5-Air models on 78 long-horizon, tool-use trajectories, achieving significant improvements across various evaluation metrics.

Methodology

Training utilized the slime SFT framework with consistent configurations to isolate data effects. The data was constructed from:

60 real queries from practitioners
18 synthesized queries derived from high-star GitHub PRs, ensuring high-quality supervision by PhD annotators

Each query logs a complete agent trajectory, capturing successful task completion within the SII-CLI environment, addressing tasks such as interactive software development and research workflows.

Evaluation Metrics

The effectiveness of LIMI was evaluated using the AgencyBench framework and generalization suites:

AgencyBench average score: 73.5%
FTFC (First Time Correct): 71.7%
SR@3 (Success Rate at 3): 74.6%
RC@3 (Response Correctness at 3): 74.2%

Results and Implications

LIMI demonstrates substantial data efficiency, achieving 73.5% with only 78 samples, compared to GLM-4.5’s performance with 10,000 samples (47.8%). This results in a +53.7-point improvement with 128× less data. Furthermore, LIMI shows robust generalization capabilities, averaging ~57% across tasks involving tool use, coding, and scientific computing, indicating intrinsic gains beyond just the availability of tools.

Conclusion

The research highlights that trajectory quality is more important than quantity, emphasizing the significance of curated, long-horizon workflows in collaborative software development and scientific research. This approach presents a promising avenue for firms seeking to optimize AI training processes and enhance agent capabilities with minimal data.

For further reading, you can access the original research paper here.