UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases

Understanding the Target Audience

The primary audience for this framework includes cybersecurity professionals, AI researchers, and software developers. Their pain points involve:

Inadequate evaluation methods for AI systems in cybersecurity.
Difficulty in identifying effective tools for vulnerability analysis.
Challenges in understanding AI’s capabilities and limitations in real-world scenarios.

Their goals are to:

Enhance cybersecurity across various software systems.
Evaluate the effectiveness of AI in identifying and mitigating vulnerabilities.
Stay abreast of advancements and methodologies to improve security protocols.

Interests include:

Innovative tools for vulnerability detection.
Research findings in AI applications for cybersecurity.
Best practices for secure software development.

The preferred communication methods involve technical reports, peer-reviewed papers, webinars, and industry conferences.

The Challenge in Cybersecurity Evaluation

As reliance on large software systems grows, the complexity of cybersecurity threats necessitates a shift beyond traditional protection methods. Evaluating the capability of AI agents to handle real-world vulnerabilities involves understanding intricate code paths and nuanced flaws. Current benchmarks often fail to represent the multifaceted nature of vulnerabilities found within actively maintained software systems.

Shortcomings of Existing Benchmarks

Many existing evaluation benchmarks, like Cybench and NYU CTF Bench, focus on simplified tasks that do not accurately reflect the complexities of large codebases. Issues include:

Limited complexity and scale.
Use of synthetic test cases or narrowly defined problems.
Failing to capture the diversity of execution paths and bug types.

This gap highlights the need for more effective frameworks like CyberGym that can truly assess AI systems’ competency in cybersecurity tasks.

Introducing CyberGym

CyberGym, developed at UC Berkeley, emerges as a comprehensive benchmarking tool designed to assess AI agents within real-world cybersecurity contexts. It utilizes 1,507 distinct benchmark tasks based on actual vulnerabilities from 188 major open-source software projects, originally identified by OSS-Fuzz. Each benchmark task provides a full pre-patch codebase, an executable, and a description of the vulnerability.

Within CyberGym, agents must create a proof-of-concept test to reproduce vulnerabilities in the unpatched version while ensuring the vulnerability is absent in the patched version. This emphasis on generating Proofs of Concept (PoCs) requires traversing complex code paths and synthesizing inputs to meet security conditions. The framework is modular and containerized, promoting easy expansion and reproducibility.

CyberGym Evaluation Levels

The evaluation pipeline comprises four levels of difficulty, progressively increasing the input information available:

Level 0: Codebase only.
Level 1: Natural language description added.
Level 2: Ground-truth PoC and crash stack trace included.
Level 3: Patch details and post-patch codebase provided.

This structured approach aids in assessing how effectively agents can infer vulnerability locations and contexts based on varying complexities of information.

Experimental Results

Upon testing with CyberGym, existing AI agents demonstrated limited success. The top-performing agent framework, OpenHands in conjunction with Claude-3.7-Sonnet, only reproduced 11.9% of target vulnerabilities. Success rates significantly decreased when dealing with longer PoC inputs, particularly those exceeding 100 bytes, where the reproduction rate fell below 8%.

Despite these challenges, agents managed to identify 15 previously unknown zero-day vulnerabilities and two disclosed but unpatched vulnerabilities across real-world projects. This indicates a latent potential for AI to contribute positively to cybersecurity analysis.

Key Takeaways

Volume and Realism: CyberGym features 1,507 tasks derived from real vulnerabilities, making it the largest framework of its kind.
Agent Limitations: Even the best agents showed an overall reproduction rate of only 11.9%.
Difficulty Scaling: Additional inputs significantly improved performance, with Level 3 tasks yielding a 17.1% success rate.
Length Sensitivity: Tasks involving long PoCs were particularly challenging, highlighting design considerations for future benchmarks.
Discovery Potential: Agents discovered new vulnerabilities, emphasizing their practical applications in real-world scenarios.

Conclusion

This study underscores the need for more robust evaluation methods for AI in cybersecurity. CyberGym presents a substantial advancement by offering a large-scale, real-world framework that challenges agents to engage deeply with complex codebases and demonstrate adaptive reasoning in generating valid exploits. While the results reveal that AI agents show promise in identifying vulnerabilities, significant work remains to reliably scale these capabilities for broader cybersecurity applications.

For additional details, please check the Paper, GitHub Page, and Leaderboard. Follow us on Twitter and join our community of over 100k on ML SubReddit. Subscribe to our Newsletter.