VERINA: Evaluating LLMs on End-to-End Verifiable Code Generation with Formal Proofs

Understanding the Target Audience

The primary audience for VERINA includes:

Researchers and Academics: They seek to advance knowledge in AI-driven code generation and verification techniques.
Software Developers: They are interested in tools that enhance productivity and ensure code reliability.
Business Leaders: They look for solutions that can improve software development efficiency and reduce costs associated with bugs in production.

Common pain points include:

Verification of code generated by LLMs, which often contains bugs.
The lack of formal guarantees that LLM-generated code meets specifications.
Challenges in developing benchmarks that comprehensively evaluate code, specifications, and proof generation.

Goals of the audience consist of:

Enhancing the reliability of AI-generated code.
Finding effective benchmarking methods to validate AI performance.
Integrating formal verification into existing workflows.

Communication preferences include clear, technical language that emphasizes data and proven methodologies.

LLM-Based Code Generation Faces a Verification Gap

Large Language Models (LLMs) have demonstrated strong performance in programming, evidenced by their integration into tools such as Cursor and GitHub Copilot, which aim to enhance developer productivity. However, the probabilistic nature of LLMs means they cannot provide formal guarantees regarding the code they generate. As a result, the code often contains bugs, leading to potential productivity bottlenecks when using LLM-based code generation.

Developing suitable benchmarks to track progress in verifiable code generation is critical yet challenging. This involves three interconnected tasks: code generation, specification generation, and proof generation. Current benchmarks are inadequate because they lack support for all three tasks, quality control, robust metrics, and modular design.

Existing Benchmarks Lack Comprehensive Support for Verifiability

Benchmarks such as HumanEval and MBPP have achieved progress in LLM-based code generation but do not support formal specifications or proofs. Many verification-focused efforts concentrate on only one or two tasks, assuming other components will be provided by humans. For example, DafnyBench and miniCodeProps are aimed at proof generation, while AutoSpec and SpecGen infer specifications and proofs from human-written code. Interactive theorem-proving systems like Lean present a viable target for verifiable code generation with LLMs, as they facilitate the construction of proofs with intermediate steps. Nonetheless, existing verification benchmarks in Lean, such as miniCodeProps and FVAPPS, have limitations regarding task coverage and quality control.

Introducing VERINA: A Holistic Benchmark for Code, Spec, and Proof Generation

Researchers from the University of California and Meta FAIR have proposed VERINA (Verifiable Code Generation Arena), a high-quality benchmark designed to evaluate verifiable code generation. It comprises 189 programming challenges with detailed problem descriptions, code, specifications, proofs, and test suites, all formatted in Lean. VERINA is constructed with quality control in mind, drawing problems from sources like MBPP, LiveCodeBench, and LeetCode to present various difficulty levels. Each sample undergoes manual review and refinement to ensure clear natural language descriptions, precise formal specifications, and accurate code implementations. Additionally, every sample contains test suites that cover both positive and negative scenarios, achieving 100% line coverage of the code implementation and passing ground truth specifications.

Structure and Composition of the VERINA Dataset

VERINA is divided into two subsets with varying difficulty levels:

VERINA-BASIC: Contains 108 problems translated from human-written Dafny code, including 49 problems from MBPP-DFY50 and 59 additional instances from CloverBench. These were translated using OpenAI o3-mini with few-shot prompting and subsequently inspected.
VERINA-ADV: Comprises 81 more advanced coding problems sourced from student submissions in a theorem-proving course, where students formalized solutions in Lean.

Quality assurance measures ensure detailed problem descriptions, full code coverage with positive tests, and complete test pass rates on ground truth specifications.

Performance Insights: LLM Evaluation on VERINA Highlights Key Challenges

The evaluation of nine state-of-the-art LLMs on VERINA reveals a distinct hierarchy of performance. Code generation achieves the highest success rates, followed by specification generation, while proof generation remains the most challenging, with pass@1 rates below 3.6% for all models. VERINA-ADV presents greater difficulty compared to VERINA-BASIC across all three tasks, indicating that increased problem complexity significantly influences the performance of verifiable code generation.

Iterative proof refinement with o4-mini shows an improvement from 7.41% to 22.22% for simpler problems on VERINA-BASIC after 64 iterations, although gains are limited on VERINA-ADV. Additionally, providing ground truth specifications enhances code generation, demonstrating that formal specifications can effectively constrain and direct the synthesis process.

Conclusion: VERINA Sets a New Standard in Verifiable Code Evaluation

In summary, VERINA represents a significant advancement in benchmarking verifiable code generation. It provides 189 carefully curated examples with detailed task descriptions, high-quality code, specifications in Lean, and extensive test suites with full line coverage. Nonetheless, the dataset remains relatively small for comprehensive fine-tuning tasks, necessitating scaling through automated annotation with LLM assistance. While VERINA emphasizes simple, standalone tasks suitable for benchmarking, it may not fully represent the complexities of real-world verification projects. Future improvements to the specification generation metric could include more capable provers, potentially integrating LLMs or SMT solvers to better address complex soundness and completeness relationships.

For further exploration, check out the Paper, Dataset Card, and GitHub Page. All credit for this research goes to the researchers involved in this project. Follow us on Twitter and consider joining our 100k+ ML SubReddit or subscribing to our newsletter.