Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers

Large Language Models (LLMs) have demonstrated significant potential across various programming tasks, yet their application in program optimization, particularly in low-level programming contexts, remains underexplored. While recent advancements have seen LLMs enhance performance in high-level languages like C++ and Python, their broader use for optimizing assembly code is limited. Current benchmarks for LLMs mainly focus on code generation and issue resolution, as evidenced by frameworks like HumanEval, MBPP, APPS, SWE-bench, and SWE-agent.

Models such as Codex, AlphaCode, and Code Llama primarily aim to improve code generation quality rather than performance. However, emerging research is beginning to address optimization challenges, emphasizing parallelization and code efficiency improvements. Many of these approaches, constrained by formal verification requirements, limit scalability.

In contrast, newer methodologies utilizing test-based validation allow for optimization of more complex programs. Learning-based strategies in compiler optimization, such as AutoPhase, which employs reinforcement learning for pass sequencing, and Coreset, which uses graph neural networks, have shown promising results. Superoptimization techniques aim to find the most efficient version of a program, though they are typically confined to small-scale problems. Frameworks like AutoTVM and Ansor optimize GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has gained traction, with reinforcement learning approaches guiding LLMs using feedback from test cases. Techniques like CodeRL and PPOCoder leverage policy optimization methods to enhance model performance, even for resource-constrained languages like Verilog.

Researchers from Stanford, UIUC, CMU, and Visa Research are exploring the use of LLMs to optimize assembly code performance—an area traditionally dominated by compilers like GCC. They have introduced a reinforcement learning framework using Proximal Policy Optimization (PPO), balancing rewards between correctness and speedup relative to the gcc -O3 baseline. Utilizing a dataset of 8,072 real-world programs, their model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. These findings suggest that with RL training, LLMs can effectively surpass conventional compiler optimizations.

The methodology involves optimizing compiled C programs for performance using a reinforcement learning approach. A C program, C, is compiled to assembly, P, using gcc -O3, with the aim of generating a new assembly program, P’, that is functionally equivalent but faster. Correctness is verified with a test set, while speedup is assessed by measuring execution time improvements. Employing CodeNet as the dataset, the authors apply PPO to train a language model that generates improved code. Two reward functions—Correctness-Guided Speedup and Speedup-Only—guide training based on program validity, correctness, and performance gains.

The study evaluates various language models for optimizing assembly code, revealing that many struggle with low test pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO, trained with reinforcement learning, significantly outperforms its counterparts, achieving a 96% accuracy and a 1.47× average speedup. Ablation studies indicate that referencing gcc -O3 aids performance; its removal results in sharp declines. Notably, models like Claude-3.7-sonnet can identify hardware-specific optimizations, such as replacing loops with a single popcnt instruction, showcasing their ability to perform semantic-level code transformations beyond traditional compiler capabilities.

In conclusion, this research demonstrates the application of LLMs in optimizing assembly code, an area where traditional compilers often struggle due to the complexities of low-level performance tuning. The authors fine-tune Qwen2.5-Coder-7B using PPO, rewarding both correctness (via test cases) and speedup over gcc -O3. They introduce a benchmark of 8,072 real-world C programs to evaluate performance, achieving a 96.0% test pass rate and a 1.47× average speedup while outperforming 20 other models, including Claude-3.7-sonnet. While effective, limitations include a lack of formal correctness guarantees and variability in hardware performance across systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and join our 95k+ ML SubReddit. Don’t forget to subscribe to our Newsletter.