←back to Blog

Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes

Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes

Understanding the Target Audience

The target audience for this research primarily includes software engineers, data scientists, and AI researchers who are interested in performance prediction within programming environments. These professionals often face challenges related to optimizing code performance, managing resource utilization, and improving model accuracy. Their goals include enhancing the efficiency of machine learning models and streamlining the development process. They prefer clear, concise communication that is rich in technical details and supported by empirical data.

Overview of the Research

Researchers from Cornell and Google have introduced a unified Regression Language Model (RLM) that predicts numeric outcomes directly from code strings. This model addresses key metrics such as GPU kernel latency, program memory usage, and neural network accuracy without the need for hand-engineered features. The RLM, which is a 300M-parameter encoder-decoder initialized from T5-Gemma, demonstrates strong rank correlations across various tasks and programming languages, utilizing a single text-to-number decoder that emits digits through constrained decoding.

Key Innovations

  • Unified Code-to-Metric Regression: The RLM predicts peak memory from high-level code (Python/C/C++), latency for Triton GPU kernels, and accuracy and hardware-specific latency from ONNX graphs by reading raw text representations.
  • Concrete Results: The model achieves Spearman ρ ≈ 0.93 on APPS LeetCode memory, ρ ≈ 0.52 for Triton kernel latency, and ρ > 0.5 on average across 17 CodeNet languages, demonstrating competitive performance compared to graph-based predictors.
  • Multi-Objective Decoding: The autoregressive nature of the decoder allows it to condition later metrics on earlier ones, effectively capturing realistic trade-offs along Pareto fronts.

Importance of the Research

This research is significant because traditional performance prediction pipelines in compilers and GPU kernel selection often rely on bespoke features, which can be brittle when new operations or languages are introduced. By framing regression as next-token prediction over numbers, the RLM standardizes the process: it tokenizes inputs as plain text (source code, Triton IR, ONNX) and decodes calibrated numeric strings digit-by-digit. This approach reduces maintenance costs and enhances the model’s adaptability to new tasks through fine-tuning.

Data and Benchmarks

The Code-Regression dataset has been curated to support code-to-metric tasks, including APPS/LeetCode runs, Triton kernel latencies, and CodeNet memory footprints. The NAS/ONNX suite features architectures from NASBench-101/201, FBNet, Once-for-All, and others, exported to ONNX text to predict accuracy and device-specific latency.

Technical Specifications

The backbone of the RLM is an encoder-decoder architecture initialized from T5-Gemma with approximately 300M parameters. Inputs consist of raw strings (code or ONNX), and outputs are numeric values emitted as sign/exponent/mantissa digit tokens. Constrained decoding ensures valid numerals and supports uncertainty through sampling.

Performance Statistics

  • APPS (Python) Memory: Spearman ρ > 0.9.
  • CodeNet (17 Languages) Memory: Average ρ > 0.5; strongest languages include C/C++ with approximately 0.74–0.75.
  • Triton Kernels (A6000) Latency: ρ ≈ 0.52.
  • NAS Ranking: Average Kendall τ ≈ 0.46 across NASNet, Amoeba, PNAS, ENAS, DARTS, competitive with FLAN and GNN baselines.

Conclusion

The unified code-to-metric regression approach demonstrated by the RLM effectively predicts memory, latency, and accuracy directly from code without the need for hand-engineered features. The strong correlation statistics indicate its potential utility in compiler heuristics, kernel pruning, and multi-objective NAS triage. The open dataset and library facilitate replication and lower the barrier for fine-tuning on new hardware or programming languages.

Additional Resources

For further details, please refer to the original research paper and explore the GitHub page for tutorials, codes, and notebooks.