←back to Blog

QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration

QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration

Understanding the Target Audience

The target audience for QeRL includes AI researchers, machine learning engineers, and business managers in tech companies focused on AI development. Their pain points often revolve around the high computational costs and time associated with training large language models (LLMs). They seek efficient solutions that can enhance model performance while reducing resource consumption. Their goals include achieving faster training times, maintaining or improving accuracy, and exploring innovative methodologies in reinforcement learning. Interests lie in advancements in AI technologies, particularly those that can be applied to real-world business problems. Communication preferences lean towards technical documentation, research papers, and detailed reports that provide empirical evidence and practical applications.

Overview of QeRL

QeRL (Quantization-enhanced Reinforcement Learning) is a training framework developed by NVIDIA researchers in collaboration with MIT, HKU, and Tsinghua University. It enables reinforcement learning (RL) post-training on a 32B LLM using 4-bit NVFP4 quantization on a single H100 GPU while maintaining BF16-level accuracy and achieving 1.2–1.5× speedups. The framework has been open-sourced, allowing broader access to its capabilities.

Key Innovations in QeRL

QeRL modifies the RL loop by shifting the policy’s weight path to NVFP4 (FP4) with dual-level scaling, while keeping logits and gradients in higher precision through LoRA. This approach stabilizes backpropagation and enhances sampling efficiency, resulting in faster prefill and decoding during rollouts without the need for a separate full-precision policy.

Mechanics of QeRL

The integration of Marlin-based FP4 kernels in both rollout and prefill stages, combined with LoRA’s limitation on trainable parameters, specifically targets the stages that dominate RL costs and latency, particularly for long reasoning traces.

Quantization as Exploration

A significant finding of QeRL is that deterministic FP4 quantization increases policy entropy, which flattens token distributions early in training and enhances exploration compared to 16-bit LoRA and NF4-based QLoRA baselines. To manage this effect over time, QeRL introduces Adaptive Quantization Noise (AQN), which applies channel-wise Gaussian perturbations to LayerNorm scale parameters, allowing for a controlled transition from exploration to exploitation.

Reported Results

Using the Qwen2.5 backbone model, the research team demonstrated that NVFP4 combined with LoRA outperforms vanilla LoRA and QLoRA in terms of rollout throughput and overall training time. Specifically, they reported:

  • Over 2× rollout throughput on 14B/32B models compared to QLoRA.
  • Approximately 1.8× end-to-end speedup versus QLoRA in a representative setup.
  • For a 7B model, accuracy metrics included GSM8K = 90.8% and MATH500 = 77.4%, surpassing both 16-bit LoRA and QLoRA.

QeRL maintains competitive accuracy with higher-precision baselines and converges faster due to improved exploration.

Clarifications on QeRL

QeRL utilizes weight-only FP4 with LoRA updates and does not claim FP4 precision for logits or gradients. The primary benefits are seen in rollout and prefill throughput and memory efficiency, with empirical evidence supporting that quantization-induced entropy can enhance RL exploration when modulated by AQN. Generalization to other modalities or safety/tool-use RL applications will depend on reward design and sequence lengths.

Key Takeaways

  • QeRL combines NVFP4 4-bit weight quantization with LoRA to accelerate the rollout phase and reduce memory usage, enabling RL for a 32B LLM on a single H100-80GB GPU.
  • Quantization serves as a means of exploration: FP4 increases policy entropy, while AQN schedules channel-wise noise via LayerNorm scales.
  • Efficiency metrics include >1.5× rollout speedups compared to 16-bit LoRA and ~1.8× end-to-end speedup versus QLoRA.
  • Accuracy remains competitive, with Qwen2.5-7B achieving 90.8% on GSM8K and 77.4% on MATH500, matching full-parameter fine-tuning.
  • NVFP4 is a hardware-optimized 4-bit floating format that enables efficient Marlin-based kernels.

Further Resources

For more detailed information, you can access the full paper and explore the GitHub page for tutorials, codes, and notebooks.