Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data

Recent developments indicate that reinforcement learning (RL) can significantly enhance the reasoning abilities of large language models (LLMs). This study focuses on improving Audio LLMs—models that process audio and text to perform tasks such as question answering. The MMAU benchmark is a widely used dataset that evaluates these models through multiple-choice questions on sounds, speech, and music, including some requiring external knowledge.

A previous approach, R1-AQA, utilized Group Relative Policy Optimization (GRPO) to fine-tune the Qwen2-Audio model on the AVQA dataset, achieving state-of-the-art (SOTA) results on the MMAU benchmark. Building on this, the authors applied GRPO to fine-tune Qwen2.5-Omni-7B, a newer multimodal model, leading to further performance improvements. In addition, they introduced a method to automatically generate audio question-answering (QA) data, resulting in even better outcomes.

In contrast to more complex methods like SARI, which employs a mix of supervised fine-tuning and RL with structured reasoning, the authors’ approach relies solely on RL without explicit reasoning steps. Experiments with text-only inputs were conducted to investigate the role of GRPO in performance gains. Surprisingly, fine-tuning models using only text data yielded nearly identical improvements as training with both audio and text, indicating that GRPO enhances the model’s reasoning ability primarily through text.

Researchers from MIT CSAIL, Goethe University, IBM Research, and others introduced Omni-R1, a fine-tuned version of the multimodal LLM Qwen2.5-Omni utilizing the GRPO reinforcement learning method. Trained on the AVQA dataset, Omni-R1 achieved new state-of-the-art results on the MMAU benchmark across all audio categories. Much of this improvement arose from enhanced text-based reasoning rather than from audio inputs. Notably, fine-tuning with text-only data led to significant performance gains.

The team generated large-scale audio QA datasets using ChatGPT, which further improved accuracy. Their work underscores the importance of text reasoning in the performance of audio LLMs and promises the public release of all resources.

Technical Specifications

The Omni-R1 model fine-tunes Qwen2.5-Omni using the GRPO reinforcement learning method with a straightforward prompt format that allows for direct answer selection, making it memory-efficient for 48 GB GPUs. GRPO avoids utilizing a value function by comparing grouped outputs based solely on answer correctness. Researchers expanded training data using audio captions from Qwen-2 Audio and prompted ChatGPT to generate new question-answer pairs. This methodology produced two datasets—AVQA-GPT and VGGS-GPT—encompassing 40,000 and 182,000 audios, respectively. Training on these automatically generated datasets improved performance, with VGGS-GPT helping Omni-R1 achieve state-of-the-art accuracy on the MMAU benchmark.

Performance Results

Researchers fine-tuned Qwen2.5-Omni using GRPO on the AVQA, AVQA-GPT, and VGGS-GPT datasets. Results exhibited notable performance gains, with the best average score of 71.3% on the MAU Test-mini from VGGS-GPT. Qwen2.5-Omni outperformed established baselines, including SARI, demonstrating strong reasoning capabilities even without audio input. GRPO fine-tuning yielded more substantial improvements for Qwen2-Audio due to its initially weaker text reasoning. Interestingly, fine-tuning without audio boosted performance, while text-only datasets like ARC-Easy achieved similar results. Most improvements stemmed from enhanced text reasoning, although audio-based fine-tuning remained slightly superior for optimal performance.

Conclusion

In summary, Omni-R1 is an Audio LLM developed by fine-tuning Qwen2.5-Omni using the GRPO reinforcement learning method for enhanced audio question answering. Omni-R1 achieves new state-of-the-art results on the MMAU benchmark across sounds, speech, music, and overall performance. The creation of two new large-scale datasets, AVQA-GPT and VGGS-GPT, using automatically generated questions further enhanced model accuracy. Experiments reveal that GRPO primarily bolsters text-based reasoning, significantly aiding performance. Surprisingly, fine-tuning with text alone (without audio) improved audio-based performance, emphasizing the importance of strong base language understanding. These findings suggest cost-effective strategies for developing audio-capable language models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.