New from Chinese Academy of Sciences: Stream-Omni, an LLM for Cross-Modal Real-Time AI

Understanding the Target Audience

The primary audience for Stream-Omni includes AI researchers, business leaders in technology, and decision-makers in industries leveraging AI for multimodal applications. Their pain points often revolve around:

Challenges in integrating diverse data modalities (text, vision, speech)
Need for efficient training methods with limited datasets
Desire for improved performance in real-time applications

Their goals include enhancing AI capabilities, streamlining processes, and improving user experiences through advanced technology. They are interested in the latest research findings, practical applications, and advancements in AI methodologies. Communication preferences lean towards technical detail and empirical evidence, often favoring peer-reviewed content and data-backed insights.

Understanding the Limitations of Current Omni-Modal Architectures

Large multimodal models (LMMs) have demonstrated significant capabilities across text, vision, and speech modalities, presenting vast potential for diverse applications. However, challenges persist, particularly in omni-modal LMMs that facilitate speech interactions based on visual information. These challenges arise from intrinsic representational discrepancies across modalities. Current omni-modal LMMs strive to unify text, vision, and speech by combining representations from individual modality encoders along the sequence dimension. They heavily rely on large-scale data to learn modality alignments, which is misaligned with the limited availability of public tri-modal datasets. This approach also lacks the flexibility to generate intermediate text results during speech interactions.

Categorizing Existing LMMs by Modal Focus

Current LMMs can be categorized into three groups:

Vision-oriented: Models like LLaVA extract visual features using vision encoders, which are then integrated with textual inputs to generate text.
Speech-oriented: Models such as Mini-Omni and LLaMA-Omni utilize continuous methods to project features into LLM embedding spaces, while others like SpeechGPT and Moshi convert speech into discrete units for direct LLM processing.
Omni-modal: Models including VITA-1.5, MiniCPM2.6-o, and Qwen2.5-Omni extract representations from various encoders, concatenate them for multimodal understanding, and employ speech decoders for synthesis.

Introducing Stream-Omni: A Text-Centric Alignment Approach

Researchers from the University of Chinese Academy of Sciences have developed Stream-Omni, a large language-vision-speech model aimed at addressing the modality alignment challenges in omni-modal systems. Stream-Omni employs an LLM backbone and aligns vision and speech modalities for text by focusing on their semantic relationships rather than simple concatenation methods. For vision, it applies sequence-dimension concatenation to align vision and text. For speech, a CTC-based layer-dimension mapping is introduced for speech-text alignment. This design effectively overcomes the limitations of traditional concatenation-based approaches by implementing targeted alignment mechanisms.

Architecture Overview: Dual-Layer Speech Integration and Visual Encoding

Stream-Omni’s architecture features an LLM backbone with progressive modality alignment strategies. For vision-text alignment, it utilizes a vision encoder and a projection layer to extract visual representations. For speech-text alignment, special speech layers are integrated at both the bottom and top of the LLM backbone, enabling bidirectional mapping between speech and text modalities. The training corpus is constructed through automated pipelines, leveraging LLaVA datasets for vision-text pairs, LibriSpeech and WenetSpeech for speech-text data, and creating the InstructOmni dataset by converting existing instruction datasets using text-to-speech synthesis.

Benchmarking Multimodal Capabilities Across Domains

In visual understanding tasks, Stream-Omni achieves performance comparable to leading vision-oriented LMMs, surpassing VITA-1.5 while reducing modality interference and maintaining strong visual capabilities. For speech interaction, Stream-Omni demonstrates exceptional knowledge-based performance using only 23K hours of speech data compared to discrete speech unit-based models like SpeechGPT and Moshi. In evaluations on the SpokenVisIT benchmark for vision-grounded speech interaction, Stream-Omni outperforms VITA-1.5 in real-world visual understanding. The quality of speech-text mapping with Stream-Omni achieves superior ASR performance on the LibriSpeech benchmark, excelling in both accuracy and inference time.

Conclusion: A Paradigm Shift in Multimodal Alignment

In summary, Stream-Omni presents a solution to the modality alignment challenges in omni-modal systems. This approach demonstrates that effective modality alignment can be achieved through sequence-dimension concatenation for vision-text pairs and layer-dimension mapping for speech-text integration, reducing the dependency on extensive tri-modal training data. This research establishes a new paradigm for omni-modal LMMs, illustrating that targeted alignment strategies based on semantic relationships can surpass the limitations of traditional concatenation-based methods in multimodal AI systems.

Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.