←back to Blog

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency

Understanding the Target Audience for LFM2-Audio-1.5B

The primary audience for Liquid AI’s LFM2-Audio-1.5B includes AI developers, data scientists, business managers in technology firms, and audio engineers. These professionals are often looking to incorporate advanced voice capabilities into applications while maintaining a strong focus on performance, such as low latency and resource efficiency.

**Pain Points:**
These users frequently encounter challenges with model integration, latency issues, and the complexity of managing multiple models for different tasks (ASR, TTS, etc.). The need for faster response times in real-time applications is critical.

**Goals:**
Their objectives generally revolve around implementing effective voice interactions, enhancing user experiences, and utilizing a unified model to streamline development workflows.

**Interests:**
This audience is particularly interested in novel AI approaches to audio processing, advancements in natural language processing technologies, and practical applications of AI in business contexts.

**Communication Preferences:**
They prefer technical content that is concise, data-driven, and actionable, usually with clear diagrams, code examples, and practical case studies. Engagement on platforms like GitHub and technical forums is also common.

Key Features of LFM2-Audio-1.5B

Liquid AI’s latest model, LFM2-Audio-1.5B, offers a compact design that integrates speech and text processing in an end-to-end stack, tailored for low-latency responses on resource-constrained devices. Here are the essential features:

  • Unified Backbone: LFM2-Audio extends the 1.2B-parameter LFM2 language model to treat audio and text as first-class sequence tokens.
  • Disentangled Audio I/O: The model utilizes continuous embeddings from raw waveform chunks (~80 ms) for inputs and discrete audio codes for outputs. This approach mitigates discretization artifacts while maintaining autoregressive training.
  • Implementation Specifications:
    Backbone: LFM2 (hybrid conv + attention), 1.2B params (LM only) /
    Audio Encoder: FastConformer (~115M) /
    Audio Decoder: RQ-Transformer predicting discrete Mimi codec tokens (8 codebooks) /
    Context: 32,768 tokens; vocab: 65,536 (text) / 2049×8 (audio) /
    Precision: bfloat16; License: LFM Open License v1.0; Language: English
  • Generation Modes:
    Interleaved generation for speech-to-speech chat minimizes perceived latency /
    Sequential generation for ASR/TTS tasks (modality switching turn-by-turn)
  • Latency: Lower than 100 ms from a 4-second audio query to the first response, indicating a fast interaction time.

Performance Benchmarks

According to VoiceBench evaluations, LFM2-Audio-1.5B received an overall score of 56.78, demonstrating its capability in various voice assistant tasks. Notably, its performance metrics are competitive against larger models, with specific scores from the voice assistant evaluations providing context for its effectiveness.

In classical ASR performance, LFM2-Audio matches or improves upon existing models like Whisper-large-v3-turbo on several datasets, with specific word error rates (WER) indicating advantages in accuracy. For example, it shows lower WER scores on AMI and LibriSpeech-clean datasets.

The Importance of LFM2-Audio in Voice AI Trends

Unlike typical audio processing stacks that combine ASR, LLM, and TTS—leading to increased latency and complexity—LFM2-Audio’s single-backbone design simplifies the workflow. By using continuous input embeddings and discrete output codes, it reduces the glue logic required in integration, allowing for interleaved decoding that results in quicker audio output. For developers, this means less complexity while still supporting multiple functionalities, including ASR, TTS, classification, and conversational agents.

Liquid AI offers extensive resources, including a Python package and a Gradio demo, for users to explore and implement LFM2-Audio. Access additional technical details on platforms such as [Hugging Face](https://huggingface.co/LiquidAI/LFM2-Audio-1.5B).

Conclusion

Liquid AI’s LFM2-Audio-1.5B sets a precedent in audio processing models, addressing critical industry needs for speed and efficiency. By simplifying audio and text processing into a unified framework, it enables developers and businesses alike to create sophisticated voice AI applications tailored for real-time interaction.