«`html

BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI

The Beijing Academy of Artificial Intelligence (BAAI) has introduced OmniGen2, an open-source multimodal generative model that builds upon its predecessor, OmniGen. This next-generation architecture integrates text-to-image generation, image editing, and subject-driven generation within a single transformer framework. Key innovations include the decoupling of text and image generation modeling, a reflective training mechanism, and the introduction of a purpose-built benchmark called OmniContext for evaluating contextual consistency.

A Decoupled Multimodal Architecture

OmniGen2 distinguishes itself from previous models by employing two separate pathways: an autoregressive transformer for text generation and a diffusion-based transformer for image synthesis. The model utilizes a novel positioning strategy, Omni-RoPE, which allows for flexible handling of sequences, spatial coordinates, and modality distinctions, resulting in high-fidelity image generation and editing.

To maintain the pretrained text generation capabilities of the underlying MLLM (based on Qwen2.5-VL-3B), OmniGen2 feeds VAE-derived features exclusively to the diffusion pathway. This approach ensures that the model’s text understanding and generation capabilities remain intact while providing rich visual representation for the image synthesis module.

Reflection Mechanism for Iterative Generation

A standout feature of OmniGen2 is its reflection mechanism. By integrating feedback loops during training, the model can analyze its generated outputs, identify inconsistencies, and propose refinements. This iterative process enhances instruction-following accuracy and visual coherence, particularly for nuanced tasks such as modifying color, object count, or positioning.

The reflection dataset was constructed using multi-turn feedback, allowing the model to learn how to revise and terminate generation based on content evaluation. This mechanism is crucial for bridging the quality gap between open-source and commercial models.

OmniContext Benchmark: Evaluating Contextual Consistency

To rigorously assess in-context generation, the team introduced OmniContext, a benchmark comprising three primary task types: SINGLE, MULTIPLE, and SCENE across Character, Object, and Scene categories. OmniGen2 has demonstrated state-of-the-art performance among open-source models in this domain, scoring 7.18 overall, outperforming leading models such as BAGEL and UniWorld-V1.

The evaluation utilizes three core metrics: Prompt Following (PF), Subject Consistency (SC), and Overall Score (geometric mean), each validated through GPT-4.1-based reasoning. This benchmarking framework emphasizes not only visual realism but also semantic alignment with prompts and cross-image consistency.

Data Pipeline and Training Corpus

OmniGen2 was trained on 140M T2I samples and 10M proprietary images, supplemented by meticulously curated datasets for in-context generation and editing. These datasets were constructed using a video-based pipeline that extracts semantically consistent frame pairs and automatically generates instructions using Qwen2.5-VL models. The resulting annotations cover fine-grained image manipulations, motion variations, and compositional changes.

During training, the MLLM parameters remain largely frozen to retain general understanding, while the diffusion module is trained from scratch and optimized for joint visual-textual attention. A special token “<|img|>” triggers image generation within output sequences, streamlining the multimodal synthesis process.

Performance Across Tasks

OmniGen2 delivers strong results across multiple domains:

Text-to-Image (T2I): Achieves an 0.86 score on GenEval and 83.57 on DPG-Bench.
Image Editing: Outperforms open-source baselines with high semantic consistency (SC=7.16).
In-Context Generation: Sets new benchmarks in OmniContext with scores of 7.81 (SINGLE), 7.23 (MULTIPLE), and 6.71 (SCENE).
Reflection: Demonstrates effective revision of failed generations, with promising correction accuracy and termination behavior.

Conclusion

OmniGen2 is a robust and efficient multimodal generative system that advances unified modeling through architectural separation, high-quality data pipelines, and an integrated reflection mechanism. By open-sourcing models, datasets, and code, the project lays a solid foundation for future research in controllable, consistent image-text generation. Upcoming improvements may focus on reinforcement learning for reflection refinement and expanding multilingual and low-quality robustness.

«`