←back to Blog

Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality

«`html

Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality

Despite the substantial progress in text-to-image (T2I) generation brought about by models such as DALL-E 3, Imagen 3, and Stable Diffusion 3, achieving consistent output quality — both in aesthetic and alignment terms — remains challenging. While large-scale pretraining provides general knowledge, it is insufficient for high aesthetic quality and alignment. Supervised fine-tuning (SFT) serves as a critical post-training step, but its effectiveness strongly depends on the quality of the fine-tuning dataset.

Target Audience Analysis

The target audience for Yandex’s Alchemist dataset includes:

  • Researchers and Developers — Interested in advancing T2I models through high-quality datasets.
  • AI Practitioners — Seeking scalable solutions for supervised fine-tuning without significant human intervention.
  • Businesses — Looking to improve their generative models’ output quality for commercial applications.

Common pain points for this audience include:

  • Inconsistent output quality from existing T2I models.
  • Lack of transparency in datasets leading to challenges in reproducibility.
  • High costs and inefficiency associated with human-led data curation.

The goals of this audience typically involve:

  • Enhancing model performance through improved fine-tuning datasets.
  • Achieving higher aesthetic and alignment quality in generated outputs.
  • Accessing reliable and transparent datasets for reproducible research.

Approach: A Model-Guided Dataset Curation

To address these challenges, Yandex has released Alchemist, a publicly available general-purpose SFT dataset composed of 3,350 carefully selected image-text pairs. Unlike conventional datasets, Alchemist uses a novel methodology that leverages a pre-trained diffusion model as a sample quality estimator. This approach enables the selection of training data with high impact on generative model performance without relying on subjective human labeling or simplistic aesthetic scoring.

Technical Design: Filtering Pipeline and Dataset Characteristics

The construction of Alchemist involves a multi-stage filtering pipeline starting from ~10 billion web-sourced images. The pipeline is structured as follows:

  • Initial Filtering: Removal of NSFW content and low-resolution images (threshold >1024×1024 pixels).
  • Coarse Quality Filtering: Application of classifiers to exclude images with compression artifacts, motion blur, watermarks, and other defects. These classifiers were trained on standard image quality assessment datasets.
  • Deduplication and IQA-Based Pruning: Using SIFT-like features for clustering similar images, retaining only high-quality ones. Further scoring with the TOPIQ model ensures retention of clean samples.
  • Diffusion-Based Selection: A pre-trained diffusion model’s cross-attention activations rank images based on visual complexity, aesthetic appeal, and stylistic richness.
  • Caption Rewriting: The final selected images are re-captioned using a vision-language model fine-tuned for prompt-style descriptions.

Results Across Multiple T2I Models

The effectiveness of Alchemist was evaluated across five Stable Diffusion variants: SD1.5, SD2.1, SDXL, SD3.5 Medium, and SD3.5 Large. Each model was fine-tuned using three datasets: (i) the Alchemist dataset, (ii) a size-matched subset from LAION-Aesthetics v2, and (iii) their respective baselines.

Human Evaluation: Expert annotators performed assessments across four criteria — text-image relevance, aesthetic quality, image complexity, and fidelity. Alchemist-tuned models showed statistically significant improvements in aesthetic and complexity scores, often outperforming both baselines and LAION-Aesthetics-tuned versions by margins of 12–20%. Importantly, text-image relevance remained stable, suggesting that prompt alignment was not negatively affected.

Automated Metrics: Across metrics such as FD-DINOv2, CLIP Score, ImageReward, and HPS-v2, Alchemist-tuned models generally scored higher than their counterparts. Improvements were more consistent when compared to size-matched LAION-based models than to baseline models.

Dataset Size Ablation: Fine-tuning with larger variants of Alchemist (7k and 19k samples) led to lower performance, underscoring the value of targeted, high-quality data over raw volume.

Conclusion

Alchemist provides a well-defined and empirically validated pathway to improve the quality of text-to-image generation via supervised fine-tuning. The approach emphasizes sample quality over scale and introduces a replicable methodology for dataset construction without reliance on proprietary tools.

While improvements are most notable in perceptual attributes like aesthetics and image complexity, the framework also highlights the trade-offs that emerge in fidelity, particularly for newer base models already optimized through internal SFT. Nevertheless, Alchemist establishes a new standard for general-purpose SFT datasets and offers a valuable resource for researchers and developers aiming to enhance the output quality of generative vision models.

Further Resources

Check out the Paper here and Alchemist Dataset on Hugging Face.

«`