«`html

Yandex Releases Yambda: The World’s Largest Event Dataset to Accelerate Recommender Systems

Yandex has made a significant contribution to the recommender systems community by releasing Yambda, the world’s largest publicly available dataset for recommender system research and development. This dataset bridges the gap between academic research and industry-scale applications, offering nearly 5 billion anonymized user interaction events from Yandex Music, which boasts over 28 million monthly users.

Why Yambda Matters: Addressing a Critical Data Gap in Recommender Systems

Recommender systems underpin personalized experiences across digital services, from e-commerce to streaming platforms. These systems rely on extensive behavioral data, such as clicks and listens, to infer user preferences. However, the field of recommender systems has lagged behind other AI domains due to a scarcity of large, openly accessible datasets. Unlike large language models, which learn from publicly available text, recommender systems require sensitive behavioral data that is commercially valuable and challenging to anonymize. Consequently, companies have traditionally restricted access to this data, limiting researchers’ ability to work with real-world-scale datasets.

Existing datasets like Spotify’s Million Playlist Dataset and Netflix Prize data are often too small or lack the necessary temporal detail for developing production-grade recommender models. Yandex’s release of Yambda addresses these issues by providing a high-quality, extensive dataset with robust features and anonymization safeguards.

What Yambda Contains: Scale, Richness, and Privacy

The Yambda dataset comprises 4.79 billion anonymized user interactions collected over a 10-month period from approximately 1 million users interacting with nearly 9.4 million tracks on Yandex Music. The dataset includes:

User Interactions: Implicit feedback (listens) and explicit feedback (likes, dislikes, and their removals).
Anonymized Audio Embeddings: Vector representations of tracks derived from convolutional neural networks, enabling models to leverage audio content similarity.
Organic Interaction Flags: An “is_organic” flag indicates whether users discovered a track independently or via recommendations.
Precise Timestamps: Each event is timestamped to preserve temporal ordering, crucial for modeling sequential user behavior.

All user and track identifiers are anonymized using numeric IDs to comply with privacy standards, ensuring no personally identifiable information is exposed. The dataset is provided in Apache Parquet format, which is optimized for big data processing frameworks like Apache Spark and Hadoop, making Yambda accessible for researchers and developers.

Evaluation Method: Global Temporal Split

A key innovation in Yandex’s dataset is the adoption of a Global Temporal Split (GTS) evaluation strategy. Unlike the widely used Leave-One-Out method, which disrupts the temporal continuity of user interactions, GTS splits the data based on timestamps, preserving the entire sequence of events. This approach mimics real-world recommendation scenarios, preventing future data from leaking into training and allowing models to be tested on truly unseen, chronologically later interactions.

Baseline Models and Metrics Included

To support benchmarking and accelerate innovation, Yandex provides baseline recommender models implemented on the dataset, including:

MostPop: A popularity-based model recommending the most popular items.
DecayPop: A time-decayed popularity model.
ItemKNN: A neighborhood-based collaborative filtering method.
iALS: Implicit Alternating Least Squares matrix factorization.
BPR: Bayesian Personalized Ranking, a pairwise ranking method.
SANSA and SASRec: Sequence-aware models leveraging self-attention mechanisms.

These baselines are evaluated using standard recommender metrics such as:

NDCG@k (Normalized Discounted Cumulative Gain): Measures ranking quality emphasizing the position of relevant items.
Recall@k: Assesses the fraction of relevant items retrieved.
Coverage@k: Indicates the diversity of recommendations across the catalog.

Broad Applicability Beyond Music Streaming

While the dataset originates from a music streaming service, its value extends across sectors like e-commerce, video platforms, and social networks. Algorithms validated on this dataset can be generalized or adapted for various recommendation tasks.

Benefits for Different Stakeholders

Yambda offers several benefits:

Academia: Enables rigorous testing of theories and new algorithms at an industry-relevant scale.
Startups and SMBs: Provides a resource comparable to what tech giants possess, leveling the playing field and accelerating the development of advanced recommendation engines.
End Users: Indirectly benefits from smarter recommendation algorithms that improve content discovery, reduce search time, and increase engagement.

My Wave: Yandex’s Personalized Recommender System

Yandex Music utilizes a proprietary recommender system called My Wave, which incorporates deep neural networks and AI to personalize music suggestions. My Wave analyzes various factors, including user interaction sequences and listening history, customizable preferences, and real-time music analysis. This system dynamically adapts to individual tastes by identifying audio similarities and predicting preferences, demonstrating the complex recommendation pipeline that benefits from large-scale datasets like Yambda.

Ensuring Privacy and Ethical Use

The release of Yambda highlights the importance of privacy in recommender system research. Yandex anonymizes all data with numeric IDs and omits personally identifiable information, containing only interaction signals without revealing exact user identities or sensitive attributes. This balance between openness and privacy allows for robust research while protecting individual user data, a critical consideration for the ethical advancement of AI technologies.

Access and Versions

Yandex offers the Yambda dataset in three sizes to accommodate different research and computational capacities:

Full version: ~5 billion events.
Medium version: ~500 million events.
Small version: ~50 million events.

All versions are accessible via Hugging Face, a popular platform for hosting datasets and machine learning models, enabling easy integration into research workflows.

Conclusion

Yandex’s release of the Yambda dataset marks a pivotal moment in recommender system research. By providing an unprecedented scale of anonymized interaction data paired with temporal-aware evaluation and baselines, it sets a new standard for benchmarking and accelerating innovation. Researchers, startups, and enterprises alike can now explore and develop recommender systems that better reflect real-world usage and deliver enhanced personalization.

As recommender systems continue to influence countless online experiences, datasets like Yambda play a foundational role in pushing the boundaries of what AI-powered personalization can achieve.

Check out the Yambda Dataset on Hugging Face.

«`