←back to Blog

Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

Large Reasoning Models (LRMs), trained from LLMs using reinforcement learning (RL), have shown significant performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs struggle with puzzle tasks that require logical reasoning skills, which are straightforward for humans. Current methods primarily focus on designing benchmarks for evaluation, lacking the necessary training methods and resources for modern LLMs to address these challenges. Additionally, current puzzle datasets are limited in diversity and scalability, covering only a few puzzle types with minimal control over generation or difficulty.

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial method for enhancing models’ reasoning capabilities by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR, yet previous research has largely overlooked their potential for delivering effective reward signals. Existing benchmarks evaluate various reasoning types, including abstract, deductive, and compositional reasoning, but few support scalable generation and difficulty control while lacking puzzle diversity. The improvement of LLMs’ puzzle-solving abilities primarily falls into two categories: tool integration and RLVR.

Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have introduced Enigmata, the first comprehensive toolkit designed to enhance LLMs with puzzle reasoning skills. Enigmata features 36 tasks across seven categories, each equipped with a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers also developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. The puzzle data from Enigmata improves state-of-the-art performance on advanced math and STEM reasoning tasks, demonstrating the generalization benefits of this toolkit.

The Enigmata-Data consists of 36 puzzle tasks organized into seven primary categories: Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle. This makes it the only dataset with multiple task categories that offers scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. The Enigmata-Eval was systematically sampled from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances, slightly below the theoretical maximum of 5,400 due to inherent constraints.

The proposed model outperforms most public models on Enigmata-Eval with 32 B parameters, showcasing the dataset’s and training recipe’s effectiveness. It excels on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata demonstrates outstanding performance in structured reasoning categories, particularly in Crypto, Arithmetic, and Logic tasks, indicating the effective development of rule-based reasoning capabilities. The model also shows competitive performance in search tasks that require strategic exploration and planning. Notably, Crypto and Arithmetic tasks yield the highest accuracy, while spatial and sequential tasks remain more challenging.

In summary, the researchers introduced Enigmata as a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model exhibits superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking (20 B/200 B parameters), synthetic puzzle data enhances performance in other domains, including mathematics and STEM reasoning, over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs.

Check out the Paper, GitHub Page, and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and join our 95k+ ML SubReddit and subscribe to our Newsletter.