Nous Research Team Releases Hermes 4: A Family of Open-Weight AI Models with Hybrid Reasoning

Nous Research has introduced Hermes 4, a family of open-weight models (14B, 70B, and 405B parameter sizes based on Llama 3.1 checkpoints) that achieves frontier-level performance through post-training techniques. Hermes 4 features hybrid reasoning, allowing models to switch between standard responses and more detailed reasoning using … tags for complex problem solving.

Significance of Hermes 4

Hermes 4 stands out for achieving state-of-the-art performance among open-weight models while maintaining transparency and a neutral alignment philosophy. This demonstrates that sophisticated reasoning capabilities can be developed entirely through open-source methodologies.

DataForge: Graph-Based Synthetic Data Generation

DataForge serves as the core system behind Hermes 4. This graph-based synthetic data generation system revolutionizes how training data is created. Unlike traditional curation methods, DataForge utilizes a directed acyclic graph (DAG) where each node implements a Planning Domain Definition Language (PDDL) action interface.

Each node defines preconditions, postconditions, and transformations to facilitate the automatic creation of complex data pipelines. Using pre-training seed data from DCLM and FineWeb, the system can transform diverse content formats, such as converting a Wikipedia article into a rap song, generating instruction-answer pairs from this transformation.

DataForge generates approximately 5 million samples totaling 19 billion tokens, where reasoning samples are intentionally token-heavy, averaging five times more tokens than non-reasoning counterparts to capture thought processes up to 16,000 tokens long.

Rejection Sampling at Unprecedented Scale

Hermes 4 employs Atropos, Nous Research’s open-source reinforcement learning environment, to implement rejection sampling across about 1,000 distinct task-specific verifiers. This vast infrastructure filters high-quality reasoning trajectories in various domains.

Key verification environments include:

Answer Format Training (rewarding proper formatting across 150+ output formats)
Instruction Following (using RLVR-IFEval tasks with complex constraints)
Schema Adherence (for JSON generation using Pydantic models)
Tool Use training for agentic behavior

The rejection sampling process creates a substantial corpus of verified reasoning trajectories, ensuring the model learns robust reasoning patterns rather than mere memorization of specific templates.

Length Control: Solving Overlong Generation

One innovative contribution of Hermes 4 addresses the issue of overlong reasoning, where models produce excessively extensive thought chains. The research team discovered that their 14B model reached maximum context length 60% of the time on LiveCodeBench while in reasoning mode.

The solution involves a second supervised fine-tuning stage that teaches models to stop reasoning at exactly 30,000 tokens:

Generate reasoning traces from the current policy
Insert tokens at exactly 30,000 tokens
Train solely on the termination decision, not on the reasoning chain
Apply gradient updates only to and tokens

This method has achieved a remarkable reduction of:

78.4% on AIME’24
65.3% on AIME’25
79.8% on LiveCodeBench

Relative accuracy cost during this process remains between 4.7% to 12.7%, while effectively teaching «counting behavior» to models without risking collapse.

Benchmark Performance and Neutral Alignment

Hermes 4 demonstrates exceptional performance among open-weight models. Notably, the 405B model achieves:

96.3% on MATH-500 (reasoning mode)
81.9% on AIME’24
78.1% on AIME’25
70.5% on GPQA Diamond
61.3% on LiveCodeBench
57.1% on RefusalBench, outperforming GPT-4o (17.67%) and Claude Sonnet 4 (17%)

This performance indicates the model’s capability to engage with controversial topics while maintaining appropriate boundaries, reflecting Nous Research’s commitment to a neutral alignment philosophy.

Technical Architecture and Training

The training of Hermes 4 utilizes a modified TorchTitan across 192 NVIDIA B200 GPUs, efficiently handling a highly heterogeneous sample length distribution. Key features include:

Efficient packing to achieve >99.9% batch efficiency
Flex attention and sophisticated loss masking with only assistant-role tokens contributing to cross-entropy loss
A cosine learning rate schedule with 300 warmup steps and a total of 9,000 steps at a 16,384 token context length, with a global batch size of 384 samples

The training combines Data Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism, facilitating optimal resource utilization.

Summary

Hermes 4 signifies a major advancement in open-source AI development, illustrating that leading-edge reasoning capabilities can be attained through transparent and reproducible methodologies without depending on proprietary data or closed development frameworks. By integrating innovative graph-based synthetic data generation, extensive rejection sampling, and effective length control mechanisms, Nous Research has developed models that not only rival the performance of leading proprietary systems but also uphold the neutrality and steerability necessary for practical application.

For more details, check out the research paper, explore the technical specifications, and find the model on Hugging Face. You can also access tutorials, code, and notebooks on our GitHub Page. Stay updated by following us on Twitter, joining our 100k+ ML SubReddit, and subscribing to our newsletter.