Nous Research Team Releases Hermes 4: A Family of Open-Weight AI Models with Hybrid Reasoning
Nous Research has introduced Hermes 4, a family of open-weight models (14B, 70B, and 405B parameter sizes based on Llama 3.1 checkpoints) that achieves frontier-level performance through post-training techniques. Hermes 4 features hybrid reasoning, allowing models to switch between standard responses and more detailed reasoning using
Significance of Hermes 4
Hermes 4 stands out for achieving state-of-the-art performance among open-weight models while maintaining transparency and a neutral alignment philosophy. This demonstrates that sophisticated reasoning capabilities can be developed entirely through open-source methodologies.
DataForge: Graph-Based Synthetic Data Generation
DataForge serves as the core system behind Hermes 4. This graph-based synthetic data generation system revolutionizes how training data is created. Unlike traditional curation methods, DataForge utilizes a directed acyclic graph (DAG) where each node implements a Planning Domain Definition Language (PDDL) action interface.
Each node defines preconditions, postconditions, and transformations to facilitate the automatic creation of complex data pipelines. Using pre-training seed data from DCLM and FineWeb, the system can transform diverse content formats, such as converting a Wikipedia article into a rap song, generating instruction-answer pairs from this transformation.
DataForge generates approximately 5 million samples totaling 19 billion tokens, where reasoning samples are intentionally token-heavy, averaging five times more tokens than non-reasoning counterparts to capture thought processes up to 16,000 tokens long.
Rejection Sampling at Unprecedented Scale
Hermes 4 employs Atropos, Nous Research’s open-source reinforcement learning environment, to implement rejection sampling across about 1,000 distinct task-specific verifiers. This vast infrastructure filters high-quality reasoning trajectories in various domains.
Key verification environments include:
- Answer Format Training (rewarding proper formatting across 150+ output formats)
- Instruction Following (using RLVR-IFEval tasks with complex constraints)
- Schema Adherence (for JSON generation using Pydantic models)
- Tool Use training for agentic behavior
The rejection sampling process creates a substantial corpus of verified reasoning trajectories, ensuring the model learns robust reasoning patterns rather than mere memorization of specific templates.
Length Control: Solving Overlong Generation
One innovative contribution of Hermes 4 addresses the issue of overlong reasoning, where models produce excessively extensive thought chains. The research team discovered that their 14B model reached maximum context length 60% of the time on LiveCodeBench while in reasoning mode.
The solution involves a second supervised fine-tuning stage that teaches models to stop reasoning at exactly 30,000 tokens:
- Generate reasoning traces from the current policy
- Insert tokens at exactly 30,000 tokens
- Train solely on the termination decision, not on the reasoning chain
- Apply gradient updates only to and
tokens
This method has achieved a remarkable reduction of:
- 78.4% on AIME’24
- 65.3% on AIME’25
- 79.8% on LiveCodeBench
Relative accuracy cost during this process remains between 4.7% to 12.7%, while effectively teaching «counting behavior» to models without risking collapse.
Benchmark Performance and Neutral Alignment
Hermes 4 demonstrates exceptional performance among open-weight models. Notably, the 405B model achieves:
- 96.3% on MATH-500 (reasoning mode)
- 81.9% on AIME’24
- 78.1% on AIME’25
- 70.5% on GPQA Diamond
- 61.3% on LiveCodeBench
- 57.1% on RefusalBench, outperforming GPT-4o (17.67%) and Claude Sonnet 4 (17%)
This performance indicates the model’s capability to engage with controversial topics while maintaining appropriate boundaries, reflecting Nous Research’s commitment to a neutral alignment philosophy.
Technical Architecture and Training
The training of Hermes 4 utilizes a modified TorchTitan across 192 NVIDIA B200 GPUs, efficiently handling a highly heterogeneous sample length distribution. Key features include:
- Efficient packing to achieve >99.9% batch efficiency
- Flex attention and sophisticated loss masking with only assistant-role tokens contributing to cross-entropy loss
- A cosine learning rate schedule with 300 warmup steps and a total of 9,000 steps at a 16,384 token context length, with a global batch size of 384 samples
The training combines Data Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism, facilitating optimal resource utilization.
Summary
Hermes 4 signifies a major advancement in open-source AI development, illustrating that leading-edge reasoning capabilities can be attained through transparent and reproducible methodologies without depending on proprietary data or closed development frameworks. By integrating innovative graph-based synthetic data generation, extensive rejection sampling, and effective length control mechanisms, Nous Research has developed models that not only rival the performance of leading proprietary systems but also uphold the neutrality and steerability necessary for practical application.
For more details, check out the research paper, explore the technical specifications, and find the model on Hugging Face. You can also access tutorials, code, and notebooks on our GitHub Page. Stay updated by following us on Twitter, joining our 100k+ ML SubReddit, and subscribing to our newsletter.