←back to Blog

Andrej Karpathy Releases ‘nanochat’: A Minimal, End-to-End ChatGPT-Style Pipeline You Can Train in ~4 Hours for ~$100

Andrej Karpathy Releases ‘nanochat’: A Minimal, End-to-End ChatGPT-Style Pipeline You Can Train in ~4 Hours for ~$100

Understanding the Target Audience

The target audience for Andrej Karpathy’s release of nanochat primarily includes:

  • AI Researchers and Developers: Professionals interested in natural language processing and machine learning. They often seek ways to train and fine-tune language models efficiently.
  • Business Analysts: Individuals who are motivated to leverage AI for data analysis and decision-making processes.
  • Startups and Entrepreneurs: These users aim to implement AI solutions with limited resources, looking for cost-effective tools to enhance their offerings.
  • Technical Enthusiasts and Hobbyists: Individuals eager to learn and experiment with cutting-edge technology in the AI space.

Their pain points include:

  • High costs and complexity associated with training large language models.
  • Need for reproducible and scalable AI solutions that can be adapted to specific business needs.
  • Desire for streamlined workflows that reduce setup time and resource requirements.

Goals and interests encompass:

  • Building and fine-tuning AI models with minimal investment.
  • Accessing straightforward documentation and resources to aid in implementation.
  • Exploring innovative applications of AI in various business sectors.

Communication preferences lean toward:

  • Clear, technical documentation that provides actionable insights.
  • Access to community discussions and peer support.
  • Updates on new features or improvements in a concise format.

Overview of nanochat

Andrej Karpathy has open-sourced nanochat, a compact, dependency-light codebase implementing a full ChatGPT-style stack—from tokenizer training to web UI inference—focused on reproducible and hackable large language model (LLM) training on a single multi-GPU node.

The repository offers a single-script “speedrun” that executes the full loop: tokenization, base pretraining, mid-training on chat/multiple-choice/tool-use data, supervised fine-tuning (SFT), optional reinforcement learning (RL) on GSM8K, evaluation, and serving (CLI + ChatGPT-like web UI). The recommended setup uses an 8×H100 node; at approximately $24/hour, the 4-hour speedrun totals around $100. A post-run report.md summarizes metrics (CORE, ARC-E/C, MMLU, GSM8K, HumanEval, ChatCORE).

Technical Highlights

The tokenizer employs a custom Rust byte pair encoding (BPE), featuring a vocabulary size of 65,536 tokens. Training utilizes FineWeb-EDU shards, re-packaged for easy access. The evaluation bundle includes CORE (22 autocompletion datasets like HellaSwag, ARC, and BoolQ), stored in ~/.cache/nanochat/eval_bundle.

The speedrun configuration trains a depth-20 Transformer with approximately 560 million parameters, consistent with Chinchilla-style scaling (params × ~20 tokens). This model is estimated to have a training capability of approximately 4e19 FLOPs. Training employs Muon for matrix multiplication parameters and AdamW for embeddings/unembeddings, reporting loss in bits-per-byte (bpb) to ensure tokenizer invariance.

Mid-training, SFT, and Tool Use

Post-pretraining, mid-training adapts the base model for conversations (SmolTalk) and instructs multiple-choice behavior using 100,000 MMLU auxiliary-train questions. Tool use is incorporated within the training process through specific code blocks. The default mixture includes:

  • SmolTalk: 460,000 rows
  • MMLU auxiliary-training: 100,000 rows
  • GSM8K main: 8,000 rows

The SFT process fine-tunes the model on higher-quality conversations while ensuring test-time formatting aligns with training to minimize mismatch. Example metrics from the speedrun tier following SFT include:

  • ARC-Easy: 0.3876
  • ARC-Challenge: 0.2807
  • MMLU: 0.3151
  • GSM8K: 0.0455
  • HumanEval: 0.0854
  • ChatCORE: 0.0884

Tool use is integrated end-to-end, with a custom engine implementing key functionalities such as KV cache, prefill/decode inference, and a Python interpreter sandbox for enhanced tool-augmented training and evaluation.

Optional Reinforcement Learning on GSM8K

The final stage allows for optional RL on GSM8K, utilizing a simplified Generalized Relative Policy Optimization (GRPO) loop. The walkthrough clarifies omitted components compared to standard Proximal Policy Optimization (PPO) methods, focusing on maintaining a practical approach to reinforcement learning.

Cost and Quality Scaling

The README outlines additional scaling options beyond the ~$100 speedrun:

  • ~$300 tier: d=26 (~12 hours), slightly surpassing GPT-2 CORE; requires additional pretraining shards and batch-size adjustments.
  • ~$1,000 tier: ~41.6 hours, yielding significantly improved coherence and basic reasoning/coding ability.

Prior experimental runs suggested that a d=30 model trained for ~24 hours could achieve scores of 40s on MMLU, 70s on ARC-Easy, and 20s on GSM8K.

Evaluation Snapshot

An example report.md for the ~$100/≈4-hour run shows:

  • CORE: 0.2219 (base)
  • ARC-Easy: 0.3876
  • ARC-Challenge: 0.2807
  • MMLU: 0.3151
  • GSM8K: 0.0455
  • HumanEval: 0.0854
  • ChatCORE: 0.0884
  • Wall-clock time: 3h51m
  • Conclusion

    Karpathy’s nanochat strikes a balance between accessibility and functionality. It offers a single, clean, dependency-light repository integrating tokenizer training, pretraining, mid-training, SFT, and an optional simplified RL process, resulting in a reproducible speedrun capable of producing detailed evaluation metrics.

    For further details, check out the official repository and community discussions on GitHub.

    External illustration