Neural Magic has recently announced a significant breakthrough in AI model compression, introducing a fully quantized FP8 version of Meta’s Llama 3.1 405B model. This achievement marks a milestone in AI, allowing the massive 405 billion parameter model to fit seamlessly on any 8xH100 or 8xA100 system without the common out-of-memory (OOM) errors typically encountered with the original FP8 and FP16 versions. The new model solves memory constraints and enhances inference speeds by over 2X, leveraging faster memory and computing capabilities and eliminating the need for CPU offloading or distribution across multiple nodes.
Neural Magic provides two key versions of the model:
The fully quantized FP8 version, Meta-Llama-3.1-405B-Instruct-FP8-dynamic, maintains the architecture of Meta-Llama-3.1, designed for an assistant-like chat in multiple languages. However, it is restricted to usage in English and for lawful applications only. Released under version 1.0, this model was developed by Neural Magic and operates under the llama3.1 license.
Quantization and Optimization
The model achieves remarkable efficiency through weight and activation quantization to the FP8 data type. This process reduces the number of bits per parameter from 16 to 8, halving the disk size and GPU memory requirements. Consequently, the model can be loaded and evaluated on a single node of 8xH100 GPUs instead of requiring multiple nodes.
The quantization process involves symmetric per-channel quantization, where a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are quantized dynamically on a per-token basis. This was accomplished using LLM Compressor with 512 sequences from UltraChat, ensuring optimal performance.
Deployment and Evaluation
Neural Magic’s quantized model can be deployed efficiently using the vLLM backend. The deployment process involves using the `vllm` and `transformers` libraries in Python, as demonstrated in the provided code snippets. The example highlights the integration of the model with vLLM, showcasing the ease of generating text using the optimized model.
The model was evaluated on several benchmarks, including MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande, and TruthfulQA. The evaluation utilized Neural Magic’s fork of the ‘lm-evaluation-harness’ and the vLLM engine. The quantized model, Meta-Llama-3.1-405B-Instruct-FP8-dynamic, achieved an average score of 86.55 on the OpenLLM benchmark, closely mirroring the unquantized model’s score of 86.63, demonstrating a near-perfect recovery of 99.91%.
Reproduction and Accuracy
Neural Magic provides detailed commands for reproducing the evaluation results across various benchmarks. These commands illustrate the robustness of the quantized model, maintaining high accuracy across different tasks and few-shot settings. For instance, the model achieved a 99.91% recovery rate on MMLU (5-shot) and 100.2% on Winogrande (5-shot), underscoring its reliability and precision.
Conclusion
In conclusion, the release of the fully quantized FP8 version of Meta’s Llama 3.1 405B model by Neural Magic by effectively reducing memory requirements and enhancing inference speeds, this model opens new avenues for efficient and scalable AI applications. The success of this quantization effort, with minimal loss in accuracy, highlights the potential for further innovations in the field, making powerful AI models more accessible & practical for various users.
Check out the FP8 Dynamic Quantization and FP8 Static Quantization. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here
The post Neural Magic Releases Fully Quantized FP8 Version of Meta’s Llama 3.1 405B Model: FP8 Dynamic Quantization and FP8 Static Quantization appeared first on MarkTechPost.