←back to Blog

This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model Deployment

This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model Deployment

The pursuit of improved performance in language models has led researchers to scale them up, typically by increasing the number of parameters or extending computational capacity. This trend has made the development and deployment of language models heavily reliant on substantial computational resources and memory.

However, increasing model size or generating more tokens to enhance reasoning capabilities presents significant challenges. Parameter scaling methods, such as Dense Scaling and Mixture-of-Experts Scaling, require much larger memory resources due to the increase in trainable weights. Inference-time scaling, which involves generating longer sequences or conducting multiple reasoning steps, introduces latency and slows down deployment. These approaches often lack adaptability across various scenarios and do not effectively address deployment efficiency in low-resource settings like mobile devices or embedded systems.

Researchers from Zhejiang University and Alibaba Group have proposed a new approach called PARSCALE, which stands for Parallel Scaling. This method shifts the focus from increasing model size or output length to enhancing the model’s parallel computations during training and inference. By applying multiple learnable transformations to the input, the model executes several forward passes in parallel and dynamically aggregates their outputs.

PARSCALE retains the original parameter count of the model while boosting computational diversity, making it an adaptable solution for various tasks and model architectures without requiring specialized datasets or changes in training protocols. Technically, PARSCALE appends several distinct, learnable prefixes to the same input, producing multiple parallel versions. The model processes these simultaneously, and the outputs are aggregated using a dynamic weighted sum calculated by a multilayer perceptron. This structure introduces only about 0.2% extra parameters per stream, a minor addition compared to full parameter scaling. The model employs prefix tuning to distinguish each parallel stream via unique key-value caches, allowing for efficient memory reuse. Additionally, the approach benefits from GPU-friendly parallelization, helping to keep latency low despite the increased computation. This design ensures scalability without altering the core architecture and enables application even in frozen pretrained models by only training the new prefix and aggregation parameters.

Extensive experiments were conducted on models ranging from 0.5B to 4.4B parameters with parallel streams (P) set from 1 to 8. When trained with 42 billion tokens, models with P = 8 demonstrated performance equivalent to models with up to 4.4 billion parameters but required significantly less memory and latency. Specifically, on a 1.6B model, PARSCALE used 22× less memory increase and 6× less latency increase compared to parameter scaling for the same performance. On downstream tasks, PARSCALE yielded up to a 34% improvement on GSM8K and 23% on MMLU. Coding performance also improved significantly—models with 1.6B parameters and P = 8 achieved results comparable to those of a 4.4B parameter model. The method proved effective during post-training and parameter-efficient fine-tuning, maintaining high performance even when core model parameters remained unchanged.

This paper introduces a strategy that rethinks how language models can be scaled. Instead of inflating model size or inference steps, it focuses on efficiently reusing existing computation. The researchers’ approach addresses time and memory inefficiencies while maintaining or improving performance, demonstrating a compelling shift in scaling methods. This sets a direction for deploying advanced models in constrained environments using parallel computation effectively.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

External illustration