StreamTensor: A PyTorch-to-Accelerator Compiler for Streaming LLM Intermediates Across FPGA Dataflows

Understanding the Target Audience

The target audience for StreamTensor includes:

AI researchers and developers focused on optimizing machine learning models, particularly those working with large language models (LLMs).
Data scientists and machine learning engineers who require efficient inference solutions for deployment on FPGA hardware.
Enterprise decision-makers in tech companies looking for advanced frameworks to enhance their computational resources.

Pain Points: These users often face challenges such as high latency during inference, inefficiencies with existing GPU solutions, and difficulties in optimizing hardware resources for machine learning tasks.

Goals: Their goals include reducing latency and improving energy efficiency of LLMs, integrating robust compilation tools into their existing workflows, and leveraging hardware optimizations for better performance.

Interests: This audience is keen on exploring the latest advancements in AI compilation techniques, understanding performance metrics, and discovering practical implementations of emerging technologies in real-world applications.

Communication Preferences: They prefer detailed technical documentation, real-world case studies, and peer-reviewed research that provide deep insights into solutions and methodologies.

Overview of StreamTensor

StreamTensor is a compiler designed to transform PyTorch LLM graphs, like GPT-2, Llama, Qwen, and Gemma, into stream-scheduled dataflow accelerators on AMD’s Alveo U55C FPGA. It introduces an innovative iterative tensor (itensor) type to enhance streaming capabilities, enabling provably correct inter-kernel streaming and the automated sizing of DMA engines, FIFOs, and layout converters.

Why Stream LLM Inference via Dataflow Compiler?

Traditional approaches treat LLM inference as batched kernels to DRAM, which can result in increased latency. StreamTensor redefines this by streaming tiles through on-chip FIFOs and stream converters, significantly enhancing performance. The research highlights up to 0.64× lower latency compared to GPUs and up to 1.99× higher energy efficiency for LLM decoding workloads.

For further details, refer to the comprehensive research paper: StreamTensor Research Paper.

Key Innovations of StreamTensor

Hierarchical Design Space Exploration: StreamTensor optimizes three design spaces: (i) tiling/unroll/vectorization/permutation at the Linalg level, (ii) fusion under memory/resource constraints, and (iii) resource allocation/stream widths, optimizing for sustained throughput under bandwidth limits.
End-to-End Compilation: The framework enables a seamless flow from PyTorch → Torch-MLIR → dataflow compiler, producing explicit streams and host/runtime glue without manual RTL assembly.
Iterative Tensor (Itensor) Typing: This first-class tensor type expresses iteration order, tiling, and affine maps, facilitating safe kernel fusion and minimal buffer synthesis where needed.
Formal FIFO Sizing: The inter-kernel buffering problem is tackled using a linear programming approach, ensuring minimal on-chip memory use while avoiding stalls or deadlocks.

Performance Results

In benchmarking, StreamTensor achieves:

Up to 0.76× latency improvement against prior FPGA LLM accelerators.
0.64× lower latency compared to a GPU baseline on GPT-2.
Energy efficiency improvement of up to 1.99× compared to A100 on emerging LLMs (model-dependent).

The framework operates optimally within the Alveo U55C platform, featuring 16 GB HBM2 at 460 GB/s, and dual QSFP28 or PCIe Gen3×16 configurations, making it perfectly aligned with the streaming dataflow design.

Concluding Remarks

StreamTensor represents a significant advancement in the compilation landscape, facilitating the transformation of PyTorch models into efficient stream-scheduled kernels for AMD’s Alveo U55C. The framework’s unique iterative tensor type and linear programming FIFO sizing allow for effective inter-kernel streaming, drastically reducing reliance on DRAM for LLM inference. As evidenced by the reported performance metrics, StreamTensor offers compelling advantages for organizations looking to enhance their machine learning infrastructures.

For more information, check out the full research paper. Additional resources, including tutorials and code, are available on our GitHub page. Join our discussions on Twitter, and engage with a community of over 100k in our ML SubReddit. Don’t forget to subscribe to our newsletter for the latest updates.

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows