←back to Blog

Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM Inference to 8 GB Consumer GPUs via SSD Offload—No Quantization Required

«`html

Meet oLLM: A Lightweight Python Library for 100K-Context LLM Inference on Consumer GPUs

Understanding the Target Audience

The target audience for oLLM includes data scientists, machine learning engineers, and AI researchers who are looking for efficient ways to run large-context language models on consumer-grade hardware. Their pain points include:

  • Limited GPU memory (8 GB) for running large models.
  • High costs associated with multi-GPU setups.
  • Need for efficient inference methods without sacrificing model precision.

Their goals include maximizing the use of available hardware, reducing operational costs, and maintaining high performance in tasks such as document analysis and summarization. They prefer clear, technical communication that provides actionable insights and detailed specifications.

What is oLLM?

oLLM is a lightweight Python library built on Huggingface Transformers and PyTorch, designed to run large-context Transformers on NVIDIA GPUs. It achieves this by offloading weights and KV-cache to fast local SSDs, targeting offline, single-GPU workloads without the need for quantization. The library supports FP16/BF16 weights and utilizes FlashAttention-2 and disk-backed KV caching to manage VRAM usage effectively while handling up to ~100K tokens of context.

Key Features

Recent updates to oLLM include:

  • KV cache read/writes that bypass mmap to reduce host RAM usage.
  • DiskCache support for Qwen3-Next-80B.
  • Llama-3 FlashAttention-2 for improved stability.
  • Memory reductions for GPT-OSS via “flash-attention-like” kernels and chunked MLP.

Performance Metrics

The following table summarizes the end-to-end memory and I/O footprints for various models on an RTX 3060 Ti (8 GB):

  • Qwen3-Next-80B (bf16, 160 GB weights, 50K ctx) → ~7.5 GB VRAM + ~180 GB SSD; throughput “≈ 1 tok/2 s”.
  • GPT-OSS-20B (packed bf16, 10K ctx) → ~7.3 GB VRAM + 15 GB SSD.
  • Llama-3.1-8B (fp16, 100K ctx) → ~6.6 GB VRAM + 69 GB SSD.

How oLLM Works

oLLM streams layer weights directly from SSD into the GPU, offloads the attention KV cache to SSD, and can also offload layers to CPU. It employs FlashAttention-2 with online softmax, ensuring the full attention matrix is never fully materialized. This design shifts the bottleneck from VRAM to storage bandwidth and latency, emphasizing the use of NVMe-class SSDs and KvikIO/cuFile (GPUDirect Storage) for high-throughput file I/O.

Supported Models and GPUs

oLLM supports models such as Llama-3 (1B/3B/8B), GPT-OSS-20B, and Qwen3-Next-80B. It targets NVIDIA Ampere (RTX 30xx, A-series), Ada (RTX 40xx, L4), and Hopper architectures. Notably, Qwen3-Next-80B is a sparse MoE model typically intended for multi-GPU deployments, but oLLM allows it to be executed offline on a single consumer GPU by leveraging SSD storage.

Installation and Usage

oLLM is MIT-licensed and available on PyPI. Users can install it via:

pip install ollm

For high-speed disk I/O, an additional kvikio-cu{cuda_version} dependency is required. The library includes examples in the README demonstrating usage with Inference(…).DiskCache(…) and generate(…) with a streaming text callback.

Performance Expectations and Trade-offs

Throughput for Qwen3-Next-80B at 50K context on an RTX 3060 Ti is reported at ~0.5 tok/s, making it suitable for batch or offline analytics rather than interactive applications. The design prioritizes SSD storage, which may lead to increased storage pressure due to the large KV caches required for long contexts.

While oLLM enables running large models on consumer hardware, it is essential to recognize that high-throughput inference for these models typically requires multi-GPU setups. Therefore, oLLM is best viewed as a solution for large-context, offline processing rather than a direct replacement for production serving stacks.

Conclusion

oLLM effectively balances high precision with the need to offload memory to SSDs, making it feasible to work with ultra-long contexts on a single 8 GB NVIDIA GPU. Although it may not match the throughput of data-center solutions, it offers a practical approach for offline document analysis, compliance review, and large-context summarization.

Additional Resources

For more information, check out the GitHub Repo for tutorials, code, and notebooks. Follow us on Twitter and join our community on ML SubReddit. Don’t forget to subscribe to our newsletter for updates.

«`