30692

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

Erik Schultheis, Dan Alistarh
IST Austria
The Third Conference on Parsimony and Learning (CPAL’26), 2026

@inproceedings{schultheis2026llmq,

   title={LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs},

   author={Schultheis, Erik and Alistarh, Dan},

   booktitle={The Third Conference on Parsimony and Learning (Proceedings Track)},

   year={2026}

}

We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.
No votes yet.
Please wait...

You must be logged in to post a comment.

Recent source codes

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Contact us: