high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

Erik Schultheis, Dan Alistarh

IST Austria

The Third Conference on Parsimony and Learning (CPAL’26), 2026

@inproceedings{schultheis2026llmq,

title={LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs},

author={Schultheis, Erik and Alistarh, Dan},

booktitle={The Third Conference on Parsimony and Learning (Proceedings Track)},

year={2026}

}

Download (PDF)

View

Source

Source codes

Package:

LLM.Q: Quantized LLM training in pure CUDA/C++

854

views

We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.

Tags: Computer science, CUDA, LLM, nVidia, nVidia GeForce RTX 4090, nVidia GeForce RTX 5060 Ti, nVidia L40s, Package

March 22, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

Package:

Your response

Recent source codes

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

Most viewed papers (last 30 days)

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)