high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Cramming: Training a Language Model on a Single GPU in One Day

Cramming: Training a Language Model on a Single GPU in One Day

Jonas Geiping, Tom Goldstein

University of Maryland, College Park

arXiv:2212.14034 [cs.CL], (28 Dec 2022)

DOI:10.48550/arXiv.2212.14034

@misc{https://doi.org/10.48550/arxiv.2212.14034,

doi={10.48550/ARXIV.2212.14034},

url={https://arxiv.org/abs/2212.14034},

author={Geiping, Jonas and Goldstein, Tom},

keywords={Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},

title={Cramming: Training a Language Model on a Single GPU in One Day},

publisher={arXiv},

year={2022},

}

Download (PDF)

View

Source

Source codes

Package:

Cramming the training of a (BERT-type) language model into limited compute

1527

views

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

Tags: Computer science, Deep learning, NLP, nVidia, nVidia GeForce RTX 2080 Ti, nVidia RTX A4000, nVidia RTX A6000, Package

January 8, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Cramming: Training a Language Model on a Single GPU in One Day

Package:

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Cramming: Training a Language Model on a Single GPU in One Day

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)