high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Energy-and cost-efficient Lattice-QCD computations using graphics processing units

Energy-and cost-efficient Lattice-QCD computations using graphics processing units

Matthias Bach

Johann Wolfgang Goethe-Universitat in Frankfurt am Main

Johann Wolfgang Goethe-Universitat in Frankfurt am Main, 2014

@article{bach2014energy,

title={Energy-and cost-efficient Lattice-QCD computations using graphics processing units},

author={Bach, Matthias},

year={2014}

}

Download (PDF)

View

Source

2606

views

Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus-despite the complexity of LQCD applications-it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D= kernel for a single GPU, achieving 120GFLOPS. D=-the most compute intensive kernel in LQCD simulations-is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD calculations require a sampling of the phase space. The hybrid Monte Carlo (HMC) algorithm performs this. For this task, a single AMD Radeon HD 7970 GPU provides four times the performance of two AMD Opteron 6220 running an optimized reference code. The same advantage is achieved in terms of energy-efficiency. In terms of normalized total cost of acquisition (TCA), GPU-based clusters match conventional large-scale LQCD systems. Contrary to those, however, they can be scaled up from a single node. Examples of large GPU-based systems are LOEWE-CSC and SANAM. On both, CL2QCD has already been used in production for LQCD studies.

Tags: Algorithms, AMD FirePro S10000, ATI, ATI Radeon HD 5870, ATI Radeon HD 7970, Energy-efficient computing, High Energy Physics – Lattice, OpenCL, Physics, QCD, Thesis

March 3, 2015 by hgpu

Rating: 0.5/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Energy-and cost-efficient Lattice-QCD computations using graphics processing units

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Energy-and cost-efficient Lattice-QCD computations using graphics processing units

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)