high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Understanding the impact of CUDA tuning techniques for Fermi

Understanding the impact of CUDA tuning techniques for Fermi

Yuri Torres, Arturo Gonzalez-Escribano, Diego R. Llanos

Departamento de Informatica, Universidad de Valladolid, Spain

International Conference on High Performance Computing and Simulation (HPCS), 2011

DOI:10.1109/HPCSim.2011.5999886

BibTeX

Download (PDF)

View

Source

1568

views

While the correctness of an NVIDIA CUDA program is easy to achieve, exploiting the GPU capabilities to obtain the best performance possible is a task for CUDA experienced programmers. Typical code tuning strategies, like choosing an appropriate size and shape for the threadblocks, programming a good coalescing, or maximize occupancy, are inter-dependent. Moreover, the choices are also dependent on the underlying architecture details, and the global-memory access pattern of the designed solution. For example, the size and shapes of threadblocks are usually chosen to facilitate encoding (e.g. square shapes), while maximizing the multiprocessors’ occupancy. However, this simple choice does not usually provide the best performance results. In this paper we discuss important relations between the size and shapes of threadblocks, occupancy, global memory access patterns, and other Fermi architecture features, such as the configuration of the new transparent cache. We present an insight based approach to tuning techniques, providing lines to understand the complex relations, and to easily avoid bad tuning settings.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 480, Performance

November 10, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Understanding the impact of CUDA tuning techniques for Fermi

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Understanding the impact of CUDA tuning techniques for Fermi

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)