Data access optimized applications on the GPU using NVIDIA CUDA

hgpu.org » Applications » Computer science » Data access optimized applications on the GPU using NVIDIA CUDA

Data access optimized applications on the GPU using NVIDIA CUDA

Dheevatsa Mudiger

Technische Universitat Munchen

Technische Universitat Munchen, 2009

BibTeX

Download (PDF)

View

Source

3180

views

This work is an attempt to address the problem of bandwidth limited performance of data intensive GPGPU applications. Performance limited by memory bandwidth is common issue faced by general data intensive HPC applications. In case of the GPU, this problem is more pronounced owing to the unique architecture. This problem has been tackled by optimizing basic data rearrangement operations on the GPU. In this direction, methods and approaches have been identified and formulated for optimizing data rearrangement in general on GPU architectures. These are employed to develop near optimal and generic GPU kernels for a set of data rearrangement operations. In particular a library of GPU kernels has been developed, for operations that involve rearranging a generic m-dimensional data into n-dimensions. These kernels have been hand-tuned for maximum throughput, equaling upto 90% of the bandwidth utilization of the intrinsic memcpy function. These kernels are developed as templatized generic implementations allowing for their seamless integration into other existing applications. The target GPU architectures considered in this work are – NVIDIA Tesla c1060 and NVIDIA Tesla c870 and kernels have been developed using NVIDIA CUDA. All the kernels achieve or surpass best known performance in terms of bandwidth utilization. Furthermore, as a case study of a simple CFD Navier-Stokes based flow solver has been developed for the GPU, incorporating the optimal data rearrangement principles. This has been tested for the case of a 2D lid driven cavity flow. The GPU implementation is comprehensively compared with optimal serial and parallel CPU implementations on an Intel Nehalem X5550 platform. A maximum speedup of 252x as compared to the serial code on the CPU and 13x as compared to the parallel CPU code (16 MPI processes on8 cores of 2 Quad Nehalem X5550), has been attained.

Tags: Computer science, CUDA, Data Structures and Algorithms, Fluid dynamics, nVidia, Tesla C1060, Tesla C870, Thesis

February 3, 2011 by hgpu

Rating: 2.5/5. From 1 vote.

Please wait...

high performance computing on graphics processing units: hgpu.org

Data access optimized applications on the GPU using NVIDIA CUDA

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Data access optimized applications on the GPU using NVIDIA CUDA

Share this:

Recent source codes

Most viewed papers (last 30 days)