high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Using Fermi architecture knowledge to speed up CUDA and OpenCL programs

Using Fermi architecture knowledge to speed up CUDA and OpenCL programs

Yuri Torres, Arturo Gonzalez-Escribano, Diego R. Llanos

Dpto. Informatica, Univ. Valladolid, Spain

International Workshop on Heterogeneus Architectures and Computing (ISPA 2012), 2012

BibTeX

Download (PDF)

View

Source

1971

views

The NVIDIA graphics processing units (GPUs) are playing an important role as general purpose programming devices. The implementation of parallel codes to exploit the GPU hardware architecture is a task for experienced programmers. The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded. The threadblock configuration has a significant impact on the global performance of the program. While in CUDA parallel programming model it is always necessary to specify the threadblock size and shape, the OpenCL standard also offers an automatic mechanism to take this delicate decision. In this paper we present a study of these criteria for Fermi architecture, introducing a general approach for threadblock choice, and showing that there is considerable room for improvement in OpenCL automatic strategy.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 480, OpenCL, Performance

June 13, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Using Fermi architecture knowledge to speed up CUDA and OpenCL programs

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Using Fermi architecture knowledge to speed up CUDA and OpenCL programs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)