high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A Fast and Generic GPU-Based Parallel Reduction Implementation

A Fast and Generic GPU-Based Parallel Reduction Implementation

Walid Jradi, Hugo do Nascimento, Wellington Martins

Universidade Federal de Goias – Instituto de Informatica

arXiv:1710.07358 [cs.DC], (19 Oct 2017)

BibTeX

Download (PDF)

View

Source

4698

views

Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A parallel reduction, in turn, is the reduction operation concurrently performed when multiple execution units are available. The current work reports an investigation on this subject and depicts a GPU-based parallel approach for it. Employing techniques like Loop Unrolling, Persistent Threads and Algebraic Expressions to avoid thread divergence, the presented approach was able to achieve a 2.8x speedup when compared to the work of Catanzaro, using a generic, simple and easily portable code. Experiments conducted to evaluate the approach show that the strategy is able to perform efficiently in AMD and NVidia’s hardware, as well as in OpenCL and CUDA.

Tags: Computer science, CUDA, nVidia, OpenCL, Performance, Tesla C2075

October 24, 2017 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

A Fast and Generic GPU-Based Parallel Reduction Implementation

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

A Fast and Generic GPU-Based Parallel Reduction Implementation

Share this:

Recent source codes

Most viewed papers (last 30 days)