high performance computing on graphics processing units: hgpu.org

Posts

Apr, 10

Dynamic Programming with CUDA – Part II

This module is largely stand-alone. It is "Part II" only in the sense that it does not contain the overview of dynamic programming seen in Part I, and does not recapitulate the introduction to CUDA. We will continue to refer the reader to various NVIDIA references where appropriate, particularly the NVIDIA CUDA C Programming Guide, […]

CUDA

Apr, 10

A Comparative Study of Parallel Algorithms for the Girth Problem

In this paper we introduce efficient parallel algorithms for finding the girth in a graph or digraph, where girth is the length of a shortest cycle. We empirically compare our algorithms by using two common APIs for parallel programming in C++, which are OpenMP for multiple CPUs and CUDA for multi-core GPUs. We conclude that […]

CUDA

Apr, 10

Hadoop+Aparapi: Making heterogenous MapReduce programming easier

Lately, programmers have started to take advantage of GPU capabilities of cloud-based machines. Using the GPUs can decrease the number of nodes required to perform the computation by increasing the productivity per node. We combine Hadoop, a widely-used MapReduce framework, with Aparapi, a new Java-to-OpenCL conversion tool from AMD. We propose an easy-to-use API which […]

OpenCL

Apr, 10

An innovative compilation tool-chain for embedded multi-core architectures

In this paper, we propose a compilation tool-chain supporting the effective exploitation of multi-core architectures offering hundreds of cores. The tool-chain leverages on both the application requirements and the platform-specific features to provide developers with a powerful parallel-programming environment able to generate efficient parallel code. The design of parallel applications follows a semi-automatic approach enabling […]

CUDA

•

OpenCL

Apr, 9

New Basic Linear Algebra Methods for Simulation on GPUs

We have used Graphics Processing Units (GPUs) to accelerate the solution of the types of equations typically encountered in dynamic system simulators. Compared to commercial matrix solvers that run on a CPU, we realized speedups ranging from 5 (for system size ~700) to 460 (for system size ~5800). While calculation time for the commercial matrix […]

CUDA

Apr, 9

A Study of Productivity and Performance of Modern Vector Processors

This bachelor thesis carries out a case study describing the performance and productivity of modern vector processors such as graphics processing units (GPUs) and central processing units (CPUs) based on three different computational routines arising from a magnetoencephalography application. I apply different programming paradigms to these routines targeting either the CPU or the GPU. Furthermore, […]

CUDA

•

OpenCL

Apr, 9

Tiled Shading

Abstract In this article we describe and investigate tiled shading. The tiled techniques, though simple, enable substantial improvements to both deferred and forward shading. Tiled Shading has been previously discussed only in terms of deferred shading (tiled deferred shading). We contribute a more detailed description of the technique, introduce tiled forward shading (a generalization of […]

CUDA

•

OpenGL

Apr, 9

A GPU-Based Accelerator for Chinese Word Segmentation

The task of Chinese word segmentation is to split sequence of Chinese characters into tokens so that the Chinese information can be more easily retrieved by web search engine. Due to the dramatic increase in the amount of Chinese literature in recent years, it becomes a big challenge for web search engines to analyze massive […]

CUDA

Apr, 9

Efficient computational noise in GLSL

We present GLSL implementations of Perlin noise and Perlin simplex noise that run fast enough for practical consideration on current generation GPU hardware. The key benefits are that the functions are purely computational, i.e. they use neither textures nor lookup tables, and that they are implemented in GLSL version 1.20, which means they are compatible […]

OpenGL

Apr, 7

A Scalable Framework for Heterogeneous GPU-Based Clusters

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or […]

CUDA

Apr, 7

Bound the Peak Performance of SGEMM on GPU with software-controlled fast memory

In this paper, we studied the NVIDIA GPU architecture characteristics concerning the SGEMM routine and the potential peak performance of SGEMM on Fermi GPU. Guiding by the analysis, our SGEMM routine achieved about 11% (NN), 4.5% (TN), 3% (NT) and 9% (TT) better performance than cublas in CUDA 4.1 package for large matrices on GTX580 […]

CUDA

Apr, 7

Robust Computational Tools for Multiple Testing With Genetic Association Studies

Resolving the interplay of the genetic components of a complex disease is a challenging endeavor. Over the past several years, genome-wide association studies (GWAS) have emerged as a popular approach at locating common genetic variation within the human genome associated with disease risk. Assessing genetic-phenotype associations upon hundreds of thousands of genetic markers using the […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Dynamic Programming with CUDA – Part II

A Comparative Study of Parallel Algorithms for the Girth Problem

Hadoop+Aparapi: Making heterogenous MapReduce programming easier

An innovative compilation tool-chain for embedded multi-core architectures

New Basic Linear Algebra Methods for Simulation on GPUs

A Study of Productivity and Performance of Modern Vector Processors

Tiled Shading

A GPU-Based Accelerator for Chinese Word Segmentation

Efficient computational noise in GLSL

A Scalable Framework for Heterogeneous GPU-Based Clusters

Bound the Peak Performance of SGEMM on GPU with software-controlled fast memory

Robust Computational Tools for Multiple Testing With Genetic Association Studies

Recent source codes

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

Most viewed papers (last 30 days)