high performance computing on graphics processing units: hgpu.org

Posts

Jul, 10

Towards Good Practices for Very Deep Two-Stream ConvNets

Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with […]

CUDA

Jul, 10

Multiple String Matching on a GPU using CUDAs

Multiple pattern matching algorithms are used to locate the occurrences of patterns from a finite pattern set in a large input string. Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG, five of the most well known algorithms for multiple matching require an increased computing power, particularly in cases where large-size datasets must be […]

CUDA

Jul, 8

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

We introduce a novel dictionary optimization method for high-dimensional vector quantization employed in approximate nearest neighbor (ANN) search. Vector quantization methods first seek a series of dictionaries, then approximate each vector by a sum of elements selected from these dictionaries. An optimal series of dictionaries should be mutually independent, and each dictionary should generate a […]

Jul, 8

Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing

Load balancing is a widely accepted technique for performance optimization of scientific applications on parallel architectures. Indeed, balanced applications do not waste processor cycles on waiting at points of synchronization and data exchange, maximizing this way the utilization of processors. In this paper, we challenge the universality of the load-balancing approach to optimization of the […]

Jul, 8

Autotuning OpenACC Work Distribution via Direct Search

OpenACC provides a high-productivity API for programming GPUs and similar accelerator devices. One of the last steps in tuning OpenACC programs is selecting values for the num_gangs and vector length clauses, which control how a parallel workload is distributed to an accelerator’s processing units. In this paper, we present OptACC, an autotuner that can assist […]

Jul, 8

Experiments on Parallel Training of Deep Neural Network using Model Averaging

In this work we apply model averaging to parallel training of deep neural network (DNN). Parallelization is done in a model averaging manner. Data is partitioned and distributed to different nodes for local model updates, and model averaging across nodes is done every few minibatches. We use multiple GPUs for data parallelization, and Message Passing […]

CUDA

Jul, 8

Sorting and Permuting without Bank Conflicts on GPUs

In this paper, we look at the complexity of designing algorithms without any bank conflicts in the shared memory of Graphical Processing Units (GPUs). Given input of size $n$, $w$ processors and $w$ memory banks, we study three fundamental problems: sorting, permuting and $w$-way partitioning (defined as sorting an input containing exactly $n/w$ copies of […]

Jul, 6

High Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications

This work presents a complete approach to a successful utilization of a high performance Extreme Learning Machines (ELMs) Toolbox for Big Data. It summarizes recent advantages in algorithmic performance; gives a fresh view on the ELM solution in relation to the traditional linear algebraic performance; and reaps the latest software and hardware performance achievements. The […]

CUDA

•

OpenCL

Jul, 6

LTTng CLUST: A system-wide unified CPU and GPU tracing tool for OpenCL applications

As computation schemes evolve and many new tools become available to programmers to enhance the performance of their applications, many programmers started to look towards highly parallel platforms such as Graphical Processing Unit (GPU). Offloading computations that can take advantage of the architecture of the GPU is a technique that has proven fruitful in recent […]

OpenCL

Jul, 6

Hetero-DB: Next Generation High-Performance Database Systems by Best Utilizing Heterogeneous Computing and Storage Resources

With recent advancement on hardware technologies, new general-purpose high-performance devices have been widely adopted, such as the graphics processing unit (GPU) and solid state drive (SSD). GPU may offer an order of higher throughput for applications with massive data parallelism, compared with the multicore CPU. Moreover, new storage device SSD is also capable of offering […]

CUDA

•

OpenCL

Jul, 6

Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores

In-memory key-value stores play a critical role in data processing to provide high throughput and low latency data accesses. In-memory key-value stores have several unique properties that include (1) data intensive operations demanding high memory bandwidth for fast data accesses, (2) high data parallelism and simple computing operations demanding many slim parallel computing units, and […]

CUDA

Jul, 6

Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well exploited with a combination of SIMD, multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as accelerators to compute interactions offloaded from the CPU. Here we evaluate which […]

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

Towards Good Practices for Very Deep Two-Stream ConvNets

Multiple String Matching on a GPU using CUDAs

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing

Autotuning OpenACC Work Distribution via Direct Search

Experiments on Parallel Training of Deep Neural Network using Model Averaging

Sorting and Permuting without Bank Conflicts on GPUs

High Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications

LTTng CLUST: A system-wide unified CPU and GPU tracing tool for OpenCL applications

Hetero-DB: Next Generation High-Performance Database Systems by Best Utilizing Heterogeneous Computing and Storage Resources

Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores

Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)