high performance computing on graphics processing units: hgpu.org

Posts

Dec, 4

An Accelerator based on the rho-VEX Processor: an Exploration using OpenCL

In recent years the use of co-processors to accelerate specific tasks is becoming more common. To simplify the use of these accelerators in software, the OpenCL framework has been developed. This framework provides programs a cross-platform interface for using accelerators. The rho-VEX processor is a run-time reconfigurable VLIW processor. It allows run-time switching of configurations, […]

OpenCL

Dec, 4

Optimizing CUDA Shared Memory Usage

CUDA shared memory is fast, on-chip storage. However, the bank conflict issue could cause a performance bottleneck. Current NVIDIA Tesla GPUs support memory bank accesses with configurable bit-widths. While this feature provides an efficient bank mapping scheme for 32-bit and 64-bit data types, it becomes trickier to solve the bank conflict problem through manual code […]

CUDA

Dec, 4

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In […]

OpenCL

Dec, 4

An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA

Modern Graphics Processing Units (GPUs) have high computation power and low cost. Recently, many applications in various fields have been computed powerfully on the GPU using CUDA. In this paper, we propose an efficient parallel algorithm for graph isomorphism which runs on the GPU using CUDA for matching large graphs. Parallelization of a sequential graph […]

CUDA

Dec, 1

Bridging OpenCL and CUDA: A Comparative Analysis and Translation

Heterogeneous systems are widening their user-base, and heterogeneous computing is becoming popular in supercomputing. Among others, OpenCL and CUDA are the most popular programming models for heterogeneous systems. Although OpenCL inherited many features from CUDA and they have almost the same platform model, they are not compatible with each other. In this paper, we present […]

CUDA

•

OpenCL

Dec, 1

Programming in CUDA for Kepler and Maxwell Architecture

Since the first version of CUDA was launch, many improvements were made in GPU computing. Every new CUDA version included important novel features, turning this architecture more and more closely related to a typical parallel High Performance Language. This tutorial will present the GPU architecture and CUDA principles, trying to conceptualize novel features included by […]

CUDA

Dec, 1

Auxiliary Image Regularization for Deep CNNs with Noisy Labels

Precisely-labeled data sets with sufficient amount of samples are notably important for training deep convolutional neural networks (CNNs). However, many of the available real-world data sets contain erroneously labeled samples and the error in labels of training sample makes it a daunting task to learn a well-performing deep CNN model. In this work, we consider […]

CUDA

Dec, 1

A General Framework for Constrained Bayesian Optimization using Information-based Search

We present an information-theoretic framework for solving global black-box optimization problems that also have black-box constraints. Of particular interest to us is to efficiently solve problems with decoupled constraints, in which subsets of the objective and constraint functions may be evaluated independently. For example, when the objective is evaluated on a CPU and the constraints […]

CUDA

Dec, 1

Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems

There are four trends in modern high-performance computing (HPC) that have led to an increased need for efficient memory management techniques for heterogeneous systems (such as one fitted with GPUs). First, the average size of datasets for HPC applications is rapidly increasing. Read-only input matrices that used to be on the order of megabytes or […]

CUDA

Dec, 1

Neural GPUs Learn Algorithms

Learning an algorithm from examples is a fundamental problem that has been widely studied. Recently it has been addressed using neural networks, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential […]

CUDA

Nov, 29

Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution

Contemporary GPUs allow concurrent execution of small computational kernels in order to prevent idling of GPU resources. Despite the potential concurrency between independent kernels, the order in which kernels are issued to the GPU will significantly influence the application performance. A technique for deriving suitable kernel launch orders is therefore presented, with the aim of […]

CUDA

Nov, 29

Design, Implementation and Performance Evaluation of a Stochastic Gradient Descent Algorithm on CUDA

Stochastic Gradient Descent, a stochastic optimization of Gradient Descent, is an algorithm that is used in different topics, like for example for linear regression or logistic regression. After the Netflix prize, SGD start to be used also in recommender systems to compute matrix factorization. Considering the large amounts of data that this kind of system […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

An Accelerator based on the rho-VEX Processor: an Exploration using OpenCL

Optimizing CUDA Shared Memory Usage

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA

Bridging OpenCL and CUDA: A Comparative Analysis and Translation

Programming in CUDA for Kepler and Maxwell Architecture

Auxiliary Image Regularization for Deep CNNs with Noisy Labels

A General Framework for Constrained Bayesian Optimization using Information-based Search

Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems

Neural GPUs Learn Algorithms

Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution

Design, Implementation and Performance Evaluation of a Stochastic Gradient Descent Algorithm on CUDA

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)