high performance computing on graphics processing units: hgpu.org

Posts

Dec, 6

Parallelization Methods of the Template Matching Method on Graphics Accelerators

Template matching is a classic technique used in image processing for object detection. It is based on multiple matrix-based calculations, where there are no dependencies on partial results, so parallel solutions could be created. In this article two GPU implemented methods are presented and compared to the CPU-based sequential solution.

CUDA

Dec, 6

A Study of Parallel Sorting Algorithms Using CUDA and OpenMP

This thesis reviews the parallel languages according to their computational complexities, in terms of time, while using sorting algorithms coded in CUDA and OpenMP. The thesis evaluates the solution for parallelism at a maintainable cost of money and other efforts, for achieving acceptable results of timing when compared to parallel languages together, as well as […]

CUDA

Dec, 4

The Genetic Convolutional Neural Network Model Based on Random Sample

Convolutional neural network (CNN) – the result of the training is affected by of initial value of the weights. It is concluded that the model is not necessarily the best features of expression. The use of genetic algorithm can help choosing the better characteristics. But there almost was not literature study of the combining genetic […]

CUDA

Dec, 4

An Accelerator based on the rho-VEX Processor: an Exploration using OpenCL

In recent years the use of co-processors to accelerate specific tasks is becoming more common. To simplify the use of these accelerators in software, the OpenCL framework has been developed. This framework provides programs a cross-platform interface for using accelerators. The rho-VEX processor is a run-time reconfigurable VLIW processor. It allows run-time switching of configurations, […]

OpenCL

Dec, 4

Optimizing CUDA Shared Memory Usage

CUDA shared memory is fast, on-chip storage. However, the bank conflict issue could cause a performance bottleneck. Current NVIDIA Tesla GPUs support memory bank accesses with configurable bit-widths. While this feature provides an efficient bank mapping scheme for 32-bit and 64-bit data types, it becomes trickier to solve the bank conflict problem through manual code […]

CUDA

Dec, 4

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In […]

OpenCL

Dec, 4

An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA

Modern Graphics Processing Units (GPUs) have high computation power and low cost. Recently, many applications in various fields have been computed powerfully on the GPU using CUDA. In this paper, we propose an efficient parallel algorithm for graph isomorphism which runs on the GPU using CUDA for matching large graphs. Parallelization of a sequential graph […]

CUDA

Dec, 1

Programming in CUDA for Kepler and Maxwell Architecture

Since the first version of CUDA was launch, many improvements were made in GPU computing. Every new CUDA version included important novel features, turning this architecture more and more closely related to a typical parallel High Performance Language. This tutorial will present the GPU architecture and CUDA principles, trying to conceptualize novel features included by […]

CUDA

Dec, 1

Auxiliary Image Regularization for Deep CNNs with Noisy Labels

Precisely-labeled data sets with sufficient amount of samples are notably important for training deep convolutional neural networks (CNNs). However, many of the available real-world data sets contain erroneously labeled samples and the error in labels of training sample makes it a daunting task to learn a well-performing deep CNN model. In this work, we consider […]

CUDA

Dec, 1

A General Framework for Constrained Bayesian Optimization using Information-based Search

We present an information-theoretic framework for solving global black-box optimization problems that also have black-box constraints. Of particular interest to us is to efficiently solve problems with decoupled constraints, in which subsets of the objective and constraint functions may be evaluated independently. For example, when the objective is evaluated on a CPU and the constraints […]

CUDA

Dec, 1

Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems

There are four trends in modern high-performance computing (HPC) that have led to an increased need for efficient memory management techniques for heterogeneous systems (such as one fitted with GPUs). First, the average size of datasets for HPC applications is rapidly increasing. Read-only input matrices that used to be on the order of megabytes or […]

CUDA

Dec, 1

Bridging OpenCL and CUDA: A Comparative Analysis and Translation

Heterogeneous systems are widening their user-base, and heterogeneous computing is becoming popular in supercomputing. Among others, OpenCL and CUDA are the most popular programming models for heterogeneous systems. Although OpenCL inherited many features from CUDA and they have almost the same platform model, they are not compatible with each other. In this paper, we present […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Parallelization Methods of the Template Matching Method on Graphics Accelerators

A Study of Parallel Sorting Algorithms Using CUDA and OpenMP

The Genetic Convolutional Neural Network Model Based on Random Sample

An Accelerator based on the rho-VEX Processor: an Exploration using OpenCL

Optimizing CUDA Shared Memory Usage

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA

Programming in CUDA for Kepler and Maxwell Architecture

Auxiliary Image Regularization for Deep CNNs with Noisy Labels

A General Framework for Constrained Bayesian Optimization using Information-based Search

Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems

Bridging OpenCL and CUDA: A Comparative Analysis and Translation

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)