high performance computing on graphics processing units: hgpu.org

Posts

Apr, 9

An approach of tool paths generation for CNC machining based on CUDA

This paper presents a new tool paths generation method for CNC machining based on GPU-CPU fusion calculation. CUDA, a general purpose parallel computing architecture, was provided by NVidia to resolve problems of mass data parallel computing. The new tool paths generation algorithm based on isoparametric method was redesigned to use CUDA. The final comparison experiment […]

CUDA

Apr, 9

Scalable instruction set simulator for thousand-core architectures running on GPGPUs

Simulators are still the primary tools for development and performance evaluation of applications running on massively parallel architectures. However, current virtual platforms are not able to tackle the complexity issues introduced by 1000-core future scenarios. We present a fast and accurate simulation framework targeting extremely large parallel systems by specifically taking advantage of the inherent […]

CUDA

Apr, 9

Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

As one of the most popular accelerators, Graphics Processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers in desktop and supercomputer systems. However, software power optimization method targeted for GPU has not […]

CUDA

Apr, 9

A simple and efficient way to compute depth maps for multi-view videos

This paper deals with depth maps extraction from multi-view video. Contrary to standard stereo matching-based approaches, depth maps are computed here using optical flow estimations between consecutive views. We compare our approach with the one proposed in the Depth Estimation Reference Software (DERS) for normalization purposes in the ISO-MPEG 3DV group. Experiments conducted on sequences […]

Apr, 9

Data-parallel algorithms for large-scale real-time simulation of the cellular potts model on graphics processing units

In the following paper we present techniques for data-parallel execution of the cellular potts model (CPM) on graphics processing units (GPUs). We have developed data-structures and algorithms that are optimized to use available hardware resources on the GPU. To the best of our knowledge, this is the first attempt at using data-parallel techniques for simulating […]

CUDA

•

OpenGL

Apr, 9

Real-time parallel remote rendering for mobile devices using graphics processing units

Demand for 3D visualization is increasing in mobile devices as users have come to expect more realistic immersive experiences. However, limited networking and computing resources on mobile devices remain challenges. A solution is to have a proxy-based framework that offloads the burden of rendering computation from mobile devices to more powerful servers. We present the […]

Apr, 9

A Parallel Gibbs Sampling Algorithm for Motif Finding on GPU

Motif is overrepresented pattern in biological sequence and motif finding is an important problem in bioinformatics. Due to high computational complexity of motif finding, more and more computational capabilities are required as the rapid growth of available biological data, such as gene transcription data. Among many motif finding algorithms, Gibbs sampling is an effective method […]

CUDA

Apr, 9

The method of improving performace of the GPU-accelerated 2D FDTD simulator

In this paper, several methods of optimizing parallel implementation of 2D FDTD algorithm are presented. Some practical problems occurring in real simulations are taken into consideration. Moreover, the presented methods are supported with appropriate tests and practical examples.

Apr, 8

Parallel Dense Gauss-Seidel Algorithm on Many-Core Processors

The Gauss-Seidel method is very efficient for solving problems such as tightly-coupled constraints with possible redundancies. However, the underlying algorithm is inherently sequential. Previous works have exploited sparsity in the system matrix to extract parallelism. In this paper, we propose to study several parallelization schemes for fully-coupled systems, unable to be parallelized by existing methods, […]

CUDA

Apr, 8

Optimize or Wait? Using llc Fast-Prototyping Tool to Evaluate CUDA Optimizations

Over the last few years, we have witnessed the proliferation of GPU devices onHPC environments. Manufacturers produce new versions of their devices every few years, though, posing a new problem for scientists and engineers using their technology: is it worth the time and effort spent optimizing the codes for the current version? Or it is […]

CUDA

Apr, 8

Support Vector Machines on GPU with Sparse Matrix Format

Emerging general-purpose Graphics Processing Unit (GPU) provides a multi-core platform for wide applications, including machine learning algorithms. In this paper, we proposed several techniques to accelerate Support Vector Machines (SVM) on GPUs. Sparse matrix format is introduced into parallel SVM to achieve better performance. Experimental results show that the speedup of 55x-133.8x over LIBSVM can […]

Apr, 8

High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units

The power of graphics processing unit(GPU) has been increasing rapidly more than that of CPU. It is not surprising that many software libraries were developed which enable us to use the power of GPU for general computations especially in parallel data processing. In this paper, we propose implementations of the standard block cipher ARIA of […]

CUDA

•

OpenGL

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

DuoReduce: MLIR's benchmark

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

An approach of tool paths generation for CNC machining based on CUDA

Scalable instruction set simulator for thousand-core architectures running on GPGPUs

Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

A simple and efficient way to compute depth maps for multi-view videos

Data-parallel algorithms for large-scale real-time simulation of the cellular potts model on graphics processing units

Real-time parallel remote rendering for mobile devices using graphics processing units

A Parallel Gibbs Sampling Algorithm for Motif Finding on GPU

The method of improving performace of the GPU-accelerated 2D FDTD simulator

Parallel Dense Gauss-Seidel Algorithm on Many-Core Processors

Optimize or Wait? Using llc Fast-Prototyping Tool to Evaluate CUDA Optimizations

Support Vector Machines on GPU with Sparse Matrix Format

High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units

Recent source codes

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Most viewed papers (last 30 days)