high performance computing on graphics processing units: hgpu.org

Posts

Apr, 8

Support Vector Machines on GPU with Sparse Matrix Format

Emerging general-purpose Graphics Processing Unit (GPU) provides a multi-core platform for wide applications, including machine learning algorithms. In this paper, we proposed several techniques to accelerate Support Vector Machines (SVM) on GPUs. Sparse matrix format is introduced into parallel SVM to achieve better performance. Experimental results show that the speedup of 55x-133.8x over LIBSVM can […]

Apr, 8

High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units

The power of graphics processing unit(GPU) has been increasing rapidly more than that of CPU. It is not surprising that many software libraries were developed which enable us to use the power of GPU for general computations especially in parallel data processing. In this paper, we propose implementations of the standard block cipher ARIA of […]

CUDA

•

OpenGL

Apr, 8

Record Setting Software Implementation of DES Using CUDA

The increase in computational power of off-the-shelf hardware offers more and more advantageous tradeoffs among efficiency, cost and availability, thus enhancing the feasibility of of cryptanalytic attacks aiming to lower the security of widely used cryptosystems. In this paper we illustrate an GPU-based software implementation of the most efficent variant of Data Encryption Standard (DES), […]

CUDA

Apr, 8

On accelerating iterative algorithms with CUDA: A case study on Conditional Random Fields training algorithm for biological sequence alignment

The accuracy of Conditional Random Fields (CRF) is achieved at the cost of huge amount of computation to train model. In this paper we designed the parallelized algorithm for the Gradient Ascent based CRF training methods for biological sequence alignment. Our contribution is mainly on two aspects: 1) We flexibly parallelized the different iterative computation […]

CUDA

Apr, 8

A Comparative Study on ASIC, FPGAs, GPUs and General Purpose Processors in the O(N^2) Gravitational N-body Simulation

In this paper, we describe the implementation of gravitational force calculation for N-body simulations in the context of astrophysics. It will describe high performance implementations on general purpose processors, GPUs, and FPGAs, and compare them using a number of criteria including speed performance, power efficiency and cost of development. These results show that, for gravitational […]

Apr, 8

Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs and software platforms. As the floating-point computing capacity increases continually, the problem of “memory-wall” becomes more serious, especially for array-intensive applications. In […]

CUDA

Apr, 8

CANSCID-CUDA

The 2010 MEMOCODE Hardware Software Co-design challenge is to implement a Deep Packet Inspection architecture, called the CANSCID – Combined Architecture for Stream Categorization and Intrusion Detection. In this short paper, we present the design details of our submission, that utilizes a Graphical Processing Unit (GPU) to accelerate the parallel regular expression matching. The target […]

CUDA

Apr, 8

Shape Manipulation on GPU

This paper proposes a novel hardware-accelerating deformation algorithm based on curve-skeleton model for 2D shape manipulation. The deformation algorithm can achieve real-time interactive shape manipulation without any pre-computing step. The deforming regions of shapes are demarcated with a simple skeleton frame and are simulated by a curve-skeleton model consisting of triangle-strips. The algorithm obtains two […]

OpenGL

Apr, 7

High throughput multiple-precision GCD on the CUDA architecture

Investigation of the cryptanalytic strength of RSA cryptography requires computing many GCDs of two long integers (e.g., of length 1024 bits). This paper presents a high throughput parallel algorithm to perform many GCD computations concurrently on a GPU based on the CUDA architecture. The experiments with an NVIDIA GeForce GTX285 GPU and a single core […]

CUDA

Apr, 7

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a […]

Apr, 7

Scalability of Higher-Order Discontinuous Galerkin FEM Computations for Solving Electromagnetic Wave Propagation Problems on GPU Clusters

A highly parallel implementation of Maxwell’s equations in the time domain using a cluster of Graphics Processing Units (GPUs) is presented. The higher-order Discontinuous Galerkin Finite Element Method (DG-FEM) is used for spatial discretization since its characteristics are matching the parallelization design aspects of the NVIDIA Compute Unified Device Architecture (CUDA) programming model. Asynchronous data […]

CUDA

Apr, 7

Using GPU to Accelerate Cache Simulation

Caches play a major role in the performance of high speed computer systems. Trace driven simulator is the most widely used method to evaluate cache architectures. However, as the cache design moves to more complicated architectures, along with the size of the trace is becoming larger and larger. Traditional simulation methods are no longer practical […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Support Vector Machines on GPU with Sparse Matrix Format

High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units

Record Setting Software Implementation of DES Using CUDA

On accelerating iterative algorithms with CUDA: A case study on Conditional Random Fields training algorithm for biological sequence alignment

A Comparative Study on ASIC, FPGAs, GPUs and General Purpose Processors in the O(N^2) Gravitational N-body Simulation

Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+

CANSCID-CUDA

Shape Manipulation on GPU

High throughput multiple-precision GCD on the CUDA architecture

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Scalability of Higher-Order Discontinuous Galerkin FEM Computations for Solving Electromagnetic Wave Propagation Problems on GPU Clusters

Using GPU to Accelerate Cache Simulation

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)