high performance computing on graphics processing units: hgpu.org

Posts

Dec, 24

GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training

The ability to train large-scale neural networks has resulted in state-of-the-art performance in many areas of computer vision. These results have largely come from computational break throughs of two forms: model parallelism, e.g. GPU accelerated training, which has seen quick adoption in computer vision circles, and data parallelism, e.g. A-SGD, whose large scale has been […]

CUDA

Dec, 24

Large-Scale Paralleled Sparse Principal Component Analysis

Principal component analysis (PCA) is a statistical technique commonly used in multivariate data analysis. However, PCA can be difficult to interpret and explain since the principal components (PCs) are linear combinations of the original variables. Sparse PCA (SPCA) aims to balance statistical fidelity and interpretability by approximating sparse PCs whose projections capture the maximal variance […]

CUDA

Dec, 23

GPU Acceleration of Melody Accurate Matching in Query-by-Humming

With the increasing scale of the melody database,the query-by-humming system faces the tradeoffs between response speed and retrieval accuracy. Melody accurate matching is the key factor to restrict the response speed. In this paper, we present a GPU acceleration method of melody accurate matching, in order to improve the response speed without reducing retrieval accuracy. […]

CUDA

Dec, 23

Enabling High Performance Computing in Cloud Infrastructure using Virtualized GPUs

With the advent of virtualization and Infrastructure-as-a-Service (IaaS), the broader scientific computing community is considering the use of clouds for their technical computing needs. This is due to the relative scalability, ease of use, advanced user environment customization abilities clouds provide, as well as many novel computing paradigms available for data-intensive applications. However, there is […]

CUDA

Dec, 23

Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact

The objective of high performance computing (HPC) is to ensure that the computational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications in accessing data or rearrangement of data in order to reduce […]

CUDA

Dec, 23

Single Server Multi-GPU Training of ConvNets

In this work we evaluate different approaches to parallelize computation of convolutional neural networks across several GPUs within the same server.

Dec, 23

Fast Training of Convolutional Networks through FFTs

Convolutional networks are one of the most widely employed architectures in computer vision and machine learning. In order to leverage their ability to learn complex functions, large amounts of data are required for training. Training a large convolutional network to produce state-of-the-art results can take weeks, even when using modern GPUs. Producing labels using a […]

CUDA

Dec, 22

Resource Centered Computing delivering high parallel performance

Modern parallel programming requires a combination of different paradigms, expertise and tuning, that correspond to the different levels in today’s hierarchical architectures. To cope with the inherent difficulty, ORWL (ordered read-write locks) presents a new paradigm and toolbox centered around local or remote resources, such as data, processors or accelerators. ORWL programmers describe their computation […]

CUDA

Dec, 22

Energy Auto-tuning using the Polyhedral Approach

As the HPC community moves into the exascale computing era, application energy has become a big concern. Tuning for energy will be essential in the effort to overcome the limited power envelope. How is tuning for lower energy related to tuning for faster execution? Understanding that relationship can guide both performance and energy tuning for […]

Dec, 22

Speed-Up Improvement Using Parallel Approach in Image Steganography

This paper presents a parallel approach to improve the time complexity problem associated with sequential algorithms. An image steganography algorithm in transform domain is considered for implementation. Image steganography is a technique to hide secret message in an image. With the parallel implementation, large message can be hidden in large image since it does not […]

OpenCL

Dec, 22

Numerical Simulation for the MHD System in 2D Using OpenCL

In this work we compute the MHD equations with divergence cleaning on GPU. The method is based on the finite volume approach and Strang dimensional splitting. The simplicity of the approach makes it a good candidate for a GPU implementation with OpenCL. With adequate memory optimization access, we achieve very high speedups, compared to a […]

OpenCL

Dec, 22

Accelerating Pairwise DNA Sequence Alignment using the CUDA Compatible GPU

We present a novel implementation of the pairwise DNA sequence alignment problem other than the Dynamic programming solution presented by Smith Waterman Algorithm. The proposed implementation uses CUDA; the parallel computing platform and programming model invented by NVIDIA. The main idea of the proposed implementation is assigning different nucleotide weights then merging the sub-sequences of […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training

Large-Scale Paralleled Sparse Principal Component Analysis

GPU Acceleration of Melody Accurate Matching in Query-by-Humming

Enabling High Performance Computing in Cloud Infrastructure using Virtualized GPUs

Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact

Single Server Multi-GPU Training of ConvNets

Fast Training of Convolutional Networks through FFTs

Resource Centered Computing delivering high parallel performance

Energy Auto-tuning using the Polyhedral Approach

Speed-Up Improvement Using Parallel Approach in Image Steganography

Numerical Simulation for the MHD System in 2D Using OpenCL

Accelerating Pairwise DNA Sequence Alignment using the CUDA Compatible GPU

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)