high performance computing on graphics processing units: hgpu.org

Posts

Feb, 13

Large-Scale Deep Learning on the YFCC100M Dataset

We present a work-in-progress snapshot of learning with a 15 billion parameter deep learning network on HPC architectures applied to the largest publicly available natural image and video dataset released to-date. Recent advancements in unsupervised deep neural networks suggest that scaling up such networks in both model and training dataset size can yield significant improvements […]

CUDA

Feb, 13

Primal Dual Affine Scaling on GPUs

Here we present an implementation of Primal-Dual Affine scaling method to solve linear optimization problem on GPU based systems. Strategies to convert the system generated by complementary slackness theorem into a symmetric system are given. A new CUDA friendly technique to solve the resulting symmetric positive definite subsystem is also developed. Various strategies to reduce […]

CUDA

Feb, 12

A Real-time GPU Implementation of the SIFT Algorithm for Large-Scale Video Analysis Tasks

The SIFT algorithm is one of the most popular feature extraction methods and therefore widely used in all sort of video analysis tasks like instance search and duplicate/near-duplicate detection. We present an efficient GPU implementation of the SIFT descriptor extraction algorithm using CUDA. The major steps of the algorithm are presented and for each step […]

CUDA

Feb, 10

FSCL: Homogeneous programming, scheduling and execution on heterogeneous platforms

The last few years has seen activity towards programming models, languages and frameworks to address the increasingly wide range and broad availability of heterogeneous computing resources through raised programming abstraction and portability across different platforms. The effort spent in simplifying parallel programming across heterogeneous platforms is often outweighed by the need for low-level control over […]

OpenCL

Feb, 10

GPU-accelerated HMM for Speech Recognition

Speech recognition is used in a wide range of applications and devices such as mobile phones, in-car entertainment systems and web-based services. Hidden Markov Models (HMMs) is one of the most popular algorithmic approaches applied in speech recognition. Training and testing a HMM is computationally intensive and time-consuming. Running multiple applications concurrently with speech recognition […]

CUDA

Feb, 10

Analysis and Modeling of the Timing Behavior of GPU Architectures

Graphics processing units (GPUs) offer massive parallelism. Since a couple of years GPUs can also be used for more general purpose applications; a wide variety of applications can be accelerated efficiently with the use of the CUDA and OpenCL programming models. Real-time systems frequently use many sensors that produce a big amount of data. GPUs […]

CUDA

Feb, 10

Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)

Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability. Code is either tuned using device-specific optimizations to achieve maximum performance or is […]

OpenCL

Feb, 10

CAVE-CL: An OpenCL version of the package for detection and quantitative analysis of internal cavities in a system of overlapping balls: application to proteins

Here we present the revised and newly rewritten version of our earlier published CAVE package [J. Busa et al., Comput. Phys. Commun. 181 (2010) 2116] which was originally written in FORTRAN. The package has been rewritten in C language, the algorithm has been parallelized and implemented using OpenCL. This makes the program convenient to run […]

OpenCL

Feb, 9

A Survey of Architectural Techniques For DRAM Power Management

Recent trends of CMOS technology scaling and wide-spread use of multicore processors have dramatically increased the power consumption of main memory. It has been estimated that modern data-centers spend more than 30% of their total power consumption in main memory alone. This excessive power dissipation has created the problem of “memory power wall”; which has […]

Feb, 9

FIR filtering and AES encryption with OpenCL 2.0

OpenCL has become a popular standard to leverage the unique power/performance opportunities found on heterogeneous systems. In this short contribution, we evaluate the latest parallel programming features supported in the OpenCL 2.0 standard. We explore using shared virtual memory and dynamic parallelism to accelerate two example applications.

OpenCL

Feb, 9

Speech Recognition on Modern Graphic Processing Units

Speech Recognition run on Graphic Processing Units (GPUs) has shown some promising performance improvements ranging 2-10x speedups when compare to execution on CPUs. GPU has continued to introduce new programming features, such as Dynamic Parallelism and Hyper-Q, that could further benefit Speech Recognition processing. In this paper we describe a framework developed at Northeastern describing […]

CUDA

Feb, 9

Fast Subgraph Matching on Large Graphs using Graphics Processors

Subgraph matching is the task of finding all matches of a query graph in a large data graph, which is known as an NP-complete problem. Many algorithms are proposed to solve this problem using CPUs. In recent years, Graphics Processing Units (GPUs) have been adopted to accelerate fundamental graph operations such as breadth-first search and […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Large-Scale Deep Learning on the YFCC100M Dataset

Primal Dual Affine Scaling on GPUs

A Real-time GPU Implementation of the SIFT Algorithm for Large-Scale Video Analysis Tasks

FSCL: Homogeneous programming, scheduling and execution on heterogeneous platforms

GPU-accelerated HMM for Speech Recognition

Analysis and Modeling of the Timing Behavior of GPU Architectures

Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)

CAVE-CL: An OpenCL version of the package for detection and quantitative analysis of internal cavities in a system of overlapping balls: application to proteins

A Survey of Architectural Techniques For DRAM Power Management

FIR filtering and AES encryption with OpenCL 2.0

Speech Recognition on Modern Graphic Processing Units

Fast Subgraph Matching on Large Graphs using Graphics Processors

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)