high performance computing on graphics processing units: hgpu.org

Posts

Jun, 22

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis

Designing and implementing efficient, provably correct parallel neural network processing is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. However, the diversity and large-scale data size have posed a significant challenge to construct a flexible and high-performance […]

CUDA

•

OpenCL

Jun, 22

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. Existing approaches for tensor contractions typically involves explicit copy and transpose operations. In this paper, we propose and evaluate a new BLAS-like primitive STRIDEDBATCHEDGEMM that […]

CUDA

Jun, 22

Soft GPGPUs for Embedded FPGAs: An Architectural Evaluation

We present a customizable soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. Issues related to scaling the overlay architecture to multiple GPGPU multiprocessors are considered along with application-class architectural optimizations. The overlay architecture is optimized for FPGA implementation to support efficient use of […]

CUDA

Jun, 21

YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights

Convolutional Neural Networks (CNNs) have revolutionized the world of image classification over the last few years, pushing the computer vision close beyond human accuracy. The required computational effort of CNNs today requires power-hungry parallel processors and GP-GPUs. Recent efforts in designing CNN Application-Specific Integrated Circuits (ASICs) and accelerators for System-On-Chip (SoC) integration have achieved very […]

Jun, 21

Combinatorial Optimization of Work Distribution on Heterogeneous Systems

We describe an approach that uses combinatorial optimization and machine learning to share the work between the host and device of heterogeneous computing systems such that the overall application execution time is minimized. We propose to use combinatorial optimization to search for the optimal system configuration in the given parameter space (such as, the number […]

Jun, 21

cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL

This paper presents cltorch, a hardware-agnostic backend for the Torch neural network framework. cltorch enables training of deep neural networks on GPUs from diverse hardware vendors, including AMD, NVIDIA, and Intel. cltorch contains sufficient implementation to run models such as AlexNet, VGG, Overfeat, and GoogleNet. It is written using the OpenCL language, a portable compute […]

OpenCL

Jun, 21

Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump Using CUDA-enabled GPU Hardware

This paper focuses on the anticipatory enhancement of methods of detecting stealth software. Cyber security detection tools are insufficiently powerful to reveal the most recent cyber-attacks which use malware. In this paper, we will present first an idea of the highest stealth malware, as this is the most complicated scenario for detection because it combines […]

CUDA

Jun, 21

A Parallel Algorithm for LZW Decompression, with GPU Implementation

The main contribution of this paper is to present a parallel algorithm for LZW decompression and to implement it in a CUDA-enabled GPU. Since sequential LZW decompression creates a dictionary table by reading codes in a compressed file one by one, its parallelization is not an easy task. We first present a parallel LZW decompression […]

CUDA

Jun, 16

Electric potential and field calculation of charged BEM triangles and rectangles by Gaussian cubature

It is a widely held view that analytical integration is more accurate than the numerical one. In some special cases, however, numerical integration can be more advantageous than analytical integration. In our paper we show this benefit for the case of electric potential and field computation of charged triangles and rectangles applied in the boundary […]

OpenCL

Jun, 16

NCAM: Near-Data Processing for Nearest Neighbor Search

Deep down in many applications like natural language processing (NLP), vision, and robotics is a form of the k-nearest neighbor search algorithm (kNN). The kNN algorithm is primarily bottlenecked by data movement, limiting throughput and incurring latency in these applications. While there do exist well bounded kNN approximations that improve the performance of kNN, these […]

CUDA

Jun, 16

Splotch: porting and optimizing for the Xeon Phi

With the increasing size and complexity of data produced by large scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous High Performance Computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize […]

CUDA

•

OpenCL

Jun, 16

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

We perform a study of the factors affecting training time in multi-device deep learning systems. Given a specification of a convolutional neural network, we study how to minimize the time to train this model on a cluster of commodity CPUs and GPUs. Our first contribution focuses on the single-node setting, in which we show that […]

CUDA

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Posts

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Soft GPGPUs for Embedded FPGAs: An Architectural Evaluation

YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights

Combinatorial Optimization of Work Distribution on Heterogeneous Systems

cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL

Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump Using CUDA-enabled GPU Hardware

A Parallel Algorithm for LZW Decompression, with GPU Implementation

Electric potential and field calculation of charged BEM triangles and rectangles by Gaussian cubature

NCAM: Near-Data Processing for Nearest Neighbor Search

Splotch: porting and optimizing for the Xeon Phi

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)