high performance computing on graphics processing units: hgpu.org

Posts

Jan, 19

Hardware Implementation and Quantization of Tiny-Yolo-v2 using OpenCL

The trend of increasingly model size in Deep Neural Network (DNN) algorithms boost the performance of visual recognition tasks. These gains in performance have come at a cost of increase in computational complexity and memory bandwidth. Recent studies have explored the fixed-point implementation of DNN algorithms such as AlexNet and VGG on Field Programmable Gate […]

OpenCL

Jan, 19

Towards High Performance Java-based Deep Learning Frameworks

The advent of modern cloud services along with the huge volume of data produced on a daily basis, have set the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. Prior research has focused on employing hardware accelerators as a […]

CUDA

•

OpenCL

Jan, 12

Static Analysis and Dynamic Adaptation of Parallelism

Scientific applications have an increasing need of resources and many grand scientific challenges require exascale compute capabilities to be addressed. One major concern to achieve exascale is programmability. New automatic methods are required to fill the gap between developers of scientific applications and HPC experts. In addition, as scientific applications are becoming more and more […]

OpenCL

Jan, 12

Performance-Oriented Neural Architecture Search

Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly efficient in addition to the typical criteria […]

CUDA

Jan, 12

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs

Deep Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in a wide range of applications. However, deeper CNN models, which are usually computation consuming, are widely required for complex Artificial Intelligence (AI) tasks. Though recent research progress on network compression such as pruning has emerged as a promising direction to mitigate computational burden, existing accelerators […]

Jan, 12

Fast Turnaround HLS Debugging using Dependency Analysis and Debug Overlays

High-level synthesis (HLS) has gained considerable traction over the recent years as it allows for faster development and verification of hardware accelerators than traditional RTL design. While HLS allows for most bugs to be caught during software verification, certain non-deterministic or data-dependent bugs still require debugging the actual hardware system during execution. Recent work has […]

Jan, 12

A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs

Tensor computations present significant performance challenges that impact a wide spectrum of applications ranging from machine learning, healthcare analytics, social network analysis, data mining to quantum chemistry and signal processing. Efforts to improve the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark […]

CUDA

Jan, 5

A Unified Iteration Space Transformation Framework for Sparse and Dense Tensor Algebra

We address the problem of optimizing mixed sparse and dense tensor algebra in a compiler. We show that standard loop transformations, such as strip-mining, tiling, collapsing, parallelization and vectorization, can be applied to irregular loops over sparse iteration spaces. We also show how these transformations can be applied to the contiguous value arrays of sparse […]

CUDA

Jan, 5

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take […]

CUDA

Jan, 5

Pipelined Training with Stale Weights of Deep Convolutional Neural Networks

The growth in the complexity of Convolutional Neural Networks (CNNs) is increasing interest in partitioning a network across multiple accelerators during training and pipelining the backpropagation computations over the accelerators. Existing approaches avoid or limit the use of stale weights through techniques such as micro-batching or weight stashing. These techniques either underutilize of accelerators or […]

CUDA

Jan, 5

Towards Unified INT8 Training for Convolutional Neural Network

Recently low-bit (e.g., 8-bit) network quantization has been extensively studied to accelerate the inference. Besides inference, low-bit training with quantized gradients can further bring more considerable acceleration, since the backward process is often computation-intensive. Unfortunately, the inappropriate quantization of backward propagation usually makes the training unstable and even crash. There lacks a successful unified low-bit […]

Jan, 5

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Hardware Implementation and Quantization of Tiny-Yolo-v2 using OpenCL

Towards High Performance Java-based Deep Learning Frameworks

Static Analysis and Dynamic Adaptation of Parallelism

Performance-Oriented Neural Architecture Search

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs

Fast Turnaround HLS Debugging using Dependency Analysis and Debug Overlays

A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs

A Unified Iteration Space Transformation Framework for Sparse and Dense Tensor Algebra

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Pipelined Training with Stale Weights of Deep Convolutional Neural Networks

Towards Unified INT8 Training for Convolutional Neural Network

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)