high performance computing on graphics processing units: hgpu.org

Posts

Aug, 23

Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations

GPU accelerators have become an important backbone for scientific high performance computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA’s newest server line GPU, the A100 architecture part of the Ampere generation. Specifically, we assess its performance for sparse linear algebra […]

CUDA

Aug, 9

Heterogeneous parallel computing for image registration and linear algebra applications

This doctoral thesis focuses on GPU acceleration of medical image registration and sparse general matrix-matrix multiplication (SpGEMM). The comprehensive work presented here aims to enable new possibilities in Image Guided Surgery (IGS). IGS provides the surgeon with advanced navigation tools during surgery. Image registration, which is a part of IGS, is computationally demanding, therefore GPU […]

CUDA

Aug, 9

Ignite-GPU: a GPU-enabled in-memory computing architecture on clusters

During recent years, big data explosion and the increase in main memory capacity, on the one hand, and the need for faster data processing, on the other hand, have caused the development of various in-memory processing tools to manage and analyze data. Engaging the speed of the main memory and advantaging data locality, these tools […]

CUDA

Aug, 9

Parallel acceleration of CPU and GPU range queries over large data sets

Data management systems commonly use bitmap indices to increase the efficiency of querying scientific data. Bitmaps are usually highly compressible and can be queried directly using fast hardware-supported bitwise logical operations. The processing of bitmap queries is inherently parallel in structure, which suggests they could benefit from concurrent computer systems. In particular, bitmap-range queries offer […]

CUDA

Aug, 9

PERCH 2.0: Fast and Accurate GPU-based Perception via Search for Object Pose Estimation

Pose estimation of known objects is fundamental to tasks such as robotic grasping and manipulation. The need for reliable grasping imposes stringent accuracy requirements on pose estimation in cluttered, occluded scenes in dynamic environments. Modern methods employ large sets of training data to learn features in order to find correspondence between 3D models and observed […]

CUDA

Aug, 9

AnyHLS: High-Level Synthesis with Partial Evaluation

FPGAs excel in low power and high throughput computations, but they are challenging to program. Traditionally, developers rely on hardware description languages like Verilog or VHDL to specify the hardware behavior at the register-transfer level. High-Level Synthesis (HLS) raises the level of abstraction, but still requires FPGA design knowledge. Programmers usually write pragma-annotated C/C++ programs […]

OpenCL

Aug, 2

Bounds Checking on GPU

We present a simple compilation strategy for safety-checking array indexing in high-level languages on GPUs. Our technique does not depend on hardware support for abnormal termination, and is designed to be efficient in the non-failing case. We rely on certain properties of array languages, namely the absence of arbitrary cross-thread communication, to ensure well-defined execution […]

CUDA

•

OpenCL

Aug, 2

OpenSBLI: Automated code-generation for heterogeneous computing architectures applied to compressible fluid dynamics on structured grids

OpenSBLI is an open-source code-generation system for compressible fluid dynamics (CFD) on heterogeneous computing architectures. Written in Python, OpenSBLI is an explicit high-order finite-difference solver on structured curvilinear meshes. Shock-capturing is performed by a choice of high-order Weighted Essentially Non-Oscillatory (WENO) or Targeted Essentially Non-Oscillatory (TENO) schemes. OpenSBLI generates a complete CFD solver in the […]

CUDA

•

OpenCL

Aug, 2

Deep Learning Application in Plant Stress Imaging: A Review

Plant stress is one of major issues that cause significant economic loss for growers. The labor-intensive conventional methods for identifying the stressed plants constrain their applications. To address this issue, rapid methods are in urgent needs. Developments of advanced sensing and machine learning techniques trigger revolutions for precision agriculture based on deep learning and big […]

OpenCL

Aug, 2

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

We introduce biomedical and clinical English model packages for the Stanza Python NLP library. These packages offer accurate syntactic analysis and named entity recognition capabilities for biomedical and clinical text, by combining Stanza’s fully neural architecture with a wide variety of open datasets as well as large-scale unsupervised biomedical and clinical text data. We show […]

Aug, 2

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.

CUDA

Jul, 26

Darknet on OpenCL: a multi-platform tool for object detection and classification

The article’s goal is to overview challenges and problems on the way from the state of the art CUDA accelerated neural networks code to multi-GPU code. For this purpose, the authors describe the journey of porting the existing in the GitHub, fully-featured CUDA accelerated Darknet engine to OpenCL. The article presents lessons learned and the […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations

Heterogeneous parallel computing for image registration and linear algebra applications

Ignite-GPU: a GPU-enabled in-memory computing architecture on clusters

Parallel acceleration of CPU and GPU range queries over large data sets

PERCH 2.0: Fast and Accurate GPU-based Perception via Search for Object Pose Estimation

AnyHLS: High-Level Synthesis with Partial Evaluation

Bounds Checking on GPU

OpenSBLI: Automated code-generation for heterogeneous computing architectures applied to compressible fluid dynamics on structured grids

Deep Learning Application in Plant Stress Imaging: A Review

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Darknet on OpenCL: a multi-platform tool for object detection and classification

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)