high performance computing on graphics processing units: hgpu.org

Posts

Jun, 18

EfficientBioAI: Making Bioimaging AI Models Efficient in Energy, Latency and Representation

Artificial intelligence (AI) has been widely used in bioimage image analysis nowadays, but the efficiency of AI models, like the energy consumption and latency is not ignorable due to the growing model size and complexity, as well as the fast-growing analysis needs in modern biomedical studies. Like we can compress large images for efficient storage […]

CUDA

Jun, 11

GPUHarbor: Testing GPU Memory Consistency at Large

Memory consistency specifications (MCSs) are a difficult, yet critical, part of a concurrent programming framework. Existing MCS testing tools are not immediately accessible, and thus, they have only been applied to a limited number of platforms. However, in the post-Dennard scaling landscape, there has been an explosion of new architectures and frameworks, especially for GPUs. […]

OpenCL

Jun, 11

Program Analysis and Machine Learning based Approach to Predict Power Consumption of CUDA Kernel

General Purpose Graphics Processing Unit (GPGPU) has secured a prominent position in the High-Performance Computing (HPC) world due to its performance gain and programmability. Understanding the relationship between GPU power consumption and program features can aid developers in building energy-efficient sustainable applications. In this work, we propose a static analysis based power model built using […]

CUDA

Jun, 11

minimap2-fpga: Integrating hardware-accelerated chaining for efficient end-to-end long-read sequence mapping

minimap2 is the gold-standard software for reference-based sequence mapping in third-generation long-read sequencing. While minimap2 is relatively fast, further speedup is desirable, especially when processing a multitude of large datasets. In this work, we present minimap2-fpga, a hardware-accelerated version of minimap2 that speeds up the mapping process by integrating an FPGA kernel optimised for chaining. […]

OpenCL

Jun, 11

SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations

The rise of exascale supercomputers has fueled competition among GPU vendors, driving lattice QCD developers to write code that supports multiple APIs. Moreover, new developments in algorithms and physics research require frequent updates to existing software. These challenges have to be balanced against constantly changing personnel. At the same time, there is a wide range […]

CUDA

Jun, 11

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application is semidefinite programming (SDP), and it often requires binary128 or higher precision arithmetic to solve problems involving SDP stably. However, only some processors […]

OpenCL

Jun, 4

Hybrid CPU/GPU/APU accelerated query, insert, update and erase operations in hash tables with string keys

Modern computer systems can use different types of hardware acceleration to achieve massive performance improvements. Some accelerators like FPGA and dedicated GPU (dGPU) need optimized data structures for the best performance and often use dedicated memory. In contrast, APUs, which are a combination of a CPU and an integrated GPU (iGPU), support shared memory and […]

CUDA

Jun, 4

GPU-Acceleration of Tensor Renormalization with PyTorch using CUDA

We show that numerical computations based on tensor renormalization group (TRG) methods can be significantly accelerated with PyTorch on graphics processing units (GPUs) by leveraging NVIDIA’s Compute Unified Device Architecture (CUDA). We find improvement in the runtime and its scaling with bond dimension for two-dimensional systems. Our results establish that the utilization of GPU resources […]

CUDA

Jun, 4

Compiler Technologies in Deep Learning Co-Design: A Survey

With the rapid development of deep learning applications, general-purpose processors no longer suffice for deep learning workloads because of the dying of Moore’s Law. Thus, computer architecture innovation has entered a golden age for domain-specific design, which has led to a demand for new compilation technologies to facilitate cross-layer optimization. Historically, hardware and software have […]

OpenCL

Jun, 4

A High-Performance Computing Cluster for Distributed Deep Learning: A Practical Case of Weed Classification Using Convolutional Neural Network Models

One of the main concerns in precision agriculture (PA) is the growth of weeds within a crop field. Currently, to prevent the spread of weeds, automatic techniques and computational tools are used to help to identify, classify, and detect the different types of weeds found in agricultural fields. One of the technologies that can help […]

CUDA

Jun, 4

Implementation Techniques for SPMD Kernels on CPUs

More and more frameworks and simulations are developed using heterogeneous programming models such as OpenCL, SYCL, CUDA, or HIP. A significant hurdle to mapping these models to CPUs in a performance-portable manner is that implementing work-group barriers for such kernels requires providing forward-progress guarantees so that all work-items can reach the barrier. This work provides […]

CUDA

•

OpenCL

May, 28

Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis

Genomic analysis is the study of genes which includes the identification, measurement, or comparison of genomic features. Genomics research is of great importance to our society because it can be used to detect diseases, create vaccines, and develop drugs and treatments. As a type of general-purpose accelerators with massive parallel processing capability, GPUs have been […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

EfficientBioAI: Making Bioimaging AI Models Efficient in Energy, Latency and Representation

GPUHarbor: Testing GPU Memory Consistency at Large

Program Analysis and Machine Learning based Approach to Predict Power Consumption of CUDA Kernel

minimap2-fpga: Integrating hardware-accelerated chaining for efficient end-to-end long-read sequence mapping

SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

Hybrid CPU/GPU/APU accelerated query, insert, update and erase operations in hash tables with string keys

GPU-Acceleration of Tensor Renormalization with PyTorch using CUDA

Compiler Technologies in Deep Learning Co-Design: A Survey

A High-Performance Computing Cluster for Distributed Deep Learning: A Practical Case of Weed Classification Using Convolutional Neural Network Models

Implementation Techniques for SPMD Kernels on CPUs

Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)