high performance computing on graphics processing units: hgpu.org

Posts

Jan, 8

Gunrock: GPU Graph Analytics

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or […]

CUDA

Jan, 4

Deep Neural Networks to Enable Real-time Multimessenger Astrophysics

We introduce a new methodology for time-domain signal processing, based on deep learning neural networks, which has the potential to revolutionize data analysis in science. To illustrate how this enables real-time multimessenger astrophysics, we designed two deep convolutional neural networks that can analyze time-series data from observatories including advanced LIGO. The first neural network recognizes […]

CUDA

Jan, 4

Massively Parallel Computation of Accurate Densities for N-body Dark Matter Simulations using the Phase-Space-Element Method

In 2012 a method to analyze N-body dark matter simulations using a tetrahedral tesselation of the three-dimensional dark matter manifold in six-dimensional phase space was introduced. This paper presents an accurate density computation approach for large N-body datasets, that is based on this technique and designed for massively parallel GPU-clusters. The densities are obtained by […]

CUDA

Jan, 4

Design and optimization of a portable LQCD Monte Carlo code using OpenACC

The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; […]

Jan, 4

Evaluation of Multi-Threading in Vulkan

Today processor development has a lot of focus on parallel performance by providing multiple cores that programs can use. The problem with the current version of OpenGL is that it lacks support for utilizing multiple CPU threads for calling rendering commands. Vulkan is a new low level graphics API that gives more control to the […]

OpenGL

Jan, 4

An initial performance review of software components for a heterogeneous computing platform

The design of embedded systems is a complex activity that involves a lot of decisions. With high performance demands of present day usage scenarios and software, they often involve energy hungry state-of-the-art computing units. While focusing on power consumption of computing units, the physical properties of software are often ignored. Recently, there has been a […]

OpenCL

Dec, 31

Synthesizing Benchmarks for Predictive Modeling

Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the […]

OpenCL

Dec, 31

Automatic OpenCL Task Adaptation for Heterogeneous Architectures

OpenCL defines a common parallel programming language for all devices, although writing tasks adapted to the devices, managing communication and load-balancing issues are left to the programmer. In this work, we propose a novel automatic compiler and runtime technique to execute single OpenCL kernels on heterogeneous multi-device architectures. The technique proposed is completely transparent to […]

OpenCL

Dec, 31

Android Malware Classification Using Parallelized Machine Learning Methods

Android is the most popular mobile operating system with a market share of over 80%. Due to its popularity and also its open source nature, Android is now the platform most targeted by malware, creating an urgent need for effective defense mechanisms to protect Android-enabled devices. In this dissertation, we present a novel characterization and […]

OpenCL

Dec, 31

Parallel Digital Predistortion Design on Mobile GPU and Embedded Multicore CPU for Mobile Transmitters

Digital predistortion (DPD) is a widely adopted baseband processing technique in current radio transmitters. While DPD can effectively suppress unwanted spurious spectrum emissions stemming from imperfections of analog RF and baseband electronics, it also introduces extra processing complexity and poses challenges on efficient and flexible implementations, especially for mobile cellular transmitters, considering their limited computing […]

CUDA

Dec, 31

dOpenCL – Evaluation of an API-Forwarding Implementation

Parallel workloads using compute resources such as GPUs and accelerators is a rapidly developing trend in the field of high performance computing. At the same time, virtualization is a generally accepted solution to share compute resources with remote users in a secure and isolated way. However, accessing compute resources from inside virtualized environments still poses […]

OpenCL

Dec, 26

Batched Shift Reduce Parsing with Lists of Vectors on CUDA

Shift Reduce Parsing is a common algorithm used in compilers and natural language processing, and can be used to compose a sequence of fixed-length vectors into a single vector of equal length. Previous versions are implemented using predetermined computational graphs that trade excessive memory and computation to minimize transfers of memory from the device to […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Gunrock: GPU Graph Analytics

Deep Neural Networks to Enable Real-time Multimessenger Astrophysics

Massively Parallel Computation of Accurate Densities for N-body Dark Matter Simulations using the Phase-Space-Element Method

Design and optimization of a portable LQCD Monte Carlo code using OpenACC

Evaluation of Multi-Threading in Vulkan

An initial performance review of software components for a heterogeneous computing platform

Synthesizing Benchmarks for Predictive Modeling

Automatic OpenCL Task Adaptation for Heterogeneous Architectures

Android Malware Classification Using Parallelized Machine Learning Methods

Parallel Digital Predistortion Design on Mobile GPU and Embedded Multicore CPU for Mobile Transmitters

dOpenCL – Evaluation of an API-Forwarding Implementation

Batched Shift Reduce Parsing with Lists of Vectors on CUDA

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)