high performance computing on graphics processing units: hgpu.org

Posts

Dec, 20

NVIDIA SimNet: an AI-accelerated multi-physics simulation framework

We present SimNet, an AI-driven multi-physics simulation framework, to accelerate simulations across a wide range of disciplines in science and engineering. Compared to traditional numerical solvers, SimNet addresses a wide range of use cases – coupled forward simulations without any training data, inverse and data assimilation problems. SimNet offers fast turnaround time by enabling parameterized […]

CUDA

Dec, 13

NaturalCC: A Toolkit to Naturalize the Source Code Corpus

We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming language communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and […]

CUDA

Dec, 13

Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing

This paper presents Systolic-CNN, an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture, optimized for accelerating the inference of various convolutional neural networks (CNNs) in multi-tenancy cloud/edge computing. The existing OpenCL-defined FPGA accelerators for CNN inference are insufficient due to limited flexibility for supporting multiple CNN models at run time and poor scalability resulting in underutilized FPGA […]

OpenCL

Dec, 13

EasyPBR: A Lightweight Physically-Based Renderer

Modern rendering libraries provide unprecedented realism, producing real-time photorealistic 3D graphics on commodity hardware. Visual fidelity, however, comes at the cost of increased complexity and difficulty of usage, with many rendering parameters requiring a deep understanding of the pipeline. We propose EasyPBR as an alternative rendering library that strikes a balance between ease-of-use and visual […]

OpenGL

Dec, 13

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL […]

CUDA

Dec, 13

Efficient code generation for hardware accelerators by refining partially specified implementation

Software programmable hardware accelerators, such as Graphical Processing Units (GPUs), are specialized processors designed to perform specific tasks more efficiently than general purpose processors. They trade off generality against specialized data paths and massive parallelism, providing a raw processing power that is orders of magnitude higher than for contemporary multicore CPUs. Unfortunately, finding an efficient […]

CUDA

Dec, 6

Accelerate Scientific Deep Learning Models on Heterogeneous Computing Platform with FPGA

AI and deep learning are experiencing explosive growth in almost every domain involving analysis of big data. Deep learning using Deep Neural Networks (DNNs) has shown great promise for such scientific data analysis applications. However, traditional CPU-based sequential computing without special instructions can no longer meet the requirements of mission-critical applications, which are compute-intensive and […]

Dec, 6

Exploring FPGA Optimizations to Compute Sparse Numerical Linear Algebra Kernels

The solution of sparse triangular linear systems (sptrsv) is the bottleneck of many numerical methods. Thus, it is crucial to count with efficient implementations of such kernel, at least for commonly used platforms. In this sense, Field–Programmable Gate Arrays (FPGAs) have evolved greatly in the last years, entering the HPC hardware ecosystem largely due to […]

OpenCL

Dec, 6

Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks

In this paper, we provide a fine-grain machine learning-based method, PerfNetV2, which improves the accuracy of our previous work for modeling the neural network performance on a variety of GPU accelerators. Given an application, the proposed method can be used to predict the inference time and training time of the convolutional neural networks used in […]

CUDA

Dec, 6

Python Workflows on HPC Systems

The recent successes and wide spread application of compute intensive machine learning and data analytics methods have been boosting the usage of the Python programming language on HPC systems. While Python provides many advantages for the users, it has not been designed with a focus on multi-user environments or parallel programming – making it quite […]

CUDA

Dec, 6

High-Throughput Parallel Viterbi Decoder on GPU Tensor Cores

Many research works have been performed on implementation of Vitrerbi decoding algorithm on GPU instead of FPGA because this platform provides considerable flexibility in addition to great performance. Recently, the recently-introduced Tensor cores in modern GPU architectures provide incredible computing capability. This paper proposes a novel parallel implementation of Viterbi decoding algorithm based on Tensor […]

CUDA

Nov, 29

Evaluating the Performance and Portability of Contemporary SYCL Implementations

SYCL is a single-source programming model for heterogeneous systems; it promises improved maintainability, productivity, and opportunity for compiler optimization, when compared to accelerator specific programming models. Several implementations of the SYCL standard have been developed over the past few years, including several backends using contemporary accelerator languages, like OpenCL, CUDA, and HIP. These implementations vary […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

NVIDIA SimNet: an AI-accelerated multi-physics simulation framework

NaturalCC: A Toolkit to Naturalize the Source Code Corpus

Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing

EasyPBR: A Lightweight Physically-Based Renderer

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Efficient code generation for hardware accelerators by refining partially specified implementation

Accelerate Scientific Deep Learning Models on Heterogeneous Computing Platform with FPGA

Exploring FPGA Optimizations to Compute Sparse Numerical Linear Algebra Kernels

Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks

Python Workflows on HPC Systems

High-Throughput Parallel Viterbi Decoder on GPU Tensor Cores

Evaluating the Performance and Portability of Contemporary SYCL Implementations

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)