high performance computing on graphics processing units: hgpu.org

Posts

Nov, 27

2nd International Conference on Robotics and Automation Engineering (ICRAE), 2017

The ICRAE conference is an international forum for the presentation of technological advances and research results in the fields of Robotics and Automation Engineering.Researchers from across the world are welcome to attend and submit their best work to ICRAE 2017 conference to exchange ideas about the latest theories, technology, data, and videos furthering the state-of-the-art […]

Nov, 27

2nd International Conference on Computational Intelligence and Applications (ICCIA), 2017

The aim objective of ICCIA 2017 is to present the latest research and results of scientists related to Computational Intelligence and Applications topics. This conference provides opportunities for the different areas delegates to exchange new ideas and application experiences face to face, to establish business or research relations and to find global partners for future […]

Nov, 25

A Metric for Performance Portability

The term "performance portability" has been informally used in computing to refer to a variety of notions which generally include: 1) the ability to run one application across multiple hardware platforms; and 2) achieving some notional level of performance on these platforms. However, there has been a noticeable lack of consensus on the precise meaning […]

CUDA

•

OpenCL

Nov, 25

Fast and Energy-Efficient CNN Inference on IoT Devices

Convolutional Neural Networks (CNNs) exhibit remarkable performance in various machine learning tasks. As sensor-equipped internet of things (IoT) devices permeate into every aspect of modern life, it is increasingly important to run CNN inference, a computationally intensive application, on resource constrained devices. We present a technique for fast and energy-efficient CNN inference on mobile SoC […]

Nov, 25

PVR: Patch-to-Volume Reconstruction for Large Area Motion Correction of Fetal MRI

In this paper we present a novel method for the correction of motion artifacts that are present in fetal Magnetic Resonance Imaging (MRI) scans of the whole uterus. Contrary to current slice-to-volume registration (SVR) methods, requiring an inflexible anatomical enclosure of a single investigated organ, the proposed patch-to-volume reconstruction (PVR) approach is able to reconstruct […]

CUDA

Nov, 25

Efficient Kernel Synthesis for Performance Portable Programming

The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software re-development cost for each type or generation of hardware. To mitigate this cost, a performance portable programming system is required. One fundamental difference between architectures that makes performance portability challenging is the […]

CUDA

Nov, 25

dMath: Distributed Linear Algebra for DL

The paper presents a parallel math library, dMath, that demonstrates leading scaling when using intranode, internode, and hybrid-parallelism for deep learning (DL). dMath provides easy-to-use distributed primitives and a variety of domain-specific algorithms including matrix multiplication, convolutions, and others allowing for rapid development of scalable applications like deep neural networks (DNNs). Persistent data stored in […]

CUDA

Nov, 23

Performance Analysis of CUDA and OpenCL By Implementation of Cryptographic Algorithms

This paper presents a Performance Analysis of CUDA and OpenCL. Three different cryptographic algorithms, i.e. DES, MD5, and SHA-1 have been selected as the benchmarks for extensive analysis of the performance gaps between the two. Our results show that, on the average scenario, CUDA performs 27% better than OpenCL while in the best case scenario […]

CUDA

•

OpenCL

Nov, 23

A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications

In recent years, deep neural networks (DNNs), have yielded strong results on a wide range of applications. Graphics Processing Units (GPUs) have been one key enabling factor leading to the current popularity of DNNs. However, despite increasing hardware flexibility and software programming toolchain maturity, high efficiency GPU programming remains difficult: it suffers from high complexity, […]

CUDA

•

OpenCL

Nov, 23

Deep Tensor Convolution on Multicores

Deep convolutional neural networks (ConvNets) have become a de facto standard for image classification and segmentation problems. These networks have also had early success in the video domain, despite failing to capture motion continuity and other rich temporal correlations. Evidence has since emerged that extending ConvNets to 3-dimensions leads to state-of-the-art performance across a broad […]

CUDA

Nov, 23

GA3C: GPU-based A3C for Deep Reinforcement Learning

We introduce and analyze the computational aspects of a hybrid CPU/GPU implementation of the Asynchronous Advantage Actor-Critic (A3C) algorithm, currently the state-of-the-art method in reinforcement learning for various gaming tasks. Our analysis concentrates on the critical aspects to leverage the GPU’s computational power, including the introduction of a system of queues and a dynamic scheduling […]

CUDA

Nov, 23

Optimization and Evaluation of VLPL-S Particle-in-cell Code on Knights Landing

VLPL-S code is developed based on the particlein-cell (PIC) algorithm, which is the mainstream algorithm of plasma behavior research. In this paper, we report our early experience on porting and optimizing the VLPL-S particle-in-cell code on the Knights Landing. By applying general optimization methods such as memory access optimization, thread level parallelism and vectorization to […]

high performance computing on graphics processing units: hgpu.org

Posts

2nd International Conference on Robotics and Automation Engineering (ICRAE), 2017

2nd International Conference on Computational Intelligence and Applications (ICCIA), 2017

A Metric for Performance Portability

Fast and Energy-Efficient CNN Inference on IoT Devices

PVR: Patch-to-Volume Reconstruction for Large Area Motion Correction of Fetal MRI

Efficient Kernel Synthesis for Performance Portable Programming

dMath: Distributed Linear Algebra for DL

Performance Analysis of CUDA and OpenCL By Implementation of Cryptographic Algorithms

A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications

Deep Tensor Convolution on Multicores

GA3C: GPU-based A3C for Deep Reinforcement Learning

Optimization and Evaluation of VLPL-S Particle-in-cell Code on Knights Landing

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)