high performance computing on graphics processing units: hgpu.org

Posts

Oct, 31

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

For computational fluid dynamics (CFD) applications with a large number of grid points/cells, parallel computing is a common efficient strategy to reduce the computational time. How to achieve the best performance in the modern supercomputer system, especially with heterogeneous computing resources such as hybrid CPU+GPU, or a CPU + Intel Xeon Phi (MIC) co-processors, is […]

CUDA

Oct, 29

Early Results of Deep Learning on the Stampede2 Supercomputer

We present early results of the deep learning work on the Stampede2 supercomputer. Our goal is to enable scalable and efficient deep learning model training and serving to expedite scientific discovery. We build three popular deep learning frameworks, namely, IntelCaffe, MXNet, and TensorFlow. With the built-in applications of these frameworks (CaffeNet, AlexNet, GoogLeNet, and Cifar10), […]

Oct, 29

A Study of Time and Energy Efficient Algorithms for Parallel and Heterogeneous Computing

This PhD project is motivated by the need to develop and achieve better and energy efficient computing through the use of parallelism and heterogeneous systems. Our contribution consists of both theoretical aspects, as well as in-depth and comprehensive empirical studies that aim to provide more insight into parallel and heterogeneous computing. Our first problem is […]

OpenCL

Oct, 29

Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model

In this work we use the GPU porting task for the operative Japanese weather prediction model "ASUCA" as an opportunity to examine productivity issues with OpenACC when applied to structured grid problems. We then propose "Hybrid Fortran", an approach that combines the advantages of directive based methods (no rewrite of existing code necessary) with that […]

Oct, 29

GooFit 2.0

The GooFit package provides physicists a simple, familiar syntax for manipulating probability density functions and performing fits, and is highly optimized for data analysis on NVIDIA GPUs and multithreaded CPU backends. GooFit was updated to version 2.0, bringing a host of new features. A completely revamped and redesigned build system makes GooFit easier to install, […]

CUDA

Oct, 29

Strategy Preserving Compilation for Parallel Functional Code

Graphics Processing Units (GPUs) and other parallel devices are widely available and have the potential for accelerating a wide class of algorithms. However, expert programming skills are required to achieving maximum performance. hese devices expose low-level hardware details through imperative programming interfaces where programmers explicity encode device-specific optimisation strategies. This inevitably results in non-performance-portable programs […]

OpenCL

Oct, 24

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In […]

CUDA

Oct, 24

BENCHIP: Benchmarking Intelligence Processors

The increasing attention on deep learning has tremendously spurred the design of intelligence processing hardware. The variety of emerging intelligence processors requires standard benchmarks for fair comparison and system optimization (in both software and hardware). However, existing benchmarks are unsuitable for benchmarking intelligence processors due to their non-diversity and nonrepresentativeness. Also, the lack of a […]

Oct, 24

A Fast and Generic GPU-Based Parallel Reduction Implementation

Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A parallel reduction, in turn, is the reduction operation concurrently performed when multiple execution units are available. The current […]

CUDA

•

OpenCL

Oct, 24

Parallel Computing for the Inverse of SPD matrix

In this paper, we propose a High performance Parallel Computing method for the Inverse of a symmetric positive definite (SPD) matrix. Brought in the reuse of the inverse of diagonal sub blocks technique and Combined with the newest OpenCL parallel computing framework, this methods can improve computing the inverse of SPD matrix effectively. Computing the […]

OpenCL

Oct, 24

GPU acceleration and performance of the particle-beam-dynamics code Elegant

Elegant is an accelerator physics and particle-beam dynamics code widely used for modeling and design of a variety of high-energy particle accelerators and accelerator-based systems. In this paper we discuss a recently developed version of the code that can take advantage of CUDA-enabled graphics processing units (GPUs) to achieve significantly improved performance for a large […]

CUDA

Oct, 24

Architecting SOT-RAM Based GPU Register File

With increase in GPU register file (RF) size, its power consumption has also increased. Since RF exists at the highest level in cache hierarchy, designing it with memories with high write latency/energy (e.g., spin transfer torque RAM) can lead to large energy loss. In this paper, we present an spin orbit torque RAM (SOT-RAM) based […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

Early Results of Deep Learning on the Stampede2 Supercomputer

A Study of Time and Energy Efficient Algorithms for Parallel and Heterogeneous Computing

Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model

GooFit 2.0

Strategy Preserving Compilation for Parallel Functional Code

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

BENCHIP: Benchmarking Intelligence Processors

A Fast and Generic GPU-Based Parallel Reduction Implementation

Parallel Computing for the Inverse of SPD matrix

GPU acceleration and performance of the particle-beam-dynamics code Elegant

Architecting SOT-RAM Based GPU Register File

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)