high performance computing on graphics processing units: hgpu.org

Posts

Feb, 16

Finding, Measuring, and Reducing Inefficiencies in Contemporary Computer Systems

Computer systems have become increasingly diverse and specialized in recent years. This complexity supports a wide range of new computing uses and users, but is not without cost: it has become difficult to maintain the efficiency of contemporary general purpose computing systems. Computing inefficiencies, which include nonoptimal runtimes, excessive energy use, and limits to scalability, […]

OpenCL

Feb, 16

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

Modern servers have become heterogeneous, often combining multicore CPUs with many-core GPGPUs. Such heterogeneous architectures have the potential to improve the performance of data-intensive stream processing applications, but they are not supported by current relational stream processing engines. For an engine to exploit a heterogeneous architecture, it must execute streaming SQL queries with sufficient data-parallelism […]

OpenCL

Feb, 16

CaffeLink: Mathematica binding for Caffe Deep Learning Framework

In this paper we present CaffeLink an open-source library for Mathematica which is a binding of a well-established Caffe deep learning framework. Caffe is a highly-optimized CUDA accelerated library with focus on convolutional neural networks written in C++ with Python and Matlab bindings. CaffeLink is based upon Mathematica’s LibraryLink. It makes accessible most features of […]

CUDA

Feb, 11

Writing a performance-portable matrix multiplication

There are several frameworks that, while providing functional portability of code across different platforms, do not automatically provide performance portability. As a consequence, programmers have to hand-tune the kernel codes for each device. The Heterogeneous Programming Library (HPL) is one of these libraries, but it has the interesting feature that the kernel codes, which implement […]

OpenCL

Feb, 10

Programming GPUs with C++14 and Just-In-Time Compilation

Systems that comprise accelerators (e.g., GPUs) promise high performance, but their programming is still a challenge, mainly because of two reasons: 1) two distinct programming models have to be used within an application: one for the host CPU (e.g., C++), and one for the accelerator (e.g., OpenCL or CUDA); 2) using Just-In-Time (JIT) compilation and […]

CUDA

•

OpenCL

Feb, 10

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

We introduce BinaryNet, a method which trains DNNs with binary weights and activations when computing parameters’ gradient. We show that it is possible to train a Multi Layer Perceptron (MLP) on MNIST and ConvNets on CIFAR-10 and SVHN with BinaryNet and achieve nearly state-of-the-art results. At run-time, BinaryNet drastically reduces memory usage and replaces most […]

CUDA

Feb, 10

GPU-Accelerated High-Level Synthesis for Bitwidth Optimization of FPGA Datapaths

Bitwidth optimization of FPGA datapaths can save hardware resources by choosing the fewest number of bits required for each datapath variable to achieve a desired quality of result. However, it is an NP-hard problem that requires unacceptably long runtimes when using sequential CPU-based heuristics. We show how to parallelize the key steps of bitwidth optimization […]

CUDA

Feb, 10

FARGO3D: A new GPU-oriented MHD code

We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on protoplanetary disks physics and planet-disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of […]

CUDA

Feb, 10

Performance Portable GPU Code Generation for Matrix Multiplication

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in […]

OpenCL

Feb, 9

Guided Profiling for Auto-Tuning Array Layouts on GPUs

Auto-tuning for Graphics Processing Units (GPUs) has become very popular in recent years. It removes the necessity to hand-tune GPU code especially when a new hardware architecture is released. Our auto-tuner optimizes memory access patterns. This is a key aspect to exploit the full performance of modern GPUs. As the memory hierarchy has historically changed […]

CUDA

Feb, 8

Portable Programming Models for Heterogeneous Platforms

With the end of Dennard scaling and emergence of dark silicon, the bets are high on heterogeneous architectures to achieve both application performance and energy efficiency. However, diversity in heterogeneous architectures poses severe programming challenges in terms of data layout, memory coherence, task partitioning, data distribution, and sharing of virtual addresses. Existing high-level programming languages […]

OpenCL

Feb, 8

High performance high-order numerical methods: applications in ocean modeling

This thesis presents high-order numerical methods for time-dependent simulations of oceanic wave propagation on modern many-core hardware architecture. Simulation of the waves such as tsunami, is challenging because of the varying fluid depths, propagation in many regions, requirement of high resolution near the shore, complex nonlinear wave phenomenon, and necessity of faster than real-time predictions. […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Finding, Measuring, and Reducing Inefficiencies in Contemporary Computer Systems

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

CaffeLink: Mathematica binding for Caffe Deep Learning Framework

Writing a performance-portable matrix multiplication

Programming GPUs with C++14 and Just-In-Time Compilation

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

GPU-Accelerated High-Level Synthesis for Bitwidth Optimization of FPGA Datapaths

FARGO3D: A new GPU-oriented MHD code

Performance Portable GPU Code Generation for Matrix Multiplication

Guided Profiling for Auto-Tuning Array Layouts on GPUs

Portable Programming Models for Heterogeneous Platforms

High performance high-order numerical methods: applications in ocean modeling

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)