high performance computing on graphics processing units: hgpu.org

Posts

Dec, 12

Behavioral Non-portability in Scientific Numeric Computing

The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware. Platform dependencies are particularly pronounced for arithmetic-intensive parallel numeric programs and infringe on the highly desirable goal of software portability (which is nonetheless promised by heterogeneous computing frameworks like OpenCL): the same program run on the […]

OpenCL

Dec, 12

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators […]

OpenCL

Dec, 4

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In […]

OpenCL

Nov, 24

A parallel algorithm for the constrained shortest path problem on lattice graphs

We present a parallel algorithm for finding the shortest path whose total weight is smaller than a pre-determined value. The passage times over the edges are assumed to be positive integers. In each step the processing elements are not analyzing the entire graph. Instead they are focusing on a subset of vertices called active vertices. […]

OpenCL

Nov, 13

GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication

The generic matrix-matrix multiplication (GEMM) is arguably the most popular computational kernel of the 20th century. Yet, surprisingly, no common methodology for evaluating GEMM performance has been established over the many decades of using GEMM for comparing architectures, compilers and ninja-class programmers. We introduce GEMMbench, a framework and methodology for evaluating performance of GEMM implementations. […]

OpenCL

Nov, 11

Climbing Mont Blanc – A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

Climbing Mont Blanc (CMB) is an open online judge used for training in energy efficient programming of state-of-the-art heterogeneous multicores. It uses an Odroid-XU3 board from Hardkernel with an Exynos Octa processor and integrated power sensors. This processor is three-way heterogeneous containing 14 different cores of three different types. The board currently accepts C and […]

OpenCL

Nov, 11

Integrating a large-scale testing campaign in the CK framework

We consider the problem of conducting large experimental campaigns in computer science research. Most research efforts require a certain level of bookkeeping of results. This is manageable via quick, on-the-fly infrastructure implementations. However, it becomes a problem for large-scale testing initiatives, especially as the needs of the project evolve along the way. We look at […]

OpenCL

Nov, 8

High Level Synthesis and Evaluation of the Secure Hash Standard for FPGAs

Secure hash algorithms (SHAs) are important components of cryptographic applications. SHA performance on central processing units (CPUs) is slow, therefore, acceleration must be done using hardware such as Field Programmable Gate Arrays (FPGAs). Considerable work has been done in academia using FPGAs to accelerate SHAs. These designs were implemented using Hardware Description Language (HDL) based […]

OpenCL

Nov, 4

Heterogeneous CPU/(GP) GPU Memory Hierarchy Analysis and Optimization

Heterogeneous systems, more specifically CPU – GPGPU platforms, have gained a lot of attention due to the excellent speedups GPUs can achieve with such little amount of energy consumption. Anyhow, not everything is such a good story, the complex programming models to get the maximum exploitation of the devices and data movement overheads are some […]

CUDA

Oct, 31

Energy-Efficient Execution of Data-Parallel Applications on Heterogeneous Mobile Platforms

State-of-the-art mobile system-on-chips (SoC) include heterogeneity in various forms for accelerated and energy-efficient execution of diverse range of applications. The modern SoCs now include programmable cores such as CPU and GPU with very different functionality. The SoCs also integrate performance heterogeneous cores with different power-performance characteristics but the same instruction-set architecture such as ARM big.LITTLE. […]

OpenCL

Oct, 27

The 1st International SYCL Workshop (SYCL), 2016

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16 Barcelona, Spain Sunday, 13th March, 2016 http://conf.researchr.org/track/PPoPP-2016/SYCL-2016-papers SYCL (sɪkəl – as in sickle) is a royalty-free, cross-platform C++ abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL, while adding the ease-of-use and flexibility of C++. For example, SYCL enables single source development where […]

Oct, 27

Compiling and Optimizing Java 8 Programs for GPU Execution

GPUs can enable significant performance improvements for certain classes of data parallel applications and are widely used in recent computer systems. However, GPU execution currently requires explicit low-level operations such as 1) managing memory allocations and transfers between the host system and the GPU, 2) writing GPU kernels in a low-level programming model such as […]

CUDA

•

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Behavioral Non-portability in Scientific Numeric Computing

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

A parallel algorithm for the constrained shortest path problem on lattice graphs

GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication

Climbing Mont Blanc – A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

Integrating a large-scale testing campaign in the CK framework

High Level Synthesis and Evaluation of the Secure Hash Standard for FPGAs

Heterogeneous CPU/(GP) GPU Memory Hierarchy Analysis and Optimization

Energy-Efficient Execution of Data-Parallel Applications on Heterogeneous Mobile Platforms

The 1st International SYCL Workshop (SYCL), 2016

Compiling and Optimizing Java 8 Programs for GPU Execution

Recent source codes

CuPBoP-AMD: Extending CUDA to AMD Platforms

Adopter: Automated Deep Learning Optimization via DSL-based Source Code Transformation

ROCm's implementation of Gromacs

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

Most viewed papers (last 30 days)