high performance computing on graphics processing units: hgpu.org

Posts

Dec, 31

A Comparison of the performance of HPC Accelerators

This project aims to port the scientific application GADGET-3 to multiple accelerators, research on the performance achieved and compare the porting/optimisations on the given accelerators with different architectures. In this project, the most time-consuming functions of GADGET-3 was identified based on the profiling. Partial functions in GADGET-3 were ported to the accelerator NVIDIA K40 card […]

OpenCL

Dec, 31

Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs

We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on multicore CPUs and manycore GPUs. On the GPU side, we focus on CUDA and OpenCL programming to develop methods for an efficient mapping on manycores. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned […]

CUDA

•

OpenCL

Dec, 22

OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures

The proliferation of heterogeneous computing platforms presents the parallel computing community with new challenges. One such challenge entails evaluating the efficacy of such parallel architectures and identifying the architectural innovations that ultimately benefit applications. To address this challenge, we need benchmarks that capture the execution patterns (i.e., dwarfs or motifs) of applications, both present and […]

OpenCL

Dec, 19

Autotuning Stencils Codes with Algorithmic Skeletons

The physical limitations of microprocessor design have forced the industry towards increasingly heterogeneous architectures to extract performance. This trend has not been matched with software tools to cope with such parallelism, leading to a growing disparity between the levels of available performance and the ability for application developers to exploit it. Algorithmic skeletons simplify parallel […]

OpenCL

Dec, 12

Behavioral Non-portability in Scientific Numeric Computing

The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware. Platform dependencies are particularly pronounced for arithmetic-intensive parallel numeric programs and infringe on the highly desirable goal of software portability (which is nonetheless promised by heterogeneous computing frameworks like OpenCL): the same program run on the […]

OpenCL

Dec, 12

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators […]

OpenCL

Dec, 4

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In […]

OpenCL

Nov, 24

A parallel algorithm for the constrained shortest path problem on lattice graphs

We present a parallel algorithm for finding the shortest path whose total weight is smaller than a pre-determined value. The passage times over the edges are assumed to be positive integers. In each step the processing elements are not analyzing the entire graph. Instead they are focusing on a subset of vertices called active vertices. […]

OpenCL

Nov, 13

GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication

The generic matrix-matrix multiplication (GEMM) is arguably the most popular computational kernel of the 20th century. Yet, surprisingly, no common methodology for evaluating GEMM performance has been established over the many decades of using GEMM for comparing architectures, compilers and ninja-class programmers. We introduce GEMMbench, a framework and methodology for evaluating performance of GEMM implementations. […]

OpenCL

Nov, 11

Integrating a large-scale testing campaign in the CK framework

We consider the problem of conducting large experimental campaigns in computer science research. Most research efforts require a certain level of bookkeeping of results. This is manageable via quick, on-the-fly infrastructure implementations. However, it becomes a problem for large-scale testing initiatives, especially as the needs of the project evolve along the way. We look at […]

OpenCL

Nov, 11

Climbing Mont Blanc – A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

Climbing Mont Blanc (CMB) is an open online judge used for training in energy efficient programming of state-of-the-art heterogeneous multicores. It uses an Odroid-XU3 board from Hardkernel with an Exynos Octa processor and integrated power sensors. This processor is three-way heterogeneous containing 14 different cores of three different types. The board currently accepts C and […]

OpenCL

Nov, 8

High Level Synthesis and Evaluation of the Secure Hash Standard for FPGAs

Secure hash algorithms (SHAs) are important components of cryptographic applications. SHA performance on central processing units (CPUs) is slow, therefore, acceleration must be done using hardware such as Field Programmable Gate Arrays (FPGAs). Considerable work has been done in academia using FPGAs to accelerate SHAs. These designs were implemented using Hardware Description Language (HDL) based […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Comparison of the performance of HPC Accelerators

Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs

OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures

Autotuning Stencils Codes with Algorithmic Skeletons

Behavioral Non-portability in Scientific Numeric Computing

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

A parallel algorithm for the constrained shortest path problem on lattice graphs

GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication

Integrating a large-scale testing campaign in the CK framework

Climbing Mont Blanc – A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

High Level Synthesis and Evaluation of the Secure Hash Standard for FPGAs

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)