high performance computing on graphics processing units: hgpu.org

Posts

Nov, 20

Multi-GPU Support on the Marrow Algorithmic Skeleton Framework

With the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems. Higher-level programming is a very important asset in […]

CUDA

•

OpenCL

Nov, 20

HyPHI – task based hybrid execution C++ library for the Intel Xeon Phi coprocessor

The Intel Threading Building Blocks (TBB) C++ library introduced task parallelism to a wide audience of application developers. The library is easy to use and powerful, but it is limited to shared-memory machines. In this paper we present HyPHI, a novel library for the Intel Xeon Phi coprocessor for building applications which execute using a […]

Nov, 20

International Workshop on OpenCL, IWOCL 2014

The International Workshop on OpenCL (IWOCL) is an annual meeting of OpenCL users, researchers, developers and suppliers to share OpenCL best practise, and to promote the evolution and advancement of the OpenCL standard. The meeting is open to anyone who is interested in contributing to, and participating in the OpenCL community. IWOCL is the premier […]

Nov, 19

Real-time rendering of large surface-scanned range data natively on a GPU

This thesis presents research carried out for the visualisation of surface anatomy data stored as large range images such as those produced by stereo-photogrammetric, and other triangulation-based capture devices. As part of this research, I explored the use of points as a rendering primitive as opposed to polygons, and the use of range images as […]

OpenGL

Nov, 19

Adaptive implementation selection in the SkePU skeleton programming library

In earlier work, we have developed the SkePU skeleton programming library for modern multicore systems equipped with one or more programmable GPUs. The library internally provides four types of implementations (implementation variants) for each skeleton: serial C++, OpenMP, CUDA and OpenCL targeting either CPU or GPU execution respectively. Deciding which implementation would run faster for […]

CUDA

•

OpenCL

Nov, 19

A study of the speed and the accuracy of the Boundary Element Method as applied to the computational simulation of biological organs

In this work, first a Fortran code is developed for three dimensional linear elastostatics using constant boundary elements; the code is based on a MATLAB code developed by the author earlier. Next, the code is parallelized using BLACS, MPI, and ScaLAPACK. Later, the parallelized code is used to demonstrate the usefulness of the Boundary Element […]

CUDA

Nov, 19

Implementation of the twisted mass fermion operator in the QUDA library

We discuss an extension of the QUDA library for the Wilson twisted mass operator. A performance analysis is presented for both degenerate and non-degenerate flavor doublets. The degenerate twisted mass fermion operator runs at up to 190, 487 and 856 Gflops, for double, single and half precisions respectively on recent NVIDIA Kepler GPUs, while our […]

CUDA

Nov, 19

An implicit multigrid solver for high-order compressible flow simulations on GPUs

The multigrid method has proved to be effective for a large class of numerical methods. In this study, a strategy based on Full Approximation Storage (FAS) scheme is implemented together with Full Multigrid Algorithm (FMG) to accelerate convergence of steady state solutions of the two-dimensional compressible Euler equations on Graphics Processing Unit (GPU). The Beam […]

CUDA

Nov, 18

Neurokernel: An Open Scalable Software Framework for Emulation and Validation of Drosophila Brain Models on Multiple GPUs

The brain of the fruit fly Drosophila melanogaster is an extremely attractive model system for reverse engineering the emergent properties of neural circuits because it implements complex sensory-driven behaviors with a nervous system comprising a number of components that is five orders of magnitude smaller than those of mammals. A powerful toolkit of well-developed genetic […]

CUDA

Nov, 18

Integrating Multi-GPU Execution in an OpenACC Compiler

GPUs have become promising computing devices in current and future computer systems due to its high performance, high energy efficiency, and low price. However, lack of high level GPU programming models hinders the wide spread of GPU applications. To resolve this issue, OpenACC is developed as the first industry standard of a directive-based GPU programming […]

CUDA

Nov, 18

Specification and verification of GPGPU programs

Graphics Processing Units (GPUs) are increasingly used for general-purpose applications because of their low price, energy efficiency and enormous computing power. Considering the importance of GPU applications, it is vital that the behaviour of GPU programs can be specified and proven correct formally. This paper presents a logic to verify GPU kernels written in OpenCL, […]

OpenCL

Nov, 18

Probing the Statistical Validity of the Ductile-to-Brittle Transition in Metallic Nanowires Using GPU Computing

We perform a large-scale statistical analysis (> 2000 independent simulations) of the elongation and rupture of gold nanowires, probing the validity and scope of the recently proposed ductile-to-brittle transition that occurs with increasing nanowire length [Wu et. al., Nano Lett., 12, 910-914 (2012)]. To facilitate a high-throughput simulation approach, we implement the second-moment approximation to […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Multi-GPU Support on the Marrow Algorithmic Skeleton Framework

HyPHI – task based hybrid execution C++ library for the Intel Xeon Phi coprocessor

International Workshop on OpenCL, IWOCL 2014

Real-time rendering of large surface-scanned range data natively on a GPU

Adaptive implementation selection in the SkePU skeleton programming library

A study of the speed and the accuracy of the Boundary Element Method as applied to the computational simulation of biological organs

Implementation of the twisted mass fermion operator in the QUDA library

An implicit multigrid solver for high-order compressible flow simulations on GPUs

Neurokernel: An Open Scalable Software Framework for Emulation and Validation of Drosophila Brain Models on Multiple GPUs

Integrating Multi-GPU Execution in an OpenACC Compiler

Specification and verification of GPGPU programs

Probing the Statistical Validity of the Ductile-to-Brittle Transition in Metallic Nanowires Using GPU Computing

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)