high performance computing on graphics processing units: hgpu.org

Posts

Nov, 18

Auto-tunable GPU BLAS

OpenCL is fast becoming the preferred framework used to make programs for heterogeneous platforms consisting of at least one CPU and one or more accelerators. The GPU being readily available in almost all computers, it is the most common accelerator in use.Good libraries are important to reduce development time and to make particular development environments, […]

OpenCL

Nov, 18

The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing

Multi2Sim is a simulation framework for heterogeneous computing, including models for superscalar, multithreaded, multicore, and graphics processors. Multi2Sim is an application-only simulator, which allows one or more applications to be run on top of it without booting a guest operating system first. In this chapter, an introduction to Multi2Sim is presented, and it is shown […]

OpenCL

Nov, 18

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

This paper presents a number of algorithms to run the fast multipole method (FMM) on NVIDIA CUDA-capable graphical processing units (GPUs) (Nvidia Corporation, Sta. Clara, CA, USA). The FMM is a class of methods to compute pairwise interactions between N particles for a given error tolerance and with computational cost of O(N). The methods described […]

CUDA

Nov, 18

Neon: A Domain-Specific Programming Language for Image Processing

Neon is a high-level domain-specific programming language for writing efficient image processing programs which can run on either the CPU or the GPU. End users write Neon programs in a C# programming environment. When the Neon program is executed, our optimizing code generator outputs human-readable source files for either the CPU or GPU. These source […]

Nov, 17

Dax Toolkit: A Proposed Framework for Data Analysis and Visualization at Extreme Scale

Experts agree that the exascale machine will comprise processors that contain many cores, which in turn will necessitate a much higher degree of concurrency. Software will require a minimum of a 1,000 times more concurrency. Most parallel analysis and visualization algorithms today work by partitioning data and running mostly serial algorithms concurrently on each data […]

CUDA

Nov, 17

Compilation for Heterogeneous Computing: Automating Analyses, Transformations and Decisions

Hardware accelerators, such as fpga boards or gpu, are an interesting alternative or a valuable complement to classic multi-core processors for computational-intensive software. However it proves to be both costly and difficult to use legacy applications with these new heterogeneous targets. In particular, existing compilers are generally targeted toward code generation for sequential processors and […]

CUDA

Nov, 17

Programming Future Parallel Architectures with Haskell and Intel ArBB

New parallel architectures, such as Cell, Intel MIC, GPUs, and tiled architectures, enable high performance but are often hard to program. What is needed is a bridge between high-level programming models where programmers are most productive and modern parallel architectures. We propose that that bridge is Embedded Domain Specific Languages (EDSLs). One attractive target for […]

CUDA

Nov, 17

Scientific GPU Programming with Data-Flow Languages

Graphical Processing Units or GPUs are processors used primarily to render images from computer models for domains ranging from gaming to design engineering. As the generation of very accurate images often in real time is extremely computationally intensive, they have developed into extremely powerful processors. To achieve this they have relied on being able to […]

CUDA

Nov, 17

FPGA and ASIC Convergence

The growing demands on multimedia applications and high-speed high-quality telecommunication systems with real-time constrains oriented to portable, low power consumption, devices, have being driven technologies development, methodologies and design flows of embedded systems during the last years. Through the analysis of design methodologies and strategies facing multi-core, reconfigurability and power consumption challenges, this educational survey […]

Nov, 17

Characterization and Transformation of Unstructured Control Flow in GPU Applications

Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA and OpenCL. Although this technology is widely used, commodity GPUs use different schemes to implement it, and the performance limitations of these different schemes under real workloads […]

CUDA

•

OpenCL

Nov, 17

Massive Image Editing on the Cloud

Processing massive imagery in a distributed environment currently requires the effort of a skilled team to efficiently handle communication, synchronization, faults, and data/process distribution. Moreover, these implementations are highly optimized for a specific system or cluster, therefore portability or improved performance due to system improvements is rarely considered. Much like early GPU computing, cluster computing […]

Nov, 17

Adaboost GPU-based Classifier for Direct Volume Rendering

In volume visualization, the voxel visibitity and materials are carried out through an interactive editing of Transfer Function. In this paper, we present a two-level GPU-based labeling method that computes in times of rendering a set of labeled structures using the Adaboost machine learning classifier. In a pre-processing step, Adaboost trains a binary classifier from […]

CUDA

•

OpenCL

•

OpenGL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Auto-tunable GPU BLAS

The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

Neon: A Domain-Specific Programming Language for Image Processing

Dax Toolkit: A Proposed Framework for Data Analysis and Visualization at Extreme Scale

Compilation for Heterogeneous Computing: Automating Analyses, Transformations and Decisions

Programming Future Parallel Architectures with Haskell and Intel ArBB

Scientific GPU Programming with Data-Flow Languages

FPGA and ASIC Convergence

Characterization and Transformation of Unstructured Control Flow in GPU Applications

Massive Image Editing on the Cloud

Adaboost GPU-based Classifier for Direct Volume Rendering

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)