high performance computing on graphics processing units: hgpu.org

Posts

Nov, 24

Accelerating QDP++/Chroma on GPUs

Extensions to the C++ implementation of the QCD Data Parallel Interface are provided enabling acceleration of expression evaluation on NVIDIA GPUs. Single expressions are off-loaded to the device memory and execution domain leveraging the Portable Expression Template Engine and using Just-in-Time compilation techniques. Memory management is automated by a software implementation of a cache controlling […]

CUDA

Nov, 24

A GPU-Enabled, High-Resolution Cosmological Microlensing Parameter Survey

In the era of synoptic surveys, the number of known gravitationally lensed quasars is set to increase by over an order of magnitude. These new discoveries will enable a move from single-quasar studies to investigations of statistical samples, presenting new opportunities to test theoretical models for the structure of quasar accretion discs and broad emission […]

CUDA

Nov, 23

Automated architecture-aware mapping of streaming applications onto GPUs

Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads – called warps and wave fronts, respectively, in nVidia and AMD literature – are selected by the hardware scheduler and executed in lockstep on the available […]

CUDA

•

OpenCL

Nov, 23

A Parallel Deconvolution Algorithm in Perfusion Imaging

In this paper, we will present the implementation of a deconvolution algorithm for brain perfusion quantification on GPGPU (General Purpose Graphics Processor Units) using the CUDA programming model. GPUs originated as graphics generation dedicated co-processors, but the modern GPUs have evolved to become a more general processor capable of executing scientific computations. It provides a […]

CUDA

Nov, 23

Real-World Constraints of GPUs in Real-Time Systems

Graphics processing units (GPUs) are becoming increasingly important in today’s platforms as their increased generality allows for them to be used as powerful coprocessors. In this paper, we explore possible applications for GPUs in real-time systems, discuss the limitations and constraints imposed by current GPU technology, and present a summary of our research addressing many […]

Nov, 23

Soren: Adaptive MapReduce for Programmable GPUs

In recent years the MapReduce programming model has been widely used for developing parallel data-intensive applications. As a result of its popularity, there exist many implementations of the MapReduce model on different parallel architectures including on massively parallel programmable GPUs. A basic challenge in implementing a MapReduce runtime system is the wide diversity of applications […]

CUDA

Nov, 23

Towards solving the Table Maker’s Dilemma on GPU

Since 1985, the IEEE 754 standard defines formats, rounding modes and basic operations for floating-point arithmetic. In 2008 the standard has been extended, and recommendations have been added about the rounding of some elementary functions such as trigonometric functions (cosine, sine, tangent and their inverses), exponentials, and logarithms. However to guarantee the exact rounding of […]

CUDA

Nov, 23

Accelerating Protein Sequence Search in a Heterogeneous Computing System

The "Basic Local Alignment Search Tool” (BLAST) is arguably the most widely used computational tool in bioinformatics. However, the computational power required for routine BLAST analysis has been outstripping Moore’s Law due to the exponential growth in the size of the genomic sequence databases that BLAST searches on. To address the above issue, we propose […]

Nov, 23

Building-Blocks for Performance Oriented DSLs

Domain-specific languages raise the level of abstraction in software development. While it is evident that programmers can more easily reason about very high-level programs, the same holds for compilers only if the compiler has an accurate model of the application domain and the underlying target platform. Since mapping high-level, general-purpose languages to modern, heterogeneous hardware […]

CUDA

Nov, 23

TEG: GPU Performance Estimation Using a Timing Model

Modern Graphic Processing Units (GPUs) offer significant performance speedup over conventional processors. Programming on GPU for general purpose applications has become an important research area. CUDA programming model provides a C-like interface and is widely accepted. However, since hardware vendors do not disclose enough underlying architecture details, programmers have to optimize their applications without fully […]

CUDA

Nov, 23

Accelerating the Rate of Astronomical Discovery with GPU-Powered Clusters

In recent years, the Graphics Processing Unit (GPU) has emerged as a low-cost alternative for high performance computing, enabling impressive speed-ups for a range of scientific computing applications. Early adopters in astronomy are already benefiting in adapting their codes to take advantage of the GPU’s massively parallel processing paradigm. I give an introduction to, and […]

Nov, 23

An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm

Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230,18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Accelerating QDP++/Chroma on GPUs

A GPU-Enabled, High-Resolution Cosmological Microlensing Parameter Survey

Automated architecture-aware mapping of streaming applications onto GPUs

A Parallel Deconvolution Algorithm in Perfusion Imaging

Real-World Constraints of GPUs in Real-Time Systems

Soren: Adaptive MapReduce for Programmable GPUs

Towards solving the Table Maker’s Dilemma on GPU

Accelerating Protein Sequence Search in a Heterogeneous Computing System

Building-Blocks for Performance Oriented DSLs

TEG: GPU Performance Estimation Using a Timing Model

Accelerating the Rate of Astronomical Discovery with GPU-Powered Clusters

An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)