high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL)

Portable GPU-Based Artificial Neural Networks for Accelerated Data-Driven Modeling

Portable high-order finite element kernels I: Streaming Operations

Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

Portable Mapping of Data Parallel Programs to OpenCL for Heterogeneous Systems

Portable OpenCL Out-of-Order Execution Framework for Heterogeneous Platforms

Portable Parallel Kernels for High-Speed Beamforming in Synthetic Aperture Ultrasound Imaging

Portable parallelized blowfish via RenderScript

Portable Performance on Heterogeneous Architectures

Portable Programming Models for Heterogeneous Platforms

Portable Real-Time DCT Based Steganography Using OpenCL

Portable, high-performance containers for HPC

Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use

Portage: Bringing Hackers’ Wisdom to Science

Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA

Porting a sparse linear algebra math library to Intel GPUs

Porting and optimizing MAGFLOW on CUDA

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Porting estimation of distribution algorithms to the cell broadband engine

Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned

Porting HPC Applications to AMD Instinct MI300A Using Unified Memory and OpenMP

Porting Large HPC Applications to GPU Clusters: The Codes GENE and VERTEX

Porting marine ecosystem model spin-up using transport matrices to GPUs

Porting NAHUJ to CUDA

Porting numerical integration codes from CUDA to oneAPI: a case study

Porting of an Edge-Based CFD Solver to GPUs

Porting OpenACC to OpenMP on heterogeneous systems

Porting to the Intel Xeon Phi: Opportunities and Challenges

Porting tree-based hash table compression to GPGPU model checking

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Position-Dependent Arrays and Their Application for High Performance Code Generation

Possible planet-forming regions on submillimetre images

Poster: CUDA-Accelerated Continuous 2D Scatterplots

Poster: GPU-accelerated artificial neural network for QSAR modeling

Poster: GPU-accelerated rigid body fitting of atomic structures into electron density maps

Potential contribution of CNN-based solving of stiff ODEs and PDEs to enabling real-time Computational Engineering

Potential Energy Landscapes for the 2D XY Model: Minima, Transition States and Pathways

Potential of General Purpose Graphic Processing Unit for Energy Management System

Power analysis and optimizations for GPU architecture using a power simulator

Power analysis of sorting algorithms on FPGA using OpenCL

Power and Performance Analysis of GPU-Accelerated Systems

Power and Performance Characterization of Computational Kernels on the GPU

Power and Performance Studies of the Explicit Multi-Threading (XMT) Architecture

Power Consumption Modeling and Prediction in a Hybrid CPU-GPU-MIC Supercomputer

Power Consumption of GPUs from a Software Perspective

Power consumption of mixed precision in the iterative solution of sparse linear systems

Power Control for GPU Clusters in processing large-scale streams

Power Efficient Large Matrices Multiplication by Load Scheduling on Multi-core and GPU Platform with CUDA

Power Flow Analysis on CUDA-based GPU

Power Management and Optimization

Power Management for GPU-CPU Heterogeneous Systems

Power Management Techniques for Data Centers: A Survey

Power Modeling and Optimization for GPGPUs

Power performance analysis of 3-D finite element mesh refinement with tetrahedra by CUDA/MPI on multi-core and GPU architecture

Power Profiling and Optimization for Heterogeneous Multi-Core Systems

Power Profiling of GeMTC Many Task Computing

Power-aware Performance of Mixed Precision Linear Solvers for FPGAs and GPGPUs

Power-Efficient Accelerators for High-Performance Applications

Power-efficient medical image processing using PUMA

Power-Efficient Time-Sensitive Mapping in Heterogeneous Systems

Power-Efficient Work Distribution Method for CPU-GPU Heterogeneous System

Power-performance comparison of single-task driven many-cores

Power, Energy and Speed of Embedded and Server Multi-Cores applied to Distributed Simulation of Spiking Neural Networks: ARM in NVIDIA Tegra vs Intel Xeon quad-cores

PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion

Practical Algorithms for Finding Extremal Sets

Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering

Practical and Theoretical Aspects of a Parallel Twig Join Algorithm for XML Processing using a GPGPU

Practical CFD Simulations on Programmable Graphics Hardware using SMAC

Practical considerations for GPU-accelerated CT

Practical craniofacial surgery simulator based on GPU accelerated lattice shape matching

Practical examples of GPU computing optimization principles

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

Practical logarithmic rasterization for low-error shadow maps

Practical parallel imaging compressed sensing MRI: Summary of two years of experience in accelerating body MRI of pediatric patients

Practical Patient-Specific Cardiac Blood Flow Simulations Using SPH

Practical Pre-stack Kirchhoff Time Migration of Seismic Processing on General Purpose GPU

Practical Random Linear Network Coding on GPUs

Practical Symbolic Execution Analysis and Methodology for GPU Programs

Practical Symbolic Race Checking of GPU Programs

Practical Symmetric Key Cryptography on Modern Graphics Hardware

Practically efficient methods for performing bit-reversed permutation in C++11 on the x86-64 architecture

Pragma Directed Shared Memory Centric Optimizations on GPUs

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs

Pre-Training LLMs on a budget: A comparison of three optimizers

Precise dynamic analysis for slack elasticity: adding buffering without adding bugs

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads

Precision and Performance Analysis of C Standard Math Library Functions on GPUs

Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Precision-Aware Soft Error Protection for GPUs

Precomputed Atmospheric Scattering

Precomputed compressive sensing for light transport acquisition

Precomputed Visibility Cuts for Interactive Relighting with Dynamic BRDFs

Preconditioned conjugate gradient solver for structural problems

Predictable GPGPU Computing in DNN-Driven Autonomous Systems

Predicting GPUDirect Benefits for HPC Workloads

Predicting NVIDIA’s Next-Day Stock Price: A Comparative Analysis of LSTM, MLP, ARIMA, and ARIMA-GARCH Models

Predicting the Execution Time of a kernel on a specific GPU using PTX code

Prediction of Performance and Power Consumption of GPGPU Applications

Brief statistics for this page

Titles: 100

Download open PDFs: 91

Package packages: 24

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)