high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Floating Textures

Floating-Point Arithmetic in Transport Triggered Architectures

Floating-point data compression at 75 Gb/s on a GPU

Floating-point Mixed-radix FFT Core Generation for FPGA and Comparison with GPU and CPU

Flocking Implementation for the Blender Game Engine

Flow Charts: Visualization of Vector Fields on Arbitrary Surfaces

FLOWER: A Comprehensive Dataflow Compiler for High-Level Synthesis

FlowPM: Distributed TensorFlow Implementation of the FastPM Cosmological N-body Solver

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

FlowTour: An Automatic Guide for Exploring Internal Flow Features

Fluid Dynamics Simulations on Multi-GPU Systems

Fluid Motion Modelling Using Vortex Particle Method on GPU

Fluid Simulation and Generating Textures with Reaction-Diffusion Systems on Surfaces in the GPU

Fluid Simulation by the Smoothed Particle Hydrodynamics Method: A Survey

Fluid Simulation on Surfaces in the GPU

Fluid simulation with SIMPLE method using graphic processors

Fluid Simulation: Smoothed Particle Hydrodynamics on the GPU

Fluid-solid coupling on a cluster of GPU graphics cards for seismic wave propagation

FluidFFT: common API (C++ and Python) for Fast Fourier Transform HPC libraries

FluoroSim: A Visual Problem-Solving Environment for Fluorescence Microscopy

Flux tubes at Finite Temperature

FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method

fMRI analysis on the GPU-possibilities and challenges

Focus measurement on programmable graphics hardware for all in-focus rendering from light fields

Focused Volumetric Visual Hull with Color Extraction

Forecasting high frequency financial time series using parallel FFN with CUDA and ZeroMQ

Forecasting time series with constraints

Forensics on GPU Coprocessing in Databases – Research Challenges, First Experiments, and Countermeasures

Formal Analysis of GPU Programs with Atomics via Conflict-Directed Delay-Bounding

Formal Description and Optimization Based High – Performance Computing on CUDA

Formal Semantics of Heterogeneous CUDA-C: A Modular Approach with Applications

Formal specification and verification of OpenCL Kernel optimization

Formalizing Address Spaces with application to Cuda, OpenCL, and beyond

ForOpenCL: Transformations Exploiting Array Syntax in Fortran for Accelerator Programming

Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs

Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang

FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran

Four styles of parallel and net programming

Four-dimensional Cone Beam CT Reconstruction and Enhancement using a Temporal Non-Local Means Method

Fourier Volume Rendering on the GPU Using a Split-Stream-FFT

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

FPGA accelerated 3D reconstruction using compressive sensing

FPGA Accelerated Simulation of Biologically Plausible Spiking Neural Networks

FPGA Acceleration of Multifunction Printer Image Processing using OpenCL

FPGA acceleration of rigid-molecule docking codes

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL

FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods

FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis

FPGA and ASIC Convergence

FPGA and GPU implementation of large scale SpMV

FPGA Based Acceleration of Decimal Operations

FPGA Based High Performance and Scalable Block LU Decomposition Architecture

FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only

FPGA Based Satisfiability Checking

FPGA based Speeded Up Robust Features

FPGA implementation of a Convolutional Neural Network for "Wake up word" detection

FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL

FPGA Implementation of Reduced Precision Convolutional Neural Networks

FPGA in HPC: High Level Synthesys of OpenCL kernels for Molecular Dynamics

FPGA vs. GPU for sparse matrix vector multiply

FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

FPGA-Accelerated Image Processing Using High Level Synthesis with OpenCL

FPGA-based acceleration of a particle simulation High Performance Computing application

FPGA-based acceleration of CHARMM-potential minimization

FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL

FPGA-Based Accelerator Design from a Domain-Specific Language

FPGA-Based Design of Numerical Algorithms for Kernel Density Estimation Using High Level Synthesis Approach

FPGA-based Tsunami Simulation: Performance Comparison with GPUs, and Roofline Model for Scalability Analysis

FPGA-GPU architecture for kernel SVM pedestrian detection

FPGA-GPU-CPU Heterogenous Architecture for Real-time Cardiac Physiological Optical Mapping

FPGA: An Efficient And Promising Platform For Real-Time Image Processing Applications

fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs

FPGAs, GPUs and the PS2 – A Single Programming Methodology

Fractal Art Generation using GPUs

Fractal Based Method on Hardware Acceleration for Natural Environments

Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms

Fractals Image Rendering and Compression using GPUs

Frame-based parallelization of MPEG-4 on compute unified device architecture (CUDA)

Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations

Framework for Parallel Kernels Auto-tuning

Framework for utilizing computational devices within simulation

Frameworks for GPU Accelerators: A comprehensive evaluation using 2D/3D image registration

Frameworks for multi-core architectures: a comprehensive evaluation using 2D/3D image registration

Frameworks in Medical Image Analysis with Deep Neural Networks

Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse

Free surface flow simulations on GPGPUs using the LBM

Free-form interest rate term structure decomposition: a 2nd order optimization problem

Frequent itemset mining on graphics processors

From Constraint Programming to Heterogeneous Parallelism

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming

From English To Foreign Languages: Transferring Pre-trained Language Models

From Experiment to Design – Fault Characterization and Detection in Parallel Computer Systems Using Computational Accelerators

From GPUs to AI and quantum: three waves of acceleration in bioinformatics

From MPI to MPI+OpenACC: Conversion of a legacy FORTRAN PCG solver for the spherical Laplace equation

From OpenCL to Gates: the FFT

From Parallel Programs to Customized Parallel Processors

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

From Pixels to Torques: Policy Learning using Deep Dynamical Convolutional Networks

From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation

From Rendering to Tracking Point-based 3D Models

Brief statistics for this page

Titles: 100

Download open PDFs: 97

Package packages: 16

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)