high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Collaborative Diffusion on the GPU for Path-Finding in Games

Collaborative diffusion: programming antiobjects

Collaborative execution environment for heterogeneous parallel systems

Collage: Automated Integration of Deep Learning Backends

Collection skeletons: declarative abstractions for data collections

Collective Communication for 100k+ GPUs

Collision Detection Based on Fuzzy Scene Subdivision

Collision Detection of Triangle Meshes using GPU

Collision detection on the GPU

Collision Detection: Broad Phase Adaptation from Multi-Core to Multi-GPU Architecture

Collision for 75-step SHA-1: Intensive Parallelization with GPU

Collision-Driven Volumetric Deformation on the GPU

Collision-streams: fast GPU-based collision detection for deformable models

Color and motion-based particle filter target tracking in a network of overlapping cameras with multi-threading and GPGPU

Color Correction Acceleration Using a Color Cube and OpenCL

Color Me Noisy: Example-based Rendering of Hand-colored Animations with Temporal Noise Control

Color Seamlessness in Multi-Projector Displays Using Constrained Gamut Morphing

Colored stochastic shadow maps

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Colour flux-tubes in static Pentaquark and Tetraquark systems

Column-Oriented Datalog on the GPU

Combinatorial Optimization of Work Distribution on Heterogeneous Systems

Combined acoustic and optical trapping

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Combining approximate inference methods for efficient learning on large computer clusters

Combining Belief Propagation and Successive Cancellation List Decoding of Polar Codes on a GPU Platform

Combining computer vision and physics simulations using GPGPU

Combining Data Parallelism and Task Parallelism for Efficient Performance on Hybrid CPU and GPU Systems

Combining Multiple Optimised FPGA-based Pulsar Search Modules Using OpenCL

Combining Performance and Productivity: Accelerating the Network Sensing Graph Challenge with GPUs and Commodity Data Science Software

Combining recent HPC techniques for 3D geophysics acceleration

Combustion Simulations Using Graphic Processing Units

Coming Soon: Research in a Cloud

Communication and Coordination Paradigms for Highly-Parallel Accelerators

Communication Architectures for Scalable GPU-centric Computing Systems

Communication Optimization for Multi GPU Implementation of Smith-Waterman Algorithm

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

Communication-avoiding QR decomposition for GPUs

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

Communication-Minimizing 2D Convolution in GPU Registers

Communication-minimizing Asynchronous Tensor Parallelism

Community Structure Discovery algorithm on GPU with CUDA

Compact data structure and scalable algorithms for the sparse grid technique

Comparative Analysis of OpenACC, OpenMP and CUDA using Sequential and Parallel Algorithms

Comparative Evaluation of Binary Features

Comparative evaluation of platforms for parallel Ant Colony Optimization

Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU

Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations

Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning

Comparative Study of Frequent Itemset Mining Techniques on Graphics Processor

Comparative Study of High Performance Computing Using Multi-core Parallel Systems

Comparative study of parallel programming models for multicore computing

Comparative Study of the Parallelization of the Smith-Waterman Algorithm on OpenMP and Cuda C

Comparing CUDA and OpenGL implementations for a Jacobi iteration

Comparing CUDA, OpenCL and OpenGL Implementations of the Cardiac Monodomain Equations

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Comparing FPGAs to Graphics Accelerators and the Playstation 2 Using a Unified Source Description

Comparing GPU and CPU in OLAP Cubes Creation

Comparing GPU-based multi-volume ray casting techniques

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Comparing Intra- and Inter-Processor Parallelism on Multi-Core Cell Processors for Scientific Simulations

Comparing Linear and Convex Relaxations for Stereo and Motion

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

Comparing Many-Core Accelerator Frameworks

Comparing Parallel Functional Array Languages: Programming and Performance

Comparing Parallel Hardware Architectures for Visually Guided Robot Navigation

Comparing Parallel Simulation of Social Agents using Cilk and OpenCL

Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing

Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs

Comparing Programmer Productivity in OpenACC and CUDA: an Empirical Investigation

Comparing SYCL data transfer strategies for tracking use cases

Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs

Comparing the Treecode with FMM on GPUs for vortex particle simulations of a leapfrogging vortex ring

Comparing Two Generations of Embedded GPUs Running a Feature Detection Algorithm

Comparison and Analysis of GPGPU and Parallel Computing on Multi-Core CPU

Comparison and Analysis of GPU Energy Effciency For CUDA and OpenCL

Comparison and Analysis of GPU Energy Efficiency For CUDA and OpenCL

Comparison based sorting for systems with multiple GPUs

Comparison between GPU and parallel CPU optimizations in viewshed analysis

Comparison of Cilk, Kaapi and CUDA for the Jacobi Method

Comparison of CPML Implementations for the GPU-Accelerated FDTD Solver

Comparison of different n-body algorithms on various hardware platforms using SYCL

Comparison of Different Parallel Implementaions of the 2+1-Dimensional KPZ Model and the 3-Dimensional KMC Model

Comparison of FPGA and GPU implementations of real-time stereo vision

Comparison of Fragmentation/Dispersion Models for Asteroid Nuclear Disruption Mission Design

Comparison of GPU Architectures for Asynchronous Communication with Finite-Differencing Applications

Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal

Comparison of Hybrid Sorting Algorithms Implemented on Different Parallel Hardware Platforms

Comparison of OpenCL performance on different platforms using VexCL and Blaze

Comparison of OpenMP & OpenCL Parallel Processing Technologies

Comparison of OpenMP and OpenCL Parallel Processing Technologies

Comparison of parallel sorting algorithms

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Comparison of Random Number Generators in Particle Swarm Optimization Algorithm

Comparison of Rectangular Matrix Multiplication with and without Border Conditions

Comparison of several parallel API for cloth modelling on modern GPUs

Comparison of SPMV performance on matrices with different matrix format using CUSP, cuSPARSE and ViennaCL

Comparison of Technologies for General-Purpose Computing on Graphics Processing Units

Comparison of Thread Execution Methods for GPU-oriented OpenCL Programs on Multicore Processors

Brief statistics for this page

Titles: 100

Download open PDFs: 95

Package packages: 13

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)