high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Dense photometric stereo reconstruction on many core GPUs

Dense Photometric Stereo: A Markov Random Field Approach

Dense point trajectories by GPU-accelerated large displacement optical flow

Dense Real-Time Mapping of Object-Class Semantics from RGB-D Video

Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures

DenseCut: Densely Connected CRFs for Realtime GrabCut

Density Estimations for Approximate Query Processing on SIMD Architectures

Density functional theory calculation on many-cores hybrid central processing unit-graphic processing unit architectures

Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures

Density-based clustering using graphics processors

Density-based parallel skin lesion border detection with webCL

Dependable Embedded Systems

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Deployment of CPU and GPU-based genetic programming on heterogeneous devices

Deployment of parallel linear genetic programming using GPUs on PC and video game console platforms

Depth Enhanced Panoramas

Depth Estimation using Open Compute Language (OpenCL)

Depth Images: Representations and Real-Time Rendering

Depth Map Based Superresolution Method in 3D Reconstruction

Depth map enhanced macroblock partitioning for H.264 video coding of computer graphics content

Depth-Dependent Halos: Illustrative Rendering of Dense Line Data

Depth-First Search versus Jurema Search on GPU Branch-and-Bound Algorithms: a case study

Depth-of-Field Blur Effects for First-Person Navigation in Virtual Environments

Deriving Shape Grammars on the GPU

Descend: A Safe GPU Systems Programming Language

Design and Analysis of Soft-Error Resilience Mechanisms for GPU Register File

Design and Development of an Efficient H. 264 Video Encoder for CPU/GPU using OpenCL

Design and Development of Optical Flow Based Obstacle Avoidance Using CUDA

Design and evaluation of a parallel k-nearest neighbor algorithm on CUDA-enabled GPU

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Design and implementation of a high-performance stream-based computing platform on multigenerational GPUs

Design and Implementation of a PTX Emulation Library

Design and implementation of a time-division multiplexing scan architecture using serializer and deserializer in GPU chips

Design and Implementation of Centrally-Coordinated Peer-to-Peer Live-streaming

Design and Implementation of CNN-FPGA accelerator based on Open Computing Language

Design and Implementation of GPU-Based Prim’s Algorithm

Design and implementation of MPEG audio layer III decoder using graphics processing units

Design and Implementation of ShenWei Universal C/C++

Design and implementation of software-managed caches for multicores with local memory

Design and Implementation of the Futhark Programming Language

Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU

Design and Modeling of a Non-blocking Checkpointing System

Design and optimization of a portable LQCD Monte Carlo code using OpenACC

Design and optimization of DBSCAN Algorithm based on CUDA

Design and Optimization of Hybrid MD5-Blowfish Encryption on GPUs

Design and Optimization of Image Processing Algorithms on Mobile GPU

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms

Design and Performance Analysis of Parallel Processing of SRTP Packets

Design and performance evaluation of a digital wideband receiver on a hybrid computing platform

Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers

Design and Performance Evaluation of Image Processing Algorithms on GPUs

Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels

Design and Performance of the OP2 Library for Unstructured Mesh Applications

Design and Storage Optimization of GPU-based Parallel Program of Image Registration for Remote Sensing

Design and study of a massively multi threaded shared memory architecture

Design Exploration of AES Accelerators on FPGAs and GPUs

Design Exploration of Quadrature Methods in Option Pricing

Design of 3D FFT on Multi-GPU Clusters

Design of a fully programmable shader processor for low power mobile devices

Design of a Hybrid Memory System for General-Purpose Graphics Processing Units

Design of a parallel AES for graphics hardware using the CUDA framework

Design of a programmable micro-ultrasound research platform

Design of an FPGA-Based FDTD Accelerator Using OpenCL

Design of FPGA-Based Accelerator for Convolutional Neural Network under Heterogeneous Computing Framework with OpenCL

Design of Hardware Accelerator for Lempel-Ziv 4 (LZ4) Compression

Design of high-performance parallelized gene predictors in MATLAB

Design of MILC Lattice QCD Application for GPU Clusters

Design optimization of automotive electronic control unit using the analysis of common-mode current by fast electromagnetic field solver

Design Principles for Sparse Matrix Multiplication on the GPU

Design Space Exploration for GPU-Based Architecture

Design Space Exploration of an OpenCL Based SAXPY Kernel Implementation on FPGAs

Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad

Design Space Exploration of OpenCL Applications on Heterogeneous Parallel Platforms

Design Space Exploration of Real-time Bedside and Portable Medical Ultrasound Adaptive Beamformer Acceleration

Design space exploration towards a realtime and energy-aware GPGPU-based analysis of biosensor data

Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms

Design, Implementation and Performance Evaluation of a Stochastic Gradient Descent Algorithm on CUDA

Design, Implementation and Test of Efficient GPU to GPU Communication Methods

Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs

Designing a high-performance boundary element library with OpenCL and Numba

Designing a Modern Skeleton Programming Framework for Parallel and Heterogeneous Systems

Designing a Unified Programming Model for Heterogeneous Machines

Designing and optimizing compute kernels on NVIDIA GPUs

Designing Bit-Reproducible Portable High-Performance Applications

Designing Efficient Barriers and Semaphores for Graphics Processing Units

Designing Efficient Many-Core Parallel Algorithms for All-Pairs Shortest-Paths Using CUDA

Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors

Designing efficient sorting algorithms for manycore GPUs

Designing Fast Architecture Sensitive Tree Search on Modern Multi-Core/Many-Core Processors

Designing Fast LTL Model Checking Algorithms for Many-Core GPUs

Designing Numerical Solvers for Next Generation High Performance Computing

Designing OP2 for GPU architectures

Designing scalable many-core parallel algorithms for min graphs using CUDA

Designing Scientific Applications on GPUs

Designing the Language Liszt for Building Portable Mesh-based PDE Solvers

Detecting Computer Viruses using GPUs

Detecting Data Races on OpenCL Kernels with Symbolic Execution

Detecting multiple periodicities in observational data with the multi-frequency periodogram. II. Frequency Decomposer, a parallelized time-series analysis algorithm

Detecting parametric objects in large scenes by Monte Carlo sampling

Brief statistics for this page

Titles: 100

Download open PDFs: 89

Package packages: 11

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)