Papers on hgpu.org (.txt-file)
Origami: A Convolutional Network Accelerator

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

Orthogononalization on a general purpose graphics processing unit with double double and quad double arithmetic

Orthorectification by Using GPGPU Method

Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs

Out-of-core cone beam reconstruction using multiple GPUs

Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds

Out-of-core singular value decomposition

Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Out-of-the-box library support for DBMS operations on GPUs

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Over-synchronization in GPU Programs

Overcoming the GPU memory limitation on FDTD through the use of overlapping subgrids
Overcomplete Dictionary Learning with Jacobi Atom Updates

Overdetermined Shooting Methods for Computing Standing Water Waves with Spectral Accuracy

Overhauling SC atomics in C11 and OpenCL

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping computation and communication of three-dimensional FDTD on a GPU cluster

Overtaking CPU DBMSes with a GPU in Whole-Query Analytic Processing with Parallelism-Friendly Execution Plan Optimization

Overview of approaches for accelerating scale invariant feature detection algorithm
Overview of implementation of DARPA GPU program in SAIC
OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance

Owl: Differential-based Side-Channel Leakage Detection for CUDA Applications

P-HGRMS: A Parallel Hypergraph Based Root Mean Square Algorithm for Image Denoising

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

PacketShader: a GPU-accelerated software router

Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm

Pairwise Sequence Alignment for Very Long Sequences on GPUs

Pairwise Sequence Alignment with Gaps with GPU

PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs

Panda: A Compiler Framework for Concurrent CPU-GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Pangolin: An Efficient and Flexible Graph Mining System on CPU and GPU

PanJoin: A Partition-based Adaptive Stream Join

PANNA: Properties from Artificial Neural Network Architectures

Pannotia: Understanding Irregular GPGPU Graph Applications

PantaRay: fast ray-traced occlusion caching of massive scenes
PAPER – Accelerating parallel evaluations of ROCS

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

ParadisEO-MO-GPU: a Framework for Parallel GPU-based Local Search Metaheuristics

Paragon: Collaborative Speculative Loop Execution on GPU and CPU

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

Paraiso : An Automated Tuning Framework for Explicit Solvers of Partial Differential Equations

Parakeet: A Just-In-Time Parallel Accelerator for Python

Parallax: Automatic Data-Parallel Training of Deep Neural Networks

Paralleizing AwSpPCA for robust facial recognition using CUDA

Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs

Parallel 3D Finite Difference Time Domain Simulations on Graphics Processors with Cuda
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster

Parallel 3D multigrid methods on the STI cell BE architecture

Parallel 5 point SOR for solving the Convection Diffusion equation using graphics processing units

Parallel acceleration of CPU and GPU range queries over large data sets

Parallel Acceleration on Manycore Systems and Its Performance Analysis: OpenCL Case Study

Parallel accelerators for GlimmerHMM bioinformatics algorithm
Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations

Parallel AES algorithm for fast Data Encryption on GPU

Parallel AES Encryption Engines for Many-Core Processor Arrays

Parallel Agent systems on a GPU for use with Simulations and Games

Parallel Algorithm Design and Implementation of Regular/Irregular Problems: An In-depth Performance Study on Graphics Processing Units

Parallel Algorithm for BSDEs Based High Dimensional American Option Pricing on the GPU

Parallel Algorithm for Generation of Test Recommended Path using CUDA

Parallel Algorithm for GPU Processing; for use in High Speed Machine Vision Sensing of Cotton Lint Trash

Parallel Algorithm for Solving Kepler’s Equation on Graphics Processing Units: Application to Analysis of Doppler Exoplanet Searches

Parallel Algorithm of IDCT with GPUs and CUDA for Large-scale Video Quality of 3G

Parallel algorithms for approximation of distance maps on parametric surfaces

Parallel Algorithms for Constructing Data Structures for Fast Multipole Methods

Parallel Algorithms for Counting Problems on Graphs Using Graphics Processing Units

Parallel Algorithms for GPU accelerated Probabilistic Inference

Parallel Algorithms for Hybrid Multi-core CPU-GPU Implementations of Component Labelling in Critical Phase Models

Parallel algorithms for problems of cluster analysis with very large amount of data

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Parallel algorithms to a parallel hardware: Designing vision algorithms for a GPU

Parallel and Concurrent Programming in Haskell: Techniques for Multicore and Multithreaded Programming

Parallel and Distributed Deep Learning

Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern Matching Algorithms

Parallel and efficient Boolean on polygonal solids

Parallel and Heterogeneous Timing Analysis: Partition, Algorithm, and System

Parallel and Improved PageRank Algorithm for GPU-CPU Collaborative Environment

Parallel and in-process compilation of individuals for genetic programming on GPU

Parallel and Scalable Sparse Basic Linear Algebra Subprograms

Parallel ant colony for nonlinear function optimization with graphics hardware acceleration
Parallel Application Library for Object Recognition

Parallel Approach for Longest Common Subsequence problem on GPU

Parallel Approach for Time Series Analysis with General Regression Neural Networks

Parallel Approaches for SWAMP Sequence Alignment
Parallel Approaches to Edit Distance and Approximate String Matching

Parallel Approaches to Shortest-Path Problems for Multilevel Heterogeneous Computing

Parallel Arbitrary-precision Integer Arithmetic

Parallel Asynchronous Modelization and Execution of Cholesky Algorithm using Petri Nets

Parallel Banding Algorithm to compute exact distance transform with the GPU

Parallel Batch Training of the Self-Organizing Map Using OpenCL
Parallel Benefit on Different Programming Paradigms

Parallel Bio-Inspired Methods for Model Optimization and Pattern Recognition

Parallel birth and death process for cell nuclei extraction in histopathology images

Parallel Branch and Bound on a CPU-GPU System

Parallel Branch Prediction on GPU Platform

Titles: 100
open PDFs: 90
packages: 18
