Papers on hgpu.org (.txt-file)
Asynchronous Communication Schemes for Finite Difference Methods on Multiple GPUs

Asynchronous Methods for Deep Reinforcement Learning

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous Parallel Computing Algorithm implemented in 1D Heat Equation with CUDA

Asynchronous Parallel Computing Model of Global Motion Estimation with CUDA

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Asynchronous-Many-Task Systems: Challenges and Opportunities – Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX

ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs

Atmospheric turbulence removal using convolutional neural network

Atomic-free Irregular Computations on GPUs

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

Attack Signature Matching using Graphics Processors in High-Performance Intrusion Detection Systems

Attention-based NMT Models as Feature Functions in Phrase-based SMT

ATTILA: a cycle-level execution-driven simulator for modern GPU architectures

Audiovisual Voice Activity Detection and Localization of Simultaneous Speech Sources

Augmented reality live-action compositing

Augmented reality usage for prototyping speed up

Augmenting Operating Systems With the GPU

Augur: a Modeling Language for Data-Parallel Probabilistic Inference

Aurally and visually enhanced audio search with soundtorch
AUTO-GC: Automatic translation of data mining applications to GPU clusters
Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters

Auto-Generation of Parallel Finite-Differencing Code for MPI, TBB and CUDA

Auto-optimization of a Feature Selection Algorithm

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Auto-tunable GPU BLAS (thesis)

Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems

Auto-tuning 3-D FFT library for CUDA GPUs
Auto-tuning a High-Level Language Targeted to GPU Codes

Auto-tuning a LOFAR radio astronomy pipeline in JavaCL

Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs

Auto-Tuning Dedispersion for Many-Core Accelerators

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs

Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU

Auto-tuning interactive ray tracing using an analytical GPU architecture model

Auto-tuning of fast fourier transform on graphics processors

Auto-Tuning of Level 1 and Level 2 BLAS for GPUs

Auto-tuning on the macro scale: high level algorithmic auto-tuning for scientific applications

Auto-tuning Shallow water simulations on GPUs

Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems

Auto-tuning Streamed Applications on Intel Xeon Phi

Auto-Tunning of Data Communication on Heterogeneous Systems

Auto-Vectorizing a Large-scale Production Unstructured-mesh CFD Application

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

AutoMat – Automatic Differentiation for Generalized Standard Materials on GPUs

Automated and interactive approaches for optimal surface finding based segmentation of medical image data

Automated and parallel code generation for finite-differencing stencils with arbitrary data types

Automated Architecture Design for Deep Neural Networks

Automated architecture-aware mapping of streaming applications onto GPUs

Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow

Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models

Automated Deep Learning Optimization via DSL-Based Source Code Transformation

Automated development of applications for graphical processing units using rewriting rules
Automated Dynamic Analysis of CUDA Programs

Automated Enhanced Parallelization of Sequential C to Parallel OpenMP

Automated Generation of OpenCL Programs Based on Algebra-Algorithmic Approach

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications

Automated image alignment for 2D gel electrophoresis in a high-throughput proteomics pipeline

Automated Long-Term Monitoring of Parallel Microfluidic Operations Applying a Machine Vision-Assisted Positioning Method

Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation

Automated pose estimation in 3D point clouds applying annealing particle filters and inverse kinematics on a GPU

Automated Runtime Analysis and Adaptation for Scalable Heterogeneous Computing

Automated Software Testing of Memory Performance in Embedded GPUs

Automated Techniques for Enabling Efficient MPI Application Migration

Automated test generation for OpenCL kernels using fuzzing and constraint solving

Automated Testing of Graphics Shader Compilers

Automated Tool to Generate Parallel CUDA code from a Serial C Code

Automatic abstraction and fault tolerance in cortical microachitectures

Automatic acceleration of Numpy applications on GPUs and multicore CPUs

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models

Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems

Automatic bi-layer video segmentation based on sensor fusion

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

Automatic C-to-CUDA Code Generation for Affine Programs

Automatic classification of object code using machine learning

Automatic Code Generation and Adaptive Grid Scheduling for GPU Cluster Computing

Automatic code generation and tuning for stencil kernels on modern shared memory architectures

Automatic code generation for solvers of cardiac cellular membrane dynamics in GPUs
Automatic Code Generation for Stencil Computations on GPU Architectures

Automatic code generation methods applied to numerical linear algebra in high performance computing

Automatic Code Rewriting for Performance Portability

Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL

Automatic Compilation for Heterogeneous Architectures with Single Assignment C

Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Automatic Compiler Based FPGA Accelerator for CNN Training

Automatic contention detection and amelioration for data-intensive operations

Automatic CPU-GPU communication management and optimization

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Automatic Data Layout Generation and Kernel Mapping for CPU+GPU Architectures

Automatic Data Layout Optimizations for GPUs

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Automatic Detection and Denoising of Signals in Large Geophysical Datasets

Automatic Discovery of Algorithms for Multi-Agent Systems

Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems
Titles: 100
open PDFs: 92
packages: 15
