Papers on hgpu.org (.txt-file)
Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs
Automatic Parallelization of Tiled Stencil Loop Nests on GPUs
Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime
Automatic Performance Optimisation of Parallel Programs for GPUs via Rewrite Rules
Automatic Performance Optimization in ViennaCL for GPUs
Automatic Performance Optimization on Heterogeneous Computer Systems using Manycore Coprocessors
Automatic Performance Tuning of Pipeline Patterns for Heterogeneous Parallel Architectures
Automatic Performance Tuning of Stencil Computations on Graphics Processing Units
Automatic Point Target Detection for Interactive Visual Analysis of SAR Images
Automatic Pose Estimation for Range Images on the GPU
Automatic program analysis for data parallel kernels
Automatic program parallelization for multicore processors
Automatic Resource-Constrained Static Task Parallelization
Automatic run-time mapping of polyhedral computations to heterogeneous devices with memory-size restrictions
Automatic safety proofs for asynchronous memory operations
Automatic Scan Parallelization in OpenMP
Automatic scanning of nuclear emulsions with wide-angle acceptance for nuclear fragment detection
Automatic Scheduling of Compute Kernels Across Heterogeneous Architectures
Automatic Selection of Sparse Matrix Representation on GPUs
Automatic shader level of detail
Automatic SIMD Code Generation
Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification
Automatic Software Synthesis from High-Level ForSyDe Models Targeting Massively Parallel Processors
Automatic source code adaptation for heterogeneous platforms
Automatic Synthesis of Heterogeneous CPU-GPU Embedded Applications from a UML Profile
Automatic Termination Analysis for GPU Kernels
Automatic Test Case Reduction for OpenCL
Automatic test case reduction of randomly generated OpenCL kernels
Automatic transformation and optimization of applications on GPUs and GPU clusters
Automatic Translation of CUDA to OpenCL and Comparison of Performance Optimizations on GPUs
Automatic tuning matrix multiplication performance on graphics hardware
Automatic Tuning of Local Memory Use on GPGPUs
Automatic Virtualization of Accelerators
Automatically Exploiting the Memory Hierarchy of GPUs through Just-in-Time Compilation
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation
Automatically Generating Efficient Simulation Codes on GPUs from Partial Differential Equations
Automatically Harnessing Sparse Acceleration
Automatically Selecting Profitable Thread Block Sizes Using Machine Learning
Automatically translating a general purpose C++ image processing library for GPUs
Automatically Tuned Dense Linear Algebra for Multicore+GPU
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
Automating a Labour Performance Measurement and Risk Assessment: An Evaluation of Methods for a Computer Vision based System
Automating elimination of idle functions by run-time reconfiguration
Automating GPU computing in MATLAB
Automating Heterogeneous Parallelism in Numerical Differential Equations
Automating the Last-Mile for High Performance Dense Linear Algebra
AutOMP: An Automatic OpenMP Parallelization Generator for Variable-Oriented High-Performance Scientific Codes
AutoParBench: A Unified Test Framework for OpenMP-based Parallelizers
AutoPhase: Compiler Phase-Ordering for High Level Synthesis with Deep Reinforcement Learning
Autotuning CUDA Compiler Parameters for Heterogeneous Applications using the OpenTuner Framework
Autotuning CUDA: Applying NLP Techniques to LS-CAT
Autotuning for Automatic Parallelization on Heterogeneous Systems
Autotuning GPU Kernels via Static and Predictive Analysis
Autotuning of Pattern Runtimes for Accelerated Parallel Systems
Autotuning OpenACC Work Distribution via Direct Search
Autotuning OpenCL Workgroup Size for Stencil Patterns
Autotuning Programs with Algorithmic Choice
Autotuning Stencil-Based Computations on GPUs
Autotuning Stencils Codes with Algorithmic Skeletons
Autotuning Tensor Contraction Computations on GPUs
Autotuning Wavefront Abstractions for Heterogeneous Architectures
Autotuning Wavefront Patterns for Heterogeneous Architectures
Autotuning, Code Generation and Optimizing Compiler Technology for GPUs
Auxiliary Image Regularization for Deep CNNs with Noisy Labels
AvA: Accelerated Virtualization of Accelerators
AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries
AVSS2011 demo session: GPU enabled Smart Video Node
AVX-512 extension to OpenQCD 1.6
AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
Axel: a heterogeneous cluster with FPGAs and GPUs
AZP: Automatic Specialization for Zero Values in Gaming Applications
b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions
B-CALM: An open-source GPU-based 3D-FDTD with multi-pole dispersion for plasmonics
B-Calm: an Open-Source Multi-Gpu-Based 3D-FDTD with Multi-Pole Dispersion for Plasmonics
Back Ground Subtraction Algorithm For Moving Object Detection In FPGA
Backpropagation Training for Fisher Vectors within Neural Networks
BaCO: A Fast and Portable Bayesian Compiler Optimization Framework
Bacon: A GPU Programming System With Just in Time Specialization
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Balancing locality and concurrency: solving sparse triangular systems on GPUs
Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach
Bamboo: Automatic Translation of MPI Source into a Latency-Tolerant Form
Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
Bandwidth Reduction Through Multithreaded Compression of Seismic Images
Bandwidth Requirements of GPU Architectures
BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU
Barra, a Modular Functional GPU Simulator for GPGPU
Barra: A Parallel Functional Simulator for GPGPU
BarraCUDA – a fast short read sequence aligner using graphics processing units
Barrier Invariants: A Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels
Barycentric coordinates computation in homogeneous coordinates
BASEMENT v3: a modular freeware for river process modelling over multiple computational backends
Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts
BAT: A Benchmark suite for AutoTuners
Batch Method for Efficient Resource Sharing in Real-time Multi-GPU Systems
Batch Records Insertion into Multidimensional Linear Dynamic Hashing Table on GPU
Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs
Titles: 100
open PDFs: 95
packages: 28