__host__ __device__ -- Generic programming in Cuda
.NET High Performance Computing
"Local Rank Differences" Image Feature Implemented on GPU
[Serbian] The Methods and Procedures for Accelerating Operations and Queries in Large Database Systems and Data Warehouse (Big Data Systems)
10x10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency
190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs
2-D Impulse Noise Suppression by Recursive Gaussian Maximum Likelihood Estimation
24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way Galaxy with 18600 GPUs
2D and 3D level-set algorithms on GPU
2D Image Convolution using Three Parallel Programming Models on the Xeon Phi
2D Triangulation of Polygons on CUDA
2D/3D image registration on the GPU
2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation
2PARMA: Parallel Paradigms and Run-time Management Techniques for Many-Core Architectures
3-SAT on CUDA: Towards a massively parallel SAT solver
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
3D data denoising via Non-Local means filter by using parallel GPU strategies
3D Edge Bundling for Geographical Data Visualization
3D FFT on a Single FPGA
3D finite difference computation on GPUs using CUDA
3D finite element numerical integration on GPUs
3D GPU Architecture using Cache Stacking: Performance, Cost, Power and Thermal analysis
3D Haar-Like Elliptical Features for Object Classification in Microscopy
3D Hydrodynamic Simulation of Classical Nova Explosions
3D Information Extraction Based on GPU
3D Modeling, Distance and Gradient Computation for Motion Planning: A Direct GPGPU Approach
3D Non-Local Means denoising via multi-GPU
3D nonrigid registration via optimal mass transport on the GPU
3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels
3D Object Recognition with Convolutional Neural Networks
3D Objects Tracking by GPGPU-Enhanced Particle Filter Algorithms
3D Recursive Gaussian IIR on GPU and FPGAs: A Case Study for Accelerating Bandwidth-Bounded Applications
3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation
3D simulation of complex shading affecting PV systems taking benefit from the power of graphics cards developed for the video game industry
3D Skeleton Extraction Method using Potential Field on OpenCL
3D tumor localization through real-time volumetric x-ray imaging for lung cancer radiotherapy
3D vision of electromagnetic fields in antenna and microwave technique
3D visualization of astronomy data cubes using immersive displays
3D-color video camera
3DES ECB Optimized for Massively Parallel CUDA GPU Architecture
3I: A tool for visualizing and processing in parallel 2D & 3D images
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence
4kUHD H264 wireless live video streaming using CUDA
5.6: GPU enhancement of FDTD-PIC plasma-wave simulations
8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks
86 PFLOPS Deep Potential Molecular Dynamics simulation of 100 million atoms with ab initio accuracy
94% on CIFAR-10 in 3.29 Seconds on a Single GPU
A (ir)regularity-aware task scheduler for heterogeneous platforms
A (Somewhat Dated) Comparative Study of Betweenness Centrality Algorithms on GPU
A 3D Convex Hull Algorithm for Graphics Hardware
A 3D radiative transfer framework: XIII. OpenCL implementation
A 3D radiative transfer framework. VIII. OpenCL implementation
A 57mW embedded mixed-mode neuro-fuzzy accelerator for intelligent multi-core processor
A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks
A balanced programming model for emerging heterogeneous multicore systems
A Batched GPU Algorithm for Set Intersection
A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit
A Bi-objective Optimization Framework for Query Plans
A biomolecular electrostatics solver using Python, GPUs and boundary elements that can handle solvent-filled cavities and Stern layers
A block-asynchronous relaxation method for graphics processing units
A Braille Conversion Service Using GPU and Human Interaction by Computer Vision
A breadth-first course in multicore and manycore programming
A capabilities-aware framework for using computational accelerators in data-intensive computing
A Case Against Small Data Types on GPGPUs
A case for neuromorphic ISAs
A Case for Work-stealing on FPGAs with OpenCL Atomics
A Case Study for Petascale Applications in Astrophysics: Simulating Gamma-Ray Bursts
A Case Study in Using OpenCL on FPGAs: Creating an Open-Source Accelerator of the AutoDock Molecular Docking Software
A Case Study of OpenCL on an Android Mobile GPU
A Case Study of SWIM: Optimization of Memory Intensive Application on GPGPU
A case study on porting scientific applications to GPU/CUDA
A Case Study: Exploiting Neural Machine Translation to Translate CUDA to OpenCL
A CG-based Poisson solver on a GPU-cluster
A characterization and analysis of PTX kernels
A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads
A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using Multi-GPU
A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
A closer look at GPUs
A Cloud Computing Service Architecture of a Parallel Algorithm Oriented to Scientific Computing with CUDA and Monte Carlo
A cluster for CS education in the manycore era
A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor
A Co-Prime Blur Scheme for Data Security in Video Surveillance
A Coarse Grain Reconfigurable Architecture for sequence alignment problems in bio-informatics
A code motion technique for accelerating general-purpose computation on the GPU
A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators
A Code Transformation Framework for Scientific Applications on Structured Grids
A code-based analytical approach for using separate device coprocessors in computing systems
A Collective Knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniques
A collision detection algorithm using adaptive particle sensor
A combined MPI-CUDA parallel solution of linear and nonlinear Poisson-Boltzmann equation
A Common GPU n-Dimensional Array for Python and C
A Comparative Analysis of GPU Implementations of Spectral Unmixing Algorithms
A comparative analysis of the performance and deployment overhead of parallelized Finite Difference Time Domain (FDTD) algorithms on a selection of high performance multiprocessor computing systems
A comparative benchmarking of the FFT on Fermi and Evergreen GPUs
A Comparative Measurement Study of Deep Learning as a Service Framework
A Comparative Study of 2D Numerical Methods with GPU Computing
A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++
A Comparative Study of Game Tree Searching Methods
A comparative study of GPU programming models and architectures using neural networks
A Comparative Study of Neighborhood Filters for Artifact Reduction in Iterative Low-Dose CT
A Comparative Study of OpenACC Implementations
A Comparative Study of Parallel Algorithms for the Girth Problem
A Comparative Study on ASIC, FPGAs, GPUs and General Purpose Processors in the O(N^2) Gravitational N-body Simulation
A Comparative Study on Exact Triangle Counting Algorithms on the GPU
A Comparison between GPU-based Volume Ray Casting Implementations: Fragment Shader, Compute Shader, OpenCL, and CUDA
A comparison between parallelization approaches in molecular dynamics simulations on GPUs
A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units
A comparison of CPU and GPU performance for Fourier pseudospectral simulations of the Navier-Stokes, Cubic Nonlinear Schrodinger and Sine Gordon Equations
A Comparison of CPU and OpenCL Parallelization Methods for Correlation and Graph Layout Algorithms used in the Network Analysis of High Dimensional Data
A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation
A Comparison of FPGA and GPU for Real-Time Phase-based Optical Flow, Stereo, and Local Image Features
A Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling
A Comparison of Gradient Estimation Methods for Volume Rendering on Unstructured Meshes
A Comparison of High-Level Design Tools for SoC-FPGA on Disparity Map Calculation Example
A comparison of HPC-based quantum computing simulators using Quantum Volume
A Comparison of Many-threaded Differential Evolution and Genetic Algorithms on CUDA
A Comparison of Massively Parallel Programming Models Through Applications in Sound Propagation and Jitter Measurement
A Comparison of Modern GPU and CPU Architectures: And the Common Convergence of Both
A Comparison of OpenCL, CUDA, and HIP as Compilation Targets for a Functional Array Language
A Comparison of Optimal Scanline Voxelization Algorithms
A comparison of period finding algorithms
A Comparison of Potential Interfaces for Batched BLAS Computations
A Comparison of Sequential and GPU Implementations of Iterative Methods to Compute Reachability Probabilities
A Comparison of Serial & Parallel Particle Filters for Time Series Analysis
A Comparison of Statistical Techniques for Detecting Side-Channel Information Leakage in Cryptographic Devices
A Comparison of Support Vector Machines Training GPU-Accelerated Open Source Implementations
A Comparison of the performance of HPC Accelerators
A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models
A Comparison of Two Methods for Geometric Milling Simulation Accelerated by GPU
A Comparison of xPU Platforms Exemplified with Ray Tracing Algorithms
A Compile-Time Managed Multi-Level Register File Hierarchy
A Compiler and Runtime for Heterogeneous Computing
A compiler for high performance computing with many-core accelerators
A Compiler for Throughput Optimization of Graph Algorithms on GPUs
A compiler framework for optimization of affine loop nests for gpgpus
A Compiler Framework for Optimizing Dynamic Parallelism on GPUs
A Compiler Infrastructure for Accelerator Generators
A Compiler Infrastructure for Embedded Multicore SoCs
A compiler toolkit for array-based languages targeting CPU/GPU hybrid systems
A Complete and Efficient CUDA-Sharing Solution for HPC Clusters
A Complete Descritpion of the UnPython and Jit4GPU Framework
A complete modular resultant algorithm targeted for realization on graphics hardware
A comprehensive analysis and parallelization of an image retrieval algorithm
A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices
A Comprehensive Deep Learning Library Benchmark and Optimal Library Selection
A Comprehensive Performance Analysis of HSA and OpenCL 2.0
A Comprehensive Performance Comparison of CUDA and OpenCL
A comprehensive study of Dynamic Memory Management in OpenCL kernels
A Comprehensive Survey on Various Evolutionary Algorithms on GPU
A Computational Comparison of Basis Updating Schemes for the Simplex Algorithm on a CPU-GPU System
A Computational Model of Afterimages
A Computational Realization of a Semi-Lagrangian Method for Solving the Advection Equation
A computationally efficient and scalable approach for privacy preserving kNN classification
A Computationally Efficient Approach for Exemplar-based Color Image Inpainting using GPU
A Computationally Efficient Parallel Kernel Regression for Image Reconstruction
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines
A Compute Unified System Architecture for Graphics Clusters Incorporating Data Locality
A Computing Kernel for Network Binarization on PyTorch
A computing origami: Optimized code generation for emerging parallel platforms
A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors
A constant-space belief propagation algorithm for stereo matching
A Consumer Application for GPGPUs: Desktop Search
A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters
A Contour-Guided Deformable Image Registration Algorithm for Adaptive Radiotherapy
A control-structure splitting optimization for GPGPU
A convex formulation for color image segmentation in the context of passive emitter localization
A Convex Relaxation Approach to Space Time Multi-view 3D Reconstruction
A Convolutional Neural Network Cascade for Face Detection
A CPU and GPU Heterogeneous Processing of Multimedia Data by using OpenCL
A CPU-GPU Hybrid Runtime for the Aeminium Language
A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline
A Cross-Input Adaptive Framework for GPU Programs Optimization
A Cross-platform Evaluation of Graphics Shader Compiler Optimization
A CUDA Back-End for the Equelle Compiler
A CUDA Based Implementation of an Image Authentication Algorithm
A CUDA based Solution to the Multidimensional Knapsack Problem Using the Ant Colony Optimization
A CUDA Implementation of Independent Component Analysis in the Time-Frequency Domain
A CUDA implementation of the High Performance Conjugate Gradient benchmark
A CUDA Kernel Scheduler Exploiting Static Data Dependencies
A CUDA Monte Carlo simulator for radiation therapy dosimetry based on Geant4
A CUDA SIMT Interpreter for Genetic Programming
A CUDA SIMT interpreter for genetic programming. Revised
A CUDA-Based Cooperative Evolutionary Multi-Swarm Optimization Applied to Engineering Problems
A CUDA-Based Implementation of Stable Fluids in 3D with Internal and Moving Boundaries
A CUDA-based parallel implementation of K-nearest neighbor algorithm
A CUDA-Based Real Parameter Optimization Benchmark
A CUDA-enabled Parallel Implementation of Collaborative Filtering
A curved-element unstructured discontinuous Galerkin method on GPUs for the Euler equations
A Customized 3D GPU Poisson Solver for Free BCs
A Data Communication Scheduler for Stream Programs on CPU-GPU Platform
A Data Parallel Algorithm for Seismic Raytracing
A data parallel approach to genetic programming using programmable graphics hardware
A data parallel view on polyhedral process networks
A Data-Driven Model for Anisotropic Heterogeneous Subsurface Scattering
A Data-oriented Method for Scheduling Dependent Tasks on High-density Multi-GPU Systems
A Data-Parallel Algorithmic Modelica Extension for Efficient Execution on Multi-Core Platforms
A Data-Parallel Extension to Ruby for GPGPU
A Data-Parallel Graphics Pipeline Implemented in OpenCL
A dataflow-like programming model for future hybrid clusters
A Datalog Engine for GPUs
A declarative API for particle systems
A decompression pipeline for accelerating out-of-core volume rendering of time-varying data
A Deep Generative Deconvolutional Image Model
A Deep Learning Approach for Automatic Code Optimization in the Tiramisu Compiler
A deep learning approach to autonomous lunar landing
A Deep Learning Based Cost Model for Automatic Code Optimization
A Deep Learning Model for Loop Interchange
A design case study: CPU vs. GPGPU vs. FPGA
A Design Framework for Mapping Dataflow Graphs onto Heterogeneous Multiprocessor Platforms
A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA
A design pattern language for engineering (parallel) software: merging the PLPP and OPL projects
A design tool for efficient mapping of multimedia applications onto heterogeneous platforms
A Detailed GPU Cache Model Based on Reuse Distance Theory
A development of an accelerator board dedicated for multi-precision arithmetic operations and its application to Feynman loop integrals II
A Development Platform for Embedded Domain-Specific Languages
A directionally adaptive edge anti-aliasing filter
A Discussion of Selected Vienna-Libraries for Computational Science
A Distributed Approximation Algorithm for Mixed Packing-Covering Linear Programs
A Distributed Architecture for Smart Recycling Using Machine Learning
A distributed computing approach to improve the performance of the Parallel Ocean Program (v2.1)
A Distributed CPU-GPU Framework for Pairwise Alignments on Large-Scale Sequence Datasets
A Distributed Data Mining Framework Accelerated with Graphics Processing Units
A Distributed GPU-based Framework for real-time 3D Volume Rendering of Large Astronomical Data Cubes
A distributed multi-GPU system for high speed electron microscopic tomographic reconstruction
A Distributed-memory Tridiagonal Solver Based on a Specialised Data Structure Optimised for CPU and GPU Architectures
A Diversified Multi-Start Algorithm for Unconstrained Binary Quadratic Problems Leveraging the Graphics Processor Unit
A Domain Specific Approach to Heterogeneous Computing: From Availability to Accessibility
A Domain Specific Language for Performance Portable Molecular Dynamics Algorithms
A Domain-Extensible Compiler with Controllable Automation of Optimisations
A Domain-Specific Approach To Heterogeneous Parallelism 
A Domain-Specific Language and Compiler for Stencil Computations on Short-Vector SIMD and GPU Architectures
A domain-specific language for geospatial computations on the GPU
A Domain-specific Language to Facilitate Software Defined Radio Parallel Executable Patterns Deployment on Heterogeneous Architectures
A Duality Based Approach for Realtime TV-L1 Optical Flow
A Dynamic Approach to Weighted Suffix Tree Construction Algorithm
A Dynamic Hash Table for the GPU
A Dynamic IP Lookup Architecture using Parallel Multiple Hash in GPU-based Software Router
A Dynamic Offload Scheduler for spatial multitasking on Intel Xeon Phi Coprocessor
A Dynamic Programming Model To Solve Optimisation Problems Using GPUs
A Dynamic Resource Management and Scheduling Environment for Embedded Multimedia and Communications Platforms
A Dynamic Resource Management System for Network-Attached Accelerator Clusters
A dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms
A dynamically configurable coprocessor for convolutional neural networks
A Fair Comparison of Modern CPUs and GPUs Running the Genetic Algorithm under the Knapsack Benchmark
A Fast 3D Spatial Analysis Technique Using Graphic Process Units
A Fast Algorithm for Constructing Inverted Files on Heterogeneous Platforms
A Fast and Accurate GHT Implementation on CUDA
A Fast and Efficient SIFT Detector Using the Mobile GPU
A Fast and Efficient Simulation Framework for Modeling Heat Transport
A Fast and Generic GPU-Based Parallel Reduction Implementation
A fast and intuitive visual programming language (VPL) for constructing Computer Vision and Image processing systems on GPUs
A Fast and Rigorously Parallel Surface Voxelization Technique for GPU-Accelerated CFD Simulations
A fast and robust seed flooding algorithm on GPU for Voronoi diagram generation
A Fast and Secure Way to Prevent SQL Injection Attacks using Bitslice Technique and GPU Support
A Fast and Simple Approach to Merge and Merge Sort using Wide Vector Instructions
A Fast Batched Cholesky Factorization on a GPU
A Fast GEMM Implementation On a Cypress GPU
A fast GEMM implementation on the cypress GPU
A fast GPU algorithm for graph connectivity
A Fast GPU Implementation for Solving Sparse Ill-Posed Linear Equation Systems
A fast GPU-based Monte Carlo simulation of proton transport with detailed modeling of non-elastic interactions
A Fast GPU-Based Motion Estimation Algorithm for H.264/AVC
A Fast GVF Snake Algorithm on the GPU
A Fast High Quality Pseudo Random Number Generator for Graphics Processing Units
A fast high quality pseudo random number generator for nVidia CUDA
A fast hybrid time-synchronous/event approach to parallel discrete event simulation of queuing networks
A Fast Implementation of Parallel Discrete-Event Simulation on GPGPU
A Fast Implementation of the Octagon Abstract Domain on Graphics Hardware
A Fast Jet Finder Algorithm Using Graphic Processing Unit
A fast marching method based back projection algorithm for photoacoustic tomography in heterogeneous media
A Fast Method For Computing Principal Curvatures From Range Images
A Fast Mixed-Band Lifting Wavelet Transform on the GPU
A Fast Parallel Implementation of Queue-based Morphological Reconstruction using GPUs
A Fast Poisson Solver with Periodic Boundary Conditions for GPU Clusters in Various Configurations
A Fast Similarity Join Algorithm Using Graphics Processing Units
A fast stereo matching algorithm suitable for embedded real-time systems
A fast Texture-by-numbers synthesis method based on texture optimization
A Fast, GPU based, Dictionary Attack to OpenPGP Secret Keyrings
A Feedback Approach to Task Partitioning in Heterogeneous Architectures
A Field Guide to Genetic Programming
A fight for performance and accuracy of the matrix multiplication routines: CUBLAS on Nvidia Tesla versus MKL and ATLAS on Intel Nehalem
A File System Using GPU-Accelerated File-wise Reliability Scheme
A Financial Benchmark for GPGPU Compilation
A Fine Grained Cycle Sharing System with Cooperative Multitasking on GPUs
A finite volume approach for the simulation of nonlinear dissipative acoustic wave propagation
A First Look at Bugs in LLM Inference Engines
A first look at integrated GPUs for green high-performance computing
A First Order Primal-Dual Algorithm for Nonconvex TV^q Regularization
A First Step Towards GPU-assisted Query Optimization
A Fixed-Complexity Sphere Decoder for MIMO Systems on Graphics Processing Units
A flexible algorithm for calculating pair interactions on SIMD architectures
A flexible high-performance Lattice Boltzmann GPU code for the simulations of fluid flows in complex geometries
A Flexible Kernel for Adaptive Mesh Refinement on GPU
A Flexible Multi-Volume Shader Framework for Arbitrarily Intersecting Multi-Resolution Datasets
A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters
A flexible simulation framework for graphics architectures
A fluid simulation system based on the MPS method
A Foray into Efficient Mapping of Algorithms to Hardware Platforms on Heterogeneous Systems
A Framework for 3D Model-Based Visual Tracking Using a GPU-Accelerated Particle Filter
A Framework for Automated Generation of Specialized Function Variants
A Framework for Automated Performance Tuning and Code Verification on GPU Computing Platforms
A Framework for Automatic OpenMP Code Generation
A Framework for Composing High-Performance OpenCL from Python Descriptions
A framework for cost based optimization of hybrid CPU/GPU query plans in database systems
A framework for data-access strategies in GPGPU programs
A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures
A Framework for Developing Real-Time OLAP algorithm using Multi-core processing and GPU: Heterogeneous Computing
A framework for dynamically instrumenting GPU compute applications within GPU Ocelot
A framework for efficient and scalable execution of domain-specific templates on GPUs
A framework for efficient execution on GPU and CPU+GPU systems
A framework for exploring numerical solutions of advection-reaction-diffusion equations using a GPU-based approach
A Framework for Fast and Efficient Neural Network Compression
A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors
A Framework for Genetic Algorithms in Parallel Environments
A framework for GPU-based application-independent 3D interactions
A framework for lab-based real-time video analysis on distributed camera networks
A Framework for Lattice QCD Calculations on GPUs
A Framework for Management of Distributed Data Processing and Event Selection for the Icecube Neutrino Observatory
A Framework for Megascale Agent Based Model Simulations on Graphics Processing Units
A Framework for Megascale Agent Based Model Simulations on the GPU
A Framework for multisensor image fusion using graphics hardware
A framework for network traffic analysis using GPUs
A framework for parallel unstructured grid applications on GPUs
A Framework for Productive, Efficient and Portable Parallel Computing
A Framework for Profiling and Performance Monitoring of Heterogeneous Applications
A framework for simulating and estimating the state and functional topology of complex dynamic geometric networks
A Framework for the Volumetric Integration of Depth Images
A Framework for Transparent Execution of Massively-Parallel Applications on CUDA and OpenCL
A framework for volume segmentation and visualization using Augmented Reality
A Framework of Large-Scale Terrain Visualization Based on GPU
A Framework to Generate High-Performance Time-stepped Agent-based Simulations on Heterogeneous Hardware
A framework to implement a multifrontal scheme on GPU architectures with OpenCL
A Full-Depth Amalgamated Parallel 3D Geometric Multigrid Solver for GPU Clusters
A fully parallel, high precision, N-body code running on hybrid computing platforms
A GaBP-GPU Algorithm of Solving Large-Scale Sparse Linear Systems
A Game Architecture Based on Multiple GPUs With Energy Management
A game loop architecture for the GPU used as a math coprocessor in real-time applications
A Gb/s Parallel Block-based Viterbi Decoder for Convolutional Codes on GPU
A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices
A General Framework for Constrained Bayesian Optimization using Information-based Search
A general relativistic evolution code on CUDA architectures
A general tridiagonal solver for coprocessors: Adapting g-Spike for the Intel Xeon Phi
A General-Purpose GPU Reservoir Computer
A generalized GPU-based connected component labeling algorithm
A Generic and Scalable Pipeline for GPU Tetrahedral Grid Rendering
A Generic Approach for Developing Highly Scalable Particle-Mesh Codes for GPUs
A Generic Approach to Topic Models
A Generic Inverted Index Framework for Similarity Search on the GPU
A Generic Library for Stencil Computations
A generic library for structured real-time computations: GPU implementation applied to retinal and cortical vision processes
A GPGPU based program to solve the TDSE in intense laser fields through the finite difference approach
A GPGPU compiler for memory optimization and parallelism management
A GPGPU Implementation of Approximate String Matching with Regular Expression Operators and Comparison with Its FPGA Implementation
A GPGPU solution of the FMM near interactions for acoustic scattering problems
A GPGPU Transparent Virtualization Component for High Performance Computing Clouds
A GPGPU-Based Collision Detection Algorithm
A GPGPU-based Pipeline for Accelerated Rendering of Point Clouds
A GPU Accelerated Aggregation Algebraic Multigrid Method
A GPU accelerated algorithm for 3D Delaunay triangulation
A GPU Accelerated Algorithm for Compressive Sensing Based Image Super-Resolution
A GPU accelerated Barnes-Hut Tree Code for FLASH4
A GPU Accelerated BiConjugate Gradient Stabilized Solver for Speeding-up Large Scale Model Evaluation
A GPU Accelerated Continuous and Discontinuous Galerkin Non-hydrostatic Atmospheric Model
A GPU Accelerated High Performance Cloud Computing Infrastructure for Grid Computing Based Virtual Environmental Laboratory
A GPU accelerated interactive interface for exploratory functional connectivity analysis of fMRI data
A GPU Accelerated Navier-Stokes Solver with Multi-level Granularity for Solving Sparse Implicit Systems
A GPU Accelerated Simulator for CO2 Storage
A GPU accelerated spring mass system for surgical simulation
A GPU accelerated storage system
A GPU Accelerated Volumetric Ray Tracer for Incandescent Gas
A GPU acceleration for FFT-based fast solvers for the integral equation
A GPU Algorithm for 3D Convex Hull
A GPU Algorithm for Greedy Graph Matching
A GPU Algorithm for IC Floorplanning: Specification, Analysis and Optimization
A GPU approach to FDTD for Radio Coverage Prediction
A GPU Approach to Fortran Legacy Systems
A GPU approach to parallel replica-exchange polymer simulations
A GPU Based 3D Object Retrieval Approach Using Spatial Shape Information
A GPU based Algorithm for Determining the Optimal Cutting Direction in Deep Mold Machining
A GPU based implementation of Center-Surround Distribution Distance for feature extraction and matching
A GPU Based Implementation of Side Effect Analysis
A GPU based interactive modeling approach to designing fine level features
A GPU Based Memory Optimized Parallel Method For FFT Implementation
A GPU based Parallel Hierarchical Fuzzy ART Clustering
A GPU based real-time GPS software receiver
A GPU based real-time software correlation system for the Murchison Widefield Array prototype
A GPU based real-time video compression method for video conferencing
A GPU based saliency map for high-fidelity selective rendering
A GPU cluster optimized multigrid scheme for computing unsteady incompressible fluid flow
A GPU framework for parallel segmentation of volumetric images using discrete deformable model
A GPU framework for parallel segmentation of volumetric images using discrete deformable models
A GPU Framework for Sparse Matrix Vector Multiplication
A GPU Framework for the Visualization and On-the-Fly Amplification of Real Terrains
A GPU implementation for improved granular simulations with LAMMPS
A GPU implementation for LBG and SOM training
A GPU implementation for two MIMO-OFDM detectors
A GPU Implementation for Two-Dimensional Shallow Water Modeling
A GPU Implementation of a Jacobi Method for Lattice Basis Reduction
A GPU implementation of a real-time MIMO detector
A GPU implementation of a track-repeating algorithm for proton radiotherapy dose calculations
A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation
A GPU implementation of EGSnrc's Monte Carlo photon transport for imaging applications
A GPU Implementation of Fast Parallel Markov Clustering in Bioinformatics Using EllPACK-R Sparse Data Format
A GPU Implementation of Inclusion-based Points-to Analysis
A GPU Implementation of Large Neighborhood Search for Solving Constraint Optimization Problems
A GPU Implementation of Local Search Operators for Symmetric Travelling Salesman Problem
A GPU implementation of massively parallel direction splitting for the incompressible Navier-Stokes equations
A GPU Implementation of Parallel Constraint-based Local Search
A GPU implementation of the Simulated Annealing Heuristic for the Quadratic Assignment Problem
A GPU Memory System Comparison for an Elliptic Test Problem
A GPU operations framework for WattDB
A GPU Parallelized Spectral Method for Elliptic Equations
A GPU persistent grid mapping for terrain rendering
A GPU solvent-solvent interaction calculation accelerator for biomolecular simulations using the GROMOS software
A GPU Sub-pixel Algorithm for Autostereoscopic Virtual Reality
A GPU Support for Large Scale Quantum Chemistry Applications
A GPU Tile-Load-Map architecture for terrain rendering: theory and applications
A GPU Tool for Efficient, Accurate, and Realistic Simulation of Cone Beam CT Projections
A GPU vs CPU performance evaluation of an experimental video compression algorithm
A GPU-Accelerated Algorithm for Self-Organizing Maps in a Distributed Environment
A GPU-accelerated Boundary Element Method and Vortex Particle Method
A GPU-accelerated Branch-and-Bound Algorithm for the Flow-Shop Scheduling Problem
A GPU-accelerated Direct-sum Boundary Integral Poisson-Boltzmann Solver
A GPU-Accelerated Framework for Image Processing and Computer Vision
A GPU-accelerated immersive audio-visual framework for interaction with molecular dynamics using consumer depth sensors
A GPU-accelerated local search algorithm for the Correlation Clustering problem
A GPU-accelerated Navier-Stokes Solver for Steady Turbomachinery Simulations
A GPU-Accelerated Parallel Preconditioner for the Solution of the Boltzmann Transport Equation for Semiconductors
A GPU-Accelerated Two Stage Visual Matching
A GPU-Based 3D Image Synthesizing Method for Real-Time Multiview Autostereoscopic Displays
A GPU-Based Accelerator for Chinese Word Segmentation
A GPU-based Affine and Scale Invariant Feature Transform Algorithm
A GPU-based Algorithm for Estimating 3D Geometry and Motion in Near Real-time
A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction
A GPU-based Approximate SVD Algorithm
A GPU-based architecture for improved online rebinning performance in clinical 3-D PET
A GPU-based architecture for real-time data assessment at synchrotron experiments
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis
A GPU-based closed frequent itemsets mining algorithm over stream
A GPU-based computing framework for CSCW
A GPU-Based Enhanced Genetic Algorithm for Power-Aware Task Scheduling Problem in HPC Cloud
A GPU-based finite-size pencil beam algorithm with 3D-density correction for radiotherapy dose calculation
A GPU-based Flood Simulation Framework
A GPU-based framework for efficient image processing
A GPU-based Framework for Real-time Free Viewpoint Television
A GPU-based hyperbolic SVD algorithm
A GPU-based implementation for Range Queries on Spaghettis Data Structure
A GPU-Based Implementation of Differential Evolution for Solving the Gene Regulatory Network Model Inference Problem
A GPU-based implementation of motion detection from a moving platform
A GPU-based implementation of the MRF algorithm in ITK package
A GPU-based interactive bio-inspired visual clustering
A GPU-based iterated tabu search for solving the quadratic 3-dimensional assignment problem
A GPU-based Large-scale Monte Carlo Simulation Method for Systems with Long-range Interactions
A GPU-based light hierarchy for real-time approximate illumination
A GPU-based matting Laplacian solver for high resolution image matting
A GPU-based maximal frequent itemsets mining algorithm over stream
A GPU-based Method for Computing Eigenvector Centrality of Gene-expression Networks
A GPU-based Multi-level Subspace Decomposition Scheme for Hierarchical Tensor Product Bases
A GPU-based Multiresolution Pipeline for Compressed Volume Rendering
A GPU-Based Parallel Algorithm for Design Structure Matrix (DSM) Partition
A GPU-based parallel algorithm for time series pattern mining
A GPU-based Parallel Ant Colony Algorithm for Scientific Workflow Scheduling
A GPU-based Parallel Fireworks Algorithm for Optimization
A GPU-based Parallel Procedure for Nonlinear Analysis of Complex Structures Using a Coupled FEM/DEM Approach
A GPU-based platform for cancer-treatment planning
A GPU-based real time trigger for rare kaon decays at NA62
A GPU-based Simulation for Stochastic Computing
A GPU-Based Simulation Kernel within Heterogeneous Collaborative Computation on Large-Scale Artificial Society
A GPU-Based Solution to Fast Calculation of Betweenness Centrality on Large Weighted Networks
A GPU-based survey for millisecond radio transients using ARTEMIS
A GPU-Based Track-Repeating Algorithm for Dose Calculation for Photon Radiotherapy
A GPU-Based Transient Stability Simulation Using Runge-Kutta Integration Algorithm
A GPU-based vision system for real time detection of fastening elements in railway inspection
A GPU-Based Wide-Band Radio Spectrometer
A GPU-Computing Approach to Solar Stokes Profile Inversion
A GPU-enabled solver for time-constrained linear sum assignment problems
A GPU-Enabled, High-Resolution Cosmological Microlensing Parameter Survey
A GPU-enhanced cluster for accelerated FMS
A GPU-inspired soft processor for high-throughput acceleration
A GPU-inspired soft processor for high-throughput acceleration (thesis)
A GPU-supported High-Level Programming Language for Image Processing
A GPU-tailored approach for training kernelized SVMs
A GPU/CUDA implementation of the collection-diffusion model to compute SER of large area and complex circuits
A Graph-based Model for GPU Caching Problems
A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures
A Graphics Hardware-Based Vortex Detection and Visualization System
A Graphics Parallel Memory Organization Exploiting Request Correlations
A Graphics Processing Unit Implementation of Coulomb Interaction in Molecular Dynamics
A graphics processor-based intranuclear cascade and evaporation simulation
A group theoretical toolbox for color image operators
A Haptic Device Interface for Medical Simulations using OpenCL
A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex
A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors
A Hardware-Accelerated Parallel Implementation of a Two-Dimensional Scheme for Free Surface Flows
A Hardware-Accelerated Patch Search Engine for Image Completion
A hardware-aware debugger for the OpenGL shading language
A Heterogeneous Accelerated Matrix Multiplication: OpenCL + APU + GPU+ Fast Matrix Multiply
A Heterogeneous Inference Framework for a Deep Neural Network
A Heterogeneous Parallel Framework for Domain-Specific Languages
A Hierarchical Thread Scheduler and Register File for Energy-efficient Throughput Processors
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication
A high performance agent based modelling framework on graphics card hardware with CUDA
A high performance computing for AOM stock trading order matching using GPU
A high performance computing framework for physics-based modeling and simulation of military ground vehicles
A High Performance Framework for Coupled Urban Microclimate Models
A High Performance Image Authentication Algorithm on GPU with CUDA
A High Performance Massively Parallel Approach for Real Time Deformable Body Physics Simulation
A High Performance Parallel FDTD Method Enhanced By Using SSE Instruction Set
A High Performance Parallel Sparse Linear Equation Solver Using CUDA
A High Performance Random Number Generator Using Heterogeneous Computing Platform
A High Quality Reflectance Model in Medical Image Visualization
A High-efficiency FPGA-based Accelerator for Convolutional Neural Networks using Winograd Algorithm
A High-Performance Brownian Bridge for GPUs: Lessons for Bandwidth Bound Applications
A High-Performance Computing Cluster for Distributed Deep Learning: A Practical Case of Weed Classification Using Convolutional Neural Network Models
A high-performance fault-tolerant software framework for memory on commodity GPUs
A High-Performance Multi-user Service System for Financial Analytics Based on Web Service and GPU Computation
A High-Performance Parallel FDTD Method Enhanced by Using SSE Instruction Set
A High-productivity Framework for Multi-GPU computation of Mesh-based applications
A High-resolution approach for Tsunami impact simulation on graphics processing units
A high-speed multi-GPU implementation of bottom-up attention using CUDA
A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data
A high-throughput screening approach to discovering good forms of biologically inspired visual representation
A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition
A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization
A Highly Extensible Framework for Molecule Dynamic Simulation on GPUs
A Highly Parallel Reuse Distance Analysis Algorithm on GPUs
A Highly Parameterizable Framework for Conditional Restricted Boltzmann Machine Based Workloads Accelerated With FPGAs and OpenCL
A Highly Scalable Solution of an NP-Complete Problem Using CUDA
A Highly-Efficient Memory-Compression Scheme for GPU-Accelerated Intrusion Detection Systems
A History-Based Performance Prediction Model with Profile Data Classification for Automatic Task Allocation in Heterogeneous Computing Systems
A Human–Machine Collaborative Tuning Framework for Triton Kernel Optimization on SIMD Platforms
A hybrid algorithm for parallel molecular dynamics simulations
A Hybrid Analytical DRAM Performance Model
A Hybrid Approach to Parallel Connected Component Labeling Using CUDA
A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs
A Hybrid Computational Grid Architecture for Comparative Genomics
A Hybrid Computing Platform Digital Wideband Receiver Design and Performance Measurement
A hybrid condensed finite element model with GPU acceleration for interactive 3D soft tissue cutting
A hybrid CPU-GPU parallelization scheme of variable neighborhood search for inventory optimization problems
A Hybrid CPU/GPU Cluster for Encryption and Decryption of Large Amounts of Data
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection
A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation
A Hybrid GPU-FPGA-based Computing Platform for Machine Learning
A Hybrid GPU/CPU FFT Library for Large FFT Problems
A hybrid Hermitian general eigenvalue solver
A Hybrid Method for Computing Apparent Ridges
A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration
A Hybrid Parallel Algorithm for Computing and Tracking Level Set Topology
A hybrid parallel framework for computational solid mechanics
A Hybrid Parallel Implementation of the Aho-Corasick and Wu-Manber Algorithms Using NVIDIA CUDA and MPI Evaluated on a Biological Sequence Database
A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning
A Hybrid Programming Model for Compressible Gas Dynamics Using OpenCL
A Hybrid Software Framework for the GPU Acceleration of Multi-Threaded Monte Carlo Applications
A Hybrid-parallel Architecture for Applications in Bioinformatics
A Hyperelastic Finite-Element Model of Human Skin for Interactive Real-Time Surgical Simulation
A journey from single-GPU to optimized multi-GPU SPH with CUDA
A Kinetic Vlasov Model for Plasma Simulation Using Discontinuous Galerkin Method on Many-Core Architectures
A Language for Describing Optimization Strategies
A Language for Nested Data Parallel Design-space Exploration on GPUs
A Lattice Boltzmann Method Simulator for Microfluidics on GPU Cluster
A Lattice-Preserving Multigrid Method for Solving the Inhomogeneous Poisson Equations Used in Image Analysis
A Light-weight API for Portable Multicore Programming
A Light-Weight Approach to Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures
A lighting model for fast rendering of forest ecosystems
A Lightweight Approach to Performance Portability with targetDP
A Lightweight, GPU-Based Software RAID System
A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs
A linguistic approach to concurrent, distributed, and adaptive programming across heterogeneous platforms
A load balance multi-scheduling model for OpenCL kernel tasks in an integrated cluster
A local diffusion wavelet approach for scattered data registration based on GPU
A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures
A low-cost 3D human interface device using GPU-based optical flow algorithms
A Low-Cost Solution For Excavator Simulation With Realistic Visual Effect
A low-power handheld GPU using logarithmic arithmetic and triple DVFS power domains
A Low-Power Hybrid CPU-GPU Sort
A low-power integrated x86-64 and graphics processor for mobile computing devices
A Machine-Learning Framework for Design for Manufacturability
A Many Threaded CUDA Interpreter for Genetic Programming
A Many-core Machine Model for Designing Algorithms with Minimum Parallelism Overheads
A map reduce framework for programming graphics processors 
A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction
A MapReduce Framework for Heterogeneous Computing Architectures
A Markovian event-based framework for stochastic spiking neural networks 
A Massive Data Parallel Computational Framework on Petascale/Exascale Hybrid Computer Systems
A massively multicore parallelization of the Kohn-Sham energy gradients
A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures
A massively parallel adaptive fast-multipole method on heterogeneous architectures
A Massively Parallel Algorithm for Cell Classification Using CUDA
A massively parallel algorithm for constructing the BWT of large string sets
A Massively Parallel Approach for Nonlinear Interdependency Analysis of Multivariate Signals with GPGPU
A Massively Parallel Architecture for Bioinformatics
A Massively Parallel Associative Memory Based on Sparse Neural Networks
A massively parallel framework using P systems and GPUs
A massively parallel implementation of QC-LDPC decoder on GPU
A massively parallel program to solve the phase field formulation for crack propagation
A master-slave robotic simulator based on GPUDirect
A matrix approach to tomographic reconstruction and its implementation on GPUs
A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling
A memory access model for highly-threaded many-core architectures
A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
A Memory Centric Kernel Framework for Accelerating Short-Range, Interactive Particle Simulation
A Memory Efficient Algorithm for Adaptive Multidimensional Integration with Multiple GPUs
A Memory Efficient and Fast Sparse Matrix Vector Product on a GPU
A Memory Model for Scientific Algorithms on Graphics Processors
A memory optimization technique for software-managed scratchpad memory in GPUs
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems
A meshless hierarchical representation for light transport
A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications
A Method for Accelerating Bronchoscope Tracking Based on Image Registration by GPGPU
A method for decompilation of AMD GCN kernels to OpenCL
A Method for Large-Scale Terrain Rendering Based-on GPU
A method for speeding up beam-tracing simulation using thread-level parallelization
A Method to Improve Interest Point Detection and its GPU Implementation
A methodology for comparing optimization algorithms for auto-tuning
A Methodology for Translating C-Programs to OpenCL
A Metric for Performance Portability
A Micro-benchmark Suite for AMD GPUs
A Microbenchmark Framework for Performance Evaluation of OpenMP Target Offloading
A middleware for efficient stream processing in CUDA
A minimal model for acoustic forces on Brownian particles
A Mixed Hierarchical Algorithm for Nearest Neighbor Search
A mixed precision semi-Lagrangian algorithm and its performance on accelerators
A Mixed-Precision Algorithm for the Solution of Lyapunov Equations on Hybrid CPU-GPU Platforms
A ML-based resource utilization OpenCL GPU-kernel fusion model
A mobile robot navigation with use of CUDA parallel architecture
A Model Extraction Attack on Deep Neural Networks Running on GPUs
A Model for Real Time Ocean Breaking Waves Animation
A model of dynamic compilation for heterogeneous compute platforms
A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs
A Modeling Approach based on UML/MARTE for GPU Architecture
A Modular Framework for Deformation and Fracture using GPU Shaders
A modular GPU raytracer using OpenCL for non-interactive graphics
A Modular System Architecture for Online Parallel Vision Pipelines
A molecular docking system using CUDA
A Monte Carlo Neutron Transport Code for Eigenvalue Calculations on a Dual-GPU System and CUDA Environment
A Moving Least Squares Based Approach for Contour Visualization of Multi-Dimensional Data
A MPI back-end for the OpenACC accULL. Exploiting OpenACC semantics in Message Passing Clusters
A Multi GPU Read Alignment Algorithm with Model-based Performance Optimization
A multi-agent architecture for scheduling of high performance services in a GPU cluster
A multi-GPU accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations
A multi-GPU acceleration for 3D imaging of the prostate
A Multi-GPU Compute Solution for Optimized Genomic Selection Analysis
A Multi-GPU Programming Library for Real-Time Applications
A Multi-GPU Sources Reconstruction Method for Imaging Applications
A Multi-GPU Spectrometer System for Real-Time Wide Bandwidth Radio Signal Analysis
A multi-lane traffic simulation model via continuous cellular automata
A multi-platform linear algebra toolbox for finite element solvers on heterogeneous clusters
A Multi-Stage CUDA Kernel for Floyd-Warshall
A multi-Teraflop Constituency Parser using GPUs
A Multi-View Stereo Implementation on Massively Parallel Hardware
A multigrid solver for boundary value problems using programmable graphics hardware
A multiphysics and multiscale software environment for modeling astrophysical systems
A Mutable Hardware Abstraction to Replace Threads
A near real-time framework for extracting tip-sample forces in dynamic atomic force microscopy (dAFM)
A Nearest Neighbor Data Structure for Graphics Hardware
A Neighborhood Grid Data Structure for Massive 3D Crowd Simulation on GPU
A Network Intrusion Detection System Framework based on Hadoop and GPGPU
A Networked Dataflow Simulation Environment for Signal Processing and Data Mining Applications
A new adaptive model for real-time fluid simulation with complex boundaries
A New Approach for Color Character Extraction Based on Parallel Clustering
A new approach for sparse matrix vector product on NVIDIA GPUs
A New Approach of Performance Analysis of Certain Graph Algorithms
A New Approach to rCUDA
A new approach to the lattice Boltzmann method for graphics processing units
A New Architecture for Games and Simulations Using GPUs
A New Architecture for Optimization Modeling Frameworks
A New Class of Parallel Scheduling Algorithms
A New Compilation Path: From Python/NumPy to OpenCL
A New Cooperative Evolutionary Multi-Swarm Optimizer Algorithm Based on CUDA Parallel Architecture Applied to Solve Engineering Optimization Problems
A new CUDA-based GPU implementation of the two-dimensional Athena code
A New Data Layout For Set Intersection on GPUs
A New Digital Repository for Hyperspectral Imagery with Unmixing-Based Retrieval Functionality Implemented on GPUs
A New Era in Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPU Architectures 
A new GPU-accelerated hydrodynamical code for numerical simulation of interacting galaxies
A New GPU-based Approach to the Shortest Path Problem
A New GPU-Based Neighbor Search Algorithm for Fluid Simulations
A new gravitational N-body simulation algorithm for investigation of cosmological chaotic advection
A New High Performance GPU-based Approach to Prime Numbers Generation
A new method for GPU based irregular reductions and its application to k-means clustering
A New method to Design Accurate Images with Tree Structural Transformations
A New Morphological Anomaly Detection Algorithm for Hyperspectral Images and its GPU Implementation
A new multi-core pipelined architecture for executing sequential programs for parallel geospatial computing
A New Non-Blocking Approach on GPU Dynamical Memory Management
A New Parallel Implementation of DSI Based Disparity Computation Using CUDA
A New Parallel Method of Smith-Waterman Algorithm on a Heterogeneous Platform
A new parallel tool for classification of remotely sensed imagery
A new parallel video understanding and retrieval system
A new parallelisation technique for heterogeneous CPUs
A new physics engine with automatic process distribution between CPU-GPU
A new ray-tracing scheme for 3D diffuse radiation transfer on highly parallel architectures
A new representation of intensity atlas for GPU-accelerated instance generation
A New Software Based GPU Framework
A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems
A New Tool for Classification of Satellite Images Available from Google Maps: Efficient Implementation in Graphics Processing Units
A new way in few-body scattering calculations: discretized Faddeev equations solved on GPU
A Newcomer In The PGAS World - UPC++ vs UPC: A Comparative Study
A Non-linear GPU Thread Map for Triangular Domains
A Normalized Particle Swarm Optimization Algorithm to Price Complex Chooser Option and Accelerating its Performance with GPU
A Note on Auto-tuning GEMM for GPUs
A Note on Particle Filters Applied to DSGE Models
A note on the GPU acceleration of eigenvalue computations
A novel and scalable Multigrid algorithm for many-core architectures
A novel approach for implementing Steganography with computing power obtained by combining Cuda and Matlab
A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters
A Novel Approach to Visualizing Dark Matter Simulations
A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs
A Novel Computational Model for GPUs with Applications to Efficient Algorithms
A Novel Computing-Enhanced Cloud Storage Model Supporting Combined Service Aware
A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling
A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs
A Novel Data Structure for Particle System Simulation based on GPU with the Use of Neighborhood Grids
A novel FPGA-based SVM classifier
A Novel GPU Implementation of Eigen Analysis for Risk Management
A Novel GPU-Based Deformation Pipeline
A Novel GPU-based Parallel Implementation Scheme and Performance Analysis of Robot Forward Dynamics Algorithms
A Novel Graphical Processing Unit Method for Power Systems Security Analysis
A novel hardware acceleration technique for high performance parallel FDTD method
A Novel Implementation of QuickHull Algorithm on the GPU
A Novel Interface for Interactive Exploration of DTI Fibers
A Novel Learning Algorithm for Bayesian Network and Its Efficient Implementation on GPU
A Novel Mapping of Arbitrary Precision Integer Operations to the GPU
A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression
A Novel Monte Carlo Noise Reduction Operator
A Novel Multi-GPU Neural Simulator
A novel multiple-walk parallel algorithm for the Barnes-Hut treecode on GPUs - towards cost effective, high performance N-body simulation
A Novel Open Source Morphology Using GPU Processing With LTU-CUDA
A novel parallel Tier-1 coder for JPEG2000 using GPUs
A Novel Scheme for High Performance Finite-Difference Time-Domain (FDTD) Computations Based on GPU
A novel sorting algorithm for many-core architectures based on adaptive bitonic sort
A novel stereo camera based collision warning system for automotive applications
A NPR System for Generating Floral Patterns based on L-System
A Numerical Study of Continuous Data Assimilation for the 2D-NS Equations Using Nodal Points
A numerical tour of wave propagation
A Package for Multi-Dimensional Monte Carlo Integration on Multi-GPUs
A Package for OpenCL Based Heterogeneous Computing on Clusters with Many GPU Devices
A parallel accelerator for semantic search
A Parallel Access Method for Spatial Data Using GPU
A Parallel Active-Set Method for Solving Frictional Contact Problems
A Parallel Algorithm Development Model for the GPU Architecture
A Parallel Algorithm for Calculation of Large Determinants with High Accuracy for GPUs and MPI clusters
A Parallel Algorithm for Dot Product over Word-Size Finite Field Using Floating-Point Arithmetic
A Parallel Algorithm for Enumerating Joint Weight of a Binary Linear Code in Network Coding
A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware
A Parallel Algorithm for Flight Route Planning on GPU Using CUDA (thesis)
A parallel algorithm for implicit depletant simulations
A Parallel Algorithm for LZW Decompression, with GPU Implementation
A parallel algorithm for the constrained shortest path problem on lattice graphs
A Parallel Algorithm for UAV Flight Route Planning on GPU
A Parallel Algorithm of PCA-SIFT Based on CUDA
A Parallel Algorithm to Test Chordality of Graphs
A Parallel Ant Colony Optimization Algorithm for the Travelling Salesman Problem: Improving Performance Using CUDA
A parallel Ant Colony Optimization algorithm with GPU-acceleration based on All-In-Roulette selection
A Parallel Auxiliary Grid AMG Method for GPU
A Parallel Cellular Automaton Simulation Framework using CUDA
A Parallel Compression Pipeline for Improving GPU Virtualization Data Transfers
A parallel decoding algorithm of LDPC codes using CUDA
A Parallel Deconvolution Algorithm in Perfusion Imaging
A Parallel Depth-aided Exemplar-based Inpainting for Real-time View Synthesis on GPU
A Parallel Edge Preserving Algorithm for Salt and Pepper Image Denoising
A parallel error diffusion implementation on a GPU
A parallel evolutionary algorithm to optimize dynamic memory managers in embedded systems
A Parallel Framework for Parametric Maximum Flow Problems in Image Segmentation
A parallel Genetic Programming algorithm for classification
A Parallel Gibbs Sampling Algorithm for Motif Finding on GPU
A Parallel GPU Version of the Traveling Salesman Problem
A Parallel Image Segmentation Algorithm on GPUs
A Parallel Immune Algorithm Based on Fine-Grained Model with GPU-Acceleration
A parallel immune algorithm for traveling salesman problem and its application on cold rolling scheduling
A parallel implementation of a derivative pricing model incorporating SABR calibration and probability lookup tables
A Parallel Implementation of the Galerkin Method for Solving Partial Differential Equations on a Triangular Mesh
A Parallel Implementation of the Self Organising Map using OpenCL
A Parallel Intermediate Representation for Embedded Languages
A Parallel Jacobi-Type Lattice Basis Reduction Algorithm
A parallel mapping of optical flow to Compute Unified Device Architecture for motion-based image segmentation
A Parallel Mediated Reality Platform
A Parallel Method for Impulsive Image Noise Removal on Hybrid CPU/GPU Systems
A parallel method for tuning Fuzzy TSK Systems with CUDA
A Parallel Monte Carlo Code for Simulating Collisional N-body Systems
A Parallel Multi-view Rendering Architecture
A parallel pattern for iterative stencil + reduce
A Parallel Preconditioned Bi-Conjugate Gradient Stabilized Solver for the Poisson Problem
A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform
A Parallel PSO Algorithm for a Watermarking Application on a GPU
A Parallel Ray Tracing Architecture Suitable for Application-Specific Hardware and GPGPU Implementations
A Parallel Recursive Approach for Solving All Pairs Shortest Path Problem on GPU using OpenCL
A parallel search tree algorithm for vertex cover on graphical processing units
A Parallel Solution to Finding Nodal Neighbors in Generic Meshes
A Parallel Solver for Markov Decision Process in Crowd Simulations
A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs
A Parallel Streaming Motion Estimation for Real-Time HD H.264 Encoding on Programmable Processors
A Parallel Supercomputer Implementation of a Biological Inspired Neural Network and its use for Pattern Recognition
A Parallel Tree Pattern Query Processing Algorithm for Graph Databases using a GPGPU
A Parallel Twig Join Algorithm for XML Processing using a GPGPU
A parallelization cost model for GPU
A Parallelized Algorithm for Hyperspectral Biometrics
A Parallelized Implementation for H. 264 Real-time Encoding Scheme
A Parallelizing Matlab Compiler Framework and Run time for Heterogeneous Systems
A parameterisable and scalable Smith-Waterman algorithm implementation on CUDA-compatible GPUs
A particle system for interactive visualization of 3D flows
A particle-based method for viscoelastic fluids animation
A pattern recognition system for prostate mass spectra discrimination based on the CUDA parallel programming model
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters
A PC-based fully-programmable medical ultrasound imaging system using a graphics processing unit
A PCG Implementation of an Elliptic Kernel in an Ocean Global Circulation Model Based on GPU Libraries
A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications
A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs
A Performance and Scalability Analysis of the Tsunami Simulation EasyWave for Different Multi-Core Architectures and Programming Models
A Performance Comparison of Algebraic Multigrid Preconditioners on CPUs, GPUs, and Xeon Phis
A Performance Comparison of CUDA and OpenCL 
A Performance Comparison of Different Graphics Processing Units Running Direct N-Body Simulations
A Performance Comparison of Sort and Scan Libraries for GPUs
A Performance Criteria for parallel Computation on basis of block size using CUDA Architecture
A Performance Model and Optimization Strategies for Automatic GPU Code Generation of PDE Systems Described by a Domain-Specific Language
A Performance Model for Memory Bandwidth Constrained Applications on Graphics Engines
A Performance Model for the Communication in Fast Multipole Methods on HPC Platforms
A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs
A Performance Optimization Support Framework for GPU-based Traffic Simulations with Negotiating Agents
A Performance Portable Matrix Free Dense MTTKRP in GenTen
A performance prediction model for the CUDA GPGPU platform
A performance spectrum for parallel computational frameworks that solve PDEs
A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations
A performance study of general-purpose applications on graphics processors using CUDA
A Performance Study of Zero Crossing Rate (ZCR) on Graphics Processors (GPUs) Using CUDA
A Performance-Portable SYCL Implementation of CRK-HACC for Exascale
A performance/cost evaluation for a GPU-based drug discovery application on volunteer computing
A Personal Surround Environment: Projective Display with Correction for Display Surface Geometry and Extreme Lens Distortion
A Pervasive Parallel Framework for Visualization
A pilgrimage to gravity on GPUs
A platform-independent tool for modeling parallel programs
A Polyphase Filter For GPUs And Multi-Core Processors
A polyphase filter for many-core architectures
A portable and high-performance matrix operations library for CPUs, GPUs and beyond
A portable C++ library for memory and compute abstraction on multi-core CPUs and GPUs
A Portable High-Productivity Approach to Program Heterogeneous Systems
A portable implementation of the radix sort algorithm in OpenCL
A Portable OpenCL Lattice Boltzmann Code for Multi- and Many-core Processor Architectures
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
A Power Efficient Neural Network Implementation on Heterogeneous FPGA and GPU Devices
A power-aware symbiotic scheduling algorithm for concurrent GPU kernels
A Power-Efficient Scheduling Approach in a Cpu-Gpu Computing System by Thread-Based Parallel Programming
A practical and robust bump-mapping technique for today's GPU's
A practical approach of curved ray prestack Kirchhoff Time Migration on GPGPU
A practical multi-viewer tabletop autostereoscopic display
A Practical Performance Model for Compute and Memory Bound GPU Kernels
A Practical Quicksort Algorithm for Graphics Processors
A Practical Visualization Strategy for Large-Scale Supernovae CFD Simulations
A Practical, Targeted, and Stealthy Attack Against WPA Enterprise Authentication
A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers
A Predictive Shutdown Technique for GPU Shader Processors
A Preliminary Review of Literature on Parallel Constraint Solving
A preliminary study of OpenCL for accelerating CT reconstruction and image recognition
A Problem-Based Learning Approach to GPU Computing
A Program Behavior Study of Block Cryptography Algorithms on GPGPU
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching
A programming framework for data streaming on the Xeon Phi
A programming language interface to describe transformations and code generation
A Programming Model for GPU Load Balancing
A programming model for GPU-based parallel computing with scalability and abstraction
A progressive mesh method for physical simulations using lattice Boltzmann method on single-node multi-gpu architectures
A prototyping environment for high performance reconfigurable computing
A pseudospectral matrix method for time-dependent tensor fields on a spherical shell
A PTX Code Generator for LLVM
A pure vision-based approach to topological SLAM
A Push-Relabel-Based Maximum Cardinality Bipartite Matching Algorithm on GPUs
A Qualitative Comparison Study Between Common GPGPU Frameworks
A Quantitative Comparison of Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs
A Quantitative Performance Analysis Model for GPU Architectures
A Quantitative Study of Irregular Programs on GPUs
A Quasi-Parallel GPU-Based Algorithm for Delaunay Edge-Flips
A QUDA-branch to compute disconnected diagrams in GPUs
A Ray Tracing Implementation Performance Comparison between the CPU and the GPU
A readahead prefetcher for GPU file system layer
A real time Breast Microwave Radar imaging reconstruction technique using simt based interpolation
A real-time 1080p 2D-to-3D video conversion system
A real-time augmented view synthesis system for transparent car pillars
A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors
A real-time coarse-to-fine multiview capture system for all-in-focus rendering on a light-field display
A Real-time Coherent Dedispersion Pipeline for the Giant Metrewave Radio Telescope
A Real-Time Computer Vision Library for Heterogeneous Processing Environments
A Real-time GPU Implementation of the SIFT Algorithm for Large-Scale Video Analysis Tasks
A Real-Time Multigrid Finite Hexahedra Method for Elasticity Simulation using CUDA
A Real-Time ProCam System for Interaction with Chinese Ink-and-Wash Cartoons
A real-time procedural shading system for programmable graphics hardware
A Real-time Single Pulse Detection Algorithm for GPUs
A Real-Time Soft Shadow Rendering Algorithm by Occluder-Discretization
A real-time subsurface scattering rendering method for dynamic objects
A Real-Time, GPU-Based, Non-Imaging Back-End for Radio Telescopes
A realtime GPU subdivision kernel
A Reconfigurable GPU Implementation for Tomlinson-Harashima Precoding
A Reconfigurable Processor for Phylogenetic Inference
A reduced order explicit dynamic finite element algorithm for surgical simulation
A Reduction of the Elastic Net to Support Vector Machines with an Application to GPU Computing
A refactoring tool to extract GPU kernels
A Region Growing Segmentation Algorithm for GPUs
A Reliable Throughput Gain on GPUs
A rendering method for simulated emission nebulae
A Reproducible Research Methodology for Designing and Conducting Faithful Simulations of Dynamic HPC Applications
A Research of MapReduce with GPU Acceleration
A Resource Selection System for Cycle Stealing in GPU Grids
A Resource-Efficient Computing Paradigm for Computational Protein Modeling Applications
A Restructuring Algorithm for CUDA
A Reverse-Projecting Pixel-Level Painting Algorithm
A Review of CUDA, MapReduce, and Pthreads Parallel Computing Models
A Review of the Parallelization Strategies for Iterative Algorithms
A Review on Parallelization of Node based Game Tree Search Algorithms on GPU
A Rigid Body Physics Engine for Interactive Applications
A Road Marking Extraction Method Using GPGPU
A Run-Time Adaptive FPGA Architecture for Monte Carlo Simulations
A Runtime Controller for OpenCL Applications on Heterogeneous System Architectures
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
A Scala Prototype to Generate Multigrid Solver Implementations for Different Problems and Target Multi-Core Platforms
A Scalable and Reconfigurable Shared-Memory Graphics Cluster Architecture
A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems
A Scalable End-to-End Optimized Real-Time Image-Based Rendering Framework on Graphics Hardware
A Scalable Framework for Heterogeneous GPU-Based Clusters
A Scalable Framework for Monte Carlo Simulation Using FPGA-based Hardware Accelerators with Application to SPECT Imaging
A Scalable GPU-based Approach to Accelerate the Multiple-Choice Knapsack Problem
A scalable GPU-based approach to shading and shadowing for photorealistic real-time augmented reality
A Scalable graph-cut algorithm for N-D grids
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches
A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
A scalable hybrid algorithm based on domain decomposition and algebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs
A Scalable Hybrid FPGA/GPU FX Correlator
A Scalable Lane Detection Algorithm on COTSs with OpenCL
A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow
A Scalable, Efficient Scheme for Evaluation of Stencil Computations over Unstructured Meshes
A scalable, numerically stable, high-performance tridiagonal solver using GPUs
A scheduling and runtime framework for a cluster of heterogeneous machines with multiple accelerators
A Scheduling Framework for a Heterogeneous Parallel Architecture
A Screen Space Quality Method for Data Abstraction
A scripting language for Digital Content Creation applications
A second generation of DEFG: Declarative Framework for GPUs
A Second-Order Distributed Trotter-Suzuki Solver with a Hybrid Kernel
A Self-Optimizing Framework for Developing Metrology Software on Massive Parallel Processor Architectures
A self-organization based optical flow estimator with GPU implementation
A self-organization based optical flow estimator with GPU implementation (thesis)
A Semi-Automated Tool Flow for Roofline Anaylsis of OpenCL Kernels on Accelerators
A Shader Library for OpenGL 4 and GLSL 4.3 Learning and Development
A shared file system abstraction for heterogeneous architectures
A shared-scene-graph image-warping architecture for VR: Low latency versus image quality
A short guide to CUDA C: For physicists with multi-core graphics cards
A Short Note on Gaussian Process Modeling for Large Datasets using Graphics Processing Units
A SIMD Interpreter for Genetic Programming on GPU Graphics Cards
A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization
A Similarity Measure for GPU Kernel Subgraph Matching
A Similarity-Based Analysis Tool for Scientific Application Porting
A simple and efficient way to compute depth maps for multi-view videos
A simple and flexible volume rendering framework for graphics-hardware-based raycasting
A simple GPU-based approach for 3D Voronoi diagram construction and visualization
A simple method to accelerate fringe analysis algorithms based on graphics processing unit and MATLAB
A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures
A Simulation Framework for Scheduling Performance Evaluation on CPU-GPU Heterogeneous System
A simulation suite for lattice Boltzmann based real time CFD applications exploiting multi-level parallelism on modern multi-and many-core architectures
A Simulation Suite for Lattice-Boltzmann based Real-Time CFD Applications Exploiting Multi-Level Parallelism on modern Multi- and Many-Core Architectures
A Simulator for the Cafadis Real Time 3DTV Camera
A Single (Unified) Shader GPU Microarchitecture for Embedded Systems
A single-pass GPU ray casting framework for interactive out-of-core rendering of massive volumetric datasets
A small-world network model for distributed storage of semantic metadata
A Smart GPU Implementation of an Elliptic Kernel for an Ocean Global Circulation Model
A smooth particle hydrodynamics code to model collisions between solid, self-gravitating objects
A Software Framework for the Detection and Classification of Biological Targets in Bio-Nano Sensing
A Software-Based Self Test of CUDA Fermi GPUs
A Sorting Library for FPGA Implementation in OpenCL Programming
A Sparse Matrix Personality for the Convey HC-1
A sparse octree gravitational N-body code that runs entirely on the GPU processor
A Spiking Neural P system simulator based on CUDA
A Splitting Algorithm for Directional Regularization and Sparsification
A stand-alone Finite Difference Time Domain (FDTD) simulation for Integrated Optoelectronics Laboratory
A state-of-the-art password strength analysis demonstrator
A Static Analysis-based Cross-Architecture Performance Prediction Using Machine Learning
A Static Load Balancing Scheme for Parallel Volume Rendering on Multi-GPU Clusters
A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL
A Stencil DSEL for Single Code Accelerated Computing with SYCL
A stencil-based implementation of Parareal in the C++ domain specific embedded language STELLA
A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU
A stereoscopic movie player with real-time content adaptation to the display geometry
A Stochastic-based Optimized Schwarz Method for the Gravimetry Equations on GPU Clusters
A straightforward CUDA implementation for interactive ray-tracing
A Straightforward Preprocessing Approach for Accelerating Convex Hull Computations on the GPU
A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs
A Strategy for Automatically Generating High Performance CUDA Code for a GPU Accelerator from a Specialized Fortran Code Expression
A Stream Processor Cluster Architecture Model with the Hybrid Technology of MPI and CUDA
A stream-computing extension to OpenMP
A streaming model for nested data parallelism
A streaming narrow-band algorithm: interactive computation and visualization of level sets
A structural analysis of the A5/1 state transition graph
A structured parallel periodic arnoldi shooting algorithm for RF-PSS analysis based on GPU platforms
A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and Quantum Computers
A Study of CUDA Acceleration and Impact of Data Transfer Overhead in Heterogeneous Environment
A Study of Data Partitioning on OpenCL-based FPGAs
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations
A study of integer sorting on multicores
A Study of Mixed Precision Strategies for GMRES on GPUs
A study of parallel evolution strategy: pattern search on a GPU computing platform
A Study of Parallel Sorting Algorithms Using CUDA and OpenMP
A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture
A Study of Productivity and Performance of Modern Vector Processors
A Study of Real-Time Lighting Effects
A Study of Scheduling a Neuro-imaging Application On a Heterogeneous CPU-GPU Cluster
A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs
A Study of Successive Over-relaxation Method Parallelization Over Modern HPC Languages
A Study of the Parallelization of Hybrid SAT Solver using CUDA
A Study of the Potential of Locality-Aware Thread Scheduling for GPUs
A study of the speed and the accuracy of the Boundary Element Method as applied to the computational simulation of biological organs
A Study of Time and Energy Efficient Algorithms for Parallel and Heterogeneous Computing
A Study on Efficient Application Mapping on Parallel Computing Accelerators
A Study on GPU Computing and Accelerating Simulation of Sedimentary Rock Structure
A Study on Neural-based Code Summarization in Low-resource Settings
A Study on Parallel Imaging Algorithm of 3D Geological Data
A study on tetrahedron-based inhomogeneous Monte Carlo optical simulation
A Study on the Acceleration of Arrival Curve Construction and Regular Specification Mining using GPUs
A Study on the Intersection of GPU Utilization and CNN Inference
A Summary of Recent GPU Developments and Key Enabling Technologies for Digital Media Applications
A Superresolution Framework for High-Accuracy Multiview Reconstruction
A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems
A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-volatile On-chip Caches
A Survey of Architectural Techniques For DRAM Power Management
A Survey of Architectural Techniques For Improving Cache Power Efficiency
A Survey Of Architectural Techniques for Managing Process Variation
A Survey Of Architectural Techniques for Near-Threshold Computing
A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks
A survey of BRDF models for computer graphics
A Survey of Cache Bypassing Techniques
A Survey of Cache Partitioning Techniques for Multicore Processors
A Survey of Cloud Lighting and Rendering Techniques
A Survey of Cloud-Based GPU Threats and Their Impact on AI, HPC, and Cloud Computing
A Survey of Convolutional Neural Networks on Edge with Reconfigurable Computing
A Survey of CPU-GPU Heterogeneous Computing Techniques
A Survey of CUDA-based Multidimensional Scaling on GPU Architecture
A Survey of Deep Learning Library Testing Methods
A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities
A Survey of FPGA Based Neural Network Accelerator
A Survey of FPGA-based Accelerators for Convolutional Neural Networks
A Survey of General-Purpose Computation on Graphics Hardware
A Survey of General-purpose Polyhedral Compilers
A survey of GPU-based medical image computing techniques
A Survey of Machine Learning for Computer Architecture and Systems
A survey of medical image registration on graphics hardware
A Survey of Medical Image Registration on Multicore and the GPU
A Survey of Methods For Analyzing and Improving GPU Energy Efficiency
A Survey of Neural Computation on Graphics Processing Hardware
A survey of point-based techniques in computer graphics
A Survey of Power Management Techniques for Phase Change Memory
A Survey of Recent Developments in SYCL Compiler Implementations
A Survey of Recent Prefetching Techniques for Processor Caches
A Survey of ReRAM-based Architectures for Processing-in-memory and Neural Networks
A Survey of Soft-Error Mitigation Techniques for Non-Volatile Memories
A Survey of Software Techniques for Using Non-Volatile Memories for Storage and  Main Memory Systems
A survey of sparse matrix-vector multiplication performance on large matrices
A Survey of System Architectures and Techniques for FPGA Virtualization
A Survey Of Techniques for Approximate Computing
A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors
A Survey of Techniques for Architecting and Managing GPU Register File
A Survey Of Techniques for Architecting DRAM Caches
A Survey of Techniques for Architecting Processor Components using Domain Wall Memory
A Survey of Techniques for Architecting SLC/MLC/TLC Hybrid Flash Memory based SSDs
A Survey of Techniques for Architecting TLBs
A Survey Of Techniques for Cache Locking
A Survey of Techniques for Designing and Managing CPU Register File
A Survey of Techniques for Dynamic Branch Prediction
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
A Survey of Techniques for Improving Security of GPUs
A Survey of Techniques for Improving Security of Non-volatile Memories
A Survey Of Techniques for Managing and Leveraging Caches in GPUs
A Survey of Techniques for Modeling and Improving Reliability of Computing Systems
A Survey on Agent-based Simulation using Hardware Accelerators
A Survey on Compiler Autotuning using Machine Learning
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
A Survey on Evaluating and Optimizing Performance of Intel Xeon Phi
A survey on FPGA-based accelerator for ML models
A Survey on GPU System Considering its Performance on Different Applications
A survey on graphic processing unit computing for large-scale data mining
A Survey on Hardware Accelerators for Large Language Models
A Survey on Optimization Techniques for Edge Artificial Intelligence (AI)
A Survey on Optimized Implementation of Deep Learning Models on the NVIDIA Jetson Platform
A Survey On Parallelization Of Data Mining Techniques
A survey on various computationally intensive parallel applications in High performance Computing System with OpenCL-MPI
A Survey Paper on Solving TSP using Ant Colony Optimization on GPU
A Switched Dynamical System Framework for Analysis of Massively Parallel Asynchronous Numerical Algorithms
A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code
A symbolic verifier for CUDA programs
A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves
A System for Capturing, Rendering and Multiplexing Images on Multi-view Autostereoscopic Display
A Systematic Literature Survey of Sparse Matrix-Vector Multiplication
A systematic performance study of the parallel programming framework SkePU 3 using HPC-benchmarks
A Systematic Survey of General Sparse Matrix-Matrix Multiplication
A Task Parallel Algorithm for Computing the Costs of All-Pairs Shortest Paths on the CUDA-Compatible GPU
A Task-centric Memory Model for Scalable Accelerator Architectures
A task-driven implementation of a simple numerical solver for hyperbolic conservation laws
A taxonomy of accelerator architectures and their programming models
A TBB-CUDA Implementation for Background Removal in a video-based Fire Detection System
A Template Metaprogramming Approach to Support Parallel Programs for Multicores
A Tensor Compiler for Unified Machine Learning Prediction Serving
A Test Drive of the NVIDIA Jetson TX1 Developer Kit for Deep Learning and Computer Vision Applications
A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA 
A Time Optimal Parallel Algorithm for the Dynamic Programming on the Hierarchical Memory Machine
A time-energy performance analysis of MapReduce on heterogeneous systems with GPUs
A Tool for Automatic Suggestions for Irregular GPU Kernel Optimization
A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels
A Tool for Interactive Parallelization
A tool for mapping Single Nucleotide Polymorphisms using Graphics Processing Units
A tool set for random number generation on GPUs in R
A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs
A toolkit to describe and interactively display three-manifolds embedded in four-space
A Training Framework and Architectural Design for Distributed Deep Learning
A training roadmap for new HPC users
A training-free nose tip detection method from face range images
A Translation Framework for Executing the Sequential Binary Code on CPU/GPU Based Architectures
A Translation Framework from RVC-CAL Dataflow Programs to OpenCL/SYCL based Implementations
A translation system for enabling data mining applications on GPUs
A translator framework for Dynamic Programming problems
A trigger system based on Graphics Processing Unit (GPU)
A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems
A Tuned, Concurrent-Kernel Approach to Speed Up the APSP Problem
A Tuning Framework for Software-Managed Memory Hierarchies
A tutorial on the implementations of linear image filters in CPU and GPU
A tutorial overview on the properties of the discrete cosine transform for encoded image and video processing
A two-fluid finite-volume solver based on OpenCL
A two-level real-time vision machine combining coarse- and fine-grained parallelism
A two-level simulator for spaceborne SAR
A two-level task scheduler on Multiple DSP system for OpenCL
A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization
A Two-stage Query by Singing/Humming System on GPU
A Unified Approach for Registration and Depth in Depth from Defocus
A Unified Approach to Variable Renaming for Enhanced Vectorization
A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud
A Unified Framework for Multi-Sensor HDR Video Reconstruction
A Unified Iteration Space Transformation Framework for Sparse and Dense Tensor Algebra
A Unified Optimization Approach for CNN Model Inference on Integrated GPUs
A Unified Optimization Approach for Sparse Tensor Operations on GPUs
A Unified Optimizing Compiler Framework for Different GPGPU Architectures
A Unified Rolling Shutter and Motion Blur Model for 3D Visual Registration
A Unified Runtime System for Heterogeneous Multi-core Architectures
A unified sparse matrix data format for modern processors with wide SIMD units
A Unified, Hardware-Fitted, Cross-GPU Performance Model
A uniform approach for programming distributed heterogeneous computing systems
A Uniform Platform to Support Multigenerational GPUs for High Performance Stream-based Computing
A University-Industry Collaboration Case Study: Intel Real-Time Multi-View Face Detection Capstone Design Projects
A User's Guide to KSig: GPU-Accelerated Computation of the Signature Kernel
A Validation Testsuite for OpenACC 1.0
A Variant of Concurrent Constraint Programming on GPU
A Variant of Mersenne Twister Suitable for Graphic Processors
A Variant RSA Acceleration with Parallelization
A Variational Model for Interactive Shape Prior Segmentation and Real-Time Tracking
A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels
A very fast census-based stereo matching implementation on a graphics processing unit
A Very Simple Approach for 3-D to 2-D Mapping
A Video Deblurring Optimization Algorithm Based on Motion Detection
A view-dependent adaptivity metric for real time mesh tessellation
A Virtual Machine Model for Accelerating Relational Database Joins using a General Purpose GPU
A virtual memory based runtime to support multi-tenancy in clusters with GPUs
A visibility-based approach for occupancy grid computation in disparity space
A Vision for GPU-accelerated Parallel Computation on Geo-Spatial Datasets
A Visual Approach to Investigating Shared and Global Memory Behavior of CUDA Kernels
A volume segmentation approach based on GrabCut
A Watermarking Co-Processor for New Generation Graphics Processing Units
A Way For Accelerating The DNA Sequence Reconstruction Problem By CUDA
A work-efficient GPU algorithm for level set segmentation
A Workload Balanced MapReduce Framework on GPU Platforms
A Wrapper of OpenCL library for gVirtus Framework
A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing
ab-Stream: A Framework for programming Many-core
ABC-SysBio--approximate Bayesian computation in Python with GPU support
Abelian: A Compiler for Graph Analytics on Distributed, Heterogeneous Platforms
Abstract shade trees
Abstracting OpenCL for Multi-Application Workloads on CPU-FPGA Clusters
Abstraction and Implementation of Unstructured Grid Algorithms on Massively Parallel Heterogeneous Architectures
Abstractions for C++ code optimizations in parallel high-performance applications
Abstractions for Programming Graphics Processors in High-Level Programming Languages
Abundance Estimation Algorithms using NVIDIA CUDA Technology
ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code
Accelerate Cache Simulation with Generic GPU
Accelerate Deep Learning Inference with MCTS in the game of Go on the Intel Xeon Phi
Accelerate Local Tone Mapping for High Dynamic Range Images Using OpenCL with GPU
Accelerate micromagnetic simulations with GPU programming in MATLAB
Accelerate Scientific Deep Learning Models on Heterogeneous Computing Platform with FPGA
Accelerate Smoothed Particle Hydrodynamics using GPU
Accelerate video decoding with generic GPU
Accelerated 2D Image Processing on GPUs
Accelerated Approximate Nearest Neighbors Search Through Hierarchical Product Quantization
Accelerated Combinatorial Optimization using Graphics Processing Units and C++ AMP
Accelerated composite distribution function methods for computational fluid dynamics using GPU
Accelerated Computation of Minimum Enclosing Balls by GPU Parallelization and Distance Filtering
Accelerated cone beam CT reconstruction based on OpenCL
Accelerated cryo-EM structure determination with parallelisation using GPUs in relion-2
Accelerated Deep Learning using Intel Xeon Phi
Accelerated Dictionary Learning with GPU/Multicore CPU and Its Application to Music Classification
Accelerated dimension-independent adaptive Metropolis
Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing
Accelerated Dynamic Programming on GPU: A Study of Speed Up and Programming Approach
Accelerated Event-by-Event Neutrino Oscillation Reweighting with Matter Effects on a GPU
Accelerated Flow Visualization of Advective-Diffusive Mixing Processes Using GPUs
Accelerated GPU Powered Methods for Auditing Security of Wireless Networks Using Probabilistic Password Generation
Accelerated GPU Simulation of Compressible Flow by the Discontinuous Evolution Galerkin Method
Accelerated Large-Scale Multiple Sequence Alignment
Accelerated Matrix Element Method with Parallel Computing
Accelerated MD Program Using CUDA Technology
Accelerated molecular dynamics force evaluation on graphics processing units for thermal conductivity calculations
Accelerated multi-view stereo using parallel processing capababilities of the GPUS
Accelerated Network Coding with Dynamic Stream Decomposition on Graphics Processing Unit
Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN
Accelerated Nodal Discontinuous Galerkin Simulations for Reverse Time Migration with Large Clusters
Accelerated People Tracking Using Texture in a Camera Network
Accelerated polyhedral visual hulls using OpenCL
Accelerated Pressure Projection using OpenCL on GPUs
Accelerated Primality Testing Using GPUs
Accelerated protein structure comparison using TM-score-GPU
Accelerated ray tracing for radiotherapy dose calculations on a GPU
Accelerated realization method of infrared targets detection based on GPU
Accelerated regular grid traversals using extended anisotropic chessboard distance fields on a parallel stream processor
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU)
Accelerated Root Finding for Computational Finance
Accelerated Runtime Verification of LTL Specifications with Counting Semantics
Accelerated simulation of spiking neural networks using GPUs
Accelerated Sparse Matrix Operations in Nonlinear Least Squares Solvers
Accelerated SQLite Database using GPUs
Accelerated Variance Reduction Methods on GPU
Accelerated video encoding using render context information
Accelerated Wide Baseline Matching using OpenCL
Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs
Accelerating 3D Fourier migration with graphics processing units
Accelerating a Bayesian Phylogenetic Inference Application with OpenACC
Accelerating a climate physics model with OpenCL
Accelerating a Cloud-Based Software GNSS Receiver
Accelerating a Linear Programming Algorithm on AMD GPUs
Accelerating a Movie Recommender System Using VirtualCL on a Heterogeneous GPU Cluster
Accelerating a Novel Particle-based Fluid Simulation on the GPU
Accelerating a three-dimensional finite-difference wave propagation code using GPU graphics cards
Accelerating a TV based JPEG decompression algorithm with Cuda
Accelerating Ab Initio Nuclear Physics Calculations with GPUs
Accelerating adaptive background subtraction with GPU and CBEA architecture
Accelerating Adaptive IDW Interpolation Algorithm on a Single GPU
Accelerating advanced MRI reconstructions on GPUs
Accelerating Algebraic Reconstruction Using CUDA-Enabled GPU
Accelerating Algorithms on GPUs in SCIRun: the Conjugate Gradient Case Study
Accelerating All-Atom Normal Mode Analysis with Graphics Processing Unit
Accelerating an imaging spectroscopy algorithm for submerged marine environments using heterogeneous computing
Accelerating and Characterizing Seam Carving Using a Heterogeneous CPU-GPU System
Accelerating Ant Colony Optimization-based Edge Detection on the GPU using CUDA
Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors
Accelerating astrophysical particle simulations with programmable hardware (FPGA and GPU)
Accelerating AutoDock VINA with GPUs
Accelerating AutoDock4 with GPUs and Gradient-Based Local Search
Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction
Accelerating batched 1D-FFT with a CUDA-capable computer
Accelerating Beam Dynamics Simulations with GPUs
Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC
Accelerating Binary Genetic Algorithm Driven Missile Design Optimization Routine with a CUDA Coded Six Degrees-Of-Freedom Simulator
Accelerating bioinformatics applications on CUDA-enabled multi-GPU systems
Accelerating biomedical signal processing algorithms with parallel programming on graphic processor units
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating Bit Error Rate Simulation in MATLAB with Graphics Processors
Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design
Accelerating Blockchain Search of Full Nodes Using GPUs
Accelerating Boosting-based Face Detection on GPUs
Accelerating BP Neural Network-Based Image Compression by CPU and GPU Cooperation
Accelerating Braided B+ Tree Searches on a GPU with CUDA
Accelerating calculations of RNA secondary structure partition functions using GPUs
Accelerating cellular automata simulations using AVX and CUDA
Accelerating Clustering Coefficient Calculations on a GPU Using OPENCL
Accelerating CNN inference on FPGAs: A Survey
Accelerating CNN on FPGA: An Implementation of MobileNet on FPGA
Accelerating code on multi-cores with FastFlow
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA Compatible Devices
Accelerating complex brain-model simulations on GPU platforms
Accelerating Component-Based Dataflow Middleware with Adaptivity and Heterogeneity
Accelerating Computational Algorithms
Accelerating Computational Finance Simulations with OpenCL
Accelerating Compute-Intensive Applications with GPUs and FPGAs 
Accelerating Computer Vision Algorithms Using OpenCL on Mobile GPU - A Case Study
Accelerating Concurrent Heap on GPUs
Accelerating Constraint Automata Composition with GPGPU Parallelization
Accelerating Content-Based Image Retrieval via GPU-adaptive Index Structure
Accelerating convolutions on the sphere with hybrid GPU/CPU kernel splitting
Accelerating Correlated Quantum Chemistry Calculations Using Graphical Processing Units
Accelerating Correlation Power Analysis Using Graphics Processing Units
Accelerating Cosmological Data Analysis with Graphics Processors
Accelerating cosmological simulations on GPUs: a portable approach using OpenMP
Accelerating Cost Aggregation for Real-Time Stereo Matching
Accelerating Cryptographic Primitives with GPUs
Accelerating Cryptosystems on Hardware Platforms
Accelerating CUDA Graph Algorithms at Maximum Warp
Accelerating data clustering on GPU-based clusters under shared memory abstraction
Accelerating data mining workloads: current approaches and future challenges in system architecture design
Accelerating Database Query Processing on OpenCL-based FPGAs
Accelerating Deep Convolutional Neural Networks Using Specialized Hardware
Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs
Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent
Accelerating Deep Neural Networks implementation: A survey
Accelerating Deep Neural Networks on Low Power Heterogeneous Architectures
Accelerating DEM simulations on GPUs by reducing the impact of warp divergences
Accelerating Density Functional Calculations with Graphics Processing Unit
Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL
Accelerating digital forensic searching through GPGPU parallel processing techniques
Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures
Accelerating Discrete Wavelet Transforms on GPUs
Accelerating Discrete Wavelet Transforms on Parallel Architectures
Accelerating Dissipative Particle Dynamics Simulations on GPUs: Algorithms, Numerics and Applications
Accelerating distance matrix calculations utilizing GPU
Accelerating DNA analysis applications on GPU clusters
Accelerating DNA Sequence Analysis using Intel Xeon Phi
Accelerating Double Precision FEM Simulations with GPUs 
Accelerating Double Precision Floating-point Hessenberg Reduction on FPGA and Multicore Architectures
Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores
Accelerating Dust Temperature Calculations with Graphics Processing Units
Accelerating Dynamic Binary Translation with GPUs
Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs
Accelerating Electron Tomography Reconstruction Algorithm ICON Using the Intel Xeon Phi Coprocessor on Tianhe-2 Supercomputer
Accelerating electrostatic surface potential calculation with multi-scale approximation on graphics processing units
Accelerating Encrypted Computing on Intel GPUs
Accelerating encryption using commodity hardware
Accelerating Energy Minimization using Graphics Processors
Accelerating epistasis analysis in human genetics with consumer graphics hardware
Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA
Accelerating Euler Equations Numerical Solver on Graphics Processing Units
Accelerating Eulerian Fluid Simulation With Convolutional Networks
Accelerating evolutionary computation with graphics processing units
Accelerating Exact and Approximate Inference for (Distributed) Discrete Optimization with GPUs
Accelerating Exact Similarity Search on CPU-GPU Systems
Accelerating exotic option pricing and model calibration using GPUs
Accelerating Fast Fourier Transform for Wideband Channelization
Accelerating Fast Fourier Transforms Using Hadoop and CUDA
Accelerating feature extraction for patch-based Multi-View Stereo algorithm
Accelerating Financial Applications on the GPU
Accelerating finite-rate chemical kinetics with coprocessors: comparing vectorization methods on GPUs, MICs, and CPUs
Accelerating floating-point fitness functions in evolutionary algorithms: a FPGA-CPU-GPU performance comparison
Accelerating Fluids Simulation Using SPH and Implementation on GPU
Accelerating Foreign-Key Joins using Asymmetric Memory Channels
Accelerating Fruchterman-Reingold with OpenCL
Accelerating Fully Homomorphic Encryption on GPUs
Accelerating Fully Homomorphic Encryption Using GPU
Accelerating Genetic Programming through Graphics Processing Units
Accelerating Genetic Programming Using Graphics Processing Units
Accelerating Genome-Wide Association Studies Using CUDA Compatible Graphics Processing Units
Accelerating Genomics Research with OpenCL and FPGAs
Accelerating geometric queries using the GPU
Accelerating geoscience and engineering system simulations on graphics hardware
Accelerating Geospatial Analysis on GPUs using CUDA
Accelerating glassy dynamics using graphics processing units
Accelerating global sequence alignment using CUDA compatible multi-core GPU
Accelerating GPU Implementation of Contourlet Transform
Accelerating GPU kernels for dense linear algebra
Accelerating GPU Programs by Reducing Irregular Control Flow and Memory Access
Accelerating Graph Analysis with Heterogeneous Systems
Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor
Accelerating H.264 Advanced Video Coding with GPU/CUDA Technology
Accelerating H.264 inter prediction in a GPU by using CUDA
Accelerating Habanero-Java Programs with OpenCL Generation
Accelerating Haskell Array Codes with Algorithmic Skeletons on GPUs
Accelerating Haskell array codes with multicore GPUs
Accelerating high-level engineering computations by automatic compilation of Geometric Algebra to hardware accelerators
Accelerating High-Order Stencils on GPUs
Accelerating High-Throughput Computing through OpenCL
Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning
Accelerating IISPH: A Parallel GPGPU Solution Using CUDA
Accelerating Image Feature Comparisons using CUDA on Commodity Hardware
Accelerating image recognition on mobile devices using GPGPU
Accelerating Image Reconstruction in Dual-Head PET System by GPU and Symmetry Properties
Accelerating Image Reconstruction in Three-Dimensional Optoacoustic Tomography on Graphics Processing Units
Accelerating image registration of MRI by GPU-based parallel computation
Accelerating Image Retrieval Using Factorial Correspondence Analysis on GPU
Accelerating In-Memory Graph Database traversal using GPGPUS
Accelerating Inclusion-based Pointer Analysis on Heterogeneous CPU-GPU Systems
Accelerating incoherent dedispersion
Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms
Accelerating InSAR raw data simulation on GPU using CUDA
Accelerating Interpreted Programming Languages on GPUs with Just-In-Time Compilation and Runtime Optimisations
Accelerating iterative field-compensated MR image reconstruction on GPUs
Accelerating Iterative SpMV for Discrete Logarithm Problem using GPUs
Accelerating Java on Embedded GPU
Accelerating JPEG Decompression on GPUs
Accelerating K-Means on the Graphics Processor via CUDA
Accelerating Kernel Density Estimation on the GPU Using the CUDA Framework
Accelerating Kirchhoff Migration by CPU and GPU Cooperation
Accelerating Krylov Subspace Solvers on Graphics Processing Units
Accelerating Lagrangian Particle Dispersion in the Atmosphere with OpenCL across Multiple Platforms
Accelerating Lambert's Problem on the GPU in MATLAB
Accelerating Large Graph Algorithms on the GPU Using CUDA
Accelerating Large Scale Image Analyses on Parallel CPU-GPU Equipped Systems
Accelerating Large-Scale Convolutional Neural Networks with Parallel Graphics Multiprocessors
Accelerating large-scale protein structure alignments with graphics processing units
Accelerating large-scale simulations of cortical neuronal network development
Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors
Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization
Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array
Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors
Accelerating light scattering simulations of nanostructures by reconfigurable computing
Accelerating linear system solutions using randomization techniques
Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster
Accelerating linpack with CUDA on heterogenous clusters
Accelerating Live Graph-Cut-Based Object Tracking Using CUDA
Accelerating Lossless Data Compression with GPUs
Accelerating Low-End Edge Computing with Cross-Kernel Functionality Abstraction
Accelerating Low-Fidelity Aerodynamic Codes
Accelerating mahout on heterogeneous clusters using HadoopCL
Accelerating MapReduce on a coupled CPU-GPU architecture
Accelerating marching cubes with graphics hardware
Accelerating MATLAB Image Processing Toolbox functions on GPUs
Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms
Accelerating Mixed-Abstraction SystemC Models on Multi-Core CPUs and GPUs
Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs
Accelerating molecular docking and binding site mapping using FPGAs and GPUs
Accelerating Molecular Docking by Parallelized Heterogeneous Computing - A Case Study of Performance, Quality of Results, and Energy-Efficiency using CPUs, GPUs, and FPGAs
Accelerating Molecular Docking Calculations Using Graphics Processing Units
Accelerating molecular dynamic simulation on graphics processing units
Accelerating molecular dynamics simulations using Graphics Processing Units with CUDA
Accelerating Molecular Dynamics Simulations with GPUs
Accelerating molecular modeling applications with graphics processors
Accelerating Molecular Simulations with Triton: Fused GPU Kernels for TensorNet Neural Potentials
Accelerating Monte Carlo simulations with an NVIDIA graphics processor
Accelerating Multi-layer Perceptron based short term demand forecasting using Graphics Processing Units
Accelerating Multi-Scale Flows for LDDKBM Diffeomorphic Registration
Accelerating Multi-scale Image Fusion Algorithms Using CUDA
Accelerating Multi-Sensor Image Fusion Using Graphics Hardware
Accelerating Multiple Compound Comparison Using LINGO-based Load-Balancing Strategies on Multi-GPUs
Accelerating NBODY6 with Graphics Processing Units
Accelerating Nearest Neighbor Search on Manycore Systems 
Accelerating non-linear image registration with GPUs
Accelerating Noninvasive Transmural Electrophysiological Imaging with CUDA
Accelerating NTRU based Homomorphic Encryption using GPUs
Accelerating NTRU Encryption with Graphics Processing Units
Accelerating numerical solution of stochastic differential equations with CUDA
Accelerating Outlier Detection with Uncertain Data using Graphics Processors
Accelerating Pairwise DNA Sequence Alignment using the CUDA Compatible GPU
Accelerating Parameter Sweep Applications Using CUDA
Accelerating parameter synthesis for stochastic models
Accelerating Particle Image Velocimetry Using Hybrid Architectures
Accelerating Particle Swarm Algorithm with GPGPU
Accelerating Partitional Algorithms for Flow Cytometry on GPUs
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems
Accelerating Phase Correlation Functions Using GPU and FPGA 
Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA
Accelerating Phylogenetic Inference on GPUs: an OpenACC and CUDA comparison
Accelerating POCS interpolation of 3D irregular seismic data with Graphics Processing Units
Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic
Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware
Accelerating Population Balance Model-based particulate process simulations via parallel computing
Accelerating Power Flow studies on Graphics Processing Unit
Accelerating Preconditioned Iterative Linear Solvers on GPU
Accelerating Protein Coordinate Conversion using GPUs
Accelerating Protein Sequence Search in a Heterogeneous Computing System
Accelerating Protein Structure Prediction using Particle Swarm Optimization on GPU
Accelerating QDP++ using GPUs
Accelerating QDP++/Chroma on GPUs
Accelerating Quadrature Methods for Option Valuation
Accelerating Quantum Chromodynamics Calculations with GPUs
Accelerating Quantum Monte Carlo Simulations with Emerging Architectures
Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units
Accelerating Radio Astronomy with Auto-Tuning
Accelerating Random Forests on CPUs and GPUs for Object-Class Image Segmentation
Accelerating reaction-diffusion simulations with general-purpose graphics processing units
Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures
Accelerating Recommender Systems using GPUs
Accelerating recurrent neural network language model based online speech recognition system
Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization
Accelerating Reed-Solomon coding in RAID systems with GPUs
Accelerating Regular LDPC Code Decoders on GPUs
Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis
Accelerating Regularized Iterative CT Reconstruction on Commodity Graphics Hardware (GPU)
Accelerating Resolution-of-the-Identity Second-Order Moller-Plesset Quantum Chemistry Calculations with Graphical Processing Units
Accelerating S3D: A GPGPU Case Study
Accelerating scientific applications using GPU's
Accelerating Scientific Computations with Mixed Precision Algorithms
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
Accelerating SELECT WHERE and SELECT JOIN Queries on a GPU
Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware
Accelerating SIFT on parallel architectures
Accelerating Simulation Codes through the GeMTC Framework
Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures
Accelerating Simulations of Light Scattering Based on Finite-Difference Time-Domain Method with General Purpose GPUs
Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU
Accelerating Smith-Waterman Local Sequence Alignment on GPU Cluster
Accelerating Smith-Waterman on Heterogeneous CPU-GPU Systems
Accelerating solutions of PDEs with GPU-based swept time-space decomposition
Accelerating Spark RDD Operations with Local and Remote GPU Devices
Accelerating Sparse Approximate Matrix Multiplication on GPUs
Accelerating Sparse Graph Neural Networks with Tensor Core Optimization
Accelerating Sparse Matrix Kernels on Graphics Processing Units
Accelerating Sparse Matrix Vector Multiplication on Many-Core GPUs
Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
Accelerating spatial clustering detection of epidemic disease with graphics processing unit
Accelerating SQL Database Operations on a GPU with CUDA
Accelerating SSL with GPUs
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units
Accelerating Stochastic Simulations on GPUs Using OpenCL
Accelerating String Matching Using Multi-Threaded Algorithm on GPU
Accelerating string tokenization with FPGAs for IoT data handling equipment
Accelerating Swarm Intelligence Algorithms with GPU-Computing
Accelerating SWHE based PIRs using GPUs
Accelerating System-Level Design Tasks Using Commodity Graphics Hardware: A Case Study
Accelerating SystemC Simulations using GPUs
Accelerating Template-Based Matching on the GPU for AR Applications
Accelerating ternary quantized convolutional neural networks using OpenCL for FPGA
Accelerating tetrahedral interpolation with data-level and Thread-Level Parallel optimization
Accelerating Text Mining Workloads in a MapReduce-based Distributed GPU Environment
Accelerating the ANSYS Direct Sparse Solver with GPUs
Accelerating The Cloud with Heterogeneous Computing
Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations
Accelerating the Critical Line Algorithm for Portfolio Optimization Using GPUs
Accelerating the D3Q19 Lattice Boltzmann Model with OpenACC and MPI
Accelerating the FDTD Method Using SSE and Graphics Processing Units
Accelerating the Fourier split operator method via graphics processing units
Accelerating the Gillespie Exact Stochastic Simulation Algorithm Using Hybrid Parallel Execution on Graphics Processing Units
Accelerating the Hough Transform with CUDA on Graphics Processing Units
Accelerating the local outlier factor algorithm on a GPU for intrusion detection systems
Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware
Accelerating the Nonuniform Fast Fourier Transform Using FPGAs
Accelerating the numerical simulation of magnetic field lines in tokamaks using the GPU 
Accelerating the Nussinov RNA folding algorithm with CUDA/GPU
Accelerating the pre-processing stages of JPEG encoder on a heterogenous system using OpenCL
Accelerating the Rate of Astronomical Discovery with GPU-Powered Clusters
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs
Accelerating the Simulations of the Ising Model by the GPU under the CUDA Environment
Accelerating the Smith-Waterman Algorithm for Bio-sequence Matching on GPU
Accelerating the Smoldyn Spatial Stochastic Biochemical Reaction Network Simulator Using GPUs
Accelerating the solution of families of shifted linear systems with CUDA
Accelerating the Stochastic Simulation Algorithm
Accelerating the Stochastic Simulation Algorithm using Emerging Architectures
Accelerating the Sweep3D for a Graphic Processor Unit
Accelerating Topic Model Training on a Single Machine
Accelerating total variation regularization for matrix-valued images on GPUs
Accelerating Twisted Mass LQCD with QPhiX
Accelerating Two Algorithms for Large-Scale Compound Selection on GPUs
Accelerating Unstructured Mesh Computational Fluid Dynamics on the NVidia Tesla GPU Architecture
Accelerating urban fast response Lagrangian dispersion simulations using inexpensive graphics processor parallelism
Accelerating Vector Calculations on GPU
Accelerating video decoding using GPU
Accelerating Viola-Jones Face Detection to FPGA-Level Using GPUs
Accelerating Volume Image Registration through Correlation Ratio based Methods on GPUs
Accelerating Wavelet Lifting on Graphics Hardware Using CUDA
Accelerating wavelet-based video coding on graphics hardware using CUDA
Accelerating Web Search using GPUs
Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming
Accelerating Workloads on FPGAs via OpenCL: A Case Study with OpenDwarfs
Accelerating Wright-Fisher Forward Simulations on the Graphics Processing Unit
Acceleration and Energy Efficiency of a Geometric Algebra Computation using Reconfigurable Computers and GPUs
Acceleration and Optimisation of a Monte Carlo Code for Light Propagation in Sprays and Other Scattering Media
Acceleration as a Service (XaaS) Source Containers
Acceleration for the many, not the few
Acceleration Methods for Bayesian Network Sampling
Acceleration of a CFD Code with a GPU
Acceleration of a Full-scale Industrial CFD Application with OP2
Acceleration of a Locally Tuned Sine Non Linear Video Enhancement Algorithm on GPGPU
Acceleration of a QM/MM-QMC simulation using GPU
Acceleration of Acoustic Emission Signal Processing Algorithms using CUDA Standard
Acceleration of AES encryption on CUDA GPU
Acceleration of Agent-Based Pandemic Modeling on Multiple GPUs
Acceleration of an improved Retinex algorithm
Acceleration of bilateral filtering algorithm for manycore and multicore architectures
Acceleration of Binomial Options Pricing via Parallelizing along time-axis on a GPU
Acceleration of Block-Aware Matrix Factorization on Heterogeneous Platforms
Acceleration of calculation of Third Party Risk around an airport using OpenCL
Acceleration of cardiac tissue simulation with graphic processing units
Acceleration of Cellular Automata through Parallel Computing with OpenCL
Acceleration of CFD and data analysis using graphics processors
Acceleration of Coarse Grain Molecular Dynamics on GPU Architectures
Acceleration of Composite Order Bilinear Pairing on Graphics Hardware
Acceleration of computation speed for elastic wave simulation using a Graphic Processing Unit
Acceleration of computational quantum chemistry by heterogeneous computer architectures
Acceleration of Deep Learning on FPGA
Acceleration of Deterministic Boltzmann Solver with Graphics Processing Units
Acceleration of Diagrammatic Determinantal Quantum Monte Carlo Calculations using GPUs
Acceleration of direct volume rendering with programmable graphics hardware
Acceleration of Distance-to-Default with GPU
Acceleration of ensemble machine learning methods using many-core devices
Acceleration of FDTD mode solver by high-performance computing techniques
Acceleration of Feynman loop integrals in high-energy physics on many core GPUs
Acceleration of finite-difference time-domain (FDTD) using graphics processor units (GPU)
Acceleration of Functional Validation Using GPGPU
Acceleration of genetic algorithms for sudoku solution on many-core processors
Acceleration of GPU-based ultrasound simulation via data compression
Acceleration of grammatical evolution using graphics processing units: computational intelligence on consumer games and graphics hardware
Acceleration of Hardware Testing and Validation Algorithms using Graphics Processing Units
Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems in a Hybrid CPU-GPU Computing Environment
Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems Using GPU
Acceleration of Hessenberg Reduction for Nonsymmetric Matrix
Acceleration of information-theoretic data analysis with graphics processing units
Acceleration of Intrusion Detection in Encrypted Network Traffic Using Heterogeneous Hardware
Acceleration of iterative Navier-Stokes solvers on graphics processing units
Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL
Acceleration of large-scale FDTD simulations on high performance GPU clusters
Acceleration of Linear Finite-Difference Poisson-Boltzmann Methods on Graphics Processing Units
Acceleration of LOD-FDTD Method Using Fundamental Scheme on Graphics Processor Units
Acceleration of low-latency gravitational wave searches using Maxwell-microarchitecture GPUs
Acceleration of LSB Algorithm in GPU
Acceleration of Medical Image Registration using Graphics Process Units in Computing Normalized Mutual Information
Acceleration of Monte-Carlo Molecular Simulations on Hybrid Computing Architectures
Acceleration of Multiresolution Imaging Algorithms: A Comparative Study
Acceleration of multivariate analysis techniques in TMVA using GPUs
Acceleration of PET Monte Carlo simulation using the graphics hardware ray-tracing engine
Acceleration of physics simulation engine through OpenCL
Acceleration of PIC Simulation with GPU
Acceleration of Radiance for Lighting Simulation by Using Parallel Computing with OpenCL
Acceleration of real-life stencil codes on GPUs
Acceleration of recovery simulation on big model using GPU
Acceleration of Scientific Deep Learning Models on Heterogeneous Computing Platform with Intel FPGAs
Acceleration of Selective Cationic Antibacterial Peptides computation: A comparison of FPGA and GPU approaches
Acceleration of Solving Maxwell's Equations Using Cluster of GPUs
Acceleration of spiking neural networks in emerging multi-core and GPU architectures
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump Using CUDA-enabled GPU Hardware
Acceleration of stereo-matching on multi-core CPU and GPU
Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters
Acceleration of tensor-product operations for high-order finite element methods
Acceleration of the 3D ADI-FDTD method using graphics processor units
Acceleration of the GAMESS-UK electronic structure package on graphical processing units
Acceleration of the Method of Moments Calculations by Using Graphics Processing Units
Acceleration of the MMFF94 routines within OpenBabel using Eigen and OpenCL
Acceleration of the Smith-Waterman Algorithm using Single and Multiple Graphics Processors 
Acceleration of the speed of tissue characterization algorithm for coronary plaque by employing GPGPU technique
Acceleration of Time-Domain Finite Element Method (TD-FEM) Using Graphics Processor Units (GPU)
Acceleration of TM cylinder EFIE with CUDA
Acceleration of Tsunami Wave Propagation Modeling based on Re-engineering of Computational Components
Acceleration of Variance of Color Differences-Based Demosaicing Using CUDA
Acceleration of Various Direct/Iterative Solvers for MoM by GPU and Its Computational Cost
Acceleration technique for volume rendering using 2D texture based ray plane casting on GPU
Acceleration Techniques for GPU-based Volume Rendering
Acceleration-as-a-Service: Exploiting Virtualised GPUs for a Financial Application
Accelerator Aware MPI Micro-benchmarking using CUDA, OpenACC and OpenCL
Accelerator weather forecasting
Accelerator-Oriented Algorithm Transformation for Temporal Data Mining
Accelerator: using data parallelism to program GPUs for general-purpose uses
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
AccFFT: A library for distributed-memory 3-D FFT on CPU and GPU architectures
AccFFT: A library for distributed-memory FFT on CPU and GPU architectures
Accounting for Secondary Uncertainty: Efficient Computation of Portfolio Risk Measures on Multi and Many Core Architectures
Accounting for Uncertainty in Medical Data: A CUDA Implementation of Normalized Convolution
ACCTuner: OpenACC Auto-Tuner For Accelerated Scientific Applications
AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks
accULL: An User-directed Approach to Heterogeneous Programming
Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study
Accuracy, Memory, and Speed Strategies in GPU-Based Finite-Element Matrix-Generation
Accurate Analytic Models to Estimate Execution Time on GPU Applications
Accurate and Efficient Filtering using Anistropic Filter Decomposition
Accurate Cross-Architecture Performance Modeling for Sparse Matrix-Vector Multiplication (SpMV) on GPUs
Accurate CUDA Performance Modeling for Sparse Matrix-Vector Multiplication
Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels
Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing
Accurate Models of NVIDIA Tensor Cores
Accurate multi-view reconstruction using robust binocular stereo and surface meshing
Accurate real-time stereo correspondence using intra- and inter-scanline optimization
Accurate Sequence Alignment using Distributed Filtering on GPU Clusters
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale
Achieving a single compute device image in OpenCL for multiple GPUs
Achieving High Throughput Sequencing with Graphics Processing Units
Achieving high-performance with a sparse direct solver on Intel KNL
Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability
Achieving O(1) IP lookup on GPU-based software routers
Achieving Speedup in Aggregate Risk Analysis using Multiple GPUs
Achieving TeraCUPS on Longest Common Subsequence Problem using GPGPUs
ACL2 Meets the GPU: Formalizing a CUDA-based Parallelizable All-Pairs Shortest Path Algorithm in ACL2
ACO on Multiple GPUs with CUDA for Faster Solution of QAPs
ACO with tabu search on a GPU for solving QAPs using move-cost adjusted thread assignment
Acquisition Method of Spread Spectrum Signals Based on GPU Acceleration
ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time
Action Spotting and Recognition Based on a Spatiotemporal Orientation Analysis
Action-Based Multifield Video Visualization
Active Structured Learning for High-Speed Object Detection
Active thread compaction for GPU path tracing
Activity recognition from videos with parallel hypergraph matching on GPUs
Adaboost GPU-based Classifier for Direct Volume Rendering
AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles
Adaptable particle-in-cell algorithms for graphical processing units
Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation
Adaptation of algorithms for underwater sonar data processing to GPU-based systems
Adaptation of an acoustic propagation model to the parallel architecture of a graphics processor
Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments
Adaptation of the MapReduce programming framework to compute-intensive data-analytics kernels
Adapting a message-driven parallel application to GPU-accelerated clusters 
Adapting data processing methods to modern GPU architecture
Adapting database components to heterogeneous environments
Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework
Adapting MoM with RWG Basis Functions to GPU Technology Using CUDA
Adapting Particle Filter Algorithms to Many-Core Architectures
Adapting the GA Approach to Solve Traveling Salesman Problems on CUDA Architecture
Adaptive algebraic multigrid on SIMD architectures
Adaptive and Hybrid Machine Learning Approaches Utilizing General Purpose Computing on Graphical Processing Units
Adaptive and Transparent Cache Bypassing for GPUs
Adaptive Data Migration in Load-Imbalanced HPC Applications
Adaptive discrete cosine transform-based image compression method on a heterogeneous system platform using Open Computing Language
Adaptive Dynamic Load Balancing in Heterogeneous Multiple GPUs-CPUs Distributed Setting: Case Study of B&B Tree Search
Adaptive enhancement and noise reduction in very low light-level video
Adaptive fast multipole methods on the GPU
Adaptive GPU Array Layout Auto-Tuning
Adaptive Hardware-accelerated Terrain Tessellation
Adaptive implementation selection in the SkePU skeleton programming library
Adaptive Input-aware Compilation for Graphics Engines
Adaptive Kinetic-Fluid Solvers for Heterogeneous Computing Architectures
Adaptive Line Tracking with Multiple Hypotheses for Augmented Reality
Adaptive load balancing for raycasting of non-uniformly bricked volumes
Adaptive Mesh Fluid Simulations on GPU
Adaptive Multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU
Adaptive OpenCL (ACL) Execution in GPU Architectures
Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
Adaptive Optimization Techniques for High-Performance Computing
Adaptive parallelism mapping in dynamic environments using machine learning
Adaptive Partitioning for Iterated Sequences of Irregular OpenCL Kernels
Adaptive proxy geometry for direct volume manipulation
Adaptive Row-grouped CSR Format for Storing of Sparse Matrices on GPU
Adaptive sampling in three dimensions for volume rendering on GPUs
Adaptive sampling of intersectable models exploiting image and object-space coherence
Adaptive Sequential Posterior Simulators for Massively Parallel Computing Environments
Adaptive Simulation of Large-Scale Ocean Surface
Adaptive SpMV/SpMSpV on GPUs for Input Vectors of Varied Sparsity
Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing
Adaptive Treelet Meshes for Efficient Streak-Surface Visualization on the GPU
Adaptive Video Encoding Based on OpenCL Face Recognition
Adaptive Work-Efficient Connected Components on the GPU
Adaptive, real-time visual simultaneous localization and mapping
Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation
Adding fault tolerance to OpenCL: Through redundant heterogeneous computing
Adding GPU Computing to Computer Organization Courses
Adding special-purpose processor support to the Erlang VM
Address Selection for Efficient Barriers on the Intel Xeon Phi
Addressing Challenges in Utilizing GPUs for Accelerating Privacy-Preserving Computation
ADHA: Automatic Data layout framework for Heterogeneous Architectures
Adhoc On-Demand Distance Vector Protocol For Energy Efficiency
Adjoint Algorithmic Differentiation of a GPU Accelerated Application
Adjoint Lattice Boltzmann for Topology Optimization on multi-GPU architecture
Adjustable GPU Acceleration for Hermitian Eigensystems
ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers
Advanced 2D Rasterization on Modern CPUs
Advanced Architectures for Astrophysical Supercomputing
Advanced CFD Modeling Using GeForce GPUs
Advanced Concurrency Control Algorithm Design and GPU System Support for High Performance In-Memory Data Management
Advanced illumination techniques for GPU volume raycasting
Advanced Joins on GPUs
Advanced MRI reconstruction toolbox with accelerating on GPU
Advanced Multi-Frame Rate Rendering Techniques
Advanced Optimization Techniques for Sparse Grids on Modern Heterogeneous Systems
Advanced Optimizations of An Implicit Navier-Stokes Solver on GPGPU
Advanced Programming Platform for efficient use of Data Parallel Hardware
Advanced Simulation Library: Expanding software ecosystem for the DSP/FPGA/GPU market
Advanced Techniques for the Rendering and Visualization of Volumetric Seismic Data
Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study
Advanced ultrasound beam forming using GPGPU technology
Advanced Video Coding on CPUs and GPUs: Parallelization and RD Analysis
Advances in Electron Microscopy with Deep Learning
Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle
Advancing Large Scale Many-Body QMC Simulations on GPU Accelerated Multicore Systems
Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library
Advantages and GPU implementation of high-performance indexed DNA search based on suffix arrays
Adventures in the microlensing cloud: large datasets, eResearch tools, and GPUs
ADWPNAS: Architecture-Driven Weight Prediction for Neural Architecture Search
AeminiumGPU: A CPU-GPU Hybrid Runtime for the Aeminium Language
AeminiumGPU: An Intelligent Framework for GPU Programming
Aeolian Sand Movement and Interacting with Vegetation: A GPU Based Simulation and Visualization Method
AES Algorithm Adapted on GPU Using CUDA for Small Data and Large Data Volume Encryption
AES and DES Encryption with GPU
AES Encryption Algorithm Based on the High Performance Computing of GPU
AES Encryption and Decryption Using Direct3D 10 API
AES Encryption Implementation and Analysis on Commodity Graphics Processing Units
AES Encryption Implementation on CUDA GPU and Its Analysis
AES encryption on modern consumer architectures
AES finalists implementation for GPU and multi-core CPU based on OpenCL
AES on GPU: a CUDA Implementation
AES Performance Analysis on Several Programming Environments, Operating Systems or Computational Platforms
Affine Vector Cache for memory bandwidth savings
AFiD-GPU: a versatile Navier-Stokes Solver for Wall-Bounded Turbulent Flows on GPU Clusters
AFOCL: Portable OpenCL Programming of FPGAs via Automated Built-in Kernel Management
Age and Gender Classification using Convolutional Neural Networks
Ageing at the Spin-Glass/Ferromagnet Transition: Monte Carlo Simulation using GPUs
Agent-based crowd simulation using GPU computing
Agent-Based Modeling on High Performance Computing Architectures
Agentic Code Optimization via Compiler-LLM Cooperation
AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU
AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization
Aggregate Gaze Visualization with Real-time Heatmaps
Aging in the three-dimensional Random Field Ising Model
AI Benchmark: All About Deep Learning on Smartphones in 2019
AI Benchmark: Running Deep Neural Networks on Android Smartphones
AI Factories: It's time to rethink the Cloud-HPC divide
AIPerf: Automated machine learning as an AI-HPC benchmark
Air pollution modelling using a graphics processing unit with CUDA
Airborne Downward Looking Sparse Linear Array 3-D SAR Heterogeneous Parallel Simulation
Airborne radar clutter simulation using GPU (CUDA)
AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs
AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis
Akid: A Library for Neural Network Research and Production from a Dataism Approach
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Algebraic 3D Reconstruction of Planetary Nebulae
Algebraic Splats Representation for Point Based Models
Algorithm 9xx: Sparse QR Factorization on the GPU
Algorithm Acceleration from GPGPUs for the ATLAS Upgrade
Algorithm and implementation of multi-channel spike sorting using GPU in a home-care surveillance system
Algorithm Construction for GPGPU
Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method
Algorithm level power efficiency optimization for CPU-GPU processing element in data intensive SIMD/SPMD computing
Algorithmic and Software System Support to Accelerate Data Processing in CPU-GPU Hybrid Computing Environments
Algorithmic Contributions to the Theory of Regular Chains
Algorithmic Differentiation: Application to Variational Problems in Computer Vision
Algorithmic GPGPU Memory Optimization
Algorithmic performance studies on graphics processing units
Algorithmic Skeleton Framework for the Orchestration of GPU Computations
Algorithmic Trading: A brief, computational finance case study on data centre FPGAs
Algorithms acceleration of pattern-matching in multi-core architectures
Algorithms and Data Structures for Interactive Ray Tracing on Commodity Hardware
Algorithms and Heuristics for Scalable Betweenness Centrality Computation on Multi-GPU Systems
Algorithms for Compression on GPUs
Algorithms for Large-Scale Power Delivery Network Analysis on Massively Parallel Architectures
Algorithms for manipulating large geometric data
Algorithms for Rapid Characterization and Optimization of Aperture and Reflector Antennas
Algorithms for representation of 3D regions in radiotherapy planning software
Algorithms for Solving Non-Stationary Heat Conduction Problem for Design of a Technical Device
Algorithms for the mapping of genome sequences in GPGPU
ALICE HLT High Speed Tracking on GPU
Alignator: A GPU powered software package for robust fiducial-less alignment of cryo tilt-series
Alignment invariant image comparison implemented on the GPU
All You Need Is Binary Search! A Practical View on Lightweight Database Indexing on GPUs
All-pairs Shortest Path Algorithm based on MPI+CUDA Distributed Parallel Programming Model
All-Pairs Shortest Path Algorithms Using CUDA
All-pairs shortest-paths for large graphs on the GPU
Alpaka - An Abstraction Library for Parallel Kernel Acceleration
Alpha-Beta Divergences Discover Micro and Macro Structures in Data
ALPINIST: An Annotation-Aware GPU Program Optimizer
ALPyNA: Acceleration of Loops in Python for Novel Architectures
Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes
Ambient Occlusion and Edge Cueing for Enhancing Real Time Molecular Visualization
Ambient occlusion volumes
AMD MI300X GPU Performance Analysis
Ameliorating Memory Contention of OLAP operators on GPU Processors
American Basket Option Pricing on a multi GPU Cluster
American Options Based on Malliavin Calculus and Nonparametric Variance Reduction Methods
American Options Pricing on Multi-core Graphic Cards
AMGCL - A C++ library for efficient solution of large sparse linear systems
AMGCL: an Efficient, Flexible, and Extensible Algebraic Multigrid Implementation
An 8.6 mW 25 Mvertices/s 400-MFLOPS 800-MOPS 8.91 mm Multimedia Stream Processor Core for Mobile Applications
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code
An abstract object oriented runtime system for heterogeneous parallel architecture
An Accelerated 3D Navier-Stokes Solver for Flows in Turbomachines
An Accelerated IHS Transform Fusion of Remote Sensing Image Data Based on GPU
An acceleration of the algorithm for the nurse rerostering problem on a graphics processing unit
An Accelerator based on the rho-VEX Processor: an Exploration using OpenCL
An adaptative game loop architecture with automatic distribution of tasks between CPU and GPU
An Adaptative Multi-GPU based Branch-and-Bound. A Case Study: the Flow-Shop Scheduling Problem
An adaptive Expectation-Maximization algorithm with GPU implementation for electron cryomicroscopy
An Adaptive Framework for Managing Heterogeneous Many-Core Clusters
An adaptive framework for visualizing unstructured grids with time-varying scalar fields
An Adaptive Hybrid Multiprocessor technique for bioinformatics sequence alignment
An Adaptive Multi-Spline Refinement Algorithm in Simulation Based Sailboat Trajectory Optimization Using Onboard Multi-Core Computer Systems
An Adaptive Multiresolution Mesh Representation for CPU-GPU Coupled Computation
An adaptive octree textures painting algorithm
An adaptive performance modeling tool for GPU architectures
An Adaptive Step Size GPU ODE Solver for Simulating the Electric Cardiac Activity
An algebraic parallel treecode in arbitrary dimensions
An Algorithm for Detecting Cycles in Undirected Graphs using CUDA Technology
An algorithm for efficient computation of spatial impulse response on the GPU with application in ultrasound simulation
An Algorithm for Fast Edit Distance Computation on GPUs
An algorithm-architecture co-design framework for gridding reconstruction using FPGAs
An algorithmic incremental and iterative development method to parallelize dusty-deck FORTRAN HPC codes in GPGPUs using CUDA 
An Analysis of Conventional and Heterogeneous Workloads on Production Supercomputing Resources
An Analysis of OpenACC Programming Model: Image Processing Algorithms as a Case Study
An Analysis of Programmer Productivity versus Performance for High Level Data Parallel Programming
An Analysis of Variation Between Cores For Intel Xeon Phi Knights Corner And Xeon Phi Knights Landing
An Analytical Approach of Mars Rovers by Using GPU Technology and Genetic Algorithm
An Analytical Approach to the Design of Parallel Block Cipher Encryption/Decryption: A CPU/GPU Case Study
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
An application of graphical numerical accelerators in simulations of ion-transport through biological membranes
An Approach for Maximizing Performance on Heterogeneous Clusters of CPU and GPU
An approach for the effective utilization of GP-GPUs in parallel combined simulation
An Approach for Traffic Forecast with GPU Computing & Cellular Automata Model
An approach of tool paths generation for CNC machining based on CUDA
An Approach to Efficient FEM Simulations on Graphics Processing Units Using CUDA
An approach to performance portability through generic programming
An Architectural Journey into RISC Architectures for HPC Workloads
An architecture design of GPU-accelerated VoD streaming servers with network coding
An Architecture for Distributed Behavioral Models with GPUs
An architecture for real time fluid simulation using multiple GPUs
An asymmetric distributed shared memory model for heterogeneous parallel systems
An Asynchronous Dataflow-Driven Execution Model For Distributed Accelerator Computing
An Asynchronous Event Communication Technique for Soft Real-Time GPGPU Applications
An Auto-Programming Approach to Vulkan
An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU
An auto-tuning framework for parallel multicore stencil computations
An Auto-tuning Solution to Data Streams Clustering in OpenCL
An Automated Approach for SIMD Kernel Generation for GPU based Software Acceleration
An Automated Tool for Converting Directive Based C Code Into Parallel CUDA Code
An Automated Video Surveillance System Using Viewpoint Feature Histogram and CUDA-enabled GPUs
An Automatic Host and Device Memory Allocation Method for OpenMPC
An Automatic Input-Sensitive Approach for Heterogeneous Task Partitioning
An Automatic OpenCL Compute Kernel Generator for Basic Linear Algebra Operations
An Automatic Speech Recognition Application Framework for Highly Parallel Implementations on the GPU
An Autonomous Data Language
An Autotuning Framework for Intel Xeon Phi Platforms
An effective GPU implementation of breadth-first search
An Effective Model of CPU/GPU Collaborative Computing in GPU Clusters
An Efficient Acceleration of Digital Fonensics Search Using GPGPU
An Efficient Acceleration of Symmetric Key Cryptography Using General Purpose Graphics Processing Unit
An Efficient Approach for Generating Pencil Filter and Its Implementation on GPU
An Efficient Block Cipher Implementation on Many-Core Graphics Processing Units
An Efficient Cell List Implementation for Monte Carlo Simulation on GPUs
An Efficient Common Substrings Algorithm for On-the-Fly Behavior-Based Malware Detection and Analysis
An Efficient Deterministic Parallel Algorithm for Adaptive Multidimensional Numerical Integration on GPUs
An Efficient Dispatcher for Large Scale GraphProcessing on OpenCL-based FPGAs
An Efficient Fine-grained Parallel Genetic Algorithm Based on GPU-Accelerated
An efficient GPU acceptance-rejection algorithm for the selection of the next reaction to occur for Stochastic Simulation Algorithms
An efficient GPU algorithm for tetrahedron-based Brillouin-zone integration
An Efficient GPU Implementation of Modified Discrete Cosine Transform Using CUDA
An efficient GPU implementation of the revised simplex method
An efficient GPU-based approach for interactive global illumination
An efficient GPU-based time domain solver for the acoustic wave equation
An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs
An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU
An Efficient Implementation of Double Precision 1-D FFT for GPUs Using CUDA
An Efficient Implementation of GPU Virtualization in High Performance Clusters
An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases
An Efficient Implementation of the Longest Common Subsequence Algorithm with Bit-Parallelism on GPUs
An efficient KNN algorithm implemented on FPGA based heterogeneous computing system using OpenCL
An Efficient Load Balancing Method for Tree Algorithms
An efficient midpoint-radius representation format to deal with symmetric fuzzy numbers
An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm
An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor
An Efficient Multiway Mergesort for GPU Architectures
An efficient numerical method for solving the Boltzmann equation in multidimensions
An efficient out-of-core volume rendering method based on ray casting and GPU acceleration
An efficient parallel algorithm for accelerating computational protein design
An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA
An Efficient Parallel Data Clustering Algorithm Using Isoperimetric Number of Trees
An Efficient Parallel GPU Evaluation of Small Angle X-Ray Scattering Profiles
An Efficient Parallel ISODATA Algorithm Based on Kepler GPUs
An Efficient Parallel Motion Estimation Algorithm and X264 Parallelization in CUDA
An Efficient SAR Processor Based on GPU via CUDA
An efficient scheduling scheme using estimated execution time for heterogeneous computing systems
An Efficient Signal Processor of Synthetic Aperture Radar Based on GPU
An Efficient Simulation Environment for Modeling Large-Scale Cortical Processing
An efficient solution for hazardous geophysical flows simulation using GPUs
An efficient stochastic approach to groupwise non-rigid image registration
An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs
An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs
An Efficient WSN Simulator for GPU-Based Node Performance
An Efficient, Automatic Approach to High Performance Heterogeneous Computing
An efficient, model-based CPU-GPU heterogeneous FFT library
An Embedded Stream Processor Core Based on Logarithmic Arithmetic for a Low-Power 3-D Graphics SoC
An Embedding Method for Interactive Simulation on Dynamic Surfaces
An emotionally biased ant colony algorithm for pathfinding in games
An Empirical Performance Evaluation of GPU-Enabled Graph-Processing Systems
An Empirical Study of Intel Xeon Phi
An Empirically Guided Optimization Framework for FPGA OpenCL
An Empirically Optimized Radix Sort for GPU
An End-to-End Programming Model for AI Engine Architectures
An End-to-End System for Unconstrained Face Verifcation with Deep Convolutional Neural Networks
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition
An Energy Consumption Model for GPU Computing at Instruction Level
An Energy Efficient GPGPU Memory Hierarchy with Tiny Incoherent Caches
An energy model for graphics processing units
An Energy Optimization of a GPU Application by Grid Design Space Exploration
An Energy-Efficient Heterogeneous System for Embedded Learning and Classification
An Environment to Support GPU and Multicore Programming for Rapid, High Performance, Application Deployment
An EoS-meter of QCD transition from deep learning
An error correction solver for linear systems: Evaluation of mixed precision implementations
An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark
An evaluation of GPU acceleration for sparse reconstruction
An Evaluation of the GAMA/StarPU Frameworks for Heterogeneous Platforms: the Progressive Photon Mapping Algorithm
An Evaluative Comparison of Performance Portability across GPU Programming Models
An events based algorithm for distributing concurrent tasks on multi-core architectures
An Evolutionary Approach to Parallel Computing Using GPU
An Evolutionary Optimization Strategy Using Graphics Processing Units to Efficiently Investigate Gene-Gene Interactions in Genetic Association Studies
An Execution Model and Runtime For Heterogeneous Many-Core Systems
An execution model for adaptive load-balancing on multicore and multi-GPU systems
An Execution Model for OpenCL 2.0
An Experiment in Parallelizing the Fast Fourier Transform
An experimental approach to performance measurement of heterogeneous parallel applications using CUDA
An Experimental Distributed Visualization System for Petascale Computing
An experimental study of group-by and aggregation on CPU-GPU processors
An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads
An experimental study on performance portability of OpenCL kernels
An Explicit Algorithm for Porous Media Flow Simulation using GPUs
An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home)
An exploration of CUDA and CBEA for a gravitational wave source-modelling application
An Exploration of OpenCL for a Numerical Relativity Application
An Exploration of OpenCL on Multiple Hardware Platforms for a Numerical Relativity Application
An Exploratory Study of High Performance Graphics Application Programming Interfaces
An extended GPU radiosity solver
An Extensible Component-based Approach to Simulation Systems on Heterogeneous Clusters
An Extensible Framework for Composing Stencils with Common Scientific Computing Patterns
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An FPGA Accelerator for Molecular Dynamics Simulation Using OpenCL
An FPGA Implementation of Information Theoretic Visual-Saliency System and Its Optimization
An FPGA-based processing pipeline for high definition stereo video
An FPGA-based Torus Communication Network
An FPGA-specific algorithm for direct generation of multi-variate Gaussian random numbers
An hardware architecture for 3D object tracking and motion estimation
An HPC Benchmark Survey and Taxonomy for Characterization
An hybrid AES-256-GCM implementation for NEON CPU & CUDA GPU
An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices
An image-warping VR-architecture: design, implementation and applications
An implementation and its evaluation of password cracking tool parallelized on GPGPU
An implementation for quad-tree based solid object coloring using CUDA
An implementation of a reordering approach for increasing the product of diagonal entries in a sparse matrix
An Implementation of Coincidence Algorithm on Graphic Processing Units
An Implementation of Conflict-Free Offline Permutation on the GPU
An Implementation of Differential Evolution for Independent Tasks Scheduling on GPU
An implementation of level set based topology optimization using GPU
An Implementation of Real-Time Phased Array Radar Fundamental Functions on a DSP-Focused, High-Performance, Embedded Computing Platform
An implementation of tensor product patch smoothers on GPU
An Implementation of the Discontinuous Galerkin Method on Graphics Processing Units
An Implementation of the Smooth Particle Mesh Ewald Method on GPU Hardware
An implementation of the tile QR factorization for a GPU and multiple CPUs
An implicit multigrid solver for high-order compressible flow simulations on GPUs
An implicit Tensor-Mass solver on the GPU for soft bodies simulation
An Improved CUDA-Based Implementation of Differential Evolution on GPU
An Improved Image Segmentation Algorithm Based on GPU Parallel Computing
An improved implementation of Preconditioned Conjugate Gradient Method on GPU
An Improved Magma Gemm For Fermi Graphics Processing Units
An Improved Monte Carlo Ray Tracing for Large-Scale Rendering in Hadoop
An Improved Parallel Algorithm using GPU for Siting Observers on Terrain
An improved parallel contrast-aware halftoning
An Improved Parallel Implementation of 3D DRIE Simulation on GPU
An improved scheme of an interactive finite element model for 3D soft-tissue cutting and deformation
An Improved Study of Physically Based Fluid Simulation on GPU
An improved study of real-time fluid simulation on GPU
An improved visual inspection system using visual servo
An in-depth performance analysis of irregular workloads on VLIW APU
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures
An Incompressible Navier-Stokes Equations Solver on the GPU Using CUDA
An initial performance review of software components for a heterogeneous computing platform
An innovative compilation tool-chain for embedded multi-core architectures
An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing
An Integrated Framework for Feature Extraction, Object Recognition and Stereo Vision with GPU support
An integrated GPU power and performance model
An intelligent semi-automatic application porting system for application accelerators
An intelligent system for accelerating parallel SVM classification problems on large datasets using GPU
An Interest Point Based Illumination Condition Matching Approach to Photometric Registration Within Augmented Reality Worlds
An Interface for Halo Exchange Pattern
An Intermediate Library for Multi-GPUs Computing Skeletons
An Interrupt-Driven Work-Sharing For-Loop Scheduler
An Introduction to GPU Accelerated Surgical Simulation 
An Introduction to High Performance Computing on AWS
An Introduction to OpenCL C++
An Introduction to the OpenCL Programming Model
An introductory tour of interactive rendering
An Investigation into Concurrent Expectation Propagation
An Investigation of Atomic Synchronization for Sort-Based Group-By Aggregation on GPUs
An investigation of GPU-based stiff chemical kinetics integration methods
An Investigation of the Performance Portability of OpenCL
An Investigation of Unified Memory Access Performance in CUDA
An MDE Approach for Automatic Code Generation from MARTE to OpenCL
An MLIR pipeline for offloading Fortran to FPGAs via OpenMP
An MPI-Based Python Framework for Distributed Training with Keras
An MPI-CUDA Implementation and Optimization for Parallel Sparse Equations and Least Squares (LSQR)
An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters
An MPI-CUDA Implementation for the Compression of DEM
An MPI-CUDA implementation of an improved Roe method for two-layer shallow water systems
An N log N Parallel Fast Direct Solver for Kernel Matrices
An octree-based proxy for collision detection in large-scale particle systems
An On-Demand Fast Parallel Pseudo Random Number Generator with Applications
An open framework for rapid prototyping of signal processing applications
An open source finite-difference time-domain solver for room acoustics using graphics processing units
An open source MATLAB program for fast numerical Feynman integral calculations for open quantum system dynamics on GPUs
An Open-source FPGA Library for Data Sorting
An Open-Source GPU-Accelerated Feature Extraction Tool
An OpenCL 3D FFT for Molecular Dynamics Simulations on Multiple FPGAs
An OpenCL design of the Bob Jenkins lookup3 hash function using the Xilinx SDAccel Development Environment
An OpenCL Fast Fourier Transformation
An OpenCL framework for heterogeneous multicores with local memory
An OpenCL implementation for the solution of TDSE on GPU and CPU architectures
An OpenCL implementation of a forward sampling algorithm for CP-logic
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Runtime and Scheduler for Embedded Multicore DSP Parallel Systems
An OpenCL-Based FPGA Accelerator for Faster R-CNN
An OpenCL-based Implementation of H.264 Encoder
An OpenCL-based Monte Carlo dose calculation engine (oclMC) for coupled photon-electron transport
An OpenCL(TM) Deep Learning Accelerator on Arria 10
An OpenMP Programming Environment on Mobile Devices
An optimal k-exclusion real-time locking protocol motivated by multi-GPU systems
An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation
An optimised multi-baseline approach for on-line MR-temperature monitoring on commodity graphics hardware
An optimised radial basis function algorithm for fast non-rigid registration of medical images
An Optimization for Fast Generation of Digital Hologram
An optimized algorithm for discrete element system analysis using CUDA
An optimized GPU implementation of a 2D free surface simulation model on unstructured meshes
An Optimized GPU Memory Hierarchy Design for an OpenCL Kernel
An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs
An Optimized Multiple Right-Hand Side Dslash Kernel for Intel Xeon Phi
An Optimized Parallel IDCT on Graphics Processing Units
An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language
An Out-of-core GPU Approach for Accelerating Geostatistical Interpolation
An Overview of Miscellaneous Applications of GPU Computing
An Overview of Selected Hybrid and Reconfigurable Architectures
An overview of techniques for predicting the performance of GPU accelerated applications
An Overview on the Latest Nature-Inspired and Metaheuristics-Based Image Registration Algorithms
An Ultra-Fast, Optimized and Massively-Parallelized Curvelet Transform Algorithm on GP-GPUs
An Ultrafast Scalable Many-core Motif Discovery Algorithm for Multiple GPUs
An ultrasonic imaging system based on a new SAFT approach and a GPU beamformer
An unsupervised parallel genetic cluster algorithm for graphics processing units
Analysing Astronomy Algorithms for GPUs and Beyond
Analysing the Performance of GPU Hash Tables for State Space Exploration
Analysis & Design of Efficient Cryptographic Systems
Analysis Acceleration in TMVA for the ATLAS Experiment at CERN using GPU Computing
Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA
Analysis and implementation of a BLAST-Like algorithm for MIC architectures
Analysis and Implementation of eSTREAM and SHA-3 Cryptographic Algorithms
Analysis and Modeling of the Timing Behavior of GPU Architectures
Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms
Analysis and Optimization Techniques for Massively Parallel Processors
Analysis and Parameter Prediction of Compiler Transformation for Graphics Processors
Analysis and performance estimation of the conjugate gradient method on multiple GPUs
Analysis and Review of Sorting Algorithms
Analysis of 3-dimensional electromagnetic fields in dispersive media using cuda
Analysis of a Computational Biology Simulation Technique on Emerging Processing Architectures
Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards
Analysis of Genetic Expression with Microarrays using GPU Implemented Algorithms
Analysis of GPGPU Platforms Efficiency in General-Purpose Computations
Analysis of GPU accelerated OpenCL applications on the Intel HD 4600 GPU
Analysis of GPU Parallel Computing based on Matlab
Analysis of GPU-based convolution for acoustic wave propagation modeling with finite differences: Fortran to CUDA-C step-by-step
Analysis of High Level implementations for Recursive Methods on GPUs
Analysis of illumination conditions at the lunar south pole using parallel computing techniques
Analysis of KECCAK Tree Hashing on GPU Architectures
Analysis of Metallic Nanostructures by a Discontinuous Galerkin Time-Domain Maxwell Solver on Graphics Processing Units
Analysis of Multicore CPU and GPU Toward Parallelization of Total Focusing Method Ultrasound Reconstruction
Analysis of Parallel Montgomery Multiplication in CUDA
Analysis of Parallel Sorting Algorithms on Heterogeneous Processors with OpenCL
Analysis of periodic anisotropic media by means of split-field FDTD method and GPU computing
Analysis of periodic structures with GPU accelerating
Analysis of Real-Time Stereo Vision Algorithms On GPU
Analysis of RSA algorithm using GPU programming
Analysis of Single Phase Fluid Flow and Heat Transfer in Slip Flow Regime by Parallel Implementation of Lattice Boltzmann Method on GPUs
Analysis of SuperLU Solvers on Intel MIC Architecture
Analysis of Surface Folding Patterns of DICCCOLS Using the GPU-Optimized Geodesic Field Estimate
Analysis of the Performance of the Fish School Search Algorithm Running in Graphic Processing Units
Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable Logic
Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs
Analytic Anti-Aliasing of Linear Functions on Polytopes
Analytic Antialiasing for Selective High Fidelity Rendering
Analytic Visibility on the GPU
Analytical motion blur rasterization with compression
Analytical Performance Estimation during Code Generation on Modern GPUs
Analytical Study of Various High Performance Computing Paradigms
Analyzing and Improving the Performance of Spatial Database Processing
Analyzing CUDA workloads using a detailed GPU simulator
Analyzing CUDA's Compiler through the Visualization of Decoded GPU Binaries
Analyzing GPU Performance in Virtualized Environments: A Case Study
Analyzing GPU Tensor Core Potential for Fast Reductions
Analyzing Locality of Memory References in GPU Architectures
Analyzing Memory Accesses for Performance and Correctness of Parallel Programs
Analyzing Modern NVIDIA GPU cores
Analyzing Optimization Techniques for Power Efficiency on Heterogeneous Platforms
Analyzing Password Strength and Efficient Password Cracking
Analyzing program flow within a many-kernel OpenCL application
Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter
Analyzing Soft-Error Vulnerability on GPGPU Microarchitecture
Analyzing the CUDA Applications with its Latency and Bandwidth Tolerance
Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study
Analyzing the Performance Portability of SYCL across CPUs, GPUs, and Hybrid Systems with Protein Database Search
Analyzing throughput of GPGPUs exploiting within-die core-to-core frequency variation
Analyzing Use of OpenCL on the Cell Broadband Engine and a Proposal for OpenCL Extensions
Anatomizing Deep Learning Inference in Web Browsers
Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures
Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
Anatomy of High-Performance Many-Threaded Matrix Multiplication
Android Malware Classification Using Parallelized Machine Learning Methods
ANGHABENCH: a Suite with One Million Compilable C Benchmarks for Code-Size Reduction
Animating physically based explosions in real-time
Animation of Orthogonal Texture Patterns for Vector Field Visualization
Anisotropic interfacial tension, contact angles, and line tensions: A graphics-processing-unit-based Monte Carlo study of the Ising model
Anisotropic Kuwahara Filtering on the GPU
Anisotropic mesh coarsening and refinement on GPU architecture
Anisotropic noise
AnnotationGym: A Generic Framework for Automatic Source Code Annotation
Anomalous behaviour detection using spatiotemporal oriented energies, subset inclusion histogram comparison and event-driven processing
Anomalous metastability in a temperature-driven transition
Anomalous Structure and Scaling of Ring Polymer Brushes
Anonymized Network Sensing using C++26 std::execution on GPUs
Ansor: Generating High-Performance Tensor Programs for Deep Learning
Anti-parallel Patterns in Fine-grain Data-parallel Programs
ANTS2 package: simulation and experimental data processing for Anger camera type detectors
AnyHLS: High-Level Synthesis with Partial Evaluation
AnySeq/GPU: A Novel Approach for Faster Sequence Alignment on GPUs
AnySL: efficient and portable shading for ray tracing
Anytime Algorithms for GPU Architectures
APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics
APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters
APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters
APHOG: A Framework for Fast Object Detection Using Histograms of Oriented Gradients
API-Compiling for Image Hardware Accelerators
APL on GPUs: A TAIL from the Past, Scribbled in Futhark
APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores
APOGEE: adaptive prefetching on GPUs for energy efficiency
Apple Silicon Performance in Scientific Computing
Applicability of GPU Computing for Efficient Merge in In-Memory Databases
Application level energy measurements and models for hybrid platform with accelerators
Application of Assembly of Finite Element Methods on Graphics Processors for Real-Time Elastodynamics 
Application of Deep-Learning to Compiler-Based Graphs
Application of GPGPU for Acceleration of Short DNA Sequence Alignment in Unipro UGENE Project
Application of GPU Computing to Some Urban Traffic Problems
Application of GPU Smooth Particle Hydrodynamics: Wave Runup and Overtopping on Composite Slopes
Application of GPUs for the Calculation of Two Point Correlation Functions in Cosmology
Application of Graphics Processing Units to Search Pipeline for Gravitational Waves from Coalescing Binaries of Compact Objects
Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels
Application of the Characteristic Basis Function Method using CUDA
Application of the Mean Field Methods to MRF Optimization in Computer Vision
Application of the OpenCL API for Implementation of the NIPALS Algorithm for Principal Component Analysis of Large Data Sets
Application Performance Profiling on Intel GPUs with Oneprof and Onetrace
Application Synthesis and Optimization on Heterogeneous Parallel Processing Systems
Application-guided tool development for architecturally diverse computation
Application-independent accurate mouse placements on surfaces of arbitrary geometry
Applications of Deep Neural Networks
Applications of Linux-Based QT-CUDA Parallel Architecture
Applications of Many-Core Technologies to On-line Event Reconstruction in High Energy Physics Experiments
Applications Performance on GPGPUs with the Fermi Architecture
Applying Contact Angle to a Two-Dimensional Smoothed Particle Hydrodynamics (SPH) model on a Graphics Processing Unit (GPU) Platform
Applying Genetic Algorithms to Tune Heterogeneous Platform Configurations
Applying GPU Dynamic Parallelism to High-Performance Normalization of Gene Expressions
Applying graphics processor acceleration in a software defined radio prototyping environment
Applying Object Oriented Design Patterns to CUDA based Pyramidal Image Blending - An Experience
Applying OOC Techniques in the Reduction to Condensed Form for Very Large Symmetric Eigenproblems on GPUs
Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads
Applying Source Level Auto-Vectorization to Aparapi Java
Applying the “Simple Accelerator Modelling in MATLAB” (SAMM) Code to High Luminosity LHC Upgrade
Applying the Midas Touch of Reproducibility to High-Performance Computing
Applying the Parallel GPU Model to Radiation Therapy Treatment
Approaches for parallelizing reductions on modern GPUs
Approaches for the Parallelization of Software Implementation of Integer Multiplication
Approximate Belief Propagation by Hierarchical Averaging of Outgoing Messages
Approximate Dynamic Programming and Neural Networks on Game Hardware
Approximate dynamic programming with post-decision states as a solution method for dynamic economic models
Approximate Principal Direction Trees
Approximate Similarity Search for Online Multimedia Services on Distributed CPU-GPU Platforms
Approximate Subdivision Surface Evaluation in the Language of Linear Algebra
Approximation of BEM matrices using GPGPUs
Approximation of Loop Subdivision Surfaces for Fast Rendering
Approximative inference for multivariate functional data on massively parallel processors
APPy: Annotated Parallelism for Python on GPUs
APTCC: Auto Parallelizing Translator From C To CUDA
APUNet: Revitalizing GPU as Packet Processing Accelerator
AQsort: Scalable Multi-Array In-Place Sorting with OpenMP
AQUAgpusph, a free 3D SPH solver accelerated with OpenCL
Aquila 2.0: Software Architecture for Cognitive Robotics
Aquila: An Open-Source GPU-Accelerated Toolkit for Cognitive Robotics Research
Arax: a runtime framework for decoupling applications from heterogeneous accelerators
Arbitrarily large iterative tomographic reconstruction on multiple GPUs using the TIGRE toolbox
Arbitrary dimension Reed-Solomon coding and decoding for extended RAID on GPUs
Arbitrary-Precision Arithmetics on the GPU
ArborX: A Performance Portable Search Library
ARC: Adaptive Ray-tracing with CUDA, a New Ray Tracing Code for Parallel GPUs
ArchesWeather: An efficient AI weather forecasting model at 1.5° resolution
Architecting an LTE Base Station with Graphics Processing Units
Architecting graphics processors for non-graphics compute acceleration
Architecting SOT-RAM Based GPU Register File
Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels
Architectural Analysis and Performance Characterization of NVIDIA GPUs using Microbenchmarking
Architectural Comparisons for a Quantum Monte Carlo Application
Architectural Considerations for Compiler-guided Unroll-and-Jam of CUDA Kernels
Architectural Exploration and Scheduling Methods for Coarse Grained Reconfigurable Arrays
Architectural explorations for streaming accelerators with customized memory layouts
Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems
Architectural Principles and Experimentation of Distributed High Performance Virtual Clusters
Architectural Support for the Stream Execution Model on General-Purpose Processors
Architectural Support for Virtual Memory in GPUs
Architecture Comparisons between Nvidia and ATI GPUs: Computation Parallelism and Data Communications
Architecture of the real-time target detection processing in an airborne hyperspectral demonstrator system
Architecture-Adaptive Code Variant Tuning
Architecture-and Workload-Aware Heterogeneous Algorithms for Sparse Matrix Vector Multiplication
Architecture-Aware Algorithms and Software for Peta and Exascale Computing
Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems
Architecture-Aware Optimization on a 1600-core Graphics Processor
Architecture-Aware Optimization Targeting Multithreaded Stream Computing
Architecture-based Performance Evaluation of Genetic Algorithms on Multi/Many-core Systems
Architecture, Design, and Experimental Evaluation of a Lightfield Descriptor Depth Buffer Algorithm on Reconfigurable Logic and on a GPU
Are Very Deep Neural Networks Feasible on Mobile Devices?
ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants
Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space
Aristotle: A Performance Impact Indicator for the OpenCL Kernels Using Local Memory
ARK: GPU-driven Code Execution for Distributed Deep Learning
ARKCoS: Artifact-Suppressed Accelerated Radial Kernel Convolution on the Sphere
Array Languages Make Neural Networks Fast
Array Program Transformation with Loo.py by Example: High-Order Finite Elements
Array-Oriented Languages and Polyhedral Compilation
ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android
Articulated object tracking by rendering consistent appearance parts
Artifact-Free Decompression and Zooming of JPEG Compressed Images with Total Generalized Variation
Artifact-Free JPEG Decompression with Total Generalized Variation
Artificial Intelligence in Electric Machine Drives: Advances and Trends
Artificial neural network computation on graphic process unit
Artificial Neural Network Simulation on CUDA
ARVO-CL: The OpenCL version of the ARVO package - An efficient tool for computing the accessible surface area and the excluded volume of proteins via analytical equations
ASAMgpu V1.0-a moist fully compressible atmospheric model using graphics processing units (GPUs)
Aspect-Driven Mixed-Precision Tuning Targeting GPUs
Aspects of GPU for general purpose high performance computing
Assembling large mosaics of electron microscope images using GPU
Assembly of finite element methods on graphics processors
Assembly-Free Large-Scale Modal Analysis on the GPU
Assembly-Free Structural Dynamics On CPU and GPU
Assessing Accelerator-Based HPC Reverse Time Migration
Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems
Assessing Intel OneAPI capabilities and cloud-performance for heterogeneous computing
Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment
Assessing opportunities of SYCL for biological sequence alignment on GPU-based systems
Assessing the feasibility of OpenCL CPU implementations for agent-based simulations
Assessing the hardness of SVP algorithms in the presence of CPUs and GPUs
Assessing the Impact of Compiler Optimizations on GPUs Reliability
Assessing the Performance-Energy Balance of Graphics Processors for Spectral Unmixing
Assessment of GPU computational enhancement to a 2D flood model
Assessment of various GPU acceleration strategies in text categorization processing flow
Astronomical Photometric Data Reduction Using GPGPU
Astrophysical data mining with GPU. A case study: genetic classification of globular clusters
Astrophysical Particle Simulations on Heterogeneous CPU-GPU Systems
Astrophysical Particle Simulations with Custom GPU Clusters
Astrophysical particle simulations with large custom GPU clusters on three continents
Astrophysical particle simulations with large custom GPU clusters on three continents
Astrophysical Supercomputing with GPUs: Critical Decisions for Early Adopters
Astrophysical-oriented Computational multi-Architectural Framework
ASW: Accelerating Smith-Waterman Algorithm on Coupled CPU-GPU Architecture
AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference
Asymptotic Peak Utilisation in Heterogeneous Parallel CPU/GPU Pipelines: A Decentralised Queue Monitoring Strategy
Asynchronous Communication for Finite-Difference Simulations on GPU Clusters using CUDA and MPI
Asynchronous Communication Schemes for Finite Difference Methods on Multiple GPUs
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous Parallel Computing Algorithm implemented in 1D Heat Equation with CUDA
Asynchronous Parallel Computing Model of Global Motion Estimation with CUDA
Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures
Asynchronous-Many-Task Systems: Challenges and Opportunities - Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX
ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs
Atmospheric Chemistry
Atmospheric turbulence removal using convolutional neural network
Atomic-free Irregular Computations on GPUs
Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations
Attack Signature Matching using Graphics Processors in High-Performance Intrusion Detection Systems
Attaining system performance points: revisiting the end-to-end argument in system design for heterogeneous many-core systems
Attention-based NMT Models as Feature Functions in Phrase-based SMT
ATTILA: a cycle-level execution-driven simulator for modern GPU architectures
Audiovisual Voice Activity Detection and Localization of Simultaneous Speech Sources
Augmented reality live-action compositing
Augmented reality usage for prototyping speed up
Augmenting LLM Code Translation with Compiler Analysis for C to Triton Kernel Generation
Augmenting Operating Systems With the GPU
Augur: a Modeling Language for Data-Parallel Probabilistic Inference
Aurally and visually enhanced audio search with soundtorch
AUTO-GC: Automatic translation of data mining applications to GPU clusters
Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters
Auto-Generation and Auto-Tuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters
Auto-Generation of Parallel Finite-Differencing Code for MPI, TBB and CUDA
Auto-optimization of a Feature Selection Algorithm
Auto-SpMV: Automated Optimizing SpMV Kernels on GPU
Auto-tunable GPU BLAS
Auto-tunable GPU BLAS (thesis)
Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems
Auto-tuning 3-D FFT library for CUDA GPUs
Auto-tuning a High-Level Language Targeted to GPU Codes
Auto-tuning a LOFAR radio astronomy pipeline in JavaCL
Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs
Auto-Tuning Dedispersion for Many-Core Accelerators
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs
Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU
Auto-tuning interactive ray tracing using an analytical GPU architecture model
Auto-tuning of fast fourier transform on graphics processors
Auto-Tuning of Level 1 and Level 2 BLAS for GPUs
Auto-tuning on the macro scale: high level algorithmic auto-tuning for scientific applications
Auto-tuning Shallow water simulations on GPUs
Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems
Auto-tuning Streamed Applications on Intel Xeon Phi
Auto-Tunning of Data Communication on Heterogeneous Systems
Auto-Vectorizing a Large-scale Production Unstructured-mesh CFD Application
AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication
AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning
AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
AutoMat - Automatic Differentiation for Generalized Standard Materials on GPUs
Automated and interactive approaches for optimal surface finding based segmentation of medical image data
Automated and parallel code generation for finite-differencing stencils with arbitrary data types
Automated Architecture Design for Deep Neural Networks
Automated architecture-aware mapping of streaming applications onto GPUs
Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow
Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models
Automated Deep Learning Optimization via DSL-Based Source Code Transformation
Automated development of applications for graphical processing units using rewriting rules
Automated Dynamic Analysis of CUDA Programs
Automated Enhanced Parallelization of Sequential C to Parallel OpenMP
Automated Generation of OpenCL Programs Based on Algebra-Algorithmic Approach
Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications
Automated image alignment for 2D gel electrophoresis in a high-throughput proteomics pipeline
Automated Long-Term Monitoring of Parallel Microfluidic Operations Applying a Machine Vision-Assisted Positioning Method
Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation
Automated pose estimation in 3D point clouds applying annealing particle filters and inverse kinematics on a GPU
Automated Runtime Analysis and Adaptation for Scalable Heterogeneous Computing
Automated Software Testing of Memory Performance in Embedded GPUs
Automated Techniques for Enabling Efficient MPI Application Migration
Automated test generation for OpenCL kernels using fuzzing and constraint solving
Automated Testing of Graphics Shader Compilers
Automated Tool to Generate Parallel CUDA code from a Serial C Code
Automatic abstraction and fault tolerance in cortical microachitectures
Automatic acceleration of Numpy applications on GPUs and multicore CPUs
Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models
Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems
Automatic bi-layer video segmentation based on sensor fusion
Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper
Automatic C-to-CUDA Code Generation for Affine Programs
Automatic classification of object code using machine learning
Automatic Code Generation and Adaptive Grid Scheduling for GPU Cluster Computing
Automatic code generation and tuning for stencil kernels on modern shared memory architectures
Automatic code generation for solvers of cardiac cellular membrane dynamics in GPUs
Automatic Code Generation for Stencil Computations on GPU Architectures
Automatic code generation methods applied to numerical linear algebra in high performance computing
Automatic Code Rewriting for Performance Portability
Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL
Automatic Compilation for Heterogeneous Architectures with Single Assignment C
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
Automatic Compiler Based FPGA Accelerator for CNN Training
Automatic contention detection and amelioration for data-intensive operations
Automatic CPU-GPU communication management and optimization
Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures
Automatic Data Layout Generation and Kernel Mapping for CPU+GPU Architectures
Automatic Data Layout Optimizations for GPUs
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories
Automatic Detection and Denoising of Signals in Large Geophysical Datasets
Automatic Discovery of Algorithms for Multi-Agent Systems
Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems
Automatic efficient data layout for multithreaded stencil codes on CPUs and GPUs
Automatic fitting of spiking neuron models to electrophysiological recordings
Automatic Fusions of CUDA-GPU Kernels for Parallel Map
Automatic Generation Of Application-Specific Accelerators for FPGAs from Python Loop Nests
Automatic generation of CUDA code performing tensor manipulations using C++ expression templates
Automatic Generation of FFT Libraries for GPU Platforms
Automatic generation of heterogeneous spectrometers for radio astronomy
Automatic Generation of Multicore Chemical Kernels
Automatic Generation of OpenCL Code for ARM Architectures
Automatic Generation of OpenCL Code through Polyhedral Compilation with LLM
Automatic generation of software pipelines for heterogeneous parallel systems
Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUs
Automatic GPU optimization through higher-order functions in functional languages
Automatic Hepatic Vessel Segmentation Using Graphics Hardware
Automatic Implementation of Evolutionary Algorithms on GPUs using ESDL
Automatic Kernel Generation for Volta Tensor Cores
Automatic library generation for BLAS3 on GPUs
Automatic Loop Partitioning for Heterogeneous Systems
Automatic Mapping for OpenCL-Programs on CPU/GPU Heterogeneous Platforms
Automatic Mapping of Stream Programs on Multicore Architectures
Automatic Multi-Camera Setup Optimization for Optical Tracking
Automatic Multi-GPU Code Generation applied to Simulation of Electrical Machines
Automatic NUMA Characterization using Cbench
Automatic Online Tuning (AutoTune): Fully Extended Analysis
Automatic OpenCL code generation for multi-device heterogeneous architectures
Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design
Automatic OpenCL Task Adaptation for Heterogeneous Architectures
Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators based on a Domain-Specific Language for Medical Imaging
Automatic Optimization of OpenCL-Based Stencil Codes for FPGAs and Its Evaluation
Automatic Optimization of Thread Mapping for a GPGPU Programming Framework
Automatic Parallelization for GPUs
Automatic parallelization for graphics processing units
Automatic Parallelization for Heterogeneous Embedded Systems
Automatic Parallelization of a Gap Model using Java and OpenCL
Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs
Automatic Parallelization of Tiled Stencil Loop Nests on GPUs
Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime
Automatic Performance Optimisation of Parallel Programs for GPUs via Rewrite Rules
Automatic Performance Optimization in ViennaCL for GPUs
Automatic Performance Optimization on Heterogeneous Computer Systems using Manycore Coprocessors
Automatic Performance Tuning of Pipeline Patterns for Heterogeneous Parallel Architectures
Automatic Performance Tuning of Stencil Computations on Graphics Processing Units
Automatic Point Target Detection for Interactive Visual Analysis of SAR Images
Automatic Pose Estimation for Range Images on the GPU
Automatic program analysis for data parallel kernels
Automatic program parallelization for multicore processors
Automatic Resource-Constrained Static Task Parallelization
Automatic run-time mapping of polyhedral computations to heterogeneous devices with memory-size restrictions
Automatic safety proofs for asynchronous memory operations
Automatic Scan Parallelization in OpenMP
Automatic scanning of nuclear emulsions with wide-angle acceptance for nuclear fragment detection
Automatic Scheduling of Compute Kernels Across Heterogeneous Architectures
Automatic Selection of Sparse Matrix Representation on GPUs
Automatic shader level of detail
Automatic SIMD Code Generation
Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification
Automatic Software Synthesis from High-Level ForSyDe Models Targeting Massively Parallel Processors
Automatic source code adaptation for heterogeneous platforms
Automatic Synthesis of Heterogeneous CPU-GPU Embedded Applications from a UML Profile
Automatic Termination Analysis for GPU Kernels
Automatic Test Case Reduction for OpenCL
Automatic test case reduction of randomly generated OpenCL kernels
Automatic transformation and optimization of applications on GPUs and GPU clusters
Automatic Translation of CUDA to OpenCL and Comparison of Performance Optimizations on GPUs
Automatic tuning matrix multiplication performance on graphics hardware 
Automatic Tuning of Local Memory Use on GPGPUs
Automatic Virtualization of Accelerators
Automatically Exploiting the Memory Hierarchy of GPUs through Just-in-Time Compilation
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation
Automatically Generating Efficient Simulation Codes on GPUs from Partial Differential Equations
Automatically Harnessing Sparse Acceleration
Automatically Selecting Profitable Thread Block Sizes Using Machine Learning
Automatically translating a general purpose C++ image processing library for GPUs
Automatically Tuned Dense Linear Algebra for Multicore+GPU
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
Automating a Labour Performance Measurement and Risk Assessment: An Evaluation of Methods for a Computer Vision based System
Automating elimination of idle functions by run-time reconfiguration
Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach
Automating GPU computing in MATLAB
Automating Heterogeneous Parallelism in Numerical Differential Equations
Automating the Last-Mile for High Performance Dense Linear Algebra
AutOMP: An Automatic OpenMP Parallelization Generator for Variable-Oriented High-Performance Scientific Codes
Autonomous heterogeneous catalyst discovery with a self-evolving multi-agent digital twin
AutoParBench: A Unified Test Framework for OpenMP-based Parallelizers
AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning
AutoPhase: Compiler Phase-Ordering for High Level Synthesis with Deep Reinforcement Learning
Autotuning CUDA Compiler Parameters for Heterogeneous Applications using the OpenTuner Framework
Autotuning CUDA: Applying NLP Techniques to LS-CAT
Autotuning for Automatic Parallelization on Heterogeneous Systems
Autotuning GEMMs for Fermi
Autotuning GPU Kernels via Static and Predictive Analysis
Autotuning of Pattern Runtimes for Accelerated Parallel Systems
Autotuning OpenACC Work Distribution via Direct Search
Autotuning OpenCL Workgroup Size for Stencil Patterns
Autotuning Programs with Algorithmic Choice
Autotuning Stencil-Based Computations on GPUs
Autotuning Stencils Codes with Algorithmic Skeletons
Autotuning Tensor Contraction Computations on GPUs
Autotuning Wavefront Abstractions for Heterogeneous Architectures
Autotuning Wavefront Patterns for Heterogeneous Architectures
Autotuning, Code Generation and Optimizing Compiler Technology for GPUs
Auxiliary Image Regularization for Deep CNNs with Noisy Labels
AvA: Accelerated Virtualization of Accelerators
AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries
AVSS2011 demo session: GPU enabled Smart Video Node
AVX-512 extension to OpenQCD 1.6
AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
Axel: a heterogeneous cluster with FPGAs and GPUs
AZP: Automatic Specialization for Zero Values in Gaming Applications
b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions
B-CALM: An open-source GPU-based 3D-FDTD with multi-pole dispersion for plasmonics
B-Calm: an Open-Source Multi-Gpu-Based 3D-FDTD with Multi-Pole Dispersion for Plasmonics
Back Ground Subtraction Algorithm For Moving Object Detection In FPGA
Backpropagation Training for Fisher Vectors within Neural Networks
BaCO: A Fast and Portable Bayesian Compiler Optimization Framework
Bacon: A GPU Programming System With Just in Time Specialization
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Balancing locality and concurrency: solving sparse triangular systems on GPUs
Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach
Bamboo: Automatic Translation of MPI Source into a Latency-Tolerant Form
Bandicoot: A Templated C++ Library for GPU Linear Algebra
Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
Bandwidth Reduction Through Multithreaded Compression of Seismic Images
Bandwidth Requirements of GPU Architectures
BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU
Barnes-hut treecode on GPU
Barra, a Modular Functional GPU Simulator for GPGPU
Barra: A Parallel Functional Simulator for GPGPU
BarraCUDA - a fast short read sequence aligner using graphics processing units
Barrier Invariants: A Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels
Barycentric coordinates computation in homogeneous coordinates
BASEMENT v3: a modular freeware for river process modelling over multiple computational backends
Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts
BAT: A Benchmark suite for AutoTuners
Batch Method for Efficient Resource Sharing in Real-time Multi-GPU Systems
Batch Records Insertion into Multidimensional Linear Dynamic Hashing Table on GPU
Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs
Batched Linear Algebra Problems on GPU Accelerators
Batched Matrix Computations on Hardware Accelerators
Batched Matrix Computations on Hardware Accelerators Based on GPUs
Batched Multi Triangulation
Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression
Batched Shift Reduce Parsing with Lists of Vectors on CUDA
Bayesian Image Restoration Using A Large-scale Total Patch Variation Prior
Bayesian inference for artificial perception using OpenCL on FPGAs and GPUs
Bayesian model comparison via sequential Monte Carlo
Bayesian neural networks for detecting epistasis in genetic association studies
Bayesian Neural Networks for Genetic Association Studies of Complex Disease
Bayesian Neural Networks in Data-Intensive High Energy Physics Applications
Bayesian Optimization for auto-tuning GPU kernels
Bayesian real-time perception algorithms on GPU
Bayesian Sparse Unsupervised Learning for Probit Models of Binary Data
Bayesian Sparsity-Path-Analysis of Genetic Association Signal using Generalized t Priors
Bayesian State-Space Modelling on High-Performance Hardware Using LibBi
BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem
BEAGLE: an Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics
Beam Dynamics Simulations Using GPUs
Beam Dynamics Simulations with a GPU-accelerated Version of ELEGANT
Beauty And The Beast: Exploiting GPUs In Haskell
Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation
Behavioral graph fraud detection in E-commerce
Behavioral Non-portability in Scientific Numeric Computing
Behavioral Spherical Harmonics for Long-Range Agents' Interaction
Belief Propagation by Message Passing in Junction Trees: Computing Each Message Faster Using GPU Parallelization
Belief Propagation on the GPU for Stereo Vision
Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!
Bempp-cl: A fast Python based just-in-time compiling boundary element library
BenchDirect: A Directed Language Model for Compiler Benchmarks
BenchFriend: Correlating the Performance of GPU Benchmarks
BENCHIP: Benchmarking Intelligence Processors
Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library
Benchmarking Across Platforms: European Option Pricing
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
Benchmarking and Implementation of Probability-Based Simulations on Programmable Graphics Cards
Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs: an industry case study
Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms
Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor
Benchmarking Deep Learning Models on Jetson TX2
Benchmarking GPU and CPU codes for Heisenberg spin glass overrelaxation 
Benchmarking GPU and TPU Performance with Graph Neural Networks
Benchmarking GPU Devices with N-Body Simulations
Benchmarking GPUs to tune dense linear algebra
Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters
Benchmarking Intel Xeon Phi to Guide Kernel Design
Benchmarking Modern Edge Devices for AI Applications
Benchmarking Next Generation Hardware Platforms: An Experimental Approach
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption
Benchmarking optimization algorithms for auto-tuning GPU kernels
Benchmarking Parallel Performance on Many-Core Processors
Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors
Benchmarking State-of-the-Art Deep Learning Software Tools
Benchmarking the cost of thread divergence in CUDA
Benchmarking the Intel Xeon Phi Coprocessor
Benchmarking the Memory Hierarchy of Modern GPUs
Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers
Benchmarking Thread Block Cluster
Benchmarking TPU, GPU, and CPU Platforms for Deep Learning
Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs
Benchmarks for Intel MIC Architecture
BenchPress: A Deep Active Benchmark Generator
BePilot: An AI Programming Assistant for Compiler Backend Development
Berkeley Dwarfs on CUDA
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
Best Practice Guide - GPGPU
Best Practice Guide - Intel Xeon Phi
Best Practice Guide Intel Xeon Phi v2.0
Best-effort semantic document search on GPUs
Betatron tune measurement with the LHC damper using a GPU
Better GPU Hash Tables
Better speedups using simpler parallel programming for graph connectivity and biconnectivity
Betweenness Centrality on GPUs and Heterogeneous Architectures
Beyond 16GB: Out-of-Core Stencil Computations
Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising
Beyond Amdahl's Law: An Objective Function That Links Multiprocessor Performance Gains To Delay and Energy
Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure
Beyond programmable shading (parts I and II)
Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes
BFROST: Binary Features from Robust Orientation Segment Tests accelerated on the GPU
Bi-directional Path Tracing on GPU
Bidimensional Median Filter for Parallel Computing Architectures
BIDMach: Large-scale Learning with Zero Memory Allocation
Bifrost: a Python/C++ Framework for High-Throughput Stream Processing in Astronomy
Big Integer Multiplication with CUDA FFT (cuFFT) Library
Bigger Buffer k-d Trees on Multi-Many-Core Systems
BigKernel — High Performance CPU-GPU Communication Pipelining for Big Data-style Applications
Bilateral Filtering with CUDA
Billion-scale similarity search with GPUs
Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models
Binary Interval Search (BITS): A Scalable Algorithm for Counting Interval Intersections
Binary Interval Search: a scalable algorithm for counting interval intersections
Binary Mesh Partitioning for Cache-Efficient Visualization
Binary Segmentation of Video Sequences in Real Time
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
Binaural Simulations Using Audio Rate FDTD Schemes and CUDA
Binomial American Option Pricing on CPU-GPU Hetergenous System
Bio-inspired computer visual system using GPU and Visual Pattern Assessment Language (ViPAL): Application on breast cancer prognosis
Bio-Inspired Optimization of Ultra-Wideband Patch Antennas Using Graphics Processing Unit Acceleration
Bio-sequence database scanning on a GPU
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images
Bioinformatics Sequence Comparisons on Manycore Processors
Biomedical and Clinical English Model Packages in the Stanza Python NLP Library
Biomedical image analysis on a cooperative cluster of GPUs and multicores
Biomolecular electrostatics simulation with a parallel FMM-based BEM, using up to 512 GPUs
Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns
Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU
Bit-level Parallelization of 3DES Encryption on GPU
Bit-Packed Damaged Lattice Potts Model Simulations with CUDA and GPUs
Bit-Parallel Multiple Pattern Matching
Bit-Vectorized GPU Implementation of a Stochastic Cellular Automaton Model for Surface Growth
Bitcoin and The Age of Bespoke Silicon
BitCracker: BitLocker meets GPUs
Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations
Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL
BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages
Black-Box Side-Channel Attacks Highlight the Importance of Countermeasures: An Analysis of the Xilinx Virtex-4 and Virtex-5 Bitstream Encryption Mechanism
BLAS Comparison on FPGA, CPU and GPU
Blasting through lattice calculations using CUDA
BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing
Blind image deconvolution algorithm on NVIDIA CUDA platform
Blink: Fast and Generic Collectives for Distributed ML
Blister: GPU-based rendering of Boolean combinations of free-form triangulated shapes
Block based Singular Value Decomposition approach to matrix factorization for recommender systems
Block Conjugate Gradient Solver in OpenCL
Block Time Step Storage Scheme for Astrophysical N-body Simulations
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
Block-Parallel IDA* for GPUs
Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs
Block-Size Independence for GPU Programs
Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
Blockchain Goes Green? Part II: Characterizing the Performance and Cost of Blockchains on the Cloud and at the Edge
Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study
Blocking Self-avoiding Walks Stops Cyber-epidemics: A Scalable GPU-based Approach
Blocks and Fuel: Frameworks for deep learning
Blum Blum Shub on the GPU
Boda-RTC: Productive Generation of Portable, Efficient Code for Convolutional Neural Networks on Mobile Computing Platforms
Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster
Boids that see: Using self-occlusion for simulating large groups on GPUs
Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance
BoltzGen:Toward Universal Binder Design
Bone structure analysis on multiple GPGPUs
Bone Structure Analysis with GPGPUs
Bonsai: A GPU Tree-Code
Boosted Algorithms for Visual Object Detection on Graphics Processing Units
Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables
Boosting Java Performance using GPGPUs
Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs
Boosting quantum evolutions using Trotter-Suzuki algorithms on GPUs
Boosting sphere decoding speed through Graphic Processing Units
BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs
BOPM implemented on a GPU-architecture
Bothnia: a dual-personality extension to the Intel integrated graphics driver
Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU
Bouncing Behavior of Microscopic Dust Aggregates
Bound the Peak Performance of SGEMM on GPU with software-controlled fast memory
Bounding the effect of partition camping in GPU kernels
Bounds Checking on GPU
Bounds on the Energy Consumption of Computational Kernels
Brain perfusion imaging: performance and accuracy
BrainCove: A Tool for Voxel-wise fMRI Brain Connectivity Visualization
BrainFrame: A heterogeneous accelerator platform for neuron simulations
BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism
Branch and Data Herding: Reducing Control and Memory Divergence for Error-tolerant GPU Applications
Breadth First Search Vectorization on the Intel Xeon Phi
Breadth-First Search using Dynamic Parallelism on the GPU
Breaking DVB-CSA
Breaking ECC2K-130
Breaking the GPU programming barrier with the auto-parallelising SAC compiler
Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers
Bridging Control-Centric and Data-Centric Optimization
Bridging OpenCL and CUDA: A Comparative Analysis and Translation
Bridging parallel and reconfigurable computing with multilevel PGAS and SHMEM+
Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective
Bridging the GPGPU-FPGA efficiency gap
Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs
Bridging the Semantic Gaps of GPU Acceleration for Scaleout CNN-based Big Data Processing: Think Big, See Small
Brief announcement: better speedups for parallel max-flow
Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs
Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs
Bringing OpenCL to Commodity RISC-V CPUs
Bringing Parallel Performance to Python with Domain-Specific Selective Embedded Just-in-Time Specialization
Brook for GPUs: Stream Computing on Graphics Hardware
Brownian Dynamics of Active Sphere Suspensions Confined Near a No-Slip Boundary
Brownian dynamics simulations on CPU and GPU with BD_BOX
Browsing a Large Collection of Community Photos Based on Similarity on GPU
Browsing Large Image Datasets through Voronoi Diagrams
Brute force de-shredding algorithm using the GPU
Brute-Force k-Nearest Neighbors Search on the GPU
BSGP: bulk-synchronous GPU programming
Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs
Buffer overflow vulnerabilities in CUDA: a preliminary analysis
Bufferless NOC Simulation of Large Multicore System on GPU Hardware
Build and Travel KD-Tree with CUDA
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs
Building a Personal High Performance Computer with Heterogeneous Processors
Building a Real-Time Multi-GPU Platform: Robust Real-Time Interrupt Handling Despite Closed-Source Drivers
Building Correlators with Many-Core Hardware
Building Human Brain Network in 3D Coefficient Map Determined by X-ray Microtomography
Building Multiclass Nonlinear Classifiers with GPUs
Building Source-to-Source Compilers for Heterogeneous Targets
Building-Blocks for Performance Oriented DSLs
Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation
Bulk GCD Computation Using a GPU to Break Weak RSA Keys
Bump Mapping Unparametrized Surfaces on the GPU
Bundled depth-map merging for multi-view stereo
Burrows-Wheeler Aligner: A Parallel Approach
BVH for efficient raytracing of dynamic metaballs on GPU
C and CUDA Implementation for SIRT and SART Reconstruction Algorithms
C Language Extensions for Hybrid CPU/GPU Programming with StarPU
C to Cellular Automata and Execution on CPU, GPU and FPGA
C-DAC's Efforts - Application Kernels on HPC Cluster with GPU Accelerators
C-for-Metal: High Performance SIMD Programming on Intel GPUs
C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++
Cache and bandwidth aware matrix multiplication on the GPU
Cache Miss Analysis for GPU Programs Based on Stack Distance Profile
Cache-efficient numerical algorithms using graphics hardware
CADDIES: A New Framework for Rapid Development of Parallel Cellular Automata Algorithms for Flood Simulation
Caffe con Troll: Shallow Ideas to Speed Up Deep Learning
Caffe: Convolutional Architecture for Fast Feature Embedding
Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks
Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks
CaffeLink: Mathematica binding for Caffe Deep Learning Framework
CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms
Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition
Calculation by articificial compressibility method and virtual flux method on GPU
Calculation of fermion loops for eta-prime and nucleon scalar and electromagnetic form factors
Calculation of Force Field Grids for Molecular Docking Using Graphics Processing Unit
Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)
Calculation of Stochastic Heating and Emissivity of Cosmic Dust Grains with Optimization for the Intel Many Integrated Core Architecture
Calculation of weight vectors for wideband beamforming using Graphics Processing Units
CAMPAIGN: An open-source Library of GPU-accelerated Data Clustering Algorithms
Can CUDA be exposed through web services?
Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?
Can GPUs Sort Strings Efficiently?
Can Large Language Models Predict Parallel Code Performance?
Can PCM Benefit GPU? Reconciling Hybrid Memory Design with GPU Massive Parallelism for Energy Efficiency
Can Portability Improve Performance? An Empirical Study of Parallel Graph Analytics
Can Tensor Cores Benefit Memory-Bound Kernels? (No!)
Can We Run in Parallel? Automating Loop Parallelization for TornadoVM
Canadian Hydrogen Intensity Mapping Experiment (CHIME) Pathfinder
Candidate set parallelization strategies for Ant Colony Optimization on the GPU
CANNA: Neural Network Acceleration using Configurable Approximation on GPGPU
Canny edge detection on NVIDIA CUDA
CANSCID-CUDA
Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL
CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures
Capturing the Memory Topology of GPUs
Caracal: dynamic translation of runtime environments for GPUs
Caracteristiques arithmetiques des processeurs graphiques
CaravelaMPI: Message Passing Interface for Parallel GPU-Based Applications
Cardiac Dysrhythmia Detection with GPU-Accelerated Neural Networks
Cardiac simulation on multi-GPU platform
Cardiac tissue simulation using graphics hardware
Cartesian SENSE and k-t SENSE reconstruction using commodity graphics hardware
Cascaded Segmentation-Detection Networks for Word-Level Text Spotting
Case Studies in Acceleration of Heston's Stochastic Volatility Financial Engineering Model: GPU, Cloud and FPGA Implementations
Case Study: GPU-based implementation of sequence pair based floorplanning using CUDA
Case study: Interactive rendering of adaptive mesh refinement data
Case study: Runtime reduction of a buffer insertion algorithm using GPU parallel programming
CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
Casting Shadows in Real Time
Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization
Caustics Mapping: An Image-Space Technique for Real-Time Caustics
CAVE-CL: An OpenCL version of the package for detection and quantitative analysis of internal cavities in a system of overlapping balls: application to proteins
CBench: Analyzing Compute Performance for Modern NVIDIA and AMD GPUs
CBESW: sequence alignment on the Playstation 3
CBinfer: Change-Based Inference for Convolutional Neural Networks on Video Data
CDFC: Collision Detection Based on Fuzzy Clustering for Deformable Objects on GPU's
Celeris: A GPU-accelerated open source software with a Boussinesq-type wave solver for real-time, interactive simulation and visualization
CELES: CUDA-accelerated simulation of electromagnetic scattering by large ensembles of spheres
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabled GPU
cellGPU: massively parallel simulations of dynamic vertex models
Cellular automaton for ultra-fast watershed transform on GPU
Cellular genetic algorithms
Cellular Genetic Algorithms and Local Search for 3-SAT problem on Graphic Hardware
Cellular GPU Models to Euclidean Optimization Problems
Cellular Level Agent Based Modelling on the Graphics Processing Unit
Central Force Optimization on a GPU: A case study in high performance metaheuristics using multiple topologies
cf4ocl: a C framework for OpenCL
CFD code adaptation to the FPGA architecture
CFD Simulation of Jet Cooling and Implementation of Flow Solvers in GPU
CFD-based analysis and two-level aerodynamic optimization on Graphics Processing Units
CFMDS: CUDA-based fast multidimensional scaling for genome-scale data
CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs
Cg in Two Pages
Cg: a system for programming graphics hardware in a C-like language
CGiS, a new Language for Data-parallel GPU Programming
CGO: G: Intelligent Heuristic Construction with Active Learning
CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection
Chai: Collaborative Heterogeneous Applications for Integrated-architectures
ChainerMN: Scalable Distributed Deep Learning Framework
Challenge benchmarks that must be conquered to sustain the gpu revolution
Challenges Adapting CUDA PIC Codes to multiple GPUs
Challenges and Opportunities in C/C++ Source-To-Source Compilation
Challenges and opportunities of obtaining performance from multi-core CPUs and many-core GPUs
Challenges and Techniques for Transparent Acceleration of Unmodified Big Data Applications
Challenges for a GPU-Accelerated Dynamic Programming Approach for Join-Order Optimization
Challenges for compiler support for exascale computing
Challenges of mapping financial analytics to many-core architecture
Challenges of medical image processing
Challenging cloning related problems with GPU-based algorithms
Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching
ChamNet: Towards Efficient Network Design through Platform-Aware Model Adaptation
CHAOS: A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi
Character-level Transformer-based Neural Machine Translation
Charactering and Detecting CUDA Program Bugs
Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks
Characterising Bipartite Graph Matching Algorithms on GPUs
Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications
Characterization and Exploitation of GPU Memory Systems
Characterization and Performance Analysis for 3D Benchmarks
Characterization and Transformation of Unstructured Control Flow in Bulk Synchronous GPU Applications
Characterization and Transformation of Unstructured Control Flow in GPU Applications
Characterization of FPGA-based High Performance Computers
Characterization of Lossy SIW Resonators Based on Multilayer Perceptron Neural Networks on Graphics Processing Unit
Characterization of OpenCL on a Scalable FPGA Architecture
Characterization of Speech Recognition Systems on GPU Architectures
Characterizing and Enhancing Global Memory Data Coalescing on GPUs
Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems
Characterizing and Improving the Use of Demand-Fetched Caches in GPUs
Characterizing and Optimizing Irregular Applications on Graphics Processing Units
Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems
Characterizing CUDA and OpenMP Synchronization Primitives
Characterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs
Characterizing Deep Learning Training Workloads on Alibaba-PAI
Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features
Characterizing the Challenges and Evaluating the Efficacy of a CUDA-to-OpenCL Translator
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
Charged particles constrained to a curved surface
CHARM-SYCL: New Unified Programming Environment for Multiple Accelerator Types
Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services
Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs
CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations
Chest CT automatic analysis for lung nodules detection implemented on a GPU computing system
Chestnut: A GPU Programming Language for Non-Experts
CHO: A Benchmark Suite for OpenCL-based FPGA Accelerators
CHO: Towards a Benchmark Suite for OpenCL FPGA Accelerators
Cholla : A New Massively-Parallel Hydrodynamics Code For Astrophysical Simulation
Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency
CHPS: An Environment for Collaborative Execution on Heterogeneous Desktop Systems
Chrono: a parallel multi-physics library for rigid-body, flexible-body, and fluid dynamics
Chunkflow: Distributed Hybrid Cloud Processing of Large 3D Images by Convolutional Nets
CI/CD Efforts for Validation, Verification and Benchmarking OpenMP Implementations
Cinematic Particle Systems with OpenCL
Circular Hough Transform in OpenCL
CitiusSynapse: A Deep Learning Framework for Embedded Systems
CL-VIS: Visualization Platform for Understanding and Checking the OpenCL Programs
CL2QCD - Lattice QCD based on OpenCL
CL4SE: A Context Learning Benchmark For Software Engineering Tasks
Clacc: Translating OpenACC to OpenMP in Clang
Classical Mechanical Hard-Core Particles Simulated in a Rigid Enclosure using Multi-GPU Systems
Classical Simulation of Quantum Adiabatic Algorithms using Mathematica on GPUs
Classiffication-based Financial Markets Prediction using Deep Neural Networks
Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks
Classification Performance of Convolutional Neural Networks
Classify QCD phase transition with deep learning
ClawHMMER: A Streaming HMMer-Search Implementation
CLBlast: A Tuned OpenCL BLAS Library
ClearPath: highly parallel collision avoidance for multi-agent simulation 
ClearView: An Interactive Context Preserving Hotspot Visualization Technique
CLgrep: A Parallel String Matching Tool
Climbing Mont Blanc - A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors
Clinically applicable Monte Carlo-based biological dose optimization for the treatment of head and neck cancers with spot-scanning proton therapy
Clipmapping on the GPU
clMAGMA: High Performance Dense Linear Algebra with OpenCL
clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization
Clock Math - A System for Solving SLEs Exactly
CLOP: A Multi-stage Compiler to Seamlessly Embed Heterogeneous Code
clOpenCL - Supporting Distributed Heterogeneous Computing in HPC Clusters
CLort: High Throughput and Low Energy Network Intrusion Detection on IoT Devices with Embedded GPUs
Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology
Cloth Simulation on the GPU
Cloth Simulation Using AABB Hierarchies and GPU Parallelism
CloudCL: Single-Paradigm Distributed Heterogeneous Computing for Cloud Infrastructures
Cloudlet-screen computing: A multi-core-based, cloud-computing-oriented, traditional-computing-compatible parallel computing Paradigm for the masses
clpeak - peak performance of your opencl device
clRNG: A Random Number API with Multiple Streams for OpenCL
clSPARSE: A Vendor-Optimized Open-Source Sparse BLAS Library
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
CLTestCheck: Measuring Test Effectiveness for GPU Kernels
cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL
CLTune: A Generic Auto-Tuner for OpenCL Kernels
CLUEstering: a high-performance density-based clustering library for scientific computing
ClusCo: clustering and comparison of protein models
Cluster and Fast-Update Simulations of Regular and Rewired Lattice Ising Models Using CUDA and Graphical Processing Units
Cluster versus GPU implementation of an Orthogonal Target Detection Algorithm for Remotely Sensed Hyperspectral Images 
Cluster-Level Tuning of a Shallow Water Equation Solver on the Intel MIC Architecture
Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters
Clustering Based Search Algorithm For Motion Estimation
Clustering billions of data points using GPUs
Clustering coefficient queries on massive dynamic social networks
Clustering on GPU - A Brief Survey
Clustering Throughput Optimization on the GPU
ClusterWatch: Flexible, Lightweight Monitoring for High-end GPGPU Clusters
CMA-ES for Hyperparameter Optimization of Deep Neural Networks
CMCpy: Genetic Code-Message Coevolution Models in Python
CMLCompiler: A Unified Compiler for Classical Machine Learning
CnC-CUDA: declarative programming for GPUs
CNN2Gate: An Implementation of Convolutional Neural Networks Inference on FPGAs with Automated Design Space Exploration
CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis
Co-design of a particle-in-cell plasma simulation code for Intel Xeon Phi: a first look at Knights Landing
Co-processing SPMD Computation on GPUs and CPUs on Shared Memory System
Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU
Co-tuning of Software Specializers and Hardware Accelerators within a CNN Application
Coalition Structure Generation with the Graphic Processor Unit
Coalition Structure Generation with the Graphics Processing Unit
Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs
Coarse grain parallelization of evolutionary algorithms on GPGPU cards with EASEA
Coating Process Monitoring Using Computer Vision
CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning
Code Generation Compiler for the OpenMP 4.0 Accelerator Model onto OMPSS
Code Generation for a Variety of Accelerators for a Graph DSL
Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU
Code Generation for Embedded Heterogeneous Architectures on Android
Code Generation for High-Level Synthesis of Multiresolution Applications on FPGAs
Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views
Code Optimization and Performance Analysis of Oceanographic Software Package NEMO for GPGPU Systems
Code Optimization and Scaling of the Astrophysics Software Gadget on Intel Xeon Phi
Code optimization based on source to source transformations using profile guided metrics
Code Optimization on GPUs
Code Optimization on Kepler GPUs and Xeon Phi
Code Optimization Techniques for Graphics Processing Units
Code Refinement of Stencil Codes
CodegenBench: Can LLMs Write Efficient Code Across Architectures?
CodePy
CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models
Coding Ants: Using Ant Colony Optimization to Accelerate CT Reconstruction
CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices
Cofactorization on Graphics Processing Units
COFFEE: an Optimizing Compiler for Finite Element Local Assembly
Cognitive radio network for the smart grid: Experimental system architecture, control algorithms, security, and microgrid testbed
Coherence aware GPU-based ray casting for virtual colonoscopy
Coherent Photon Mapping on the Intel MIC Architecture
Coherent Spatiotemporal Filtering, Upsampling and Rendering of RGBZ Videos
Coherent transport by adiabatic passage on atom chips
Collaborative design and optimization using Collective Knowledge
Collaborative Diffusion on the GPU for Path-Finding in Games
Collaborative diffusion: programming antiobjects
Collaborative execution environment for heterogeneous parallel systems
Collage: Automated Integration of Deep Learning Backends
Collection skeletons: declarative abstractions for data collections
Collective Communication for 100k+ GPUs
Collision Detection Based on Fuzzy Scene Subdivision
Collision Detection of Triangle Meshes using GPU
Collision detection on the GPU
Collision Detection: Broad Phase Adaptation from Multi-Core to Multi-GPU Architecture
Collision for 75-step SHA-1: Intensive Parallelization with GPU
Collision-Driven Volumetric Deformation on the GPU
Collision-streams: fast GPU-based collision detection for deformable models
Color and motion-based particle filter target tracking in a network of overlapping cameras with multi-threading and GPGPU
Color Correction Acceleration Using a Color Cube and OpenCL
Color Me Noisy: Example-based Rendering of Hand-colored Animations with Temporal Noise Control
Color Seamlessness in Multi-Projector Displays Using Constrained Gamut Morphing
Colored stochastic shadow maps
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
Colour flux-tubes in static Pentaquark and Tetraquark systems
Column-Oriented Datalog on the GPU
Combinatorial Optimization of Work Distribution on Heterogeneous Systems
Combined acoustic and optical trapping
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
Combining approximate inference methods for efficient learning on large computer clusters
Combining Belief Propagation and Successive Cancellation List Decoding of Polar Codes on a GPU Platform
Combining computer vision and physics simulations using GPGPU
Combining Data Parallelism and Task Parallelism for Efficient Performance on Hybrid CPU and GPU Systems
Combining Multiple Optimised FPGA-based Pulsar Search Modules Using OpenCL
Combining Performance and Productivity: Accelerating the Network Sensing Graph Challenge with GPUs and Commodity Data Science Software
Combining recent HPC techniques for 3D geophysics acceleration
Combustion Simulations Using Graphic Processing Units
Coming Soon: Research in a Cloud
Communication and Coordination Paradigms for Highly-Parallel Accelerators
Communication Architectures for Scalable GPU-centric Computing Systems
Communication Optimization for Multi GPU Implementation of Smith-Waterman Algorithm
Communication-Avoiding Optimization of Geometric Multigrid on GPUs
Communication-avoiding QR decomposition for GPUs
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Communication-Minimizing 2D Convolution in GPU Registers
Communication-minimizing Asynchronous Tensor Parallelism
Community Structure Discovery algorithm on GPU with CUDA
Compact data structure and scalable algorithms for the sparse grid technique
Comparative Analysis of OpenACC, OpenMP and CUDA using Sequential and Parallel Algorithms
Comparative Evaluation of Binary Features
Comparative evaluation of platforms for parallel Ant Colony Optimization
Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU
Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations
Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning
Comparative Study of Frequent Itemset Mining Techniques on Graphics Processor
Comparative Study of High Performance Computing Using Multi-core Parallel Systems
Comparative study of parallel programming models for multicore computing
Comparative Study of the Parallelization of the Smith-Waterman Algorithm on OpenMP and Cuda C
Comparing CUDA and OpenGL implementations for a Jacobi iteration
Comparing CUDA, OpenCL and OpenGL Implementations of the Cardiac Monodomain Equations
Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels
Comparing FPGAs to Graphics Accelerators and the Playstation 2 Using a Unified Source Description
Comparing GPU and CPU in OLAP Cubes Creation
Comparing GPU-based multi-volume ray casting techniques
Comparing Hardware Accelerators in Scientific Applications: A Case Study
Comparing Intra- and Inter-Processor Parallelism on Multi-Core Cell Processors for Scientific Simulations
Comparing Linear and Convex Relaxations for Stereo and Motion
Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation
Comparing Many-Core Accelerator Frameworks
Comparing Parallel Functional Array Languages: Programming and Performance
Comparing Parallel Hardware Architectures for Visually Guided Robot Navigation
Comparing Parallel Simulation of Social Agents using Cilk and OpenCL
Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing
Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs
Comparing Programmer Productivity in OpenACC and CUDA: an Empirical Investigation
Comparing SYCL data transfer strategies for tracking use cases
Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips
Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Comparing the Treecode with FMM on GPUs for vortex particle simulations of a leapfrogging vortex ring 
Comparing Two Generations of Embedded GPUs Running a Feature Detection Algorithm
Comparison and Analysis of GPGPU and Parallel Computing on Multi-Core CPU
Comparison and Analysis of GPU Energy Effciency For CUDA and OpenCL
Comparison and Analysis of GPU Energy Efficiency For CUDA and OpenCL
Comparison based sorting for systems with multiple GPUs
Comparison between GPU and parallel CPU optimizations in viewshed analysis
Comparison of Cilk, Kaapi and CUDA for the Jacobi Method
Comparison of CPML Implementations for the GPU-Accelerated FDTD Solver
Comparison of different n-body algorithms on various hardware platforms using SYCL
Comparison of Different Parallel Implementaions of the 2+1-Dimensional KPZ Model and the 3-Dimensional KMC Model
Comparison of FPGA and GPU implementations of real-time stereo vision
Comparison of Fragmentation/Dispersion Models for Asteroid Nuclear Disruption Mission Design
Comparison of GPU Architectures for Asynchronous Communication with Finite-Differencing Applications 
Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal
Comparison of Hybrid Sorting Algorithms Implemented on Different Parallel Hardware Platforms
Comparison of OpenCL performance on different platforms using VexCL and Blaze
Comparison of OpenMP & OpenCL Parallel Processing Technologies
Comparison of OpenMP and OpenCL Parallel Processing Technologies
Comparison of parallel sorting algorithms
Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs
Comparison of Random Number Generators in Particle Swarm Optimization Algorithm
Comparison of Rectangular Matrix Multiplication with and without Border Conditions
Comparison of several parallel API for cloth modelling on modern GPUs
Comparison of SPMV performance on matrices with different matrix format using CUSP, cuSPARSE and ViennaCL
Comparison of Technologies for General-Purpose Computing on Graphics Processing Units
Comparison of Thread Execution Methods for GPU-oriented OpenCL Programs on Multicore Processors
COMPASS: a programmable data prefetcher using idle GPU shaders
Compensated Visual Hull for Defective Segmentation and Occlusion
Compensated Visual Hull with GPU-Based Optimization
Compensating Indirect Scattering for Immersive and Semi-Immersive Projection Displays
Competing computational approaches to reaction-diffusion equations in clusters of cells
Compilation and Design Space Exploration of Dataflow Programs for Heterogeneous CPU-GPU Platforms
Compilation for Heterogeneous Computing: Automating Analyses, Transformations and Decisions
Compilation techniques and language support to facilitate dependence-driven computation
Compile-time GPU memory access optimizations
Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
Compiler and Runtime Systems for Generative AI Models
Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures
Compiler Assisted Runtime Adaptation
Compiler Fuzzing through Deep Learning
Compiler optimizations for directive-based programming for accelerators
Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs
Compiler Optimizations for SIMD/GPU/Multicore Architectures
Compiler support for general-purpose computation on GPUs
Compiler Support for High-level GPU Programming
Compiler Support for Speculation in Decoupled Access/Execute Architectures
Compiler Technologies in Deep Learning Co-Design: A Survey
Compiler-assisted distribution of OpenMP code for improved scalability
Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU
Compiler-based Data Prefetching and Streaming Non-temporal Store Generation for the Intel Xeon Phi Coprocessor
Compiler-Based Tools to Aid in Data Transfer Optimization and On-Chip Debug of Heterogeneous Compute Systems
Compiler-centric across-stack deep learning acceleration
Compiler-directed memory management for heterogeneous MPSoCs
Compiler-Driven Performance on Heterogeneous Computing Platforms
Compiler-Level Explicit Cache for a GPGPU Programming Framework
CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research
Compilers for Portable Programming of Heterogeneous Parallel & Approximate Computing Systems
Compiling a High-level Directive-Based Programming Model for GPGPUs
Compiling a high-level language for GPUs: (via language support for architectures and compilers)
Compiling an Array Language to a Graphics Processor
Compiling and Optimizing Java 8 Programs for GPU Execution
Compiling and Optimizing OpenMP 4.X Programs to OpenCL and SPIR
Compiling for a heterogeneous vector image processor
Compiling High Performance Recursive Filters
Compiling Parallel Functional Code with Data Parallel Idealised Algol
Compiling Python to a hybrid execution environment
Compiling Stream Applications for Heterogeneous Architectures
Complete PISO and SIMPLE solvers on Graphics Processing Units
Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU
Complexity effective memory access scheduling for many-core accelerator architectures
Composability of parallel codes on heterogeneous architectures
Composing Distributed Computations Through Task and Kernel Fusion
Composing multiple StarPU applications over heterogeneous machines: a supervised approach
Composition and Reuse with Compiled Domain-Specific Languages
Compositional Compilation for Sparse, Irregular Data Parallelism
Compositional Deep Learning in Futhark
Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs
Compoundly weighted Voronoi: a sequential and parallel implementation
Comprehensive Analysis of High-Performance Computing Methods for Filtered Back-Projection
Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs
Comprehensive Evaluations of Cone-beam CT dose in Image-guided Radiation Therapy via GPU-based Monte Carlo simulations
Comprehensive Optimization of Parametric Kernels for Graphics Processing Units
Comprehensive Performance Monitoring for GPU Cluster Systems
Compressed Dynamic Mode Decomposition for Real-Time Object Detection
Compressed Facade Displacement Maps
Compressed Learning of Deep Neural Networks for OpenCL-Capable Embedded Systems
Compressed Multiple-Row Storage Format
Compressed Real Numbers for AI: a case-study using a RISC-V CPU
Compressed sensing using hidden Markov models with application to vision based aircraft tracking
Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks
Compressing Floating-Point Number Stream for Numerical Applications
Compression Domain Volume Rendering
Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications
Compressive Phase Contrast Tomography
Computation of Air-Vortices Based on GPU Technology: Optimizing and Parallelizing a Model for Wake-Vortex Prediction Using OpenCL
Computation of electron quantum transport in graphene nanoribbons using GPU
Computation of Galois field expressions for quaternary logic functions on GPUs
Computation of gray-level co-occurrence matrix based on CUDA and its optimization
Computation of Large Covariance Matrices by SAMMY on Graphical Processing Units and Multicore CPUs
Computation of the Isogeometric Analysis Stiffness Matrix on GPU
Computation of the Spatial Impulse Response for Ultrasonic Fields on the Graphics Processing Units (GPU)
Computation of Troposphere Slant Delays on a GPU
Computation of Voronoi diagrams using a graphics processing unit
Computation on GPU of Eigenvalues and Eigenvectors of a Large Number of Small Hermitian Matrices
Computation on programmable graphics hardware
Computational advances in gravitational microlensing: a comparison of CPU, GPU, and parallel, large data codes
Computational Biology and Applied Bioinformatics
Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers
Computational Experiments in Markov Chain Monte Carlo
Computational Fluid Dynamic on GPU
Computational Fluid Dynamics Simulations using Many Graphics Processors
Computational Fluid Dynamics Using Graphics Processing Units: Challenges and Opportunities
Computational Fluid Dynamics using OpenCL - a Practical Introduction
Computational Gravitational Dynamics with Modern Numerical Accelerators
Computational investigation of intense short-wavelength laser interaction with rare gas clusters
Computational kinetics of a large scale biological process on GPU workstations: DNA bending
Computational modeling of synthetic microbial biofilms
Computational Modelling of Galaxy Formation using FLAME GPU
Computational Optimization of a Time-Domain Beamforming Algorithm Using CPU and GPU
Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach
Computational Physics on Graphics Processing Units
Computational Simulation of Freely Falling Water Droplets on Graphics Processing Units
Computational stereo camera system with programmable control loop
Computational wave optics library for C++: CWO++ library
Computationally Efficient Algorithms for Evaluation of Statistical Descriptors
Computationally Efficient Implementation of a Hamming Code Decoder using a Graphics Processing Unit
Computationally Efficient Tsunami Modelling on Graphics Processing Units (GPU)
Compute Distance Matrices with GPU
Compute Pairwise Manhattan Distance and Pearson Correlation Coefficient of Data Points with GPU
Compute Unified Device Architecture Application Suitability
Compute units in OpenMP: Extensions for heterogeneous parallel programming
Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards
Computer Finit-Difference Time-Domain Simulation of Electromagnetic Wave Propagation using GPUs
Computer Finite-Difference Time-Domain Simulation of Electromagnetic Wave Propagation using GPUs
Computer generated holography using parallel commodity graphics hardware
Computer Graphics: From Pixels to Programmable Graphics Hardware
Computer Simulation of Dark Matter Effects on Galaxy Rotation
Computer Simulation of Saturn's Ring Structure
Computer Tomography and Ultrasonography Image Registration Based on the Cooperation of GPU and CPU
Computer Vision Accelerators for Mobile Systems based on OpenCL GPGPU Co-Processing
Computer Vision and Image Segmentation Implemented on GPU Using Compute Unified Device Architecture as Applied on Quality Inspection of Pre-etched Printed Circuit Board
Computer Vision Application in Graphic Processors
Computer vision based geometric calibration in curved multi-projector displays
Computer vision for continuous plankton monitoring
Computer Vision Models in Surveillance Robotics
Computer Vision on the GPU -- Tools, Algorithms and Frameworks
Computer vision signal processing on graphics processing units
Computer-Generated Marbling Textures: A GPU-Based Design System
Computing 2D Alpha Shapes Using GPU
Computing Best Possible Pseudo-Solutions to Interval Linear Systems of Equations
Computing dynamics of thin films via large scale GPU-based simulations
Computing finite models using free Boolean generators
Computing High Resolution Explicit Corridor Maps using Parallel Technologies
Computing least squares condition numbers on hybrid multicore/GPU systems
Computing Nash Equilibria in Bimatrix Games: GPU-based Parallel Support Enumeration
Computing of high breakdown regression estimators without sorting on graphics processing units
Computing on Knights and Kepler Architectures
Computing OpenSURF on OpenCL and General Purpose GPU
Computing optical flow using fast total variation
Computing Optimal Cycle Mean in Parallel on CUDA
Computing Performance Benchmarks among CPU, GPU, and FPGA
Computing Prestack Kirchhoff Time Migration on General Purpose GPU
Computing Privacy-Preserving Edit Distance and Smith-Waterman Problems on the GPU Architecture
Computing Reachable Sets via Barrier Methods on SIMD Architectures
Computing resultants on Graphics Processing Units: Towards GPU-accelerated computer algebra
Computing room acoustics with CUDA - 3D FDTD schemes with boundary losses and viscosity
Computing room acoustics with CUDA-3D FDTD schemes with boundary losses and viscosity
Computing Spatial Distance Histograms for Large Scientific Datasets On-the-Fly
Computing Spectral Transforms Used in Digital Logic on the GPU
Computing spike-based convolutions on GPUs
Computing Strongly Connected Components in Parallel on CUDA
Computing Strongly Connected Components with CUDA
Computing the distance between two finite element solutions defined on different 3D meshes on a GPU
Computing the Mertens function on a GPU
Computing Treewidth on the GPU
Computing trends using graphic processor in high energy physics
Computing virtual acoustics using the 3D finite difference time domain method and Kepler architecture GPUs
Computing without processors
Computitional intensive Tasks in Multimedia Signal Processing
Compyle: a Python package for parallel computing
CONCUR: Benchmarking LLMs for Concurrent Code Generation
ConCuR: Conciseness Makes State-of-the-Art Kernel Generation
Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel
Concurrent Algorithms and Data Structures for Many-Core Processors
Concurrent Analytical Query Processing with GPUs
Concurrent CPU-GPU Task Programming using Modern C++
Concurrent GPU Programming 
Concurrent kernel execution on Graphic Processing Units
Concurrent learning of a Probabilistic Graphical Model on the GPU
Concurrent Manipulation of Dynamic Data Structures in OpenCL
Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU
Concurrent query processing in a GPU-based database system
Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems
Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes
Concurrent Task Execution on the Intel Xeon Phi
Conditional component composition for GPU-based systems
Cone-beam Computed tomography image reconstruction based on GPU
Confidential Computing on Heterogeneous Systems: Survey and Implications
Confidentiality Issues on a GPU in a Virtualized Environment
Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors
Conflux: Embedding Massively Parallel Semantics in a High-Level Programming Language
Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs
Connected component identification and cluster update on GPU
Connected component labeling on a 2D grid using CUDA
Connected-component identification and cluster update on graphics processing units
Connecting Architecture, Fitness, Optimizations and Performance using an Anisotropic Diffusion Filter
Connectivity-Based Segmentation for GPU-Accelerated Mesh Decompression
Considerations when evaluating microprocessor platforms
Considering GPGPU for HPC Centers: Is It Worth the Effort?
Consolidating Applications for Energy Efficiency in Heterogeneous Computing Systems
Constrained inverse volume rendering for planetary nebulae
Constraint Fluids on GPU 
Constraint-based LN-curves
Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition
Constructing Natural Neighbor Interpolation Based Grid DEM Using CUDA
Constructing Two-Dimensional Voronoi Diagrams via Divide-and-Conquer of Envelopes in Space
Constructing Two-Dimensional Voronoi Diagrams via Divide-and-Conquer of Envelopes in Space (thesis)
Construction and Implementation of a Simple Agent-Based System on GPU-Architectures
Construction and Rendering of Trimmed Blending Surfaces with Sharp Features on a GPU
Construction of a Virtual Cluster by Integrating PCI Pass-Through for GPU and InfiniBand Virtualization in Cloud
Construction of Efficient Kd-Trees for Static Scenes Using Voxel-visibility Heuristic
Content Based Image Retrieval with Graphical Processing Unit
Context Parallelism for Scalable Million-Token Inference
Context-aware volume navigation
Continual surface-based multi-projector blending for moving objects
Continuous Level of Detail on Graphics Hardware
Continuous Representation of Projected Attribute Spaces of Multifields over Any Spatial Sampling
Contour-based algorithm for vectorization of satellite images
Contouring for Power Systems Using Graphical Processing Units
Contract-Based General-Purpose GPU Programming
ConTraPh: Contrastive Learning for Parallelization and Performance Optimization
Contributions of hybrid architectures to depth imaging: a CPU, APU and GPU comparative study
Contributions to Music Semantic Analysis and Its Acceleration Techniques
Contributions to Parallel Simulation of Equation-Based Models on Graphics Processing Units
Contributions to parallel stochastic simulation: Application of good software engineering practices to the distribution of pseudorandom streams in hybrid Monte-Carlo simulations
Contributions to the Efficient Use of General Purpose Coprocessors: Kernel Density Estimation as Case Study
Convergence and Scalarization for Data-Parallel Architectures
Converting Data to Task-Parallelism by Rewrites
Converting Data-Parallelism to Task-Parallelism by Rewrites: Purely Functional Programs Across Multiple GPUs
Convex Clustering: An Attractive Alternative to Hierarchical Clustering
Convolution of large 3D images on GPU and its decomposition
Convolutional Neural Network for Sentence Classification
Convolutional Neural Network-Based Image Representation for Visual Loop Closure Detection
Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors
Convolutional Neural Networks for Large-Scale Bird Song Classification in Noisy Environment
COOK Access Control on an embedded Volta GPU
Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL
Cooperative Heterogeneous Computing for Parallel Processing on CPU/GPU Hybrids
Cooperative Kernels: GPU Multitasking for Blocking Algorithms
Cooperative Multitasking for GPU-Accelerated Grid Systems
Coordinate strip-mining and kernel fusion to lower power consumption on GPU
Coordinated system level resource management for heterogeneous many-core platforms
Copperhead: Compiling an embedded data parallel language
Coprocessor Computing with FPGA and GPU
CoreTSAR: Task Scheduling for Accelerator-aware Runtimes
Correctly rounding elementary functions on GPU
Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU
Correlating Radio Astronomy Signals with Many-Core Hardware
Correlation analysis on GPU systems using NVIDIA's CUDA
Cortical architectures on a GPGPU
CosmoFlow: Using Deep Learning to Learn the Universe at Scale
Cosmological Calculations on the GPU
Cost Efficient PageRank Computation using GPU
Cost-aware function migration in heterogeneous systems
Cost-effective low-power graphics processing unit for handheld devices
Cost-effective medical image reconstruction: from clusters to graphics processing units
Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality
Cost-Effective Soft-Error Protection for SRAM-Based Structures in GPGPUs
Cost-Performance Analysis: A Comparative Study of CPU-Based Serverless and GPU-Based Training Architectures
COTS cluster-based sort-last rendering: performance evaluation and pipelined implementation
Coulomb and Landau Gauge Fixing in GPUs using CUDA and MILC
Coulomb, Landau and Maximally Abelian Gauge Fixing in Lattice QCD with Multi-GPUs
Count Sort for GPU Computing
Counting and Occurrence Sort for GPUs using an Embedded Language
Counting Triangles in Large Graphs on GPU
Coupled Vlasov and two-fluid codes on GPUs
Coupler Design and Optimization by GPU-Accelerated DG-FEM
Coupling a Generalized DEM and an SPH Models Under a Heterogeneous Massively Parallel Framework
Coupling between Meshless FEM Modeling and Rendering on GPU for Real-time Physically-based Volumetric Deformation
Coupling Lattice Boltzmann Gas and Level Set Method for Simulating Free Surface Flow in GPU/CUDA Environment
COVRA: A compression-domain output-sensitive volume rendering architecture based on a sparse representation of voxel blocks
COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs
COX: Exposing CUDA Warp-Level Functions to CPUs
cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications
Cpp-Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System at Scale
CPPJoules: An Energy Measurement Tool for C++
CPU and GPU Co-processing for Sound
CPU and GPU Implementation of QCD by using OpenCL
CPU and/or GPU: Revisiting the GPU Vs. CPU Myth
CPU-GPU Algorithms for Triangular Surface Mesh Simplification
CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL
CPU-GPU Collaboration for Output Quality Monitoring
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction application
CPU-GPU Hybrid Parallel Binomial American Option Pricing
CPU-GPU Layer-Switched Low Latency CNN Inference
CPU, GPU and FPGA Implementations of MALD: Ceramic Tile Surface Defects Detection Algorithm
CPU, SMP and GPU implementations of Nohalo level 1, a fast co-convex antialiasing image resampler
CPU/GPGPU/HW comparison of an Eigenfaces face recognition system
CPU/GPU Code Acceleration on Heterogeneous Systems and Code Verification for CFD Applications
CPU/GPU computing for long-wave radiation physics on large GPU clusters
CPUless PCs inside networked control systems
CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM
Crack-free rendering of dynamically tesselated B-Rep models
Cracks in the Sky: Abelian-Higgs Cosmic String Evolution with CUDA
Cramming: Training a Language Model on a Single GPU in One Day
Crane - Fast and Migratable GPU Passthrough for OpenCL applications
Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C+
Creating a Dataset Supporting Translation Between OpenMP Fortran and C++ Code
Creating HW/SW co-designed MPSoPC's from high level programming models
Creating Optimal Code for GPU-Accelerated CT Reconstruction Using Ant Colony Optimization
Creation and control of rain in virtual environments
CRINK: Automatic CUDA code generation for affine C programs
Critical Comparison of the Classification Ability of Deep Convolutional Neural Network Frameworks with Support Vector Machine Techniques in the Image Classification Process
Critical Links Detection using CUDA
Criticality of the XY model in complex topologies
CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads
Cropped Quad-Tree Based Solid Object Colouring with CUDA
Cross Teaching Parallelism and Ray Tracing: A Project-based Approach to Teaching Applied Parallel Computing
Cross-Compiling Shading Languages
Cross-Platform OpenCL Code and Performance Portability for CPU and GPU Architectures Investigated with a Climate and Weather Physics Model
Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels
Cross-platform programming model for many-core lattice Boltzmann simulations
CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation
CrowdCL: Web-Based Volunteer Computing with WebCL
CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
Cryptanalysis of the Full AES Using GPU-Like Special-Purpose Hardware
Cryptanalysis of the McEliece Cryptosystem on GPGPUs
CryptGPU: Fast Privacy-Preserving Machine Learning on the GPU
CryptoGraphics: Secret Key Cryptography Using Graphics Cards
Cryptography on Graphics Processing Unit: A Survey
CrystalGPU: Transparent and Efficient Utilization of GPU Power
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
CST: Constructive Solid Trimming for Rendering BReps and CSG
CT image reconstruction using hexagonal grids
CT image reconstruction with half precision floating-point values
CT to Cone-beam CT Deformable Registration With Simultaneous Intensity Correction
CU2CL: A CUDA-to-OpenCL Translator for Multi-and Many-core Architectures
CU2rCU: A CUDA-to-rCUDA Converter
CuBA - a CUDA implementation of BAMPS
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU
CUBPT: Lock-free bulk insertions to B+ tree on GPU architecture
CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels
cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications
CUD@ASP: Experimenting with GPUs in ASP solving
CUD@SAT: SAT Solving on GPUs
CUDA 2D Stencil Computations for the Jacobi Method
CUDA Accelerated Entropy Constrained Vector Quantization and Multiple K-Means
CUDA Accelerated Face Recognition Using Local Binary Patterns
CUDA accelerated iris template matching on Graphics Processing Units (GPUs)
CUDA accelerated large scale vehicular area network simulator
CUDA Accelerated LTL Model Checking
CUDA Accelerated Robot Localization and Mapping
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging
CUDA and OpenCL-based asynchronous PSO
CUDA Application Design and Development
CUDA au Coq: A Framework for Machine-validating GPU Assembly Programs
CUDA Based CAMshift Algorithm for Object Tracking Systems
CUDA Based Enhanced Differential Evolution: a Computational Analysis
CUDA Based Fast Implementation of Very Large Matrix Computation
CUDA Based GPU Programming to Simulate 3D Tissue Deformation
CUDA based iterative methods for linear systems
CUDA Based Multi Objective Parallel Genetic Algorithms: Adapting Evolutionary Algorithms for Document Searches
CUDA Based Performance Evaluation of the Computational Efficiency of the DCT Image Compression Technique on Both the CPU and GPU
CUDA Based Polyphase Filter
CUDA by Example: An Introduction to General-Purpose GPU Programming
CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography
CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment
CUDA cuts: Fast graph cuts on the GPU
CUDA Enhanced Filtering in a Pipelined Video Processing Framework
CUDA Enhanced Simulated Annealing for Chip Layout Problem
CUDA Expression Templates
CUDA Fortran for Scientists and Engineers
CUDA Implementation in the EM Scattering of a Three-Layer Canopy
CUDA Implementation of ${rm TE}^{z}$-FDTD Solution of Maxwell's Equations in Dispersive Media
CUDA Implementation of a Lattice Boltzmann Method and Code Optimization
CUDA Implementation of Parallel Algorithms for Animal Noseprint Identification
CUDA implementation of the algorithm for simulating the epidemic spreading over large networks
CUDA implementation of the solution of a system of linear equations arising in an hp-Finite Element code
CUDA implementation of Wagener's 2D convex hull PRAM algorithm
Cuda K-Nn: application to the segmentation of the retinal vasculature within SD-OCT volumes of mice
CUDA Kernel Design for GPU-Based Beam Dymanics Simulations
CUDA Kernel Design for GPU-Based Beam Dynamics Simulations
CUDA Leaks: Information Leakage in GPU Architectures
CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator
CUDA method for the FDTD simulation by GPU
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms
CUDA Parallel Algorithms for Forward and Inverse Structural Gravity Problems
CUDA performance analyzer
CUDA Programming: A Developer's Guide to Parallel Computing with GPUs
CUDA programs for GPU computing of Swendsen-Wang multi-cluster spin flip algorithm: 2D and 3D Ising, Potts, and XY models
CUDA raytracing algorithm for visualizing discrete element model output
CUDA simulations of active dumbbell suspensions
CUDA Tutorial - Cryptanalysis of Classical Ciphers Using Modern GPUs and CUDA
CUDA-Accelerated Data-Mining for Putative Heteromeric Transcription Factors and Target Genes Using Microarray Gene Expression Profiles
CUDA-Accelerated Geodesic Ray-Tracing for Fiber Tracking
CUDA-Accelerated HD-ODETLAP: Lossy High Dimensional Gridded Data Compression
CUDA-accelerated Hierarchical K-means
CUDA-Accelerated ODETLAP: A Parallel Lossy Compression Implementation
CUDA-API-wrappers: Thin C++-flavored wrappers for the CUDA runtime API
CUDA-based acceleration and algorithm refinement for volume image registration
CUDA-based AES parallelization with fine-tuned GPU memory utilization
CUDA-based GPU Implementation of Hierarchical Belief Propagation for Fast Stereo Matching
CUDA-Based Jacobi's Iterative Method
CUDA-Based Radiative Transfer Method with Application to the EM Scattering from a Two-Layer Canopy Model
CUDA-based real time surgery simulation
CUDA-based Signed Distance Field Calculation for Adaptive Grids
CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware
CUDA-C implementation of the ADER-DG method for linear hyperbolic PDEs
CUDA-enabled LBM Flow Simulation around Three Equilateral Cylinders using GPU Computing Processor
CUDA-enabled Optimisation of Technical Analysis Parameters
cuda-kat: The CUDA Kernel Author's Toolkit
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
CUDA-level performance with python-level productivity for Gaussian mixture model applications
CUDA-Lite: Reducing GPU programming complexity
CUDA-LLM: LLMs Can Write Efficient CUDA Kernels
CUDA-MEME: Accelerating Motif Discovery in Biological Sequences Using CUDA-enabled Graphics Processing Units
CUDA-OpenGL Interoperability to Visualize Electromagnetic Fields Calculated by FDTD
CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs
CUDA: Scalable parallel programming for high-performance scientific computing
cudaBayesreg: Parallel Implementation of a Bayesian Multilevel Model for fMRI Data Analysis
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABench: Benchmarking LLMs for Text-to-CUDA Generation
CudaChain: A Practical GPU-accelerated 2D Convex Hull Algorithm
CUDACL: A tool for CUDA and OpenCL programmers
CUDACLAW: a Data Parallel Solution Framework for Hyperbolic PDEs
CUDACS: securing the cloud with CUDA-enabled secure virtualization
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization
CUDAEASY - a GPU Accelerated Cosmological Lattice Program
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
CudaGIS: Report on the Design and Realization of a Massive Data Parallel GIS on GPUs
Cudagrind: A Valgrind Extension for CUDA
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CudaHull: Fast Parallel 3D Convex Hull on the GPU
CUDAICA: GPU optimization of Infomax-ICA EEG analysis
CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences
cudaMap: a GPU accelerated program for gene expression connectivity mapping
CudaRF: A CUDA-based Implementation of Random Forests
CUDArray: CUDA-based NumPy
CUDASA: Compute Unified Device and Systems Architecture
CUDASW++ 2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units
cuDNN: Efficient Primitives for Deep Learning
CUDT: A CUDA Based Decision Tree Algorithm
Cue-independent extending inverse kinematics for robust pose estimation in 3D point clouds
CUED-RNNLM - An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models
cufftShift: High Performance CUDA-accelerated FFT-shift Library
cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs
CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries
CUgrep: A GPU-based high performance multi-string matching system
cuGWAM: Genome-wide association multifactor dimensionality reduction using CUDA-enabled high-performance graphics processing unit
CuHMMer: A load-balanced CPU-GPU cooperative bioinformatics application
cuIBM -- A GPU-accelerated Immersed Boundary Method
cuInspiral: prototype gravitational waves detection pipeline fully coded on GPU using CUDA
CUKNN: A parallel implementation of K-nearest neighbor on CUDA-enabled GPU
CULA: hybrid GPU accelerated linear algebra routines
CuLDA_CGS: Solving Large-scale LDA Problems on GPUs
cuLGT: Lattice Gauge Fixing on GPUs
CULLIDE: interactive collision detection between complex models in large environments using graphics hardware
CuMAPz: a tool to analyze memory access patterns in CUDA
CuMF_SGD: Fast and Scalable Matrix Factorization
CuMF: scale matrix factorization using just ONE machine with GPUs
CuNesl: Compiling Nested Data-Parallel Languages for SIMT Architectures
CuNeuQuant: A CUDA Implementation of the NeuQuant Image Quantization Algorithm
CuParcone A High-Performance Evolvable Neural Network Model
CuPBoP-AMD: Extending CUDA to AMD Platforms
CuPBoP: CUDA for Parallelized and Broad-range Processors
CuPBoP: Making CUDA a Portable Language
cuPC: CUDA-based Parallel PC Algorithm for Causal Structure Learning on GPU
cuPentBatch - A batched pentadiagonal solver for NVIDIA GPUs
cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution
CuPP - A framework for easy CUDA integration
cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms
CURFIL: Random Forests for Image Labeling on GPU
Curling and clumping fur represented by texture layers
Curracurrong: a stream processing system for distributed environments
Current and Nascent SETI Instruments in the Radio and Optical
Current performance gains from utilizing the GPU or the ASIC MDGRAPE-3 within an enhanced Poisson Boltzmann approach
CUSA and CUDE: GPU-accelerated methods for estimating solvent accessible surface area and desolvation
cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs
CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform
CUSIMANN: An optimized simulated annealing software for GPUs
cuSLINK: Single-linkage Agglomerative Clustering on the GPU
cuSten - CUDA Finite Difference and Stencil Library
Custom Code Generation for a Graph DSL
Customizable Domain-Specific Computing
Customizable Memory Schemes for Data Parallel Accelerators
Customization of OpenCL Applications for Efficient Task Mapping under Heterogeneous Platform Constraints
Customizing Driving Directions with GPUs
Customizing Instruction Set Extensible Reconfigurable Processors using GPUs
cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs
cuSZ(x): Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs
cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio
CUTE solutions for two-point correlation functions from large cosmological datasets
CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe
cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs
CUVLE: Variable-Length Encoding on CUDA
cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs
CVC: The Contourlet Video Compression algorithm for real-time applications
CVPI: A Computer Vision Library For Mobile and Embedded Platforms
Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid
Cytochrome P450 site of metabolism prediction from 2D topological fingerprints using GPU accelerated probabilistic classifiers
CytonRL: an Efficient Reinforcement Learning Open-source Toolkit Implemented in C++
D-face: Parallel Implementation of CNN Based Face Classifier using Drone Data On K40 & Jetson TK1
D5.5.2 - Architectural Techniques to exploit SLACK & ACCURACY trade-offs
D5.5.3 - Design and implementation of the SIMD-MIMD GPU architecture
D5.5.4 - Characterization of Redundancy and Definition of Work Reuse
Daino: A High-level Framework for Parallel and Efficient AMR on GPUs
Daisen: A Framework for Visualizing Detailed GPU Execution
DAMS: distributed adaptive metaheuristic selection
Dandelion: a Compiler and Runtime for Heterogeneous Systems
Dank Learning: Generating Memes Using Deep Neural Networks
Dark Sky Simulations: Early Data Release
Darknet on OpenCL: a multi-platform tool for object detection and classification
DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware
Data access optimized applications on the GPU using NVIDIA CUDA
Data Acquisition with GPUs: The DAQ for the Muon g-2 Experiment at Fermilab
Data analysis and 3D evolution in High Energy Physics using graphic processor
Data Analysis of Minimally-Structured Heterogeneous Logs: An experimental study of log template extraction and anomaly detection based on Recurrent Neural Network and Naive Bayes
Data Assimilation using a GPU Accelerated Path Integral Monte Carlo Approach
Data Buffering Optimization Methods toward a Uniform Programming Interface for GPU-based Applications
Data classification for artificial intelligence construct training to aid in network incident identification using network telescope data
Data Coherence Analysis and Optimization for Heterogeneous Computing
Data Compression using CUDA programming in GPU
Data driven scheduling approach for the multi-node multi-GPU Cholesky decomposition
Data handling inefficiencies between CUDA, 3D rendering, and system memory
Data Layout Optimization for Multi-Valued Containers in OpenCL
Data Layout Oriented Compilation Techniques in Vectorization for Multi-/Many-cores
Data Layout Pruning on GPU
Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications
Data Layout Transformation for Structured-Grid Codes on GPU
Data Mining and Machine Learning in Astronomy
Data Mining Techniques in Parallel and Distributed Environment - A Comprehensive Survey
Data Mining Using Graphics Processing Units
Data Movement Optimization for High-Performance Computing
Data parallel acceleration of decision support queries using Cell/BE and GPUs
Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL
Data parallel execution challenges and runtime performance of agent simulations on GPUs
Data parallel loop statement extension to CUDA: GpuC
Data parallel patterns on CPU/GPU mix
Data Parallel Quadtree Indexing and Spatial Query Processing of Complex Polygon Data on GPUs
Data Parallel Three-Dimensional Cahn-Hilliard Field Equation Simulation on GPUs with CUDA
Data Parallel Visualization and Rendering on the RAMSES Supercomputer with ANARI
Data Parallelism Exploiting for H.264 Encoder
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications
Data registration module - a component of semantic simulation engine
Data Regression with Normal Equation on GPU using CUDA
Data Remanence and Digital Forensic Investigation for CUDA Graphics Processing Units
Data Sorting Using Graphics Processing Units
Data Stream Classification using Random Feature Functions and Novel Method Combinations
Data structure design for GPU based heterogeneous systems
Data Structures and Algorithms for Counting Problems on Graphs using GPU
Data Structures and Transformations for Physically Based Simulation on a GPU
Data Structures for Task-based Priority Scheduling
Data Transfer Matters for GPU Computing
Data transfer optimizations for heterogeneous managed runtime systems
Data transformations enabling loop vectorization on multithreaded data parallel architectures
Data Triage and Visual Analytics for Scientific Visualization
Data Visualization and Mining using the GPU
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory
Data-Aware Task Scheduling on Multi-accelerator Based Platforms
Data-Driven Analysis and Design of Vulkan Ray-Tracing Applications using Automatic Instrumentation
Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data
Data-driven Forecasting of Deep Learning Performance on GPUs
Data-driven Performance Optimization for Data-intensive Applications
Data-Driven Programming Abstractions and Optimization for Multi-Core Platforms
Data-driven versus Topology-driven Irregular Computations on GPUs
Data-efficient LLM Fine-tuning for Code Generation
Data-intensive document clustering on GPU clusters
Data-intensive document clustering on graphics processing unit (GPU) clusters
Data-Oriented Language Implementation of Lattice-Boltzmann Method for Dense and Sparse Geometries
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel algorithms and data structures
Data-parallel algorithms for large-scale real-time simulation of the cellular potts model on graphics processing units
Data-parallel computing
Data-Parallel Construction of delta_N-Nets with Maximum Dispersion
Data-Parallel Flattening by Expansion
Data-Parallel Hashing Techniques for GPU Architectures
Data-Parallel Octrees for Surface Reconstruction
Data-Parallelism and GPUs for Lattice Gas Fluid Simulations
Data-rich astronomy: mining synoptic sky surveys
Database Operation Development on the GPU
Dataflow-based Design and Implementation of Image Processing Applications
Dataflow-Based Implementation of Layered Sensing Applications
Dataflow-driven GPU performance projection for multi-kernel transformations
Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models
Datalog for GPUs
Dato: A Task-Based Programming Model for Dataflow Accelerators
Daubechies wavelets for high performance electronic structure calculations: The BigDFT project
daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization
Dawn of GPU Era-Potentials of Chaos Theory
DawnCC: a Source-to-Source Automatic Parallelizer of C and C++ Programs
Dax Toolkit: A Proposed Framework for Data Analysis and Visualization at Extreme Scale
Dax: Data Analysis at Extreme
DBCSR: A Library for Dense Matrix Multiplications on Distributed GPU-Accelerated Systems
DBMS Index for Hierarchical Data Using Nested Intervals and Residue Classes
DC Power Flow Based Contingency Analysis Using Graphics Processing Units
DC Power Flow Based Contingency Analysis Using Graphics Processing Units (thesis)
DCT-JPEG Image Coding Based on GPU
dCUDA: hardware supported overlap of computation and communication
De-specializing an HLS library for Deep Neural Networks: improvements upon hls4ml
Dealing With Big Data Outside Of The Cloud: GPU Accelerated Sort
Debugging CUDA
Debugging GPU stream programs through automatic dataflow recording and visualization
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Debunking the CUDA Myth Towards GPU-based AI Systems
DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI
Declarative Parallel Programming for GPUs
Decoding with Finite-State Transducers on GPUs
Decompilation of LLVM IR
Decompiling x86 Deep Neural Network Executables
Decoupled Access/Execute Metaprogramming for GPU-Accelerated Systems
Decoupled Block-Wise ILU(k) Preconditioner on GPU
Decoupled Deferred Shading for Hardware Rasterization
Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels
Decoupled Vector-Fetch Architecture with a Scalarizing Compiler
Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines
Decoupling algorithms from the organization of computation for high performance image processing
Decreasing NAME III Solution Time Using GP-GPU
Decryption-decompression of AES protected ZIP files on GPUs
Deductive verification for SYCL
Deep and Shallow convections in Atmosphere Models on Intel Xeon Phi Coprocessor Systems
Deep API Learning
Deep Architectures for Neural Machine Translation
Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition
Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition
Deep Convolutional Network evaluation on the Intel Xeon Phi: Where Subword Parallelism meets Many-Core
Deep convolutional networks for pancreas segmentation in CT imaging
Deep Convolutional Neural Networks for Smile Recognition
Deep Dynamic Neural Networks for Gesture Segmentation and Recognition
Deep Feature-based Face Detection on Mobile Devices
Deep Fluids: A Generative Network for Parameterized Fluid Simulations
Deep Graph Learning for Program Analysis and System Optimization
Deep Graph Library Optimizations for Intel(R) x86 Architecture
Deep In-GPU Experience Replay
Deep Kernel Fusion for Transformers
Deep Language Models for Software Testing and Optimisation
Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing
Deep Learning Application in Plant Stress Imaging: A Review
Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities
Deep Learning At Scale and At Ease
Deep Learning Based FPGA-CPU Acceleration
Deep Learning by Doing: The NVIDIA Deep Learning Institute and University Ambassador Program
Deep Learning for Compilers
Deep Learning for Computational Chemistry
Deep Learning for Computer Vision: A comparison between Convolutional Neural Networks and Hierarchical Temporal Memories on object recognition tasks
Deep Learning for Digital Asset Limit Order Books
Deep learning for galaxy surface brightness profile fitting
Deep Learning for Mortgage Risk
Deep Learning for Obfuscated Code Analysis
Deep Learning For Smile Recognition
Deep Learning in the Automotive Industry: Applications and Tools
Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and Pitfalls
Deep Learning Model Security: Threats and Defenses
Deep Learning Models on CPUs: A Methodology for Efficient Training
Deep Learning on FPGAs
Deep Learning on FPGAs: Past, Present, and Future
Deep learning review and its applications
Deep learning with COTS HPC systems
Deep Learning with GO
Deep Learning Workload Scheduling in GPU Datacenters: A Survey
Deep learning: A guide for practitioners in the physical sciences
Deep Neural Machine Translation with Weakly-Recurrent Units
Deep neural networks for direct, featureless learning through observation: the case of 2d spin models
Deep Neural Networks to Enable Real-time Multimessenger Astrophysics
Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups
Deep Shadow Maps from Volumetric Data on the GPU
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Deep Tensor Convolution on Multicores
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
Deep Voice: Real-time Neural Text-to-Speech
Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge
Deep, Big, Simple Neural Nets for Handwritten Digit Recognition
Deep, Dense, and Low-Rank Gaussian Conditional Random Fields
DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators
DeepBach: a Steerable Model for Bach chorales generation
DeepBE: Learning Deep Binary Encoding for Multi-Label Classification
DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training
DeepDSL: A Compilation-based Domain-Specific Language for Deep Learning
DeeperLab: Single-Shot Image Parser
DeepfakeUCL: Deepfake Detection via Unsupervised Contrastive Learning
DeepFont: Identify Your Font from An Image
DeepLearningKit - an GPU Optimized Deep Learning Framework for Apple's iOS, OS X and tvOS developed in Metal and Swift
DeepLearningKit - an Open Source Deep Learning Framework for Apple's iOS, OS X and tvOS developed in Metal and Swift
DeepMetabolism: A Deep Learning System to Predict Phenotype from Genome Sequencing
DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications
DeepProf: Performance Analysis for Deep Learning Applications via Mining GPU Execution Patterns
DeepPy: Pythonic deep learning
DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
DeepSmith: Compiler Fuzzing through Deep Learning
DeepSpark: Spark-Based Deep Learning Supporting Asynchronous Updates and Caffe Compatibility
DeepSpeech: Scaling up end-to-end speech recognition
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression
DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices
DEF-G: Declarative Framework for GPU Environment
Defocus Magnification with CUDA
Deformable model collision detection using A-buffer
Deformable object simulation in virtual environment
Deformation modeling using global medial representation structures and evaluation by biset mesh matching
Deformation of skeleton based implicit objects
Deforming a High-Resolution Mesh in Real-Time by Mapping onto a Low-Resolution Physical Model
Delaunay Triangulation in R3 on the GPU
Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks
Delta-stepping: a parallelizable shortest path algorithm
DEM based simulation of concrete structures on GPU
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
Democratic Population Decisions Result in Robust Policy-Gradient Learning: A Parametric Study with GPU Simulations
Democratizing General Purpose GPU Programming through OpenCL and Scala
Demonstrating Self-Learning Algorithm Adaptivity in a Hardware-Oblivious Database Engine
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
Demystifying Dependency Bugs in Deep Learning Stack
Demystifying GPU microarchitecture through microbenchmarking
Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
Demystifying the MLPerf Benchmark Suite
Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis
Denoising Volumetric Data on GPU
Dense and sparse parallel linear algebra algorithms on graphics processing units
Dense Arithmetic over Finite Fields with the CUMODP Library
Dense Dynamic Programming on Multi GPU
Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach
Dense linear algebra solvers for multicore with GPU accelerators
Dense Matrix Algebra on the GPU
Dense Matrix Computation on a Heterogenous Architecture: A Block Synchronous Approach
Dense optical flow by iterative local window registration
Dense photometric stereo reconstruction on many core GPUs
Dense Photometric Stereo: A Markov Random Field Approach
Dense point trajectories by GPU-accelerated large displacement optical flow
Dense Real-Time Mapping of Object-Class Semantics from RGB-D Video
Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures
DenseCut: Densely Connected CRFs for Realtime GrabCut
Density Estimations for Approximate Query Processing on SIMD Architectures
Density functional theory calculation on many-cores hybrid central processing unit-graphic processing unit architectures 
Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures
Density-based clustering using graphics processors
Density-based parallel skin lesion border detection with webCL
Dependable Embedded Systems
Deploying Graph Algorithms on GPUs: an Adaptive Solution
Deployment of CPU and GPU-based genetic programming on heterogeneous devices
Deployment of parallel linear genetic programming using GPUs on PC and video game console platforms
Depth Enhanced Panoramas
Depth Estimation using Open Compute Language (OpenCL)
Depth Images: Representations and Real-Time Rendering
Depth Map Based Superresolution Method in 3D Reconstruction
Depth map enhanced macroblock partitioning for H.264 video coding of computer graphics content
Depth-Dependent Halos: Illustrative Rendering of Dense Line Data
Depth-First Search versus Jurema Search on GPU Branch-and-Bound Algorithms: a case study
Depth-of-Field Blur Effects for First-Person Navigation in Virtual Environments
Deriving Shape Grammars on the GPU
Descend: A Safe GPU Systems Programming Language
Design and Analysis of Soft-Error Resilience Mechanisms for GPU Register File
Design and Development of an Efficient H. 264 Video Encoder for CPU/GPU using OpenCL
Design and Development of Optical Flow Based Obstacle Avoidance Using CUDA
Design and evaluation of a parallel k-nearest neighbor algorithm on CUDA-enabled GPU
Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures
Design and implementation of a high-performance stream-based computing platform on multigenerational GPUs
Design and Implementation of a PTX Emulation Library
Design and implementation of a time-division multiplexing scan architecture using serializer and deserializer in GPU chips
Design and Implementation of Centrally-Coordinated Peer-to-Peer Live-streaming
Design and Implementation of CNN-FPGA accelerator based on Open Computing Language
Design and Implementation of GPU-Based Prim's Algorithm
Design and implementation of MPEG audio layer III decoder using graphics processing units
Design and Implementation of ShenWei Universal C/C++
Design and implementation of software-managed caches for multicores with local memory
Design and Implementation of the Futhark Programming Language
Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU
Design and Modeling of a Non-blocking Checkpointing System
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
Design and optimization of DBSCAN Algorithm based on CUDA
Design and Optimization of Hybrid MD5-Blowfish Encryption on GPUs
Design and Optimization of Image Processing Algorithms on Mobile GPU
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms
Design and Performance Analysis of Parallel Processing of SRTP Packets
Design and performance evaluation of a digital wideband receiver on a hybrid computing platform
Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers
Design and Performance Evaluation of Image Processing Algorithms on GPUs
Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels
Design and Performance of the OP2 Library for Unstructured Mesh Applications
Design and Storage Optimization of GPU-based Parallel Program of Image Registration for Remote Sensing
Design and study of a massively multi threaded shared memory architecture
Design Exploration of AES Accelerators on FPGAs and GPUs
Design Exploration of Quadrature Methods in Option Pricing
Design of 3D FFT on Multi-GPU Clusters
Design of a fully programmable shader processor for low power mobile devices
Design of a Hybrid Memory System for General-Purpose Graphics Processing Units
Design of a parallel AES for graphics hardware using the CUDA framework
Design of a programmable micro-ultrasound research platform
Design of an FPGA-Based FDTD Accelerator Using OpenCL
Design of FPGA-Based Accelerator for Convolutional Neural Network under Heterogeneous Computing Framework with OpenCL
Design of Hardware Accelerator for Lempel-Ziv 4 (LZ4) Compression
Design of high-performance parallelized gene predictors in MATLAB
Design of MILC Lattice QCD Application for GPU Clusters
Design optimization of automotive electronic control unit using the analysis of common-mode current by fast electromagnetic field solver
Design Principles for Sparse Matrix Multiplication on the GPU
Design Space Exploration for GPU-Based Architecture
Design Space Exploration of an OpenCL Based SAXPY Kernel Implementation on FPGAs
Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad
Design Space Exploration of OpenCL Applications on Heterogeneous Parallel Platforms
Design Space Exploration of Real-time Bedside and Portable Medical Ultrasound Adaptive Beamformer Acceleration
Design space exploration towards a realtime and energy-aware GPGPU-based analysis of biosensor data
Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms
Design, Implementation and Performance Evaluation of a Stochastic Gradient Descent Algorithm on CUDA
Design, Implementation and Test of Efficient GPU to GPU Communication Methods
Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs
Designing a high-performance boundary element library with OpenCL and Numba
Designing a Modern Skeleton Programming Framework for Parallel and Heterogeneous Systems
Designing a Unified Programming Model for Heterogeneous Machines
Designing and optimizing compute kernels on NVIDIA GPUs
Designing Bit-Reproducible Portable High-Performance Applications
Designing Efficient Barriers and Semaphores for Graphics Processing Units
Designing Efficient Many-Core Parallel Algorithms for All-Pairs Shortest-Paths Using CUDA
Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors
Designing efficient sorting algorithms for manycore GPUs
Designing Fast Architecture Sensitive Tree Search on Modern Multi-Core/Many-Core Processors
Designing Fast LTL Model Checking Algorithms for Many-Core GPUs
Designing Numerical Solvers for Next Generation High Performance Computing
Designing OP2 for GPU architectures
Designing scalable many-core parallel algorithms for min graphs using CUDA
Designing Scientific Applications on GPUs
Designing the Language Liszt for Building Portable Mesh-based PDE Solvers
Detecting Computer Viruses using GPUs
Detecting Data Races on OpenCL Kernels with Symbolic Execution
Detecting multiple periodicities in observational data with the multi-frequency periodogram. II. Frequency Decomposer, a parallelized time-series analysis algorithm
Detecting parametric objects in large scenes by Monte Carlo sampling
Detection of a faint fast-moving near-Earth asteroid using synthetic tracking technique
Detection of collisions and self-collisions using image-space techniques
Detection of retransmissions in 10G Ethernet using GPUs
Determinant Computation on the GPU using the Condensation Method
Determining the difficulty of accelerating problems on a GPU
Deterministic Parallelism
Deterministic Sample Sort For GPUs
Developing a compiler for the XeonPhi
Developing a CUDA solver for large sparse matrices for MARIN
Developing a High Performance GPGPU Compiler Using Cetus
Developing a High Performance Software Library with MPI and CUDA for Matrix Computations
Developing a massive real-time crowd simulation framework on the GPU
Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU
Developing acquisition systems based on FPGA with OpenCL
Developing an OO Model for Generalized Matrix Multiplication: Preliminary Considerations
Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware
Developing and Evaluating clOpenCL Applications for Heterogeneous Clusters
Developing Extensible Lattice-Boltzmann Simulators for General-Purpose Graphics-Processing Units
Developing Performance-Portable Molecular Dynamics Kernels in OpenCL
Development and evaluation of a GPU-optimized N-body term for the simulation of biomolecules
Development and evaluation of scalable video motion estimators on GPU
Development methodologies for GPU and cluster of GPUs
Development of a Chemically Reacting Flow Solver on the Graphic Processing Units
Development of a CUDA Implementation of the 3D FDTD Method
Development of a Flow Solver with Complex Kinetics on the Graphic Processing Units
Development of a GPU based two-way time transfer modem
Development of a GPU-accelerated MIKE 21 Solver for Water Wave Dynamics
Development of a GPU-based High-Performance Radiative Transfer Model for the Infrared Atmospheric Sounding Interferometer (IASI)
Development of a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport
Development of a GPU-based multithreaded software application to calculate digitally reconstructed radiographs for radiotherapy
Development of a new framework for high performance volunteer computing
Development of a Restricted Additive Schwarz Preconditioner for Sparse Linear Systems on NVIDIA GPU
Development of a volume rendering system using 3D texture compression techniques on general-purpose personal computers
Development of an Algorithm for Extracting Parallelism and Pipeline Structure from Stream-based Processing flow with Spanning Tree
Development of an explicit pressure-based unstructured solver for three-dimensional incompressible flows with graphics hardware acceleration
Development of an unified FDTD-FEM library for electromagnetic analysis with CPU and GPU computing
Development of Bayesian analysis program for extraction of polarisation observables at CLAS
Development of Generic Scheduling Concepts for OpenGL ES 2.0
Development of High-Performance Software Components for Emerging Architectures
Development of JavaScript-based deep learning platform and application to distributed training
Development of Krylov and AMG linear solvers for large-scale sparse matrices on GPUs
Development of methods for the processing of mining images using genetic algorithms
Development of nonlinear filter bank system for real-time beautification of facial video using GPGPU
Development of Parallel Architectures for Radar/Video Signal Processing Applications
Development of Parallel Computation Tools
Development of Virtual Machine Tool for Simulation and Evaluation
Developmental Directions in Parallel Accelerators
Device Placement Optimization with Reinforcement Learning
Device specialization in heterogeneous multi-GPU environments
Devito: automated fast finite difference computation
DFG Implementation on Multi GPU Cluster with Computation-Communication Overlap
DGEMM on Integer Matrix Multiplication Unit
DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme
Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4
Diagnosing Performance Bottlenecks in HPC Applications
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Diagrammatic Determinantal Quantum Monte Carlo Calculations on GPUs
DIANNE: Distributed Artificial Neural Networks for the Internet of Things
DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels
Diderot: A Parallel DSL for Image Analysis and Visualization
DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation
Different Optimization Strategies and Performance Evaluation of Reduction on Multicore CUDA Architecture
Differential evolution algorithm on the GPU with C-CUDA
Differential Evolution with parallelised objective functions using CUDA
Diffusion Curves: A Vector Representation for Smooth-Shaded Images
Digital beamforming using a GPU
Digital Marbling: a GPU Approach with Precomputed Velocity Field
Digital Signal Processing using Stream High Performance Computing: A 512-input Broadband Correlator for Radio Astronomy
Digitize Your Body and Action in 3-D at Over 10 FPS: Real Time Dense Voxel Reconstruction and Marker-less Motion Tracking via GPU Acceleration
Diplomat: Mapping of multi-kernel applications using a static dataflow abstraction
Direct Communication Methods for Distributed GPUs
Direct deconvolution of radio synthesis images using L1 minimisation
Direct evaluation of NURBS curves and surfaces on the GPU
Direct GPU Compilation and Execution for Host Applications with OpenMP Parallelism
Direct GPU/FPGA Communication Via PCI Express
Direct N-body code on low-power embedded ARM GPUs
Direct N-body Kernels for Multicore Platforms
Direct N-body simulations of globular clusters: (I) Palomar 14
Direct Numeric Simulation of Sheared Convective Boundary Layer Entrainment with GPUs
Direct Numerical Simulation and Large Eddy Simulation on a Turbulent Wall-Bounded Flow Using Lattice Boltzmann Method and Multiple GPUs
Direct numerical simulation of sub-grid structures in gas-solid flow --- GPU implementation of macro-scale pseudo-particle modeling
Direct Numerical Simulation of Turbulence on Heterogenous Computer Systems: Architectures, Algorithms, and Applications
Direct Numerical Simulation of Turbulent Flows with Parallel Algorithms for Various Computing Architectures
Direct Point Rendering on GPU
Direct Self-Consistent Field Computations on GPU Clusters
Direct solution of the Boltzmann equation for a binary mixture on GPUs
Direct Visualization of Particle-Partition of Unity Data
Direct Volume Editing
Direct-to-indirect transfer for cinematic relighting
directCell: hybrid systems with tightly coupled accelerators
Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/OpenMP/GPU Parallelization in AMR
Directive-based Approach to Heterogeneous Computing
Directive-Based Compilers for GPUs
Directive-Based Data Partitioning and Pipelining and Auto-Tuning for High-Performance GPU Computing
Directive-Based Partitioning and Pipelining for Graphical Processing Units
Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs
Directives Based Programming of GPU Accelerated Systems
DISC: A Dynamic Shape Compiler for Machine Learning Workloads
Disc: Approximative Nearest Neighbor Search using Ellipsoids for Photon Mapping on GPUs
Discontinuous Galerkin Methods on Graphics Processing Units for Nonlinear Hyperbolic Conservation Laws
Discontinuous Galerkin Time Domain for Maxwell's equations on GPUs
Discrete fourier transform on multicore
Discrete Planning Unit Look-ahead Velocity Control Strategy and Parallelization Research based on GPU
Discrete Shearlet Transform on GPU with Applications in Anomaly Detection and Denoising
Discrete Wavelet Transform on Consumer-Level Graphics Hardware
Discrete-event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs)
Discriminative Convolutional Sum-Product Networks on GPU
Disjunctive Normal Networks
Dispersion Simulation and Visualization For Urban Security
Displacement Mapping on the GPU - State of the Art 
Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs
Dissecting GPU Memory Hierarchy through Microbenchmarking
Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors
Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks
Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis
Dissecting the NVidia Turing T4 GPU via Microbenchmarking
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
DISTAL: The Distributed Tensor Algebra Compiler
Distance field transform with an adaptive iteration method
Distance Fields Accelerated with OpenCL
Distance Threshold Similarity Searches on Spatiotemporal Trajectories using GPGPU
DistCL: A Framework for the Distributed Execution of OpenCL Kernels
Distortion correction algorithm for UAV remote sensing image based on CUDA
Distributed Calculations with Algorithmic Skeletons for Heterogeneous Computing Environments
Distributed computer emulation: Using OpenCL framework
Distributed Deep Learning Strategies For Automatic Speech Recognition
Distributed genetic programming on GPUs using CUDA
Distributed GPU Password Cracking Research Project
Distributed GPU Volume Rendering of ASKAP Spectral Data Cubes
Distributed learning of CNNs on heterogeneous CPU/GPU architectures
Distributed Massive Model Rendering
Distributed multi-node, multi-GPU, heterogeneous system for 3D image reconstruction in Electrical Capacitance Tomography - network performance and application analysis
Distributed OpenCL Distributing OpenCL Platform on Network Scale
Distributed OpenCL: a platform for distributed, heterogeneous computing for domain scientists
Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators
Distributed Password Cracking Platform
Distributed Texture Memory in a Multi-GPU Environment
Distributed time, conservative parallel logic simulation on GPUs
Distributed Training Large-Scale Deep Architectures
Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed wideband software-defined radio receiver for heterogeneous systems
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability
Distributed, combined CPU and GPU profiling within HPX using APEX
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
Divergence Analysis
Divergence Analysis and Optimizations
Divergence Analysis with Affine Constraints
Divide and Conquer G-Buffer Ray Tracing
Divide-and-Conquer 3D Convex Hulls on the GPU
DiVinE-CUDA - A Tool for GPU Accelerated LTL Model Checking
DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers
DL: A data layout transformation system for heterogeneous computing
DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications
DLL: A Blazing Fast Deep Neural Network Library
DMA-Assisted, Intranode Communication in GPU Accelerated Systems
dMath: A Scalable Linear Algebra and Math Library for Heterogeneous GP-GPU Architectures
dMath: Distributed Linear Algebra for DL
DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL
DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors
DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
Doctor AI: Interpretable Deep Learning for Modeling Electronic Health Records
Document Classification Using KNN on GPU
Document Image Binarization Using Image Segmentation Algorithm in Parallel Environment
Document Stream Clustering using GPUs
Dogwild! - Distributed Hogwild for CPU & GPU
Domain Decomposition method on GPU cluster
Domain Specific Languages for High Performance Computing
Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation
Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks
Domain-Specific Languages for Heterogeneous Parallel Computing
Domain-Specific On-Device Object Detection Method
Domain-Specific Optimizations Supporting Real-Time Image Compression
DOPA: GPU-based protein alignment using database and memory access optimizations
dOpenCL - Evaluation of an API-Forwarding Implementation
Dopia: Online Parallelism Management for Integrated CPU/GPU Architectures
Double-Precision Floating-Point Data Visualizations Using Vulkan API
Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?
Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering
Dragon-Alpha&cu32: A Java-based Tensor Computing Framework With its High-Performance CUDA Library
DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function
DRiVE: An Example of Distributed Rendering in Virtual Environments
Dropbear: Machine Learning Marketplaces made Trustworthy with Byzantine Model Agreement
DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation
Drug Drug Interaction Extraction from Biomedical Literature Using Syntax Convolutional Neural Network
DSDP: A Blind Docking Strategy Accelerated by GPUs
DSPSR: Digital Signal Processing Software for Pulsar Astronomy
DTAM: Dense tracking and mapping in real-time
Dual-RBF based surface reconstruction
Duality based optical flow algorithms with applications
DUODECIM - a structure for point scan compression and rendering
Duplicate Detection on GPUs
Dust-Dust Collisional Charging and Lightning in Protoplanetary Discs
DVM: Real-Time Kernel Generation for Dynamic AI Models
Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures
Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems
Dymaxion++: A Directive-based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems
Dynamic adaptation and distribution of binaries to heterogeneous architectures
Dynamic adaptation of broad phase collision detection algorithms
Dynamic Adaptation Techniques and Opportunities to Improve HPC Runtimes
Dynamic Application Autotuning for Self-Aware Approximate Computing
Dynamic autotuning of adaptive fast multipole methods on hybrid multicore CPU & GPU systems
Dynamic autotuning of SpMV kernel in CUSP library
Dynamic Buffer Overflow Detection for GPGPUs
Dynamic Compilation of Data-Parallel Kernels for Vector Processors
Dynamic Data Management Among Multiple Databases for Optimization of Parallel Computations in Heterogeneous HPC Systems
Dynamic Data Structures for Taskgraph Scheduling Policies with Applications in OpenCL Accelerators
Dynamic deformation textures: GPU-accelerated simulation of deformable models in contact
Dynamic detection of uniform and affine vectors in GPGPU computations
Dynamic Distribution Pruning for Efficient Network Architecture Search
Dynamic Feature-Adaptive Subdivision
Dynamic Fine-Grain Scheduling of Pipeline Parallelism
Dynamic GPU Energy Optimization for Machine Learning Training Workloads
Dynamic Heterogeneous Scheduling Decisions Using Historical Runtime Data
Dynamic IBR Techniques for Fixed Cost Stereoscopic Support
Dynamic Instrumentation and Optimization for GPU Applications
Dynamic Kernel/Device Mapping Strategies for GPU-assisted HPC Systems
Dynamic label placement for improved interactive exploration
Dynamic Load Balancing in GPU-Based Systems - Early Experiments
Dynamic load balancing on heterogeneous multicore/multiGPU systems
Dynamic Load Balancing on Massively Parallel Computer Architectures
Dynamic load balancing on single- and multi-GPU systems
Dynamic Load Balancing Strategies for Graph Applications on GPUs
Dynamic Load Balancing using Graphics Processors
Dynamic LOD on GPU
Dynamic loop vectorization for executing OpenCL kernels on CPUs
Dynamic Memory Allocation for OpenCL
Dynamic Memory Management on GPUs with SYCL
Dynamic Orchestration of Massively Data Parallel Execution
Dynamic Overset Grid Computations for CFD Applications on Graphics Processing Units
Dynamic Parallelism in GPU Optimized Barnes Hut Trees for Molecular Dynamics Simulations
Dynamic particle coupling for gpu-based fluid simulation
Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures
Dynamic Programming with CUDA - Part II
Dynamic real-time 4D cardiac MDCT image display using GPU-accelerated volume rendering
Dynamic Sampling and Rendering of Algebraic Point Set Surfaces
Dynamic Scheduling for Large-Scale Distributed-Memory Ray Tracing
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters
Dynamic Scheduling of Parallel Code for Heterogeneous Systems
Dynamic Self-Rescheduling of Tasks over a Heterogeneous Platform
Dynamic Shader Generation for Flexible Multi-Volume Visualization
Dynamic Sparse-Matrix Allocation on GPUs
Dynamic Task Parallelism with a GPU Work-Stealing Runtime System
Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging
Dynamic Translation of Runtime Environments for Heterogeneous Computing
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
Dynamic Warp Resizing in High-Performance SIMT
Dynamic Workload Division in GPU-CPU Heterogeneous Systems
Dynamical heterogeneities as fingerprints of a backbone structure in Potts models
Dynamical simulations of extrasolar planetary systems with debris disks using a GPU accelerated N-body code
Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs
Dynamically Managed Data for CPU-GPU Architectures
Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators 
Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-Hybrid platforms
DynaProg for Scala: A Scala DSL for Dynamic Programming on CPU and GPU
DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model
E-MOGA: A General Purpose Platform for Multi Objective Genetic Algorithm running on CUDA
E(A+M)PEC - An OpenCL Atomic and Molecular Plasma Emission Code For Interstellar Medium Simulations
E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems
Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform
Early evaluation of directive-based GPU programming models for productive exascale computing
Early Experiences in Running Many-Task Computing Workloads on GPGPUs
Early Experiences Migrating CUDA codes to oneAPI
Early Experiences Running the 3D Stencil Jacobi Method in Intel Xeon Phi
Early experiences with the intel many integrated cores accelerated computing technology
Early Experiences With The OpenMP Accelerator Model
Early Results of Deep Learning on the Stampede2 Supercomputer
EASEA parallelization of tree-based Genetic Programming
EASEA: A Generic Optimization Tool for GPU Machines in Asynchronous Island Model
EASEA: specification and execution of evolutionary algorithms on GPGPU
Easy and Efficient Agent-based Simulations with the OpenABL Language and Compiler
Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode
Easy-to-Use On-the-Fly Binary Program Acceleration on Many-Cores
EASYPAP: a Framework for Learning Parallel Programming
EasyPBR: A Lightweight Physically-Based Renderer
Ebb: A DSL for Physical Simluation on CPUs and GPUs
ECC2K-130 on NVIDIA GPUs
eccCL: parallelized GPU implementation of Ensemble Classifier Chains
ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX
ECM on Graphics Cards
EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing
Edge AI for Internet of Energy: Challenges and Perspectives
Edge coloring in unstructured CFD codes
Edge Stream Oriented LDPC Decoding
Edify 3D: Scalable High-Quality 3D Asset Generation
EDSSA: An Encoder-Decoder Semantic Segmentation Networks Accelerator on OpenCL-Based FPGA Platform
Effect And Analysis of Elastic Fidelity Computing On GPUs
Effect of GPU Communication-Hiding for SpMV Using OpenACC
Effective Dynamic Scheduling on Heterogeneous Multi/Manycore Desktop Platforms
Effective Extensible Programming: Unleashing Julia on GPUs
Effective GPU Sharing Under Compiler Guidance
Effective GPU Strategies for LU Decomposition
Effective Mapping of Grammatical Evolution to CUDA Hardware Model
Effective Multi-Modal Retrieval based on Stacked Auto-Encoders
Effective Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs
Effective Sparse Matrix Representation for the GPU Architectures
Effectiveness of GPGPU for Solving the Magnetohydrodynamics Equations Using the CIP-MOCCT Method
Effectiveness of program transformations and compilers for directive-based GPU programming models
Effects of Compiler Optimizations in OpenMP to CUDA Translation
Effects of compression on data intensive algorithms
Effects of Concurrency Techniques and Algorithm Performance: A Comparative Analysis of Single-Threaded, Multi-Threaded, and GPGPU Programming Techniques
Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU
Effects of Easy Hybrid Parallelization with CUDA for Numerical-Atomic-Orbital Density Functional Theory Calculation
Effects of GPU and CPU Loads on Performance of CUDA Applications
Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation
EFFEX: an embedded processor for computer vision based feature extraction
Efficacy of Images Versus Data Buffers: Optimizing Interactive Applications Utilizing OpenCL for Scientific Visualization
Efficent multiple pass, multiple output algorithms on the GPU
Efficiency analysis of a physical problem: Different parallel computational approaches for a dynamical integrator evolution
Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms
Efficiency of general Krylov methods on GPUs - An experimental study
Efficiency of Parallelization of Neural Network Algorithm on Graphic Cards
Efficiency of the energy transfer in the Fenna-Matthews-Olson complex using hierarchical equations on graphics processing units
Efficiency without Tears: Securing Multilingual Programs with TRINITY
Efficient 2D Software Rendering
Efficient 3D Isotropic Volume Reconstruction Based On 2D Localized Ultrasound Images
Efficient 3D reconstruction of large-scale urban environments from street-level video
Efficient Acceleration of Mutual Information Computation for Nonrigid Registration using CUDA
Efficient Algorithm for RSA Text Encryption Using CUDA-C
Efficient Algorithms for Sorting on GPUs
Efficient algorithms for the realistic simulation of fluids
Efficient all-against-all protein similarity matrix computation using OpenCL
Efficient allocation of image recognition and LLM tasks on multi-GPU system
Efficient and Accurate Sound Propagation Using Adaptive Rectangular Decomposition
Efficient and Cryptographically Secure Generation of Chaotic Pseudorandom Numbers on GPU
Efficient and Good Delaunay Meshes From Random Points
Efficient and High-quality Sparse Graph Coloring on the GPU
Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives
Efficient and portable multi-tasking for heterogeneous systems
Efficient and Quality Contouring Algorithms on the GPU
Efficient and Scalable k-Means on GPUs
Efficient and Scalable Parallel Zonal Statistics on Large-Scale Species Occurrence Data on GPUs
Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs
Efficient Approximate Visibility of Point Sets on the GPU
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
Efficient Bayesian inference in stochastic chemical kinetic models using graphical processing units
Efficient bayesian multi-view deconvolution
Efficient Calculation of Pairwise Nonbonded Forces
Efficient Canny Edge Detection Using a GPU
Efficient characterizations of composite materials electrical properties based on GPU accelerated finite difference method
Efficient code generation for hardware accelerators by refining partially specified implementation
Efficient Collision Detection and Physics-Based Deformation for Haptic Simulation with Local Spherical Hash
Efficient Communications in Training Large Scale Neural Networks
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Efficient computation of condition estimates for linear least squares problems
Efficient computation of constrained parameterizations on parallel platforms
Efficient Computation of k-Nearest Neighbour Graphs for Large High-Dimensional Data Sets on GPU Clusters
Efficient Computation of SOM for Outage Database
Efficient computation of sum-products on GPUs through software-managed cache
Efficient Computation of the Kleene Star in Max-Plus Algebra using a CUDA GPU
Efficient Computational Methods for Uncertainty Quantification of Large Systems
Efficient computational noise in GLSL
Efficient Configuration of Heterogeneous Resources and Task Scheduling Strategies in Deep Learning Auto-Tuning Systems
Efficient Convex Optimization Approaches to Variational Image Fusion
Efficient Convolutional Neural Networks for Pixelwise Classification on Heterogeneous Hardware Systems
Efficient Convolutional Patch Networks for Scene Understanding
Efficient Cross-Device Query Processing
Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU
Efficient Cubic B-spline Image Interpolation on a GPU
Efficient CUDA polynomial preconditioned Conjugate Gradient solver for Finite Element computation of elasticity problems
Efficient Data Management for GPU Databases
Efficient data structures for piecewise-smooth video processing
Efficient deconvolution methods for astronomical imaging: algorithms and IDL-GPU codes
Efficient deep learning inference on end devices
Efficient Deep Neural Network Inference for Embedded Systems: A Mixture of Experts Approach
Efficient design and implementation of visual computing algorithms on the GPU
Efficient Detection of Sunspots with GPU Acceleration Through CUDA
Efficient dictionary learning implementation on the GPU using OpenCL
Efficient Discrete Range Searching primitives on the GPU with applications
Efficient Dynamic Derived Field Generation on Many-Core Architectures Using Python
Efficient Dynamic Program Monitoring on Multi-Core Platforms
Efficient Embarrassingly Parallel on Graphics Processor Unit
Efficient Emission Computation in Hidden Semi-Markov Models on Diverse Hardware
Efficient Energyminimization in Finite-Difference Micromagnetics: Speeding up Hysteresis Computations
Efficient evaluation methods of elementary functions suitable for SIMD computation
Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets
Efficient Execution of AMR Computations on GPU Systems
Efficient Execution of OpenMP on GPUs
Efficient Execution on GPUs of Field-Based Vehicular Mobility Models
Efficient Exploitation of Heterogeneous Platforms for Images Features Extraction
Efficient Exploitation of Heterogeneous Platforms for Vertebra Detection in X-Ray Images
Efficient fault simulation on many-core processors
Efficient FFT mapping on GPU for radar processing application: modeling and implementation
Efficient fine grained shared buffer management for multiple OpenCL devices
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient floating-point texture decompression
Efficient fMRI Analysis and Clustering on GPUs
Efficient gather and scatter operations on graphics processors
Efficient Geometry Compression for GPU-based Decoding in Realtime Terrain Rendering
Efficient GPGPU-based parallel packet classification
Efficient GPU Implementation for Particle in Cell Algorithm
Efficient GPU Implementation for Single Block Orthogonal Dictionary Learning
Efficient GPU implementation of a class of array permutations
Efficient GPU implementation of a two waves WAF method for the two-dimensional one layer Shallow Water system on structured meshes
Efficient GPU Implementation of Multi-Precision Integer Division
Efficient GPU implementation of parameter estimation of a statistical model for online advertisement optimization
Efficient GPU implementation of the integral histogram
Efficient GPU-Accelerated Elastic Image Registration
Efficient GPU-based Construction of Occupancy Girds Using several Laser Range-finders
Efficient GPU-based Graph Cuts for Stereo Matching
Efficient GPU-Based Texture Interpolation using Uniform B-Splines
Efficient GPU-based Training of Recurrent Neural Network Language Models Using Spliced Sentence Bunch
Efficient GPU-Implementation of Adaptive Mesh Refinement for the Shallow-Water Equations
Efficient gradient-domain compositing using quadtrees
Efficient Graph Comparison and Visualization Using GPU
Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration
Efficient Hardware Acceleration on SoC-FPGA with OpenCL
Efficient Hash Tables on the GPU
Efficient Heterogeneous Execution on Large Multicore and Accelerator Platforms: Case Study Using a Block Tridiagonal Solver
Efficient heterogeneous matrix profile on a CPU + High Performance FPGA with integrated HBM
Efficient hierarchical parallel genetic algorithms using grid computing
Efficient High-Quality Volume Rendering of SPH Data
Efficient High-Speed WPA2 Brute Force Attacks using Scalable Low-Cost FPGA Clustering
Efficient Hybrid Execution of C++ Applications using Intel(R) Xeon Phi(TM) Coprocessor
Efficient image reconstruction for point-based and line-based rendering
Efficient Implementation and Evaluation of Methods for the Estimation of Motion in Image Sequences
Efficient Implementation and Optimization of Geometric Multigrid Operations in the LIFT Framework
Efficient implementation for MD5-RC4 encryption using GPU with CUDA
Efficient implementation for QUAD stream cipher with GPUs
Efficient Implementation of Bi-directional Path Tracer on GPU
Efficient implementation of computationally intensive algorithms on parallel computing platforms
Efficient implementation of data flow graphs on multi-gpu clusters
Efficient implementation of GPGPU synchronization primitives on CPUs
Efficient Implementation of Hyperspectral Anomaly Detection Techniques on GPUs and Multicore Processors
Efficient Implementation of MrBayes on multi-GPU
Efficient implementation of multiuser precoding algorithms on GPU for MIMO-OFDM systems
Efficient Implementation of Optical Flow Algorithm Based on Directional Filters on a GPU Using CUDA
Efficient Implementation of RLS-Based Adaptive Filters on nVIDIA GeForce Graphics Processing Unit
Efficient Implementation of the CPR Formulation for the Navier-Stokes Equations on GPUs
Efficient Implementation of the eta_T Pairing on GPU
Efficient implementation of the overlap operator on multi-GPUs
Efficient Implementation of the Simplex Method on a CPU-GPU System
Efficient Incremental Text-to-Speech on GPUs
Efficient Independent Component Analysis on a GPU
Efficient Inference For Neural Machine Translation
Efficient Integral Image Computation on the GPU
Efficient Interleaved Batch Matrix Solvers for CUDA
Efficient Intranode Communication in GPU-Accelerated Systems
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines
Efficient JPEG2000 EBCOT Context Modeling for Massively Parallel Architectures
Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs
Efficient Kernel Synthesis for Performance Portable Programming
Efficient Knowledge Extraction from Structured Data
Efficient Large-scale Approximate Nearest Neighbor Search on OpenCL FPGA
Efficient Large-scale Approximate Nearest Neighbor Search on the GPU
Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems
Efficient Large-Scale Language Model Training on GPU Clusters
Efficient LBM Visual Simulation on Face-Centered Cubic Lattices
Efficient linear-scaling quantum transport calculations on graphics processing units and applications on electron transport in graphene
Efficient lists intersection by CPU-GPU cooperative computing
Efficient magnetohydrodynamic simulations on graphics processing units with CUDA
Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards
Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster
Efficient Matrix Factorization on Heterogeneous CPU-GPU Systems
Efficient MIMD architectures for high-performance ray tracing
Efficient Model-based 3D Tracking of Hand Articulations using Kinect
Efficient molecular dynamics simulations with many-body potentials on graphics processing units
Efficient Monte Carlo sampler for detecting parametric objects in large scenes
Efficient MPI-based Communication for GPU-Accelerated Dask Applications
Efficient Multi-GPU Algorithm for All-Pairs Shortest Paths
Efficient Multi-GPU Computation of All-Pairs Shortest Paths
Efficient Multiplication of Polynomials on Graphics Hardware
Efficient nearest-neighbor computation for GPU-based motion planning
Efficient Nearest-Neighbor Data Sharing in GPUs
Efficient Neural Network Acceleration on GPGPU using Content Addressable Memory
Efficient nonbonded interactions for molecular dynamics on a graphics processing unit
Efficient Numerical Evaluation of Feynman Integral
Efficient occupancy grid computation on the GPU with lidar and radar for road boundary detection
Efficient On-the-fly Category Retrieval using ConvNets and GPUs
Efficient OpenCL system integration of non-blocking FPGA accelerators
Efficient OpenCL-based concurrent tasks offloading on accelerators
Efficient PageRank and SpMV Computation on AMD GPUs
Efficient Parallel Algorithm for Nonlinear Dimensionality Reduction on GPU
Efficient parallel algorithms for maximum-density segment problem
Efficient Parallel and External Matching
Efficient Parallel CKY Parsing on GPUs
Efficient Parallel Evaluation of Multivariate Quadratic Polynomials on GPUs
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Implementation for Single Block Orthogonal Dictionary Learning
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms
Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
Efficient Parallel Intra-prediction Mode Selection Scheme for 4x4 Blocks in H.264
Efficient parallel lists intersection and index compression algorithms using graphics processing units
Efficient Parallel Methods for Deep Reinforcement Learning
Efficient Parallel Nonnegative Least Squares on Multicore Architectures
Efficient Parallel Proximity Queries and an Application to Highly Complex Motion Planning Problems with Many Narrow Passages
Efficient Parallel RSA Decryption Algorithm for Many-core GPUs with CUDA
Efficient Parallel Scan Algorithms for GPUs
Efficient Parallel Strategy Improvement for Parity Games
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
Efficient Parallelization of Natural Language Applications using GPUs
Efficient Parallelization of Stochastic Simulation Algorithm for Chemically Reacting Systems on the Graphics Processing Unit
Efficient Parallelization of the Stochastic Simulation Algorithm for Chemically Reacting Systems On the Graphics Processing Unit
Efficient parallelized particle filter design on CUDA
Efficient Particle-Mesh Spreading on GPUs
Efficient Partitioning Based Hierarchical Agglomerative Clustering Using Graphics Accelerators with CUDA
Efficient partitioning of fragment shaders for multipass rendering on programmable graphics hardware
Efficient Password and Key recovery using Graphic Cards
Efficient Pattern-Based Time Series Classification on GPU
Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors
Efficient planar features matching for robot localization using GPU
Efficient Preconditioned Conjugate Gradient Parallelization on GPU
Efficient Probabilistic and Geometric Anatomical Mapping Using Particle Mesh Approximation on GPUs
Efficient Probabilistic Latent Semantic Indexing using Graphics Processing Unit
Efficient Probabilistic Model Checking on General Purpose Graphics Processors
Efficient Processing of MRFs for Unconstrained-Pose Face Recognition
Efficient pseudo-random number generation for monte-carlo simulations using graphic processors
Efficient pseudo-random number generators for biomolecular simulations on graphics processors
Efficient Quantized Sparse Matrix Operations on Tensor Cores
Efficient Query Processing in Co-Processor-accelerated Databases
Efficient Quicksort and 2D Convex Hull for CUDA, and MSIMD as a Realistic Model of Massively Parallel Computations
Efficient Radial Pattern Keyword Search on Knowledge Graphs in Parallel
Efficient Random Sampling - Parallel, Vectorized, Cache-Efficient, and Online
Efficient Rasterization for Outdoor Radio Wave Propagation
Efficient Ray Tracing of Dynamic Scenes on the GPU
Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization
Efficient reconfigurable design for pricing asian options
Efficient reconstruction of biological networks via transitive reduction on general purpose graphics processors
Efficient Relational Algebra Algorithms and Data Structures for GPU
Efficient relational database management using graphics processors
Efficient Rendering of Scenes with Dynamic Lighting Using a Photons Queue and Incremental Update Algorithm
Efficient Resource Scheduling for Big Data Processing on Accelerator-based Heterogeneous Systems
Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems
Efficient scan-window based object detection using GPGPU
Efficient SDS Simulations on Multi-GPU Nodes of XSEDE High-end Clusters
Efficient Shadows for GPU-based Volume Raycasting
Efficient Shallow Water Simulations on GPUs
Efficient shallow water simulations on GPUs: Implementation, visualization, verification, and validation
Efficient SIMD Vectorization for Hashing in OpenCL
Efficient similarity search on multimedia databases
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
Efficient Simulation of Fluid Flow and Transport in Heterogeneous Media Using Graphics Processing Units (GPUs)
Efficient simulation of large-scale spiking neural networks using CUDA graphics processors
Efficient Simulation of Ocean and Land Scenes Based on Digital Earth
Efficient Simulation Techniques for Large-Scale Applications
Efficient simulations of long wave propagation and runup using a LBM approach on GPGPU hardware
Efficient softmax approximation for GPUs
Efficient Sparse Matrix-Vector Multiplication on CUDA
Efficient Sparse Matrix-Vector Multiplication on GPUs using the CSR Storage Format
Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors
Efficient sparse voxel octrees
Efficient Sparse Voxel Octrees - Analysis, Extensions, and Implementation
Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format
Efficient Spatial Anti-Aliasing Rendering for Line Joins on Vector Maps
Efficient Spatial Binning on the GPU
Efficient spectral and pseudospectral algorithms for 3D simulations of whistler-mode waves in a plasma
Efficient Stack-less BVH Traversal for Ray Tracing
Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems
Efficient stream reduction on the GPU
Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures
Efficient Surface Reconstruction From Noisy Data Using Regularized Membrane Potentials
Efficient SVM Training Using Parallel Primal-Dual Interior Point Method on GPU
Efficient Synchronization Primitives for GPUs
Efficient Target and Application Specific Selection and Ordering of Compiler Passes
Efficient Triangle and Quadrilateral Clipping within Shaders
Efficient Two-Level Preconditionined Conjugate Gradient Method on the GPU
Efficient Use of In-Game Ray-Tracing Techniques
Efficient Video Compression via Content-Adaptive Super-Resolution
Efficient Virtual Shadow Maps for Many Lights
Efficient visual hull computation for real-time 3D reconstruction using CUDA
Efficient Volume Rendering in CUDA Path Tracer
Efficient Wave Propagation in Discontinuous Media and Complex Geometry for Many-core Architectures
Efficient Weighted Histogramming on GPUs with CUDA
Efficient Workload Balancing on Heterogeneous GPUs using Mixed-Integer Non-Linear Programming
Efficient XML Path Filtering Using GPUs
Efficient, High-Quality Bayer Demosaic Filtering on GPUs
EfficientBioAI: Making Bioimaging AI Models Efficient in Energy, Latency and Representation
Efficiently Computing Tensor Eigenvalues on a GPU
Efficiently GPU-accelerating long kernel convolutions in 3-D DIRECT TOF PET reconstruction via a kernel decomposition scheme
Efficiently Mapping the AES Encryption Algorithm on GPUs
Efficiently Processing Large Relational Joins on GPUs
Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs
Efficiently Using a CUDA-enabled GPU as Shared Resource
eGPU: A 750 MHz Class Soft GPGPU for FPGA
EIE: Efficient Inference Engine on Compressed Deep Neural Network
EigenCFA: accelerating flow analysis with GPUs
Eigentransport for efficient and accurate all-frequency relighting
Elastic deep learning in multi-tenant GPU cluster
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts
Elastic stream cloud (ESC): A stream-oriented cloud computing platform for Rich Internet Application
Elastically Deformable Models based on the Finite Element Method Accelerated on Graphics Hardware using CUDA
ElastiFace: Matching and Blending Textured Faces
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Electric polarizability of hadrons with overlap fermions on multi-GPUs
Electric potential and field calculation of charged BEM triangles and rectangles by Gaussian cubature
Electrical distribution grid visualization using programmable GPUs
Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era
Electromagnetic Computation and Visualization of Transmission Particle Model and its Simulation Based on GPU
Electromagnetic effects in capacitively coupled plasma simulated with a PIC-MCC darwin code
Electromagnetic transient simulation of large-scale electrical power networks using graphics processing units
Elementary functions: towards automatically generated, efficient, and vectorizable implementations
Elevation-based MRF stereo implemented in real-time on a GPU
EM+TV for Reconstruction of Cone-beam CT with Curved Detectors using GPU
Embedded Ensemble Propagation for Improving Performance, Portability and Scalability of Uncertainty Quantification on Emerging Computational Architectures
Embedded real-time stereo estimation via Semi-Global Matching on the GPU
Embedded Software Synthesis using Heterogeneous Dataflow Models
Embedding GPU Computations in Hadoop
Embedding OpenCL in C++ for Expressive GPU Programming
Embedding OpenCL in GHC Haskell
Embracing Heterogeneity: Parallel Programming for Changing Hardware
Emerging technology about GPGPU
EMMA: an AMR cosmological simulation code with radiative transfer
EmoNets: Multimodal deep learning approaches for emotion recognition in video
Empirical analysis of a parallel data mining algorithm on a graphic processor
Empirical performance modeling of GPU kernels using active learning
Employ Bump Mapping to Enrich the 3D NPR Image
Employing Directive Based Compression Solutions on Accelerators Global Memory under OpenACC
Employing GPU Accelerators for Efficient Enforcement of Data Integrity in Outsourced Data
Employing OpenCL as a Standard Hardware Abstraction in a Distributed Embedded System: A Case Study
Empower Sequence Labeling with Task-Aware Neural Language Model
Empowering Visual Categorization With the GPU
Empty Space Skipping and Occlusion Clipping for Texture-based Volume Rendering
Enabling a High Throughput Real Time Data Pipeline for a Large Radio Telescope Array with GPUs
Enabling active storage on parallel I/O software stacks
Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems
Enabling Computational Dynamics in Distributed Computing Environments Using a Heterogeneous Computing Template
Enabling CP2K Application for Exascale Computing with Accelerators using OpenACC and OpenCL
Enabling Data Movement and Computation Pipelining in Deep Learning Compiler
Enabling Development of OpenCL Applications on FPGA platforms
Enabling Efficient Online Profiling of Homogeneous and Heterogeneous Multicore Systems
Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects
Enabling Energy-Efficient Analysis of Massive Neural Signals Using GPGPU
Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators
Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments
Enabling full-speed random access to the entire memory on the A100 GPU
Enabling High Performance Computing in Cloud Infrastructure using rCUDA
Enabling High Performance Computing in Cloud Infrastructure using Virtualized GPUs
Enabling Inter-Machine Parallelism in High-Level Languages with SEJITS and MapReduce
Enabling multiple accelerator acceleration for Java/OpenMP
Enabling New Uses for GPUs
Enabling On-Device Smartphone GPU based Training: Lessons Learned
Enabling OpenCL on a Configurable, VLIW Chip-Multiprocessor
Enabling OpenMP Task Parallelism on Multi-FPGAs
Enabling OS Research by Inferring Interactions in the Black-Box GPU Stack
Enabling Profile Guided Optimizations (PGO) for Graphics
Enabling Quantum Computer Simulations on AMD GPUs: a HIP Backend for Google’s qsim
Enabling task-level scheduling on heterogeneous platforms
Enabling the use of Heterogeneous Computing for Bioinformatics
Enabling Traceability in an MDE Approach to Improve Performance of GPU Applications
Enabling Traceability in MDE to Improve Performance of GPU Applications
Encapsulated synchronization and load-balance in heterogeneous programming
Encrypting video and image streams using OpenCL code on-demand
Encrypting video streams using OpenCL code on-demand
End-to-end data reduction and hardware accelerated rendering techniques for visualizing time-varying non-uniform grid volume data
End-to-end Deep Learning of Optimization Heuristics
End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning
End-to-end Optimization of Machine Learning Prediction Queries
EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models
Energy Auto-tuning using the Polyhedral Approach
Energy conservation techniques for GPU computing
Energy Consumption of Algorithms for Solving the Compressible Navier-Stokes Equations on CPU's, GPU's and KNL's
Energy consumption of Graphic Processing Units with respect to automotive use-cases
Energy Efficiency Analysis of GPUs
Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture
Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors
Energy efficiency of mixed precision iterative refinement methods using hybrid hardware platforms
Energy Efficiency Studies of Mont Blanc Applications
Energy efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster
Energy efficient biomolecular simulations with FPGA-based reconfigurable computing
Energy Efficient Computing on Multi-core Processors: Vectorization and Compression Techniques
Energy Efficient Parallel K-Means Clustering for an Intel Hybrid Multi-Chip Package
Energy Evaluation for Applications with Different Thread Affinities on the Intel Xeon Phi
Energy Transfer Ray Tracing with OptiX
Energy-and cost-efficient Lattice-QCD computations using graphics processing units
Energy-aware metrics for benchmarking heterogeneous systems
Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters
Energy-based Tuning of Convolutional Neural Networks on Multi-GPUs
Energy-efficient algorithms
Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs
Energy-efficient computing for extreme-scale science
Energy-efficient Computing on Distributed GPUs using Dynamic Parallelism and GPU-controlled Communication
Energy-Efficient Execution of Data-Parallel Applications on Heterogeneous Mobile Platforms
Energy-Efficient FPGA Implementation for Binomial Option Pricing Using OpenCL
Energy-efficient FPGA Implementation of the k-Nearest Neighbors Algorithm Using OpenCL
Energy-Efficient GPU Clusters Scheduling for Deep Learning
Energy-efficient mechanisms for managing thread context in throughput processors
Energy-optimized mapping of application to smartphone platform - A case study of mobile face recognition
Energy-saving techniques for low-power graphics processing unit
EngineCL: Usability and Performance in Heterogeneous Computing
Engineering a static verification tool for GPU kernels
Engineering Concurrent Software Guided by Statistical Performance Analysis
Engineering of Computer Vision Algorithms Using Evolutionary Algorithms
Engineering Supercomputing Platforms for Biomolecular Applications
Enhanced implementation of the NTRUEncrypt algorithm using graphics cards
Enhanced molecular dynamics performance with a programmable graphics processor
Enhanced Parallel ILU (p)-based Preconditioners for Multi-core CPUs and GPUs-The Power (g)-pattern Method
Enhanced Parallel NegaMax Tree Search Algorithm on GPU
Enhancing and Porting the HPC-Lab Snow Simulator to OpenCL on Mobile Platforms
Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications
Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control
Enhancing Data Parallelism for Ant Colony Optimisation on GPUs
Enhancing data parallelism for Ant Colony Optimization on GPUs
Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization
Enhancing Depth-Perception with Flexible Volumetric Halos
Enhancing Efficiency of the RRTMG Radiation Code with GPU and MIC Approaches for Numerical Weather Prediction Models
Enhancing Fluid Modeling with Turbulence and Acceleration
Enhancing GPU Parallelism in Nature-Inspired Algorithms
Enhancing Performance for Solving Finite Element Mesh using Heterogeneous Platforms
Enhancing Performance of Meshfree Methods by Hybrid Computing
Enhancing Performance of Simulations using GPGPU
Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming
Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations
Enhancing R with Advanced Compilation Tools and Methods
Enhancing the Performance Analysis of NCCL GPU Collectives
Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs
Enhancing the simulation of P systems for the SAT problem on GPUs
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks
Enhancing Ubiquitous Systems through System Call Mining
Ensemble K-Means on Modern Many Core Hardware
Ensemble K-means on multi-core architectures
Entropy-based High Performance Computation of Boolean SNP-SNP Interactions Using GPUs
Environment Lighting for Point Sampled Geometry
Environment Segmentation in Service Robotics
EPEM: A General and Validated Energy Complexity Model for Multithreaded Algorithms
EpiGPU
EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs
Equalizer 2.0 - Convergence of a Parallel Rendering Framework
Equalizer: A Scalable Parallel Rendering Framework
Equilibrium and Non-Equilibrium Ising Models by Means of PCA
EQUIPE: Parallel equivalence checking with GP-GPUs
Equivalence Checking of ML GPU Kernels
Error Resilience Evaluation on GPGPU Applications
Error-bounded GPU-supported terrain visualisation
ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA
Espresso: A Fast End-to-end Neural Speech Recognition Toolkit
Espresso: Efficient Forward Propagation for BCNNs
Estimating GPU Speedups for Programs Without Writing a Single Line of GPU Code
Estimating the WCET of GPU-Accelerated Applications using Hybrid Analysis
Estimation of numerical reproducibility on CPU and GPU
Estimation of Skin Optical Parameters for Real-Time Hyperspectral Imaging Applications using GPGPU Parallel Computing
Evacuation Route Modeling and Planning with General Purpose GPU Computing
Evaluating 3-D Stencil codes on Intel Xeon Phi: Limitations and Trade-offs
Evaluating CP2K on Exascale Hardware: Intel Xeon Phi
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
Evaluating different Java bindings for OpenCL
Evaluating force field accuracy with long-time simulations of a beta-hairpin tryptophan zipper peptide
Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite
Evaluating GPU Passthrough in Xen for High Performance Cloud Computing
Evaluating GPUs for network packet signature matching
Evaluating graph coloring on GPUs
Evaluating High-Level Synthesis Techniques for Scalable Hardware-Accelerated Computing
Evaluating kernels on Xeon Phi to accelerate Gysela application
Evaluating multi-core platforms for HPC data-intensive kernels
Evaluating one-sided programming models for GPU cluster computations
Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL
Evaluating performance and portability of OpenCL programs
Evaluating Performance Portability of Accelerator Programming Models using SPEC ACCEL 1.2 Benchmarks
Evaluating Performance Portability of OpenACC
Evaluating Performance Tradeoffs on the Radeon Open Compute Platform
Evaluating polynomials in several variables and their derivatives on a GPU computing processor
Evaluating Reconfigurable Dataflow Computing Using the Himeno Benchmark
Evaluating the Arm Ecosystem for High Performance Computing
Evaluating the capabilities of the Xeon Phi platform in the context of software-only, thread-level speculation
Evaluating the cell broadband engine as a platform to run estimation of distribution algorithms
Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL
Evaluating the Energy Efficiency of OpenCL-accelerated AutoDock Molecular Docking
Evaluating the impact of reordering unstructured meshes on the performance of finite volume GPU solvers
Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs
Evaluating the Performance and Portability of Contemporary SYCL Implementations
Evaluating the Performance and Portability of OpenCL
Evaluating the Performance Impact of Multiple Streams on the MIC-based Heterogeneous Platform
Evaluating the performance of HPC-style SYCL applications
Evaluating the Performance of Legacy Applications on Emerging Parallel Architectures
Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations
Evaluating the Performance of Processing Medical Volume Data on Graphics Hardware
Evaluating the Performance of the DeepSeek Model in Confidential Computing Environment
Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications
Evaluating the potential of graphics processors for high performance embedded computing
Evaluating the Power of GPU Acceleration for IDW Interpolation Algorithm
Evaluating the use of GPUs in liver image segmentation and HMMER database searches
Evaluating the Viability of Application-Driven Cooperative CPU/GPU Fault Detection
Evaluating the Wide Area Classroom After 24,000 HPC Students
Evaluating tradeoff between recall and performance of GPU permutation index
Evaluation and enhancement of memory efficiency targeting general-purpose computations on scalable data-parallel GPU architectures
Evaluation and Improvement of GPU Ray Tracing with a Thread Migration Technique
Evaluation and tuning of the Level 3 CUBLAS for graphics processors
Evaluation Framework for GPU Performance Based on OpenCL Standard
Evaluation iterative solver for pCDR on GPU accelerator
Evaluation of an accelerator architecture for Speckle Reducing Anisotropic Diffusion
Evaluation of an OpenCL-Based FPGA Platform for Particle Filter
Evaluation of autoparallelization toolkits for commodity graphics hardware
Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL
Evaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor
Evaluation of disconnected quark loops for hadron structure using GPUs
Evaluation of Fermi Features for Data Mining Algorithms
Evaluation of FPGA-based high performance computing platforms
Evaluation of GPU Architectures Using Spiking Neural Networks
Evaluation of GPU-based track-triggering for the CMS detector at CERN's HL-LHC
Evaluation of Intel's DPC++ Compatibility Tool in heterogeneous computing
Evaluation of Libraries for Parallel Computing in Haskell - A Case Study with a Super-resolution Application
Evaluation of likelihood functions on CPU and GPU devices
Evaluation of Machine Learning Fameworks on Finis Terrae II
Evaluation of Multi-Threading in Vulkan
Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation
Evaluation of P-Scheme/G Algorithm for Solving Recurrence Equations
Evaluation of parallel particle swarm optimization algorithms within the CUDA architecture
Evaluation of Pseudo-Random Number Generation on GPU Cards
Evaluation of Rust for GPGPU high-performance computing
Evaluation of Speedup of Monte Carlo Calculations of Two Simple Reactor Physics Problems Coded for the GPU/CUDA Environment
Evaluation of Standardized Password-based Key Derivation against Parallel Processing Platforms
Evaluation of state-of-the-art polyhedral tools for automatic code generation on GPUs
Evaluation of streaming aggregation on parallel hardware architectures
Evaluation of the Intel Xeon Phi and NVIDIA K80 as accelerators for two-dimensional panel codes
Evaluation of the Stability and Performance of a Multi-Stage Riemann Solver in Relativistic Hydrodynamic Simulations
Evaluation of Two Parallel Finite Element Implementations of the Time-Dependent Advection Diffusion Problem: GPU versus Cluster Considering Time and Energy Consumption
Evenly Spaced Streamlines for Surfaces: An Image-Based Approach
Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems
Event-driven gate-level simulation with GP-GPUs
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models
Evolution of a double-front Rayleigh-Taylor system using a GPU-based high resolution thermal Lattice-Boltzmann model
Evolution of image filters on graphics processor units using Cartesian Genetic Programming
Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models
Evolution of thread-level parallelism in desktop applications
Evolutionary Algorithm for Optimizing Parameters of GPGPU-based Image Segmentation
Evolutionary Clustering on CUDA
Evolutionary Computing on Consumer-Level Graphics Hardware
Evolutionary Quantum Logic Synthesis of Boolean Reversible Logic Circuits Embedded in Ternary Quantum Space using Heuristics
Evolutionary Simulation of Life Using CUDA
Evolving a CUDA kernel from an nVidia template
Evolving CUDA PTX programs by quantum inspired linear genetic programming
Evolving GeneChip correlation predictors on parallel graphics hardware
Evolving gzip matches Kernel from an nVidia CUDA Template
Evolving Neural Networks on GPUs
Evolving Soft Robotic Locomotion in PhysX
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
EvoTorch: Scalable Evolutionary Computation in Python
exa-AMD: An Exascale-Ready Framework for Accelerating the Discovery and Design of Functional Materials
EXA2PRO: A Framework for High Development Productivity on Heterogeneous Computing Systems
Exact and complete short read alignment to microbial genomes using GPU programming
Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming
Exact calculation of disconnected loops
Exact diagonalization of quantum lattice models on coprocessors
Exact diagonalization of the Hubbard model on graphics processing units
Exact Selectivity Computation for Modern In-Memory Database Query Optimization
Exact Sparse Matrix-Vector Multiplication on GPU's and Multicore Architectures
Exact Symbolic-Numeric Computation of Planar Algebraic Curves
Examining the Analytic Structure of Green's Functions: Massive Parallel Complex Integration using GPUs
Example-based volume illustrations
ExaNBody: a HPC framework for N-Body applications
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Scientific Inverse Problems
Executing Dynamic Data Rate Actor Networks on OpenCL Platforms
Executing Process Networks on Heterogeneous Platforms using OpenCL
Execution of Compound Multi-Kernel OpenCL Computations in Multi-CPU/Multi-GPU Environments
Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A
Exercising high-level parallel programming on streams: a systems biology use case
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
Expanding the boundaries of GPU computing
Expanding the VPE-qGM Environment Towards a Parallel Quantum Simulation of Quantum Processes Using GPUs
Expansion Techniques for Collisionless Stellar Dynamical Simulations
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction
Experience Migrating OpenCL to SYCL: A Case Study on Searches for Potential Off-Target Sites of Cas9 RNA-Guided Endonucleases on AMD GPUs
Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system
Experience Report: Writing A Portable GPU Runtime with OpenMP 5.1
Experience with Intel's Many Integrated Core architecture in ATLAS software
Experiences Building an MLIR-based SYCL Compiler
Experiences Developing the OpenUH Compiler and Runtime Infrastructure
Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation
Experiences in Data-Parallel Simulation and Analysis of Complex Systems with Irregular Graph Structures
Experiences in Speeding Up Computer Vision Applications on Mobile Computing Platforms
Experiences in Teaching a Specialty Multicore Computing Course
Experiences Migrating CUDA to SYCL: A Molecular Docking Case Study
Experiences Porting a Molecular Dynamics Code to GPUs on a Cray XK7
Experiences with Achieving Portability across Heterogeneous Architectures
Experiences with Cell-BE and GPU for Tomography
Experiences with High-Level Programming Directives for Porting Applications to GPUs
Experiences with hybrid clusters
Experiences with implementing Kokkos' SYCL backend
Experiences with Mapping Non-linear Memory Access Patterns into GPUs
Experimental B+-tree for GPU
Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs
Experimental Evaluation of Thread Distribution Effects on Multiple Output Errors in GPUs
Experimental Fault-Tolerant Synchronization for Reliable Computation on Graphics Processors
Experimentation Procedure for Offloaded Mini-Apps Executed on Cluster Architectures with Xeon Phi Accelerators
Experiments on Parallel Training of Deep Neural Network using Model Averaging
Experiments with Massively Parallel Matrix Multiplication
Experiments with Single Core, Multi-core, and GPU Based Computation of Cellular Automata
Explainable Deep Behavioral Sequence Clustering for Transaction Fraud Detection
Explicit Cache Management for Volume Ray-Casting on Parallel Architectures
Explicit caching HYB: a new high-performance SpMV framework on GPGPU
Explicit Control of Vector Field Based Shape Deformations
Explicit Fourth-Order Runge-Kutta Method on Intel Xeon Phi Coprocessor
Explicit Integration with GPU Acceleration for Large Kinetic Networks
Explicit platform descriptions for heterogeneous many-core architectures
Explicit Shallow Water Simulations on GPUs: Guidelines and Best Practices
Exploded Views for Volume Data
Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code
Exploiting BSP Abstractions for Compiler Based Optimizations of GPU Applications on multi-GPU Systems
Exploiting co-execution with oneAPI: heterogeneity from a modern perspective
Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU
Exploiting Computational Resources in Distributed Heterogeneous Platforms
Exploiting Computing Power on Graphics Processing Unit
Exploiting Concurrency Patterns with Heterogeneous Task and Data Parallelism
Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs
Exploiting concurrent kernel execution on graphic processing units
Exploiting contextual information for image re-ranking and rank aggregation
Exploiting Data Parallelism in GPUs
Exploiting Data Parallelism in the yConvex Hypergraph Algorithm for Image Representation using GPGPUs
Exploiting dynamic sparse matrices for performance portable linear algebra operations
Exploiting frame-to-frame coherence for accelerating high-quality volume raycasting on graphics hardware
Exploiting GPU On-chip Shared Memory for Accelerating Schedulability Analysis
Exploiting GPU Parallelism to Optimize Real-World Problems
Exploiting GPUs to investigate an inversion method that retrieves cardiac conductivities from potential measurements
Exploiting Graphic Processing Units Parallelism to Improve Intelligent Data Acquisition System Performance in JET's Correlation Reflectometer
Exploiting graphical processing units for data-parallel scientific applications
Exploiting graphics processing units for computational biology and bioinformatics
Exploiting Heterogeneity for Energy Efficiency in Chip Multiprocessors
Exploiting Heterogeneous Computing Platforms By Cataloging Best Solutions For Resource Intensive Seismic Applications
Exploiting Heterogeneous Systems: Keccak on OpenCL
Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
Exploiting Limited Access Distance of ODE Systems for Parallelism and Locality in Explicit Methods
Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures
Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architectures
Exploiting multi-level parallelism in streaming applications for heterogeneous platforms with GPUs
Exploiting Multi-level Parallelism on a Many-core System for the Application of Hyperheuristics to a Molecular Docking Problem
Exploiting Multiple Levels of Parallelism and Online Refinement of Unstructured Meshes in Atmospheric Model Application
Exploiting OpenMP & OpenACC to Accelerate a Molecular Docking Mini-App in Heterogeneous HPC Nodes
Exploiting parallel features of modern computer architectures in bioinformatics
Exploiting parallel features of modern computer architectures in bioinformatics: applications to genetics, structure comparison and large graph analysis
Exploiting Parallel Processing Power of GPU for High Speed Frequent Pattern Mining
Exploiting Parallelism in GPUs
Exploiting Parallelism in Iterative Irregular Maxflow Computations on GPU Accelerators
Exploiting Segmentation for Robust 3D Object Matching
Exploiting SIMD extensions for linear image processing with OpenCL
Exploiting Space and Time Coherence in Grid-based Sorting
Exploiting SPMD Horizontal Locality
Exploiting SPMD Horizontal Locality to Improve Memory Efficiency
Exploiting Task Parallelism with OpenCL: A Case Study
Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization
Exploiting the Parallelism of Heterogeneous Systems using Dataflow Graphs on Top of OpenCL
Exploiting the Power of GPUs for Asymmetric Cryptography
Exploiting two-level parallelism by aggregating computing resources in task-based applications over accelerator-based machines
Exploiting Unexploited Computing Resources for Computational Logics
Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement
Exploration of Cryptocurrency Mining-Specific GPUs in AI Applications: A Case Study of CMP 170HX
Exploration of cyber-physical systems for GPGPU computer vision-based detection of biological viruses
Exploration of Low Numeric Precision Deep Learning Inference Using Intel FPGAs
Exploration of Multifrontal Method with GPU in Power Flow Computation
Exploration of Optimization Options for Increasing Performance of a GPU Implementation of a Three-Dimensional Bilateral Filter
Exploration of Parallelization Frameworks for Computational Finance
Explorations of the Viability of ARM and Xeon Phi for Physics Processing
Exploratory Data Analysis of Software Repositories via GPU Processing
Exploratory research on embedding CUDA code into hetrogeneous MP-SOC achitectures programmed with the Daedalus framework
Exploring 2D tensor fields using stress nets
Exploring Applications in CUDA
Exploring complex quantum systems with a hybrid CPU-GPU computing platform
Exploring computational capabilities of GPUs using H.264 prediction algorithms
Exploring Computer Vision and Image Processing Algorithms in Teaching Parallel Programming
Exploring CPU-GPU Coherence
Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU
Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool (open-source code)
Exploring Different Automata Representations for Efficient Regular Expression Matching on GPUs
Exploring Fine-Grained Task-based Execution on Multi-GPU Systems
Exploring FPGA Optimizations to Compute Sparse Numerical Linear Algebra Kernels
Exploring FPGA-specific Optimizations for Irregular OpenCL Applications
Exploring GPGPU Acceleration of Process-Oriented Simulations
Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications
Exploring GPGPUs Workload Characteristics and Power Consumption
Exploring GPU Memory Performance Using Digital Image Processing Algorithms
Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
Exploring Graphics Processing Unit (GPU) Resource Sharing Efficiency for High Performance Computing
Exploring graphics processing units as parallel coprocessors for online aggregation
Exploring graphics processor performance for general purpose applications
Exploring Heterogeneous Scheduling using the Task-Centric Programming Model
Exploring High Performance SQL Databases with Graphics Processing Units
Exploring LLVM Infrastructure for Simplified Multi-GPU Programming
Exploring Many-Core Design Templates for FPGAs and ASICs
Exploring Microcontrollers in GPUs
Exploring Multi-level Parallelism for Large-Scale Spiking Neural Networks
Exploring Multiple Dimensions of Parallelism in Junction Tree Message Passing
Exploring Multiple Levels of Performance Modeling for Heterogeneous Systems
Exploring new architectures in accelerating CFD for Air Force applications
Exploring Novel Parallelization Technologies for 3-D Imaging Applications
Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs
Exploring Parallel Algorithms for Volumetric Mass-Spring-Damper Models in CUDA
Exploring Portability and Performance of OpenCL FPGA Kernels on Intel HARPv2
Exploring power efficiency and optimizations targeting heterogeneous applications
Exploring Programming Multi-GPUs using OpenMP & OpenACC-based Hybrid Model
Exploring reconfigurable architectures for explicit finite difference option pricing models
Exploring Reconfigurable Architectures for Tree-Based Option Pricing Models 
Exploring Scalability in C++ Parallel STL Implementations
Exploring scalability of FIR filter realizations on Graphics Processing Units
Exploring SIMD for Molecular Dynamics, Using Intel Xeon Processors and Intel Xeon Phi Coprocessors
Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs
Exploring SYCL for batched kernels with memory allocations
Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
Exploring the acceleration of Nekbone on reconfigurable architectures
Exploring the Feasibility of Fully Homomorphic Encryption
Exploring The Latency and Bandwidth Tolerance of CUDA Applications
Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload
Exploring the Limits of GPUs With Parallel Graph Algorithms
Exploring the Millennium Run - Scalable Rendering of Large-Scale Cosmological Datasets
Exploring the multiple-GPU design space
Exploring the Multitude of Real-Time Multi-GPU Configurations
Exploring the Optimization Space of Multi-Core Architectures with OpenCL Benchmarks
Exploring the power of GPU's for training Deep Belief Networks
Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators
Exploring the use of glossy light volumes for interactive global illumination
Exploring Thread Coarsening on FPGA
Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application
Exploring utilisation of GPU for database applications
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster
Exponential integrators on graphic processing units
Exponential Integrators on Graphics Processing Units
Exposing Errors Related to Weak Memory in GPU Applications
Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
Exposing non-standard architectures to embedded software using compile-time virtualisation
Exposure Render: An Interactive Photo-Realistic Volume Rendering Framework
Expressed Sequence Tag Clustering using Commercial Gaming Hardware
Expressive Array Constructs in an Embedded GPU Kernel Programming Language
Extendable pattern-oriented optimization directives
Extendable Pattern-Oriented Optimization Directives (extended version)
Extended Data Collection: Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays
Extended Dynamic Programming and Fast Multidimensional Search Algorithm for Energy Minization in Stereo and Motion
Extended-precision floating-point numbers for GPU computation
Extending a C-like Language for Portable SIMD Programming
Extending a Run-time Resource Management framework to support OpenCL and Heterogeneous Systems
Extending abstract GPU APIs to shared memory
Extending adaptive sparse grids for stochastic collocation to hybrid parallel architectures
Extending High-Level Synthesis for Task-Parallel Programs
Extending Lyapack for the Solution of Band Lyapunov Equations on Hybrid CPU-GPU Platforms
Extending MAGMA Portability with OneAPI
Extending MPI to Accelerators
Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems
Extending OmpSs to support CUDA and OpenCL in C, C++ and Fortran Applications
Extending Scala with General Purpose GPU Programming
Extending SYCL's Programming Paradigm with Tensor-based SIMD Abstractions
Extending the Computational Application of Reaction-Diffusion Chemistry by Modelling Artificial Neural Networks
Extending the Generalized Fermat Prime Number Search Beyond One Million Digits Using GPUs
Extending the Gotran framework: LATEX and GPU acceleration
Extending the Scalability of Single Chip Stream Processors with On-chip Caches
Extending the SkelCL Skeleton Library for Stencil Computations on Multi-GPU Systems
Extension of the SkePU Skeleton Programming Framework for Multi-core CPU and Multi-GPU Systems for MPI-based Clusters
Extensions and Limitations of the Neural GPU
Extensions of Parallel Coordinates for Interactive Exploration of Large Multi-Timepoint Data Sets
Extinction-Based Shading and Illumination in GPU Volume Ray-Casting
Extracting Flow Features Using Bag-of-Features and Supervised Learning Techniques
Extracting Maximal Exact Matches on GPU
Extremely fast simulator for decoding LDPC codes
Extremely large scale simulation of a Kardar-Parisi-Zhang model using graphics cards
Eye-Full Tower: A GPU-based variable multibaseline omnidirectional stereovision system with automatic baseline selection for outdoor mobile robot navigation
Face Detection CUDA Accelerating
Face Detection for Human Identification in Surveillance
Face Detection on CUDA
Face Detection with Improved Local Binary Patterns in CUDA
Face Recognition Using OpenCL
Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs
Face Recognition: A Tutorial on Computational Aspects
Face Retriever: Pre-filtering the Gallery via Deep Neural Net
Face Search at Scale: 80 Million Gallery
Face.evoLVe: A High-Performance Face Recognition Library
Facial Expression Recognition - Review
Facial Recognition Using Neural Networks over GPGPU
FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
Falcon: A Graph Manipulation Language for Heterogeneous Systems
FAMOUS, faster: using parallel computing techniques to accelerate the FAMOUS/HadCM3 climate model with a focus on the radiative transfer algorithm
Fancier: A Unified Framework for Java, C, and OpenCL Integration
FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things
FANS: FPGA-Accelerated Near-Storage Sorting
FARGO3D: A new GPU-oriented MHD code
Fast 2-D Ultrasound Strain Imaging: The Benefits of Using a GPU
Fast 2D-3D registration using GPU-based preprocessing
Fast 3D Graphics Rendering Technique with CUDA Parallel Processing
Fast 3D Salient Region Detection in Medical Images using GPUs
Fast 3D Structure Localization in Medical Volumes using CUDA-enabled GPUs
Fast 3D Wavelet Transform on Multicore and Manycore Computing Platforms
Fast 4D Sheared Filtering for Interactive Rendering of Distribution Effects
Fast 4pi track reconstruction in nuclear emulsion detectors based on GPU technology
Fast Acceleration of 2D Wave Propagation Simulations Using Modern Computational Accelerators
Fast acoustic computations using graphics processors
Fast Adaptive Sampling Technique for Multi-Dimensional Integral Estimation Using GPUs
Fast algorithm of ray tracing based on KD-tree structure
Fast algorithms and efficient GPU implementations for the Radon transform and the back-projection operator represented as convolution operators
Fast Algorithms for Convolutional Neural Networks
Fast Algorithms for the Solution of Stochastic Partial Differential Equations
Fast American Basket Option Pricing on a multi-GPU Cluster
Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units - Radial Distribution Function Histogramming
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming
Fast analytical modeling of compton scatter using point clouds and graphics processing unit (GPU)
Fast and accurate digital signal processing realized with GPGPU technology
Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters
Fast and Accurate Generalized Harmonic Analysis and Its Parallel Computation by GPU
Fast and accurate PIV computation using highly parallel iterative correlation maximization
Fast and Accurate Poisson Denoising with Optimized Nonlinear Diffusion
Fast and accurate protein substructure searching with simulated annealing and GPUs
Fast and approximate stream mining of quantiles and frequencies using graphics processors
Fast and automatic object pose estimation for range images on the GPU
Fast and Efficient Automatic Memory Management for GPUs using Compiler-Assisted Runtime Coherence Scheme
Fast and Efficient Dense Variational Stereo on GPU
Fast and Efficient FPGA-Based Feature Detection Employing the SURF Algorithm
Fast and Efficient Lossless Image Compression Based on CUDA Parallel Wavelet Tree Encoding
Fast and Energy-Efficient CNN Inference on IoT Devices
Fast and exact solution of Total Variation models on the GPU
Fast and Flexible GPU Accelerated Binding Free Energy Calculations within the AMBER Molecular Dynamics Package
Fast and Flexible: Parallel Packet Processing with GPUs and Click
Fast and informative flow simulations in a building by using fast fluid dynamics model on graphics processing unit
Fast and Maliciously Secure Two-Party Computation Using the GPU
Fast and Memory Efficient GPU-Based Rendering of Tensor Data
Fast and Memory-Efficient Minimum Spanning Tree on the GPU
Fast and Practical Strassen's Matrix Multiplication using FPGAs
Fast and reliable collision culling using graphics hardware
Fast and Robust 3D Correspondence Matching and Its Application to Volume Registration
Fast and robust CAMShift tracking
Fast and Robust Linear Motion Deblurring
Fast and Robust Pyramid-based Image Processing
Fast and Scalable CPU/GPU Collision Detection for Rigid and Deformable Surfaces
Fast and scalable list ranking on the GPU
Fast and sleek glyph rendering for interactive HARDI data exploration
Fast Antenna Characterization Using the Sources Reconstruction Method on Graphics Processors
Fast approximate k-nearest neighbours search using GPGPU
Fast Approximation of High-Order Voronoi Diagrams and Distance Transforms on the GPU
Fast Arbitrary Precision Floating Point on FPGA
Fast Automatic Heuristic Construction Using Active Learning
Fast binding site mapping using GPUs and CUDA
Fast Bio-Inspired Computation using a GPU-based Systemic Computer
Fast Boolean Calculations Using the GPU
Fast boosting trees for classification, pose detection, and boundary detection on a GPU
Fast Burrows Wheeler Compression Using CPU and GPU
Fast BVH Construction on GPUs
Fast calculation of computer-generated-hologram on AMD HD5000 series GPU and OpenCL
Fast Calculation of Electrostatic Potentials on the GPU or the ASIC MD-GRAPE-3
Fast calculation of HELAS amplitudes using graphics processing unit (GPU)
Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units
Fast Camera Image Denoising on Mobile GPUs with Deep Learning, Mobile AI 2021 Challenge: Report
Fast cell detection in high-throughput imagery using GPU-accelerated machine learning
Fast CGH computation using S-LUT on GPU
Fast circuit simulation on graphics processing units
Fast Code Exploration for Pipeline Processing in FPGA Accelerators
Fast Collision Culling in Large-Scale Environments Using GPU Mapping Function
Fast collision detection using the A-buffer
Fast computation of computer-generated hologram using Xeon Phi coprocessor
Fast Computation of Computer-generated Hologram Using Xeon Phi Coprocessors
Fast computation of database operations using graphics processors
Fast Computation of Dipole Radiation in Stratified Background Using Graphics Processing Unit
Fast computation of general Fourier Transforms on GPUS
Fast computation of generalized Voronoi diagrams using graphics hardware
Fast computation of MadGraph amplitudes on graphics processing unit (GPU)
Fast computation of scattering maps of nanostructures using graphical processing units
Fast Computing Adaptively Sampled Distance Field on GPU
Fast computing of scattering maps of nanostructures using graphical processing units
Fast Conjugate Gradients with Multiple GPUs
Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture
Fast continuous collision detection among deformable models using graphics processors
Fast convolution kernels on pascal GPU with high memory efficiency
Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
Fast convolutional neural networks on FPGAs with hls4ml
Fast CT Image Processing using Parallelized Non-local Means
Fast CUDA-Aware MPI Datatypes without Platform Support
Fast Deformable Registration on the GPU: A CUDA Implementation of Demons
Fast Detection of Overlapping Communities via Online Tensor Methods on GPUs
Fast Determination of the Number of Endmembers for Real-Time Hyperspectral Unmixing on GPUs
Fast development of dense linear algebra codes on graphics processors
Fast Diameter Computation of Large Sparse Graphs using GPUs
Fast Disk Encryption through GPGPU Acceleration
Fast distributed phononic band-structure calculations through a GPU accelerated mixed-variational formulation
Fast Dynamic Voronoi Treemaps
Fast Effective Deterministic Primality Test Using CUDA/GPGPU
Fast Efficient Artificial Neural Network for Handwritten Digit Recognition
Fast End-to-End Multi-Conjugate AO Simulations Using Graphical Processing Units and the MAOS Simulation Code
Fast Endmember Extraction for Massive Hyperspectral Sensor Data on GPUs
Fast Estimation of Gaussian Mixture Model Parameters on GPU using CUDA
Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling
Fast evaluation of Helmholtz potential on graphics processing units (GPUs) 
Fast evolutionary image processing using Multi-GPUs
Fast Exact Bayesian Inference for High-Dimensional Models
Fast Exact Hyper-Graph Matching with Dynamic Programming for Spatio-Temporal Data
Fast Exact String Matching on the GPU
Fast exhaustive search for polynomial systems in F2
Fast Exhaustive Search for Quadratic Systems in F2 on FPGAs - Extended Version
Fast extraction of neuron morphologies from large-scale SBFSEM image stacks
Fast Face Detection Using Graphics Processor
Fast face recognition approach using a graphical processing unit (GPU)
Fast face tracking using parallel particle filter algorithm
Fast Feature Selection in a GPU Cluster Using the Delta Test
Fast Finite Solar Radiation Pressure Model Integration Using OpenGL
Fast fluid dynamics simulation on the GPU
Fast forwarding table lookup exploiting GPU memory architecture
Fast Fourier Transforms over Prime Fields of Large Characteristic and their Implementation on Graphics Processing Units
Fast free-form deformation using graphics processing units
Fast Frequent Itemset Mining from Uncertain Databases using GPGPU
Fast gain-adaptive KLT tracking on the GPU
Fast Galactic Structure Finding using Graphics Processing Units
Fast Gather-based Construction of Stereoscopic Images Using Reprojection
Fast generating of a digital hologram using general-purpose computation on graphics processing units
Fast Genetic Programming and Artificial Developmental Systems on GPUs
Fast Genetic Programming on GPUs
Fast Global Illumination for Interactive Volume Visualization
Fast GPGPU Data Rearrangement Kernels using CUDA
Fast GPGPU-Based Elliptic Curve Scalar Multiplication
Fast GPU bounding boxes on tree-structured scenes
Fast GPU Garment Simulation and Collision Detection
Fast GPU implementation of large scale dictionary and sparse representation based vision problems
Fast GPU Implementation of Sparse Signal Recovery from Random Projections
Fast GPU-based Adaptive Tessellation with CUDA
Fast GPU-Based Automatic Time Gain Compensation for Ultrasound Imaging
Fast GPU-based calculations in few-body quantum scattering
Fast GPU-Based CT Reconstruction using the Common Unified Device Architecture (CUDA)
Fast GPU-based fluid simulations using SPH
Fast GPU-based image warping and  inpainting for frame interpolation
Fast Gpu-Based Interpolation for SAR Backprojection
Fast GPU-based Locality Sensitive Hashing for K-Nearest Neighbor Computation
Fast GPU-based normal map generation for simplified models
Fast GPU-Based Seismogram Simulation from Microseismic Events in Marine Environments Using Heterogeneous Velocity Models
Fast GPU-based space-time correlation for activity recognition in video sequences
Fast Graph Cuts using Shrink-Expand Reparameterization
Fast Greeks: Case of Credit Valuation Adjustments
Fast Gunrock Subgraph Matching (GSM) on GPUs
Fast Hair Simulation and Rendering Using CUDA and OpenGL
Fast Hamiltonian Monte Carlo Using GPU Computing
Fast Hardware-Accelerated Volume Rendering of CT Scans
Fast heterogeneous computing with CUDA compatible Tesla GPU computing processor (personal supercomputing)
Fast High-Quality Volume Ray Casting with Virtual Samplings
Fast Histograms using Adaptive CUDA Streams
Fast hough transform on GPUs: exploration of algorithm trade-offs
Fast Human Detection with Cascaded Ensembles
Fast Human Detection with Cascaded Ensembles on the GPU
Fast Hydraulic and Thermal Erosion on GPU
Fast Hydraulic Erosion Simulation and Visualization on GPU
Fast hydrodynamics on heterogenous many-core hardware
Fast hyperbolic Radon transform represented as convolutions in log-polar coordinates
Fast Image Alignment with Fourier Moment Matching on GPU
Fast Image Processing with Embedded Microprocessors
Fast Image Scanning with Deep Max-Pooling Convolutional Neural Networks
Fast imaging by a single-slice-detector helical CT
Fast Implementation of DGEMM on Fermi GPU
Fast implementation of fully iterative scatter corrected OSEM for HRRT using GPU
Fast Implementation of Scale Invariant Feature Transform Based on CUDA
Fast Implementation of Two Hash Algorithms on nVidia CUDA GPU
Fast implementation of Wyner-Ziv Video codec using GPGPU
Fast in-place sorting with CUDA based on bitonic sort
Fast inference of deep neural networks in FPGAs for particle physics
Fast interpolated cameras by combining a GPU based plane sweep with a max-flow regularisation algorithm
Fast Isosurface Rendering on a GPU by Cell Rasterization
Fast JND-Based Video Carving With GPU Acceleration for Real-Time Video Retargeting
Fast k Nearest Neighbor Search using GPU
Fast k-NNG construction with GPU-based quick multi-select
Fast K-selection Algorithms for Graphics Processing Units
Fast Katsevich Algorithm Based on GPU for Helical Cone-Beam Computed Tomography
Fast Knowledge Graph Completion using Graphics Processing Units
Fast LBP Face Detection on low-power SIMD architectures
Fast Level Set Segmentation of Biomedical Images using Graphics Processing Units
Fast Linear Algebra on GPU
Fast Locality Sensitive Hashing for Beam Search on GPU
Fast locally consistent dense stereo on multicore
Fast LZW compression using a GPU
Fast Makespan Estimation for GPU Threads on a Single Streaming Multiprocessor
Fast matrix multiplies using graphics hardware 
Fast Merge Tree Computation via SYCL
Fast Mersenne prime testing on the GPU
Fast minimum spanning tree for large graphs on the GPU
Fast Monte Carlo Simulation for Patient-specific CT/CBCT Imaging Dose Calculation
Fast Morphological Image Processing on GPU using CUDA
Fast Morphological Image Processing Open-Source Extensions for GPU processing with CUDA
Fast motion detection from airborne videos using graphics processing unit
Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding
Fast MPEG-CDVS Encoder with GPU-CPU Hybrid Computing
Fast Multidimensional Image Processing with OpenCL
Fast Multipole Method vs. Spectral Method for the Simulation of Isotropic Turbulence on GPUs
Fast Multipole Methods and High Performance Computing
Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence
Fast multipole methods on graphics processors
Fast N-body Simulations on GPUs
Fast network centrality analysis using GPUs
Fast network communities visualization on massively parallel GPU architecture
Fast Neural Network Training on General Purpose Computers
Fast Neural Representations for Direct Volume Rendering
Fast Neuromimetic Object Recognition using FPGA Outperforms GPU Implementations
Fast numerical reconstruction of digital holography based on graphic processing unit
Fast OBJ file importing and parsing in CUDA
Fast Object Re-Detection and Localization in Video for Spatio-Temporal Fragment Creation
Fast On-line Statistical Learning on a GPGPU
Fast Optimal Mass Transport for Dynamic Active Contour Tracking on the GPU
Fast parallel algorithm for audio content retrieval on GPUs
Fast Parallel Algorithm for Enumerating All Chordless Cycles in Graphs
Fast parallel GPU-sorting using a hybrid algorithm
Fast Parallel Image Registration on CPU and GPU for Diagnostic Classification of Alzheimer's Disease
Fast Parallel Implementation of Fractional Packing and Covering Linear Programs
Fast Parallel Machine Learning Algorithms for Large Datasets Using Graphic Processing Unit
Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Graphics Processing Unit Computing
Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU
Fast parallel simulation of fiber optical communication systems accelerated by a graphics processing unit
Fast Parallel Sorting Algorithms on GPUs
Fast parallel surface and solid voxelization on GPUs
Fast Parallel Tandem Mass Spectral Library Searching Using GPU Hardware Acceleration
Fast parallel volume visualization on CUDA technology
Fast PCA-BAsed Face Recognition on GPUs
Fast perspective volume ray casting method using GPU-based acceleration techniques for translucency rendering in 3D endoluminal CT colonography
Fast Poisson Solvers for Graphics Processing Units
Fast Polynomial Approximation Acceleration on the GPU
Fast Positron Range Calculation in Heterogeneous Media for 3D PET Reconstruction
Fast Predictive Image Registration
Fast QAP Solver with ACO and Taboo Search on GPU using Move-Cost Adjusted Thread Assignment
Fast quantum Monte Carlo on a GPU
Fast Query for Exemplar-Based Image Completion
Fast Radix Sort for Sparse Linear Algebra on GPU
Fast Random Graph Generation
Fast Ray Sorting and Breadth-First Packet Traversal for GPU Ray Tracing
Fast RCS prediction using multiresolution shooting and bouncing ray method on the GPU
Fast reconstruction of 3D volumes from 2D CT projection data with GPUs
Fast recursive filters for simulating nonlinear dynamic systems
Fast reduction of undersampling artifacts in radial MR angiography with 3D total variation on graphics hardware
Fast Regularization of Matrix-Valued Images
Fast Retinal Vessel Analysis
Fast scale invariant feature detection and matching on programmable graphics hardware
Fast scale invariant textured synthesis with GPU acceleration
Fast scan algorithms on graphics processors
Fast scene voxelization and applications
Fast Schedulability Analysis Using Commodity Graphics Hardware
Fast seismic modeling and Reverse Time Migration on a GPU cluster
Fast Semantic Segmentation of RGB-D Scenes with GPU-Accelerated Deep Neural Networks
Fast Sequence Alignment Method Using CUDA-enabled GPU
Fast short exact repeats finding on GPU
Fast Simulation of Large-Scale Floods Based on GPU Parallel Computing
Fast simulation of nonlinear radio frequency ultrasound images in inhomogeneous nonlinear media: CREANUIS
Fast Simulations of Gravitational Many-body Problem on RV770 GPU
Fast Soft Self-Shadowing on Dynamic Height Fields
Fast Software AES Encryption
Fast Solving of Influence Diagrams for Multiagent Planning on GPU-enabled Architectures
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
Fast Sorting Algorithms using AVX-512 on Intel Knights Landing
Fast Sparse Level Sets on Graphics Hardware
Fast Sparse Matrix Multiplication on GPU
Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining
Fast Speaker Diarization Using a High-Level Scripting Language
Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training
Fast Spoken Query Detection Using Lower-Bound Dynamic Time Warping on Graphical Processing Units
Fast Subgraph Matching on Large Graphs using Graphics Processors
Fast support vector machine training and classification on graphics processors
Fast Surface Extraction and Visualization of Medical Images using OpenCL and GPUs
Fast thermal simulation of 2D/3D integrated circuits exploiting neural networks and GPUs
Fast Training of Convolutional Networks through FFTs
Fast tridiagonal solvers on the GPU
Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors
Fast Turnaround HLS Debugging using Dependency Analysis and Debug Overlays
Fast TV-L1 Optical Flow for Interactivity
Fast Two Dimensional Convex Hull on the GPU
Fast Ultrasound Image Simulation Using the Westervelt Equation
Fast Universal Background Model (UBM) Training on GPUs using Compute Unified Device Architecture (CUDA)
Fast Variable Center-Biased Windowing for High-Speed Stereo on Programmable Graphics Hardware
Fast variational static IR-drop analysis on the graphical processing unit
Fast view synthesis using GPU for 3D display
Fast Virus Signature Matching Based on the High Performance Computing of GPU
Fast volumetric deformation on general purpose hardware
Fast-Coding Robust Motion Estimation Model in a GPU
Fast-Fourier-Transform-Based Electrical Noise Measurements
Fast, Accurate and Shift-Varying Line Projections for Iterative Reconstruction Using the GPU
Fast, large volume, GPU enabled simulations for the Ly-alpha forest: power spectrum forecasts for baryon acoustic oscillation experiments
Fast, Memory-Efficient Construction of Voxelized Shadows
Fast, parallel and secure cryptography algorithm using Lorenz's attractor
Fast, parallel implementation of particle filtering on the GPU architecture
Fast, parallel, GPU-based construction of space filling curves and octrees
Fast, Processor-Cardinality Agnostic PRNG with a Tracking Application
Fast, Realistic Terrain Synthesis
FAST: fast architecture sensitive tree search on modern CPUs and GPUs
FastCollect: Offloading Generational Garbage Collection to Integrated GPUs
Faster across the PCIe bus: A GPU library for lightweight decompression
Faster Algorithms for RNA-folding using the Four-Russians method
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs
Faster Dark Matter Calculations Using the GPU
Faster File Matching using GPGPUs
Faster GPU Based Genetic Programming Using A Two Dimensional Stack
Faster GPU-based convolutional gridding via thread coarsening
Faster Maliciously Secure Two-Party Computation Using the GPU
Faster matrix-vector multiplication on GeForce 8800GTX
Faster Multipattern Matching System on GPU Based on Aho-Corasick Algorithm
Faster Multiple Pattern Matching System on GPU based on Bit-Parallelism
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Faster Radix Sort via Virtual Memory and Write-Combining
Faster sequence alignment through GPU-accelerated restriction of the seed-and-extend search space
Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO
Faster Upper Body Pose Estimation and Recognition Using CUDA
Faster Upper Body Pose Estimation Using CUDA
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
fastHOG - a real-time GPU implementation of HOG
FastMag: Fast micromagnetic simulator for complex magnetic structures
Fastplay: A Parallelization Model and Implementation of SMC on CUDA Based GPU Cluster Architecture
Fastrack: Fast IO for Secure ML using GPU TEEs
FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs
FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation
FastTree: A Hardware KD-Tree Construction Acceleration Engine for Real-Time Ray Tracing
Fat versus Thin Threading Approach on GPUs: Application to Stochastic Simulation of Chemical Reactions
Fat vs. Thin Threading Approach on GPUs: Application to Stochastic Simulation of Chemical Reactions
FATSEA-An Architectural Simulator for General Purpose Computing on GPUs
Fault Injection techniques for GPU Reliability Evaluation
Fault Table Computation on GPUs
Fault table generation using Graphics Processing Units
Fault Tree Analysis Speed-up with GPU Parallel Computing
FBLAS: Streaming Linear Algebra Kernels on FPGA
FBLAS: Streaming Linear Algebra on FPGA
FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem
FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs
FDTD calculations using graphical processing units
FDTD on Distributed Heterogeneous Multi-GPU Systems
Fearless Concurrency on the GPU
Feasibility Analysis of Bilateral Filtering by General Purpose Graphical Processing Unit Computing
Feasibility Analysis of Low Cost Graphical Processing Units for Electromagnetic Field Simulations by Finite Difference Time Domain Method
FEAST - Realisation of hardware-oriented Numerics for HPC simulations with Finite Elements
Feature Aligned Volume Manipulation for Illustration and Visualization
Feature based terrain generation using diffusion equation
Feature Extraction and Visualization from Higher-Order CFD Data
Feature Generation for Quantification of Visual Similarity
Feature tracking and matching in video using programmable graphics hardware
Feature Tracking in Time-Varying Volumetric Data through Scale Invariant Feature Transform
Feature-based speed limit sign detection using a graphics processing unit
Feature-preserving triangular geometry images for level-of-detail representation of static and skinned meshes
FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on Intel Stratix 10
FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems
Fermi GF100 GPU Architecture
Ferrofluid Simulations with the Barnes-Hut Algorithm on Graphics Processing Units
Feynman Machine: The Universal Dynamical Systems Computer
FFT and Convolution Performance in Image Filtering on GPU
FFT Implementation on a Streaming Architecture
FFT Parallel Implementation for MRI Image Reconstruction
FFT-SPA Non-Binary LDPC Decoding on GPU
FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing
Field modelling acceleration on ultrasonic systems using graphic hardware 
FIESTA 4: optimized Feynman integral calculations with GPU support
FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification
File I/O on Intel Xeon Phi Coprocessors: RAM disks, VirtIO, NFS and Lustre
Filtered Blending: A new, minimal Reconstruction Filter for Ghosting-Free Projective Texturing with Multiple Images
Final Project Implementing Extremely Randomized Trees in CUDA
Financial Derivatives Modeling Using GPU's
Financial modeling on the cell broadband engine
Finding Convex Hulls Using Quickhull on the GPU
Finding faint HI structure in and around galaxies: scraping the barrel
Finding Longest Common Subsequences by GPU-Based Parallel Ant Colony Optimization
Finding Missed Code Size Optimizations in Compilers using LLMs
Finding Next Best Views for Autonomous UAV Mapping through GPU-Accelerated Particle Simulation
Finding the Force - Consistent Particle Seeding for Satellite Aerodynamics
Finding, Measuring, and Reducing Inefficiencies in Contemporary Computer Systems
Fine-Grain Acceleration of Graph Algorithms on a Heterogeneous Chip
Fine-grain Parallelism using Multi-core, Cell/BE, and GPU Systems
Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function
Fine-grain Task Aggregation and Coordination on GPUs
Fine-grained Parallel ILU Preconditioners with Fill-ins for Multi-core CPUs and GPUs
Fine-Grained Parallel Incomplete LU Factorization
Fine-grained parallelization of a Vlasov-Poisson application on GPU
Fine-Grained Resource Sharing for Concurrent GPGPU Kernels
Fine-Grained Synchronizations and Dataflow Programming on GPUs
Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation
Fine-Granular Parallel EBCOT and Optimization with CUDA for Digital Cinema Image Compression
Fine-sorting One-dimensional Particle-In-Cell Algorithm with Monte-Carlo Collisions on a Graphics Processing Unit
Fine-Tuning GPT-5 for GPU Kernel Generation
Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices
Fingerprint grid enhancement on GPU
Fingerprint Local Invariant Feature Extraction on GPU with CUDA
Finite Difference Time Domain (FDTD) Simulations Using Graphics Processors 
Finite Difference Time-Domain Modelling of Metamaterials: GPU Implementation of Cylindrical Cloak
Finite differences numerical method for two-dimensional superlattice Boltzmann transport equation and case comparison of CPU(C) and GPGPU(CUDA) implementations
Finite element assembly strategies on multi-and many-core architectures
Finite Element Integration on GPUs
Finite Element Integration with Quadrature on the GPU
Finite Element Matrix Generation on a GPU
Finite Element Modelling of Prostate Deformation and Needle-Tissue Interactions
Finite element numerical integration for first order approximations on multi-core architectures
Finite Element Numerical Integration on Xeon Phi coprocessor
Finite Pointset Method for 2D Dam-Break Problem with GPU-Acceleration
Finite temperature lattice QCD with GPUs
Finite Volume Errors in B_K
Finite-difference time-domain simulations of metamaterials
Finite-difference time-domain solver for room acoustics using graphics processing units
Finite-size scaling method for the Berezinskii-Kosterlitz-Thouless transition
FIR filtering and AES encryption with OpenCL 2.0
Fireflies: New software for interactively exploring dynamical systems using GPU computing
Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs
Firepile: Run-time Compilation for GPUs in Scala
First Application of Lattice QCD to Pezy-SC Processor
First Evaluation of the CPU, GPGPU and MIC Architectures for Real Time Particle Tracking based on Hough Transform at the LHC
First Experiences Optimizing Smith-Waterman on Intel's Knights Landing Processor
First experiences with the Intel MIC architecture at LRZ
First Steps Towards More Numerical Reproducibility
Fitting Galaxies on GPUs
Fitting multi-planet transit models to photometric time-data series by evolution strategies
Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs
FLASH: Fast All-to-All Communication in GPU Clusters
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
Flashlight: Enabling Innovation in Tools for Machine Learning
FlexGrip: A Soft GPGPU for FPGAs
Flexible FPGA design for FDTD using OpenCL
Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU / GPU Clusters
Flexible Linear Algebra Development and Scheduling with Cholesky Factorization
Flexible N-Way MIMO Detector on GPU
Flexible neuronal network simulation framework using code generation for NVidia CUDA
Flexible OpenCL accelerated disparity estimation for video communication applications
Flexible Performant GEMM Kernels on GPUs
Flexible Pixel Compositor for Plug-and-Play Multi-Projector Displays
Flexible Software Profiling of GPU Architectures
Flexible, Fast and Accurate Sequence Alignment Profiling on GPGPU with PaSWAS
Flexible, high performance convolutional neural networks for image classification
FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
FLIA: Architecture of Collaborated Mobile GPU and FPGA Heterogeneous Computing
Flip-Flop: Convex Hull Construction via Star-Shaped Polyhedron in 3D
Floating Point Arithmetic for Transport Triggered Architectures
Floating Textures
Floating-Point Arithmetic in Transport Triggered Architectures
Floating-point data compression at 75 Gb/s on a GPU
Floating-point Mixed-radix FFT Core Generation for FPGA and Comparison with GPU and CPU
Flocking Implementation for the Blender Game Engine
Flow Charts: Visualization of Vector Fields on Arbitrary Surfaces
FLOWER: A Comprehensive Dataflow Compiler for High-Level Synthesis
FlowPM: Distributed TensorFlow Implementation of the FastPM Cosmological N-body Solver
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow
FlowTour: An Automatic Guide for Exploring Internal Flow Features
Fluid Dynamics Simulations on Multi-GPU Systems
Fluid Motion Modelling Using Vortex Particle Method on GPU
Fluid Simulation and Generating Textures with Reaction-Diffusion Systems on Surfaces in the GPU
Fluid Simulation by the Smoothed Particle Hydrodynamics Method: A Survey
Fluid Simulation on Surfaces in the GPU
Fluid simulation with SIMPLE method using graphic processors
Fluid Simulation: Smoothed Particle Hydrodynamics on the GPU
Fluid-solid coupling on a cluster of GPU graphics cards for seismic wave propagation
FluidFFT: common API (C++ and Python) for Fast Fourier Transform HPC libraries
FluoroSim: A Visual Problem-Solving Environment for Fluorescence Microscopy
Flux tubes at Finite Temperature
FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method
fMRI analysis on the GPU-possibilities and challenges
Focus measurement on programmable graphics hardware for all in-focus rendering from light fields
Focused Volumetric Visual Hull with Color Extraction
Forecasting high frequency financial time series using parallel FFN with CUDA and ZeroMQ
Forecasting time series with constraints
Forensics on GPU Coprocessing in Databases - Research Challenges, First Experiments, and Countermeasures
Formal Analysis of GPU Programs with Atomics via Conflict-Directed Delay-Bounding
Formal Description and Optimization Based High - Performance Computing on CUDA
Formal Semantics of Heterogeneous CUDA-C: A Modular Approach with Applications
Formal specification and verification of OpenCL Kernel optimization
Formalizing Address Spaces with application to Cuda, OpenCL, and beyond
ForOpenCL: Transformations Exploiting Array Syntax in Fortran for Accelerator Programming
Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs
Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang
FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran
Four styles of parallel and net programming
Four-dimensional Cone Beam CT Reconstruction and Enhancement using a Temporal Non-Local Means Method
Fourier Volume Rendering on the GPU Using a Split-Stream-FFT
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
FPGA accelerated 3D reconstruction using compressive sensing
FPGA Accelerated Simulation of Biologically Plausible Spiking Neural Networks
FPGA Acceleration of Multifunction Printer Image Processing using OpenCL
FPGA acceleration of rigid-molecule docking codes
FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL
FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods
FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis
FPGA and ASIC Convergence
FPGA and GPU implementation of large scale SpMV
FPGA Based Acceleration of Decimal Operations
FPGA Based High Performance and Scalable Block LU Decomposition Architecture
FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only
FPGA Based Satisfiability Checking
FPGA based Speeded Up Robust Features
FPGA implementation of a Convolutional Neural Network for "Wake up word" detection
FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL
FPGA Implementation of Reduced Precision Convolutional Neural Networks
FPGA in HPC: High Level Synthesys of OpenCL kernels for Molecular Dynamics
FPGA vs. GPU for sparse matrix vector multiply
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application
FPGA-Accelerated Image Processing Using High Level Synthesis with OpenCL
FPGA-based acceleration of a particle simulation High Performance Computing application
FPGA-based acceleration of CHARMM-potential minimization
FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL
FPGA-Based Accelerator Design from a Domain-Specific Language
FPGA-Based Design of Numerical Algorithms for Kernel Density Estimation Using High Level Synthesis Approach
FPGA-based Tsunami Simulation: Performance Comparison with GPUs, and Roofline Model for Scalability Analysis
FPGA-GPU architecture for kernel SVM pedestrian detection
FPGA-GPU-CPU Heterogenous Architecture for Real-time Cardiac Physiological Optical Mapping
FPGA: An Efficient And Promising Platform For Real-Time Image Processing Applications
fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs
FPGAs, GPUs and the PS2 - A Single Programming Methodology
Fractal Art Generation using GPUs
Fractal Based Method on Hardware Acceleration for Natural Environments
Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms
Fractals Image Rendering and Compression using GPUs
Frame-based parallelization of MPEG-4 on compute unified device architecture (CUDA)
Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations
Framework for Parallel Kernels Auto-tuning
Framework for utilizing computational devices within simulation
Frameworks for GPU Accelerators: A comprehensive evaluation using 2D/3D image registration
Frameworks for multi-core architectures: a comprehensive evaluation using 2D/3D image registration
Frameworks in Medical Image Analysis with Deep Neural Networks
Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse
Free surface flow simulations on GPGPUs using the LBM
Free-form interest rate term structure decomposition: a 2nd order optimization problem
Frequent itemset mining on graphics processors
From Constraint Programming to Heterogeneous Parallelism
From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming
From English To Foreign Languages: Transferring Pre-trained Language Models
From Experiment to Design - Fault Characterization and Detection in Parallel Computer Systems Using Computational Accelerators
From GPUs to AI and quantum: three waves of acceleration in bioinformatics
From MPI to MPI+OpenACC: Conversion of a legacy FORTRAN PCG solver for the spherical Laplace equation
From OpenCL to Gates: the FFT
From Parallel Programs to Customized Parallel Processors
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
From Pixels to Torques: Policy Learning using Deep Dynamical Convolutional Networks
From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation
From Rendering to Tracking Point-based 3D Models
From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation
From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels
From Tokens to Regions: CUDA-Sensitive Instruction Tuning for GPU Kernel Generation
FSAI preconditioned CG algorithm combined with GPU technique for the finite element analysis of electromagnetic scattering problems
FSCL: Homogeneous programming, scheduling and execution on heterogeneous platforms
FSimGP^2: An Efficient Fault Simulator with GPGPU
FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs
FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators
Full Covariance Gaussian Mixture Models Evaluation on GPU
Full reconstruction of a 14-qubit state within four hours
Full Speed Ahead: 3D Spatial Database Acceleration with GPUs
Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting
Full-Parallax Hologram Synthesis of Triangular Meshes using a Graphical Processing Unit
Full-resolution interactive CPU volume rendering with coherent BVH traversal
Full-Scale File System Acceleration on GPU
Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures
Full-stack Optimization for Accelerating CNNs with FPGA Validation
Full-System Simulation of Mobile CPU/GPU Platforms
Fully 3-D List-Mode OSEM Accelerated by Graphics Processing Units
Fully 3D list-mode time-of-flight PET image reconstruction on GPUs using CUDA
Fully accelerating quantum Monte Carlo simulations of real materials on GPU clusters
Fully automatic extraction of salient objects from videos in near real-time
Fully Concurrent GPU Data Structures
Fully GPU based real time corrections and reconstruction for cone beam micro CT
Fully Parallel Particle Learning for GPGPUs and Other Parallel Devices
Fully-3D GPU PET reconstruction
Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs
Function Call Re-Vectorization
Functional and dynamic programming in the design of parallel prefix networks
Functional High Performance Financial IT
Functional Programming for High-Performance Computing on Heterogeneous Architectures
Functional Signal Processing with Pure and Faust Using the LLVM Toolkit
Fused DTI/HARDI Visualization
Fusion of Morphological Images for Airborne Target Detection
FusionAccel: A General Re-configurable Deep Learning Inference Accelerator on FPGA for Convolutional Neural Networks
FusionSim: Characterizing the Performance Benefits of Fused CPU/GPU Systems
FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
Futhark Vulkan Backend
Future graphics architectures
Future of GPGPU Micro-Architectural Parameters
FUX-Sim: Implementation of a fast universal simulation/reconstruction framework for X-ray systems
Fuzz4cuda: Fuzzing Your Nvidia Gpu Libraries Through Debug Interface
Fuzzing Loop Optimizations in Compilers for C++ and Data-Parallel Languages
Fuzzy ART Neural Network Parallel Computing on the GPU
FuzzyGPU: a fuzzy arithmetic library for GPU
FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs
G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing
G-Heart: A GPU-based System for Electrophysiological Simulation and Multi-modality Cardiac Visualization
G-NET: Effective GPU Sharing in NFV Systems
G-NetMon: A GPU-accelerated Network Performance Monitoring System
G-NetMon: A GPU-accelerated Network Performance Monitoring System for Large Scale Scientific Collaborations
G-SNPM - A GPU-based SNP mapping tool
GA3C: GPU-based A3C for Deep Reinforcement Learning
GACO: A GPU-based High Performance Parallel Multi-ant Colony Optimization Algorithm
GaDei: On Scale-up Training As A Service For Deep Learning
GAIN: GPU-based Constraint Checking for Context Consistency
Gaining Cross-Platform Parallelism for HAL's Molecular Dynamics Package using SYCL
Gaiwan: a Size-Polymorphic Typesystem for GPU Programs
GALAMOST: GPU-accelerated large-scale molecular simulation toolkit
GALARIO: a GPU Accelerated Library for Analysing Radio Interferometer Observations
Galerkin-based multi-scale time integration for nonlinear structural dynamics
Gallatin: A General-Purpose GPU Memory Manager
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
GamePipe: A Virtualized Cloud Platform Design and Performance Evaluation
GAMER with out-of-core computation
GAMER-2: a GPU-accelerated adaptive mesh refinement code -- accuracy, performance, and scalability
GAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics
GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda
GARDENIA: A Domain-specific Benchmark Suite for Next-generation Accelerators
GAROP: Genetic Algorithm framework for Running On Parallel environments
GASPP: A GPU-Accelerated Stateful Packet Processing Framework
Gate-Level Simulation with GPU Computing
Gauge Field Generation on Large-Scale GPU-Enabled Systems
Gauge Fixing in Lattice QCD on GPUs
Gauge fixing in lattice QCD with multi-GPUs
Gauge fixing using overrelaxation and simulated annealing on GPUs
Gaussian Mixture Model Based Volume Visualization
Gaussian Process Models with Parallelization and GPU acceleration
Gaussian split Ewald: A fast Ewald mesh method for molecular simulation 
GBOOST : A GPU-based tool for detecting gene-gene interactions in genome-wide case control studies
GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning
GC3: An Optimizing Compiler for GPU Collective Communication
GCN Inference Acceleration using High-Level Synthesis
GCS: High-Performance Gate-Level Simulation with GP-GPUs
GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis
Gdev: First-Class GPU Resource Management in the Operating System
GDlog: A GPU-Accelerated Deductive Engine
GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
GeantV: from CPU to accelerators
GEARS: A General and Efficient Algorithm for Rendering Shadows
gearshifft - The FFT Benchmark Suite for Heterogeneous Platforms
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing
GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server
gem5-gpu: A Heterogeneous CPU-GPU Simulator
gEMfitter: A Highly Parallel FFT-Based 3D Density Fitting Tool With GPU Texture Memory Acceleration
GEMM on a GPU
Gemma in April: A matrix-like parallel programming architecture on OpenCL
GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication
gEMpicker: A Highly Parallel GPU-Accelerated Particle Picking Tool for Cryo-Electron Microscopy
GEMTC: GPU Enabled Many-Task Computing
GenBase: A Complex Analytics Genomics Benchmark
General Purpose Computation on Graphics Processing Units Using OpenCL
General purpose computing on graphics processing units using OpenCL
General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age?
General purpose lattice QCD code set Bridge++ 2.0 for high performance computing
General purpose molecular dynamics simulations fully implemented on graphics processing units
General purpose Molecular Dynamics Simulations on GPUs: Issues of Pair Forces and Scaling to large Clusters
General Transformations for GPU Execution of Tree Traversals
General-Purpose Computing on Tensor Processors
General-purpose GPU computing: practice and experience
General-purpose molecular dynamics simulations on GPU-based clusters
Generalisation in genetic programming
Generalized Resource Allocation for the Cloud
Generalized Voronoi Diagram Computation on GPU
Generalizing Execution of Vectorizable Computations by Generating Vector Oriented Byte Code
Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems
Generating 3D Topologies with Multiple Constraints on the GPU
Generating and Rendering Procedural Clouds in Real Time on Programmable 3D Graphics Hardware
Generating Binary Optimal Codes Using Heterogeneous Parallel Computing
Generating Custom Code for Efficient Query Execution on Heterogeneous Processors
Generating Device-specific GPU code for Local Operators in Medical Imaging
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Tensor Contractions for GPUs
Generating GPU Code from a High-level Representation for Image Processing Kernels
Generating GPU Compiler Heuristics using Reinforcement Learning
Generating Literature-Driven Scientific Theories at Scale
Generating massive high-quality random numbers using GPU
Generating Null Models for Large-Scale Networks on GPU
Generating optimal CUDA sparse matrix-vector product implementations for evolving GPU hardware
Generating Parallel OpenCL and OpenMP Programs from Dataflow Graphs
Generating Performance Portable Code using Rewrite Rules: From High-level Functional Expressions to High-Performance OpenCL Code
Generating SU(Nc) pure gauge lattice QCD configurations on GPUs with CUDA and OpenMP
Generating subdivision curves with L-systems on a GPU
Generating textures on Surfaces with Reaction-Diffusion systems in the GPU
Generating, Optimizing, and Scheduling a Compiler Level Representation of Stream Parallelism
Generation of Kernels for Calculating Electron Repulsion Integrals of High Angular Momentum Functions on GPUs - Preliminary Results
Generation of planar radiographs from 3D anatomical models using the GPU
Generation of Random Numbers on Graphics Processors: Forced Indentation In Silico of the Bacteriophage HK97
Generation of the Scrambled Halton Sequence Using Accelerators
Generative programming methods for parallel partial differential field equation solvers
Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission
Generic Inverted Index on the GPU
Generic System Calls for GPUs
Genetic Algorithm Modeling with GPU Parallel Computing Technology
Genetic Improvement of GPU Software
Genetic Programming An Introductory Tutorial and a Survey of Techniques and Applications
Genetic programming on GPUs for image processing
Genetic programming on graphics processing units
Genetic Programming using the Karva Gene Expression Language on Graphical Processing Units
Genetically Improved BarraCUDA
Genetically Improved CUDA C++ Software
Genetically Improved CUDA kernels for StereoCamera
GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration
GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores
GeNN: a code generation framework for accelerated brain simulations
Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis
GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations
Geo-Correction of High-Resolution Imagery Using Fast Template Matching on a GPU in Emergency Mapping Contexts
Geodesic tree-based dynamic programming for fast stereo reconstruction
Geometric Algebra Computing Technology for Accelerated Processing Units
Geometric Algebra enhanced Precompiler for C++ and OpenCL
Geometric Algebra Enhanced Precompiler for C++, OpenCL and Mathematica's OpenCLLink
Geometric Optimisation using Karva for Graphical Processing Units
Geometry Based Visualization with OpenCL
Geometry Construction from Caustic Images
Geometry Textures and Applications
Geospatial visualization using hardware accelerated real-time volume rendering
Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support
Getting Started with GPU Programming
GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation
GEVO: GPU Code Optimization using Evolutionary Computation
GGArray: A Dynamically Growable GPU Array
GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
GGNN: Graph-based GPU Nearest Neighbor Search
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
GHOST: GPGPU-Offloaded High Performance Storage I/O Deduplication for Primary Storage System
GHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomics
GIFT: A Real-time and Scalable 3D Shape Search Engine
GigaAPI for GPU Parallelization
GiMMiK - Generating Bespoke Matrix Multiplication Kernels for Various Hardware Accelerators; Applications in High-Order Computational Fluid Dynamics
Ginkgo - A Math Library designed for Platform Portability
Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing
GIS Polygon Overlay Processing: New Parallel Algorithm and System Prototype
GiST Scan Acceleration using Coprocessors
GIST: an interactive, GPU-based level set segmentation tool for 3D medical images
GKLEE: Concolic Verification and Test Generation for GPUs
GL4D: A GPU-based Architecture for Interactive 4D Visualization
gLBM: A GPU enabled Lattice Boltzmann Method Library
Glider: A GPU Library Driver for Improved System Security
Glift: Generic, efficient, random-access GPU data structures
Global Depth from Epipolar Volumes - A General Framework for Reconstructing Non-Lambertian Surfaces
Global finite element matrix construction based on a CPU-GPU implementation
Global Illumination for Advanced Computer Graphics
Global Illumination for Interactive Lighting Design Using Light Path Pre-Computation and Hierarchical Histogram Estimation
Global memory access modelling for efficient implementation of the lattice Boltzmann method on graphics processing units
Global optimization model on power efficiency of GPU and multicore processing element for SIMD computing with CUDA
Global Point Mascon Models for Simple, Accurate and Parallel Geopotential Computation
Globally scheduled real-time multiprocessor systems with GPUs
GLoP: Enabling Massively Parallel Incident Response Through GPU Log Processing
GLOpenCL: OpenCL support on hardware- and software-managed cache multicores
Glow: Graph Lowering Compiler Techniques for Neural Networks
GLSL Essentials
GLSV: Graphics library stereo vision for OpenGL
GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation
GMH: A Message Passing Toolkit for GPU Clusters
GMM based Fisher vector calculation on GPGPU
GMP implementation on CUDA - A Backward Compatible Design With Performance Tuning
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs
gNek: A GPU Accelerated Incompressible Navier Stokes Solver
Go game move prediction using convolutional neural network
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning
Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?
GooFit 2.0
GooFit: A library for massively parallelising maximum-likelihood fits
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
GOST-28147 Encryption Implementation on Graphics Processing Units
GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling
GP on SPMD parallel graphics hardware for mega Bioinformatics data mining
GP-GPU: Bridging the Gap between Modelling & Experimentation
GPA: A GPU Performance Advisor Based on Instruction Sampling
GPApriori: GPU-Accelerated Frequent Itemset Mining
GPF: a framework for general packet classification on GPU co-processors
GPflow: A Gaussian process library using TensorFlow
GPGPU Accelerated Cardiac Arrhythmia Simulations
GPGPU Accelerated Deep Object Classification on a Heterogeneous Mobile Platform
GPGPU accelerated optimization method of Interconnection Network Topology
GPGPU Accelerated Texture-Based Radiosity
GPGPU Accelerated Texture-Based Radiosity Calculation
GPGPU Acceleration Algorithm for Medical Image Reconstruction
GPGPU Acceleration for Skeletal Animation-comparing OpenCL with CUDA and GLSL
GPGPU and MIC in Accelerated Cluster for Remote Sensed Image Processing Software
GPGPU and Multi-Core Architectures for Computing Clustering Coefficients of Irregular Graphs
GPGPU Based Aeroacoustic Optimization of a Contra-Rotating Fan
GPGPU Based Non-photorealistic Rendering of Volume Data
GPGPU based simulations for one and two dimensional quantum walks
GPGPU calculations of gas thermodynamic quantities
GPGPU computation and visualization of three-dimensional cellular automata
GPGPU flow
GPGPU for orbital function evaluation with a new updating scheme
GPGPU Implementation of a Generative Modelling Language
GPGPU implementation of a synaptically optimized, anatomically accurate spiking network simulator
GPGPU Implementation of Matrix Formalism for Beam Dynamics Simulation
GPGPU Multi Object Bayesian Tracking with an embedded System on a Chip
GPGPU opportunities for the LHCb trigger
GPGPU Performance and Power Estimation Using Machine Learning
GPGPU Performance Estimation with Core and Memory Frequency Scaling
GPGPU Processing in CUDA Architecture
GPGPU Programming for Games and Science
GPGPU supported cooperative acceleration in molecular dynamics
GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments
GPGPU Test Suite Minimisation: Search Based Software Engineering Performance Improvement Using Graphics Cards
GPGPU Volume Classification using SimpleOpenCL
GPGPU workload analysis and media performance studies
GPGPU-Accelerated Instruction Accurate and Fast Simulation of Thousand-core Platforms
GPGPU-accelerated Interesting Interval Discovery and other Computations on GeoSpatial Datasets - A Summary of Results
GPGPU-Accelerated Parallel and Fast Simulation of Thousand-Core Platforms
GPGPU-Aided 3D Staggered-grid Finite-difference Seismic Wave Modeling
GPGPU-Aided Ensemble Empirical-Mode Decomposition for EEG Analysis During Anesthesia
GPGPU-assisted prediction of ion binding sites in proteins
GPGPU-Assisted Subpixel Tracking Method for Fiducial Markers
GPGPU-BASED Cortical Modeling
GPGPU-based Gaussian Filtering for Surface Metrological Data Processing
GPGPU-based Latency Insertion Method: Application to PDN simulations
GPGPU-compatible archive based stochastic ranking evolutionary algorithm (G-ASREA) for multi-objective optimization
GPGPU-FDTD method for 2-dimensional electromagnetic field simulation and its estimation
GPGPU-Sim
GPGPU: general purpose computation on graphics hardware
GPGPUs in computational finance: Massive parallel computing for American style options
GPGPUs: How to Combine High Computational Power with High Reliability
GPIC - GPU Power Iteration Cluster
GPL: A GPU-based Pipelined Query Processing Engine
GPPE: a GPU-based Parallel Processing Environment for Large Scale Concurrent Data Streams
GPRM: a high performance programming framework for manycore processors
gProximity: Hierarchical GPU-based Operations for Collision and Distance Queries
GPS forward model computing study on CPU/GPU co-processing parallel system using CUDA
GPTPU: Accelerating Applications using Edge Tensor Processing Units
GPU & CPU implementation of Young - Van Vliet's Recursive Gaussian Smoothing Filter
GPU Accelarated Multi-Block Lattice Boltzmann Solver for Viscous Flow Problems
GPU accelerated 2-D staggered-grid finite difference seismic modelling
GPU Accelerated 3-D Modeling and Simulation of a Blended Kinetic Impact and Nuclear Subsurface Explosion
GPU Accelerated Adams-Bashforth Multirate Discontinuous Galerkin FEM Simulation of High-Frequency Electromagnetic Fields
GPU accelerated alignment of 3-D CTA with 2-D X-ray data for improved guidance in coronary interventions
GPU accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model
GPU Accelerated Automated Feature Extraction from Satellite Images
GPU accelerated biochemical network simulation
GPU Accelerated Blood Flow Computation using the Lattice Boltzmann Method
GPU Accelerated Cardiac Electrophysiology
GPU Accelerated Chemical Similarity Calculation for Compound Library Comparison
GPU Accelerated Computation and Real-time Rendering of Cellular Automata Model for Spatial Simulation
GPU Accelerated Computation and Visualization of Hexagonal Cellular Automata
GPU Accelerated Computation of Fast Spectral Transforms
GPU accelerated computation of Polarized Subsurface BRDF for Flat Particulate Layers
GPU Accelerated Computation of the ICON Model
GPU Accelerated Conjunction Assessment with Applications to Formation Flight and Space Debris Tracking
GPU Accelerated Direct Volume Rendering on an Interactive Light Field Display
GPU Accelerated Discontinuous Galerkin Methods for Shallow Water Equations
GPU Accelerated Discrete Element Method (DEM) Molecular Dynamics for Conservative, Faceted Particle Simulations
GPU Accelerated Dissipative Particle Dynamics with Parallel Cell-list Updating
GPU accelerated elliptic curve cryptography in GF(2m)
GPU Accelerated Face Detection
GPU Accelerated Face Detection (thesis)
GPU accelerated fast FEM deformation simulation
GPU accelerated FDTD solver and its application in MRI
GPU accelerated feature algorithms for mobile devices
GPU Accelerated Finite Element Assembly with Runtime Compilation
GPU Accelerated Fluid Flow Computations Using the Latice Boltzmann Method
GPU Accelerated Fractal Image Compression for Medical Imaging in Parallel Computing Platform
GPU Accelerated framework for financial nested simulations
GPU accelerated fuzzy connected image segmentation by using CUDA
GPU Accelerated Gesture Detection for Real Time Interaction
GPU Accelerated Graph SLAM and Occupancy Voxel Based ICP For Encoder-Free Mobile Robots
GPU Accelerated Greedy Algorithms for Compressed Sensing
GPU accelerated high intensity ultrasound acoustical power computation
GPU accelerated Hybrid Tree Algorithm for Collision-less N-body Simulations
GPU accelerated image aligned splatting
GPU accelerated image reconstruction in a two-strip J-PET tomograph
GPU Accelerated Image Registration in Two and Three Dimensions
GPU Accelerated Inverse Photon Mapping for Real-Time Surface Reflectance Modeling
GPU Accelerated Keccak (SHA3) Algorithm
GPU Accelerated Lambert Solution Methods for the Orbital Targeting Problem
GPU accelerated maximum cardinality matching algorithms for bipartite graphs
GPU accelerated molecular dynamics simulation of thermal conductivities
GPU Accelerated Molecular Dynamics Simulation, Visualization, and Analysis
GPU Accelerated Molecular Surface Computing
GPU accelerated Monte Carlo simulation of Brownian motors dynamics with CUDA
GPU accelerated Monte Carlo simulation of pulsed-field gradient NMR experiments 
GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model
GPU accelerated Monte Carlo simulations of lattice spin models
GPU Accelerated Multimodal Background Subtraction
GPU Accelerated Multiple Deoxyribose Nucleic Acid Sequence Parallel Matching
GPU Accelerated Nature Inspired Methods for Modelling Large Scale Bi-Directional Pedestrian
GPU Accelerated Nature Inspired Methods for Modelling Large Scale Bi-Directional Pedestrian Movement
GPU Accelerated NIDS Search
GPU Accelerated Nonlinear Optimization in Radio Interferometric Calibration
GPU accelerated Nonlinear Soft Tissue Deformation
GPU Accelerated Numerical Solutions to Chaotic PDEs
GPU Accelerated Parallel Iris Localization
GPU Accelerated Parallel Iris Segmentation
GPU Accelerated Parallel Occupancy Voxel Based ICP for Position Tracking
GPU Accelerated Parameter Estimation by Global Optimization using Interval Analysis
GPU Accelerated Particle System for Triangulated Surface Meshes
GPU Accelerated Particle Visualization with Splotch
GPU Accelerated Path-Planning for Multi-agents in Virtual Environments
GPU accelerated pathfinding
GPU Accelerated Pattern Matching Algorithm for DNA Sequences to Detect Cancer using CUDA
GPU Accelerated PK-means Algorithm for Gene Clustering
GPU accelerated population annealing algorithm
GPU accelerated preprocessing for potential-visible set
GPU Accelerated Process Planning For CNC-Machined Parts:Industrial Components to Bone Implants
GPU accelerated QTL detection
GPU accelerated radio astronomy signal convolution
GPU Accelerated Radio Wave Propagation Modeling Using Ray Tracing
GPU Accelerated Randomized Singular Value Decomposition and Its Application in Image Compression
GPU Accelerated Range Trees with Applications
GPU accelerated real time polarimetric image processing through the use of CUDA
GPU Accelerated Real-Time Collision Handling in Virtual Disassembly
GPU Accelerated Real-Time Object Detection on High Resolution Videos Using Modified Census Transform
GPU Accelerated Registration of a Statistical Shape Model of the Lumbar Spine to 3D Ultrasound Images
GPU accelerated rendering of vector based maps on iOS
GPU Accelerated RNA Folding Algorithm
GPU accelerated rotation-based emission tomography reconstruction
GPU Accelerated Scalable Parallel Random Number Generators
GPU Accelerated Semiclassical Initial Value Representation Molecular Dynamics
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
GPU accelerated simulations of bluff body flows using vortex particle methods
GPU Accelerated Smith-Waterman
GPU Accelerated Solver of Time-Dependent Air Pollutant Transport Equations
GPU accelerated spectral finite elements on all-hex meshes
GPU accelerated statistical image reconstruction for Compton cameras
GPU Accelerated Stochastic Simulation
GPU Accelerated Strong and Branching Bisimilarity Checking
GPU accelerated surgical simulators for complex morphology
GPU accelerated tensor contractions in the plaquette renormalization scheme
GPU accelerated toolbox for real-time beam-shaping in multimode fibres
GPU accelerated Trotter-Suzuki solver for quantum spin dynamics
GPU Accelerated Vessel Segmentation Using Laplacian Eigenmaps
GPU accelerated viscous-fluid deformable registration for radiotherapy
GPU Accelerated VLSI Design Verification
GPU Accelerated X-Ray Image Enhancement
GPU accelerating the FEniCS Project
GPU acceleration and performance of the particle-beam-dynamics code Elegant
GPU Acceleration for General Conservation Equations and its Application to several Engineering Problems
GPU acceleration for statistical gene classification
GPU Acceleration for the C++ Standard Template Library
GPU Acceleration of 2D-DWT Image Compression in MATLAB with CUDA
GPU Acceleration of a Basket Option Pricing Engine
GPU acceleration of a fully 3D Iterative Reconstruction Software for PET using CUDA
GPU Acceleration of a Genetic Algorithm for the Synthesis of FSM-based Bimodal Predictors
GPU Acceleration of a High-Order Discontinuous Galerkin Incompressible Flow Solver
GPU acceleration of a production molecular docking code
GPU Acceleration of Algebraic Multigrid for Low-Frequency Finite Element Methods
GPU Acceleration of an Unmodified Parallel Finite Element Navier-Stokes Solver
GPU Acceleration of BCP Procedure for SAT Algorithms
GPU acceleration of compton reconstruction for the PEDRO
GPU acceleration of cutoff pair potentials for molecular modeling applications
GPU Acceleration of Equations Assembly in Finite Elements Method - Preliminary Results
GPU Acceleration of Genetic Algorithms for Subset Selection for Partial Fault Tolerance
GPU Acceleration of Graph Matching, Clustering, and Partitioning
GPU Acceleration of High-Speed Collision Molecular Dynamics Simulation
GPU Acceleration of Image Convolution using Spatially-varying Kernel
GPU Acceleration of Iterative Clustering
GPU Acceleration of k-Nearest Neighbor Search in Face Classifier based on Eigenfaces
GPU acceleration of linear systems for computational electromagnetic simulations
GPU Acceleration of Many Independent Mid-Sized Simulations on Graphs
GPU Acceleration of Matrix-based Methods in Computational Electromagnetics
GPU acceleration of matrix-based methods in computational electromagnetics (thesis)
GPU Acceleration of Melody Accurate Matching in Query-by-Humming
GPU acceleration of method of moments matrix assembly using Rao-Wilton-Glisson basis functions
GPU acceleration of MOLAR for HRRT List-Mode OSEM reconstructions
GPU Acceleration of Multilevel Solvers for Analysis of Microwave Components With Finite Element Method
GPU Acceleration of Near-Minimal Logic Minimization
GPU acceleration of Newton's method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of numerical weather prediction
GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting
GPU Acceleration of Particle-based Volume Rendering using CUDA
GPU acceleration of preconditioned solvers for ill-conditioned linear systems
GPU Acceleration of PROPELLER MRI Using CUDA
GPU Acceleration of Pyrosequencing Noise Removal
GPU Acceleration of Real-time Feature Based Algorithms
GPU acceleration of Runge Kutta-Fehlberg and its comparison with Dormand-Prince method
GPU Acceleration of Runge-Kutta Integrators
GPU Acceleration of Solving Parabolic Partial Differential Equations Using Difference Equations
GPU Acceleration of SQL Analytics on Compressed Data
GPU acceleration of the dynamics routine in the HIRLAM weather forecast model
GPU Acceleration of the Generalized Interpolation Material Point Method
GPU acceleration of the iterative physical optics (IPO) method
GPU acceleration of the particle filter: the Metropolis resampler
GPU Acceleration of the Variational Monte Carlo Method for Many Body Physics
GPU Acceleration of Transmural Electrophysiological Imaging
GPU acceleration of Zernike moments for large-scale images
GPU Accelerators for Evolvable Cellular Automata
GPU algorithms for comparison-based sorting and for merging based on multiway selection
GPU Algorithms for Diamond-based Multiresolution Terrain Processing
GPU Algorithms for Efficient Exascale Discretizations
GPU algorithms for radiosity and subsurface scattering
GPU Algorithms for the Estimation of Environmental Models Based on Large Datasets
GPU and CPU cooperation parallel visualisation for large seismic data
GPU and CPU Cooperative Accelaration for Face Detection on Modern Processors
GPU and CPU Cooperative Accelerated Road Detection
GPU Architecture and the Programming Environment
GPU architecture evaluation for multispectral and hyperspectral image analysis
Gpu architecture for stationary multisensor pedestrian detection at smart intersections
GPU architecture overview
GPU Array Access Auto-Tuning
GPU as a General Purpose Computing Resource
GPU as a Parallel Machine: Sorting on the GPU
GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training
GPU Auto-tuning Framework for Optimal Performance and Power Consumption
GPU backed Data Mining on Android Devices
GPU based acceleration architecture for image enhancement in spatial domain
GPU based acceleration of first principles calculation
GPU Based Acceleration of Telegraph Equation
GPU Based Computation of the Structural Tensor for Real-Time Figure Detection
GPU based detection and mapping of collisions for haptic rendering in Immersive Virtual Reality
GPU Based Detection of Topological Changes in Voronoi Diagrams
GPU Based Dose Calculation
GPU based Eulerian Assembly of Genomes
GPU based extraction of moving objects without shadows under intensity changes
GPU Based Fast Free-Wake Calculations For Multiple Horizontal Axis Wind Turbine Rotors
GPU based FDTD method for investigation on the electromagnetic scattering from 1-D rough soil surface
GPU Based Fluid Animation over Elastic Surface Models
GPU Based Generation and Real-Time Rendering of Semi-Procedural Terrain Using Features
GPU based Implementation of Film Flicker Reduction Algorithms
GPU Based Implementation of Recursive Digital Filtering Algorithms
GPU Based Massive Parallel Kawasaki Kinetics In Monte Carlo Modelling of Lipid Microdomains
GPU Based Methods for Interactive Information Visualization of Big Data
GPU Based Optical Character Transcription for Ancient Inscription Recognition
GPU Based Parallel Computing on Blast Program
GPU based Partially Connected Neural Evolutionary network and its application on gender recognition with face images
GPU based particle system
GPU Based Path Integral Control with Learned Dynamics
GPU Based Performance Acceleration of Radar Imaging Algorithms
GPU Based Real-time Correction for Optical Distortions in Head-Mounted Displays
GPU Based Real-Time Instrument Tracking with Three Dimensional Ultrasound
GPU Based Real-Time Welding Simulation with Smoothed-Particle Hydrodynamics
GPU based sparse grid technique for solving multidimensional options pricing PDEs
GPU Based Spot Noise Parallel Algorithm for 2D Vector Field Visualization
GPU Based Tissue Doppler Imaging
GPU based video stylization
GPU Cluster for High Performance Computing
GPU Cluster with MATLAB
GPU clusters for high-performance computing
GPU Collision Detection in Conformal Geometric Space
GPU Color Constancy
GPU Computation in Bioinspired Algorithms: A Review
GPU Computations in Heterogeneous Grid Environments
GPU Computing
GPU Computing and CUDA technology used to accelerate a mesh generator application
GPU computing architecture for irregular parallelism
GPU computing for 2-d spin systems: CUDA vs OpenGL
GPU Computing for Atmospheric Modeling
GPU Computing for Machine Learning Algorithms
GPU Computing for Meshfree Particle Method
GPU Computing for Parallel Local Search Metaheuristics
GPU Computing for Particle Tracking
GPU computing for shallow water flow simulation based on finite volume schemes
GPU computing for systems biology
GPU Computing Gems: Emerald Edition
GPU Computing Gems: Jade Edition
GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model
GPU Computing in Discrete Optimization - Part I: Introduction to the GPU
GPU Computing in Discrete Optimization - Part II: Survey Focused on Routing Problems
GPU Computing in Discrete Optimization. Part I: Introduction to the GPU
GPU Computing in Discrete Optimization. Part II: Survey Focused on Routing Problems
GPU Computing in Economics
GPU Computing in EGI Environment Using a Cloud Approach
GPU computing in medical physics: A review
GPU Computing to Improve Game Engine Performance
GPU Computing with Applications in Digital Logic
GPU computing with Kaczmarz's and other iterative algorithms for linear systems
GPU computing with OpenCL to model 2D elastic wave propagation: exploring memory usage
GPU Computing with Orientation Maps for Extracting Local Invariant Features
GPU Computing with Python: Performance, Energy Efficiency and Usability
GPU Computing: Image Convolution
GPU Computing: Programming a Massively Parallel Processor
GPU Concurrency Choices in Graph Analytics
GPU concurrency: Weak behaviours and programming assumptions
GPU Coprocessing for Wireless Network Simulation
GPU coprocessors as a service for deep learning inference in high energy physics
GPU Cuda Performance on Two-Dimensional and Three-Dimensional VAWT Vortex Models
GPU Declarative Framework
GPU detectors for interference cancellation in chaos-based CDMA communications
GPU Encrypt: AES Encryption on Mobile Devices
GPU Enhanced Simulation of Angiogenesis
GPU Enhanced Stream-Based Matrix Multiplication
GPU Enhancement of the Trigger to Extend Physics Reach at the Large Hadron Collider
GPU Enhancement of the Trigger to Extend Physics Reach at the LHC
GPU Environmental Delegation of Agent Perceptions for MABS
GPU First - Execution of Legacy CPU Codes on GPUs
GPU Floating-Point Paranoia
GPU Fluid Simulation using Smoothed Particle Hydrodynamics
GPU fluids in production: a compiler approach to parallelism
GPU for CAD
GPU for Parallel On-Board Hyperspectral Image Processing
GPU friendly fast Poisson solver for structured power grid network analysis
GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation 
GPU Gems 3
GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics
GPU hardware acceleration for industrial applications: using computation to push beyond physical limitations
GPU histogram computation
GPU implementation of 3D object selection by conic volume techniques in virtual environments
GPU Implementation of a Deep Learning Network for Financial Prediction
GPU implementation of a deep learning network for image recognition tasks
GPU implementation of a Helmholtz Krylov Solver preconditioned by a shifted Laplace multigrid method
GPU implementation of a hybrid lattice Boltzmann method for non-isothermal flows
GPU implementation of a Landau gauge fixing algorithm
GPU Implementation of a Multiobjective Search Algorithm
GPU implementation of a road sign detector based on particle swarm optimization
GPU implementation of a shell element structural solver aimed at fluid-structure interaction problems
GPU Implementation of an Automatic Target Detection and Classification Algorithm for Hyperspectral Image Analysis
GPU Implementation of Bayesian Neural Network Construction for Data-Intensive Applications
GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction
GPU implementation of epidemiological behaviour in large social networks
GPU Implementation of Extended Gaussian Mixture Model for Background Subtraction
GPU Implementation of Fuzzy Anisotropic Diffusion
GPU Implementation of Gaussian Processes
GPU Implementation of Iterative Solvers in Numerical Weather Predicting Models
GPU implementation of JPEG XR
GPU implementation of JPEG2000 for hyperspectral image compression
GPU implementation of map-MRF for microscopy imagery segmentation
GPU implementation of motion estimation for visual saliency
GPU implementation of neural networks
GPU Implementation of Parallel Support Vector Machine Algorithm with Applications to Detection Intruder
GPU Implementation of Real-Time Biologically Inspired Face Detection using CUDA
GPU Implementation of Spiking Neural Networks for Color Image Segmentation
GPU Implementation of Split-Field Finite-Difference Time-Domain Method for Drude-Lorentz Dispersive Media
GPU Implementation of the Branch and Bound method for knapsack problems
GPU Implementation of the DP code
GPU Implementation of the Keccak Hash Function Family
GPU Implementation of the LFT Shape Matching Algorithm
GPU Implementation of the Particle Filter
GPU implementation of the pixel purity index algorithm for hyperspectral image analysis
GPU implementation of the Rosenbluth generation method for static Monte Carlo simulations
GPU Implementation of the STA Algorithm on I/Q Data
GPU implementation of volume reconstruction and object detection in Digital Holographic Microscopy
GPU Implementations for Midsize Integer Addition and Multiplication
GPU Implementations of Object Detection using HOG Features and Deformable Models
GPU implementations of scheduling heuristics for heterogeneous computing environments
GPU implemention of fast Gabor filters
GPU in Physics Computation: Case Geant4 Navigation
GPU Isosurface Raycasting of FCC Datasets
GPU Join Processing Revisited
GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs
GPU Kernels for High-Speed 4-Bit Astrophysical Data Processing
GPU Load Balancing
GPU LSM: A Dynamic Dictionary Data Structure for the GPU
GPU Matrix Multiplication
GPU merge path: a GPU merging algorithm
GPU methods for real-time haptic interaction with 3D fluids
GPU Monte Carlo scatter calculations for Cone Beam Computed Tomography
GPU Multiple Sequence Alignment Fourier-Space Cross-Correlation Alignment
GPU Multisplit
GPU Nonlinear Fixed Points, with an application to GPU IFS Rendering
GPU Objects
GPU Octrees and Optimized Search
GPU Offloading in ExaHyPE Through C++ Standard Algorithms
GPU Optimized Code for Long Term Simulations of Beam-beam Effects in Colliders
GPU packet classification using OpenCL: a consideration of viable classification methods
GPU Parallel Algorithms for Reporting Movement Behaviour Patterns in Spatiotemporal Databases
GPU Parallel Collections For Scala
GPU parallel computing: Programming language, debugging tools and data structures
GPU Parallel Implementation of the Approximate K-SVD Algorithm Using OpenCL
GPU Parallel Statistical and Cube Test Analysis of the SHA-3 Finalist Candidate Hash Functions
GPU Parallelization for Unstructured Sparse Matrix Problems with OpenMP 4.5 and OpenACC
GPU parallelization of a hybrid pseudospectral fluid turbulence framework using CUDA
GPU Parallelization of Algebraic Dynamic Programming
GPU Parallelization of an Unstructured Overset Grid Incompressible Navier-Stokes Solver for Moving Bodies
GPU Parallelization of Astronomical Image Subtraction
GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications
GPU Path Tracing
GPU Pathfinding Optimization
GPU peer-to-peer techniques applied to a cluster interconnect
GPU performance analysis of a nodal discontinuous Galerkin method for acoustic and elastic models
GPU performance comparison for accelerated radar data processing
GPU Performance Modeling and Optimization
GPU Performance Portability needs Autotuning
GPU performance prediction using parametrized models
GPU phase-field lattice Boltzmann simulations of growth and motion of a binary alloy dendrite
GPU physics
GPU powered artificial immune system for visual applications
GPU powered CNN simulator (SIMCNN) with graphical flow based programmability
GPU Predictor-Corrector Interior Point Method for Large-Scale Linear Programming
GPU Prefilter for Accurate Cubic B-spline Interpolation
GPU Pro 2
GPU Pro 5: Advanced Rendering Techniques
GPU Pro 6: Advanced Rendering Techniques
GPU Pro 7: Advanced Rendering
GPU Processing for UAS-Based LFM-CW Stripmap SAR
GPU processing of particle system animation
GPU Programming - Speeding Up the 3D Surface Generator VESTA
GPU Programming for Physics Applications
GPU Programming in a High Level Language: Compiling X10 to CUDA
GPU Programming in Functional Languages: A Comparison of Haskell GPU Embedded Domain Specific Languages
GPU Programming in Rust: Implementing High Level Abstractions in a Systems Level Language
GPU Programming Strategies and Trends in GPU Computing
GPU Programming with CUDA: A brief overview
GPU Random Numbers via the Tiny Encryption Algorithm
GPU ray casting of virtual globes
GPU Ray Casting with Arbitrary Shaped Proxy
GPU Ray Marching for Real-Time Rendering of Participating Media
GPU Ray Tracing - Comparative Study of Ray-Triangle Intersection Algorithms
GPU Ray Tracing with CUDA
GPU Ray Tracing with Monte Carlo Methods
GPU Ray-Traced Collision Detection for Cloth Simulation
GPU Ray-Traced Collision Detection: Fine Pipeline Reorganization
GPU Remote Memory Access Programming
GPU rendering for tiled multi-projector autostereoscopic display based on chromium
GPU Rendering of Relief Mapped Conical Frusta
GPU Rendering of Secondary Effects
GPU Rendering of the Thin Film on Paints with Full Spectrum
GPU Rigid Skinning based on a Refined Skeletonization Method
GPU Robot Motion Planning using Semi-Infinite Nonlinear Programming
GPU sample sort
GPU schedulers: how fair is fair enough?
GPU Scripting and Code Generation with PyCUDA
GPU Shape Grammars
GPU Simulation and Rendering of Volumetric Effects for Computer Games and Virtual Environments
GPU Simulation of Radiation in Matter
GPU simulations for risk assessment in CO2 geologic sequestration 
GPU smoothing of quad meshes
GPU Sparse Matrix Multiplication with CUDA
GPU SQL Query Accelerator
GPU support for batch oriented workloads
GPU Supported Patch-Based Tessellation for Dual Subdivision
GPU Surface Flow Simulation and Multiresolution Animation in Digital Terrain Models
GPU System Call
GPU Techniques Applied to Euler Flow Simulations and Comparison to CPU Performance
GPU techniques for creating visually diverse crowds in real-time
GPU Tensor Cores for fast Arithmetic Reductions
GPU TV-L1 Optical Flow
GPU Versus FPGA for High Productivity Computing
GPU Virtualization
GPU Virtualization and Scheduling Methods: A Comprehensive Survey
GPU Virtualization for High Performance General Purpose Computing on the ESX Hypervisor
GPU virtualization on VMware's hosted I/O architecture
GPU Vision: Accelerating Computer Vision algorithms with Graphics Processing Units
GPU volume rendering in 3D echocardiography: Real-time pre-processing and ray-casting
GPU Volume Voxelization: Exploration of the performance characteristics of different GPU-based implementations
GPU vs FPGA: A Comparative Analysis for Non-standard Precision
gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments
GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures
GPU-accelerated 3D Bayesian image reconstruction from Compton scattered data
GPU-accelerated Adaptively Sampled Distance Fields
GPU-accelerated adjoint algorithmic differentiation
GPU-accelerated affordance cueing based on visual attention
GPU-accelerated algorithms for many-particle continuous-time quantum walks
GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
GPU-Accelerated Atari Emulation for Reinforcement Learning
GPU-accelerated atom and dynamic bond visualization using hyperballs: A unified algorithm for balls, sticks, and hyperboloids
GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy
GPU-Accelerated Background Generation Algorithm with Low Latency
GPU-Accelerated Bayesian Learning and Forecasting in Simultaneous Graphical Dynamic Linear Models
GPU-accelerated Bernstein-Bezier discontinuous Galerkin methods for wave problems
GPU-Accelerated BWT Construction for Large Collection of Short Reads
GPU-accelerated Chemical Similarity Assessment for Large Scale Databases
GPU-accelerated computation for robust motion tracking using the CUDA framework
GPU-accelerated Computation for Statistical Analysis of the Next-Generation Sequencing Data
GPU-accelerated Convex Multi-phase Image Segmentation
GPU-Accelerated Crack Path Computation Based on a Phase Field Approach for Brittle Fracture
GPU-accelerated Database Systems: Survey and Open Challenges
GPU-accelerated deep shadow maps for direct volume rendering
GPU-accelerated differential evolutionary Markov Chain Monte Carlo method for multi-objective optimization over continuous space
GPU-Accelerated Direct Volume Rendering of Finite Element Data Sets
GPU-accelerated discontinuous Galerkin methods on hybrid meshes
GPU-Accelerated DNA Distance Matrix Computation
GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research
GPU-Accelerated Dynamic Functional Connectivity Analysis for Functional MRI Data Using OpenCL
GPU-accelerated dynamic programming for join-order optimization
GPU-accelerated elastic 3D image registration for intra-surgical applications
GPU-Accelerated Evaluation Platform for High Fidelity Network Modeling
GPU-Accelerated Face Detection Algorithm
GPU-accelerated Faster Mean Shift with euclidean distance metrics
GPU-accelerated fault simulation and its new applications
GPU-Accelerated Feature Tracking
GPU-Accelerated First-Order Scattering Simulation for X-Ray CT Image Reconstruction
GPU-accelerated Fourier-continuation solvers and physically exact computational boundary conditions for wave scattering problems
GPU-accelerated generation of correctly-rounded elementary functions
GPU-accelerated Gibbs Sampling
GPU-accelerated hierarchical dense correspondence for real-time aerial video processing
GPU-Accelerated High-Accuracy Molecular Docking using Guided Differential Evolution
GPU-Accelerated High-Level Synthesis for Bitwidth Optimization of FPGA Datapaths
GPU-accelerated HMM for Speech Recognition
GPU-accelerated indirect boundary element method for voxel model analyses with fast multipole method
GPU-Accelerated Interactive Visualization and Planning of Neurosurgical Interventions
GPU-Accelerated Joint 1D and 2D Barcode Localization on Smartphones
GPU-Accelerated KLT Tracking with Monte-Carlo-Based Feature Reselection
GPU-Accelerated Large-Eddy Simulation of Turbulent Channel Flows
GPU-accelerated large-scale quantum molecular dynamics simulation of 3-dimensional C60 polymers
GPU-Accelerated Large-Scale Simulation of Seismic-Wave Propagation
GPU-Accelerated Light Stemmer for the Arabic Language
GPU-accelerated method for real-time shadow generation
GPU-Accelerated Method of Moments by Example: Monostatic Scattering
GPU-accelerated micromagnetic simulations using cloud computing
GPU-accelerated Model Checking of Periodic Self-Suspending Real-Time Tasks
GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows
GPU-accelerated molecular modeling coming of age
GPU-accelerated MoM-based broadband simulations using Stoer-Bulirsch algorithm
GPU-Accelerated Monte Carlo Simulations of Dense Stellar Systems
GPU-accelerated MRF segmentation algorithm for SAR images
GPU-Accelerated Nearest Neighbor Search for 3D Registration
GPU-Accelerated Non-negative Matrix Factorization for Text Mining
GPU-Accelerated Numerical Simulations of the Knudsen Gas on Time-Dependent Domains
GPU-Accelerated parallel FDTD on Distributed Heterogeneous Platform
GPU-Accelerated Parallel Finite-Difference Time-Domain Method for Electromagnetic Waves Propagation in Unmagnetized Plasma Media
GPU-accelerated phase-field simulation of dendritic solidification in a binary alloy
GPU-Accelerated Point-Based Color Bleeding
GPU-accelerated power pattern synthesis of aperiodic linear arrays
GPU-Accelerated Preconditioned Iterative Linear Solvers
GPU-accelerated protein family identification for metagenomics
GPU-accelerated ray tracing for electromagnetic propagation analysis
GPU-accelerated ray-tracing for real-time treatment planning
GPU-accelerated real-time 3D tracking for humanoid locomotion and stair climbing
GPU-accelerated real-time stixel computation
GPU-Accelerated Real-Time Surveillance De-Weathering
GPU-Accelerated Real-Time Visualization and Interaction for Coupled Fluid Dynamics
GPU-Accelerated Recurrent Neural Networks: OpenCLLink and SymbolicC
GPU-accelerated Red Blood Cells Simulations with Transport Dissipative Particle Dynamics
GPU-Accelerated Robotic Intra-operative Laparoscopic 3D Reconstruction
GPU-Accelerated Scalable Solver for Banded Linear Systems
GPU-Accelerated Shape Simplification for Mechanical-Based Applications
GPU-accelerated simulation of colloidal suspensions with direct hydrodynamic interactions
GPU-Accelerated SPH Model for Water Waves and Other Free Surface Flows
GPU-Accelerated Standardand Multi-Population Cultural Algorithms
GPU-accelerated stochastic predictive control of drinking water networks
GPU-accelerated surface denoising and morphing with lattice Boltzmann scheme
GPU-Accelerated SVM Training Algorithm Based on PC and Mobile Device
GPU-accelerated synthesis of echo generators
GPU-accelerated synthetic aperture radar backprojection in CUDA
GPU-Accelerated Text Mining
GPU-accelerated time-domain circuit simulation
GPU-accelerated triangle-triangle intersection tester algorithm
GPU-accelerated WZ Factorization with the Use of the CUBLAS Library
GPU-acceleration for Large-scale Tree Boosting
GPU-acceleration for Moving Particle Semi-implicit Method
GPU-Acceleration of In-Memory Data Analytics
GPU-Acceleration of Linear Algebra using OpenCL
GPU-acceleration of parallel unconditionally stable group explicit finite difference method
GPU-Acceleration of Tensor Renormalization with PyTorch using CUDA
GPU-acceleration of the Discontinuous Galerkin Shallow Water Equations Solver (DG-SWEM) using CUDA and OpenACC
GPU-accelererated regularisation of large diffusion-tensor volumes
GPU-accelleration of image rendering and sorting algorithms with the OpenCL framework
GPU-ArraySort: A parallel, in-place algorithm for sorting large number of arrays
GPU-Assisted Computation of Centroidal Voronoi Tessellation
GPU-Assisted Cryptography of Log-Structured Indices
GPU-assisted decoding of video samples represented in the YCoCg-R color space
GPU-Assisted High Quality Particle Rendering
GPU-Assisted Malware
GPU-assisted positive mean value coordinates for mesh deformations
GPU-Assisted Ray Casting of Large Scenes
GPU-Assisted Z-Field Simplification
GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python
GPU-Aware Non-contiguous Data Movement In Open MPI
GPU-Based 3D Texture Advection for the Visualization of Unsteady Flow Fields 
GPU-based 3D Wavelet Transform
GPU-based accelerated FDTD simulations for double negative (DNG) materials applications
GPU-based Acceleration of Deep Convolutional Neural Networks on Mobile Platforms
GPU-based acceleration of free energy calculations in solid state physics
GPU-based acceleration of MPIE/MoM matrix calculation for the analysis of microstrip circuits
GPU-based Acceleration of System-level Design Tasks
GPU-Based Acceleration of the MLEM Algorithm for SPECT Parallel Imaging with Attenuation Correction and Compensation for Detector Response
GPU-Based Acceleration on ACEnet for FDTD Method of Electromagnetic Field Analysis
GPU-based acoustic feature extraction for electronic media processing
GPU-Based Airway Tree Segmentation and Centerline Extraction
GPU-Based approaches for multiobjective local search algorithms. A case study: the flowshop scheduling problem
GPU-based Assembly of Stiffness Matrices in the Parallel Multilevel Partition of Unity Method
GPU-Based Asynchronous Global Optimization with Particle Swarm
GPU-based asynchronous particle swarm optimization
GPU-Based Background Illumination Correction for Blue Screen Matting
GPU-based Batched Spatial Query Processing on R-Trees
GPU-Based Cell Projection for Interactive Volume Rendering
GPU-based cellular automata simulations of laser dynamics
GPU-based Cloud Computing for Comparing the Structure of Protein Binding Sites
GPU-based cloud performance for LiDAR data processing
GPU-Based Cloud Service for Smith-Waterman algorithm Using Frequency Distance Filtration Scheme
GPU-based collision detection between deformable objects
GPU-based Collision Detection for Deformable Parameterized Surfaces
GPU-based color Doppler ultrasound processing
GPU-Based Computation of 2D Least Median of Squares with Applications to Fast and Robust Line Detection
GPU-Based Computation of Discrete Periodic Centroidal Voronoi Tessellation in Hyperbolic Space
GPU-Based Computation of Voxelized Minkowski Sums with Applications
GPU-based cone beam computed tomography
GPU-Based Conjugate Gradient Solver for Lattice QCD with Domain-Wall Fermions
GPU-based digital hologram reconstruction and particle detection
GPU-Based Distance Map Calculation for Vector Field Haptic Rendering
GPU-based DVB-S2 LDPC decoder with high throughput and fast error floor detection
GPU-based Dynamic Tubular Grids for Sparse Volume Rendering
GPU-based Efficient Join Algorithms on Hadoop
GPU-based efficient realistic techniques for bleeding and smoke generation in surgical simulators
GPU-based elastic-object deformation for enhancement of existing haptic applications
GPU-based Fast Cone Beam CT Reconstruction from Undersampled and Noisy Projection Data via Total Variation
GPU-based fast gamma index calcuation
GPU-based Fast Low-dose Cone Beam CT Reconstruction via Total Variation
GPU-Based Fast Minimum Spanning Tree Using Data Parallel Primitives
GPU-based fast Monte Carlo simulation for radiotherapy dose calculation
GPU-based fast pencil beam algorithm for proton therapy
GPU-based Fast Ray Casting for a Large Number of Metaballs
GPU-Based Feature-Preserving Distance Field Computation
GPU-Based FFT Computation for Multi-Gigabit WirelessHD Baseband Processing
GPU-Based flow simulation with complex boundaries
GPU-Based Foreground-Background Segmentation Using an Extended Colinearity Criterion
GPU-based framework for distributed interactive 3D visualization of multimodal remote sensing data
GPU-based frequency domain volume rendering
GPU-Based Fuzzy C-Means Clustering Algorithm for Image Segmentation
GPU-Based Global Illumination Using Lightcuts
GPU-Based Heuristic Solver for Linear Sum Assignment Problems Under Real-time Constraints
GPU-Based Hierarchical Computations for View Independent Visibility
GPU-based high-performance computing for radiation therapy
GPU-based high-speed and high-precision visual tracking
GPU-based image manipulation and enhancement techniques for dynamic volumetric medical image visualization
GPU-Based Image Processing Use Cases: A High-Level Approach
GPU-Based Image Segmentation Using Level Set Method With Scaling Approach
GPU-based Implementation of 128-bit Secure Eta Pairing Over a Binary Field
GPU-based implementation of a cerebellar spiking network model for realtime robot control
GPU-Based Implementation of JPEG2000 Encoder
GPU-based Implementation of the Variational Path Integral Method
GPU-Based Implementations of the Noniterative Regularized-CCSD(T) Corrections: Applications to Strongly Correlated Systems
GPU-based infrared thermography for NDE of minefields
GPU-based interactive visualization framework for ultrasound datasets
GPU-Based Interactive Visualization of Billion Point Cosmological Simulations
GPU-Based Interactive Visualization Techniques (Mathematics and Visualization)
GPU-Based Interactive, Stereoscopic Visualization of Automotive Crash Simulations
GPU-based intrinsic collision detection for deformable surfaces
GPU-Based Inverse Rendering With Multi-Objective Particle Swarm Optimization
GPU-based Island Model for Evolutionary Algorithms
GPU-based Iterative Cone Beam CT Reconstruction Using Tight Frame Regularization
GPU-Based Iterative Relative Fuzzy Connectedness Image Segmentation
GPU-based JSON data processing using structural indexes
GPU-based Line Probing Techniques for Mikami Routing Algorithm
GPU-Based Liquid Crystal Display Processing Platform
GPU-Based Local-Dimming for Power Efficient Imaging
GPU-based Low Dose CT Reconstruction via Edge-preserving Total Variation Regularization
GPU-based Low-dose 4DCT Reconstruction via Temporal Non-local Means
GPU-based LU decomposition for large method of moments problems
GPU-based matrix-free finite element solver exploiting symmetry of elemental matrices
GPU-based Monte Carlo radiotherapy dose calculation using phase-space sources
GPU-based Monte Carlo simulation for light propagation in complex heterogeneous tissues
GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation
GPU-Based Monte-Carlo Volume Raycasting
GPU-based motion correction of contrast-enhanced liver MRI scans: An OpenCL implementation
GPU-based Motion Planning under Uncertainties using POMDP
GPU-based Multi-start Local Search Algorithms
GPU-based Multi-stream Analyzer on Application Layer for Service-oriented Router
GPU-based multi-view rendering for spatial-multiplex autostereoscopic displays
GPU-based Multi-Volume Rendering of Complex Data in Neuroscience and Neurosurgery
GPU-based Multilevel Clustering
GPU-based non-parametric background subtraction for a practical surveillance system
GPU-Based Nonlinear Ray Tracing
GPU-Based Nonlocal Filtering for Large Scale SAR Processing
GPU-based NSEC3 Hash Breaking
GPU-based Numerical Integration in the Partition of Unity Method
GPU-based object-order ray-casting for large datasets
GPU-based Offset Surface Computation using Point Samples
GPU-based parallel collision detection for fast motion planning
GPU-based parallel collision detection for real-time motion planning
GPU-based Parallel Computation Support for Stan
GPU-based Parallel Computing for Nonlinear Finite Element Deformation Analysis
GPU-based parallel computing for the simulation of complex multibody systems with unilateral and bilateral constraints: an overview
GPU-Based Parallel Computing: A New Computational Approach and its Applications to Nuclear Engineering
GPU-Based Parallel Multi-objective Particle Swarm Optimization
GPU-based parallel particle swarm optimization
GPU-based Parallel Reservoir Simulators
GPU-Based Parallel Signature Scanning and Hash Generation
GPU-based parallel solver via the Kantorovich theorem for the nonlinear Bernstein polynomial systems
GPU-based parallel-beam and cone-beam forward- and backprojection using CUDA
GPU-based parallelization for fast circuit optimization
GPU-based particle simulation with inter-collisions
GPU-based password cracking
GPU-based Pedestrian Detection for Autonomous Driving
GPU-based physical cut in interactive haptic simulations
GPU-based point radiation for interactive volume sculpting and segmentation
GPU-based Private Information Retrieval for On-Device Machine Learning Inference
GPU-based ray casting of stacked out-of-core height fields
GPU-Based Ray Tracing of Splats
GPU-based ray-casting of non-rigid deformations: a comparison between direct and indirect approaches
GPU-Based Ray-Casting of Spherical Functions Applied to High Angular Resolution Diffusion Imaging
GPU-based real-time acoustical occlusion modeling
GPU-based Real-Time Execution of Vehicular Mobility Models in Large-Scale Road Network Scenarios 
GPU-Based Real-Time Imaging Software Suite for Medical Ultrasound
GPU-based real-time simulation and rendering of unbounded ocean surface
GPU-based real-time small displacement estimation with ultrasound
GPU-based Real-Time Soft Tissue Deformation with Cutting and Haptic Feedback
GPU-based reconstruction and display for 4D ultrasound data
GPU-based rendering for deformable translucent objects
GPU-based rendering of point-sampled water surfaces
GPU-Based Research of Highly Efficient Ray Tracing
GPU-Based Road Sign Detection Using Particle Swarm Optimization
GPU-Based Shooting and Bouncing Ray Method for Fast RCS Prediction
GPU-based Signal Processing Scheme for Bioinspired Optical Flow
GPU-based simulation of 3D blood flow in abdominal aorta using OpenFOAM
GPU-based simulation of brain neuron models
GPU-based simulation of cellular neural networks for image processing
GPU-Based Simulation of Large-Scope Ocean Wave
GPU-based simulation of side-looking sonar images
GPU-based simulation of the long-range Potts model via parallel tempering
GPU-based single-cluster algorithm for the simulation of the Ising model
GPU-based smart visibility techniques for tumor surgery planning
GPU-based solution of Continuous Time Markov Chains using CUSP
GPU-based Space Situational Awareness Simulation utilising parallelism for enhanced multi-sensor management
GPU-Based Space-Time Adaptive Processing (STAP) for Radar
GPU-Based Sparse Voxel Octree Raytracing for Rendering of Procedurally Generated Terrain
GPU-based spatial interaction force simulation
GPU-Based Spherical Light Field Rendering with Per-Fragment Depth Correction
GPU-based Steady-State Solution of the Chemical Master Equation
GPU-based Streaming Algorithm for High-Resolution Cloth Simulation
GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration
GPU-based Streaming for Parallel Level of Detail on Massive Model Rendering
GPU-Based Super-union for Minkowski Sum
GPU-based surface oriented interslice directional interpolation for volume visualization
GPU-based Swendsen-Wang multi-cluster algorithm for the simulation of two-dimensional classical spin systems
GPU-Based Techniques For Global Illumination Effects
GPU-based time parallel cache simulator
GPU-based timetable generation
GPU-based tolerance volumes for mesh processing
GPU-Based Tracking Algorithms for the ATLAS High-Level Trigger
GPU-Based Translation-Invariant 2D Discrete Wavelet Transform for Image Processing
GPU-based triangulation of the van der Waals surface
GPU-based tuning of quantum-inspired genetic algorithm for a combinatorial optimization problem
GPU-based ultra fast dose calculation using a finite pencil beam model
GPU-based ultra fast IMRT plan optimization
GPU-based ultra-fast direct aperture optimization for online adaptive radiation therapy 
GPU-based ultrafast IMRT plan optimization
GPU-based Video Feature Tracking and Matching
GPU-based visualization of domain-coloured algebraic Riemann surfaces
GPU-based Volume Rendering for Medical Image Visualization
GPU-Based Volume Rendering for Medical Imagery
GPU-Based Volume Rendering of Noisy Multi-Spectral Astronomical Data
GPU-based X server on top of EGL and openVG
GPU-BLAST: Using graphics processors to accelerate protein sequence alignment
GPU-boosted online image matching
GPU-BSM: A GPU-Based Tool to Map Bisulfite-Treated Reads
GPU-CC: a Reconfigurable GPU Architecture with Communicating Cores
GPU-centric Communication Schemes for HPC and ML Applications
GPU-ClustalW: Using Graphics Hardware to Accelerate Multiple Sequence Alignment
GPU-computing in econophysics and statistical physics
GPU-CPU multi-core for real-time signal processing
GPU-Disasm: A GPU-based x86 Disassembler
GPU-driven Parallel Processing for Realtime Creation of Tree Animation
GPU-Enabled AI
GPU-enabled Efficient Executions of Radiation Calculations in Climate Modelling
GPU-enabled FREALIGN: Accelerating single particle 3D reconstruction and refinement in Fourier space on graphic processors
GPU-enabled high performance feature modeling for ATR applications
GPU-Enabled Particle-Particle Particle-Tree Scheme for Simulating Dense Stellar Cluster System
GPU-Euler: Sequence Assembly Using GPGPU
GPU-EvR: Run-time Event Based Real-time Scheduling Framework on GPGPU Platform
GPU-Framework for Teamwork Action Recognition
GPU-friendly gallbladder modeling for laparoscopic cholecystectomy simulation
GPU-Friendly Local Regression for Voice Conversion
GPU-Friendly Multi-View Stereo Reconstruction Using Surfel Representation and Graph Cuts
GPU-friendly shape interpolation based on trajectory warping
GPU-friendly warped display for scope-maintained video surveillance
GPU-FS-kNN: A Software Tool for Fast and Scalable kNN Computation Using GPUs
GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring
GPU-Initiated Networking for NCCL
GPU-Mapping: Robotic Map Building with Graphical Multiprocessors
GPU-MEME: Using Graphics Hardware to Accelerate Motif Finding in DNA Sequences
GPU-Optimized Coarse-Grained MD Simulations of Protein and RNA Folding and Assembly
GPU-Optimized Hybrid Neighbor/Cell List Algorithm for Coarse-Grained Molecular Dynamics Simulations
GPU-Optimized Molecular Dynamics Simulations
GPU-Parallel Implementation of Color based Medical Image Retrieval in Compressed Domain
GPU-Parallel simulation of rigid fibers in Stokes flow
GPU-PIV
GPU-Powered Coherent Beamforming
GPU-powered Simulation Methodologies for Biological Systems
GPU-powered tools boost molecular visualization
GPU-PRISM: An Extension of PRISM for General Purpose Graphics Processing Units
GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications
GPU-Quicksort: A practical Quicksort algorithm for graphics processors
GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors
GPU-S2S: A Compiler for Source-to-Source Translation on GPU 
GPU-SD and DPD Parallelization for Gromacs tools for molecular dynamics simulations
GPU-SPARC: Accelerating Parallelism in Multi-GPU Real-Time Systems
GPU-Specfic Kalman Filtering and Retrodiction for Large-Scale Target Tracking
GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs
GPU-to-CPU callbacks
GPU-to-GPU and Host-to-Host multipattern string matching on a GPU
GPU: Power vs Performance
GPU's for event reconstruction in the FairRoot framework
GPU/CPU Parallel Computation of Material Damage
GPUAPI: Multi-level Chapel Runtime API for GPUs
GPUburn: A System to Test and Mitigate GPU Hardware Failures
GpuC: Data parallel language extension to CUDA
gpucc: an open-source GPGPU compiler
GPUCV: A Framework for Image Processing Acceleration with Graphics Processors
GpuCV: A GPU-Accelerated Framework for Image Processing and Computer Vision
GPUDet: A Deterministic GPU Architecture
GPUdmm: A High-Performance and Memory-Oblivious GPU Architecture Using Dynamic Memory Management
GPUdrive: Reconsidering Storage Accesses for GPU Acceleration
Gpufit: An open-source toolkit for GPU-accelerated curve fitting
GPUfs: Integrating a File System with GPUs
GPUGI: Global Illumination Effects on the GPU
GPUHammer: Rowhammer Attacks on GPU Memories are Practical
GPUHarbor: Testing GPU Memory Consistency at Large
GPUinspiral - a low-latency, high-performance implementation of the matched-filter gravitational wave search algorithm
GPULib: GPU Computing in High-Level Languages
GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs
GPUMap: A Transparently GPU-Accelerated Map Function
GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency
GPUMCD: a new GPU-oriented Monte Carlo dose calculation platform
GPUMLib: A new Library to combine Machine Learning algorithms with Graphics Processing Units
GPUmotif: An Ultra-Fast and Energy-Efficient Motif Analysis Program Using Graphics Processing Units
GPUMP: A Multiple-Precision Integer Library for GPUs
GPUNet: Searching the Deployable Convolution Neural Networks for GPUs
gpuPairHMM: High-speed Pair-HMM Forward Algorithm for DNA Variant Calling on GPUs
GPUQP: query co-processing using graphics processors
GPUQT: An efficient linear-scaling quantum transport code fully implemented on graphics processing units
GPURepair: Automated Repair of GPU Kernels
GPUs as an Opportunity for Offloading Garbage Collection
GPUs as Storage System Accelerators
GPUs for data processing in the MWA
GPUs for fast pattern matching in the RICH of the NA62 experiment
GPUs for fast triggering and pattern matching at the CERN experiment NA62
GPUs for real-time processing in HEP trigger systems
GPUs, a New Tool of Acceleration in CFD: Efficiency and Reliability on Smoothed Particle Hydrodynamics Methods
GPUs: A Closer Look
GPUs: An Oasis in the Supercomputing Desert
gpuSPHASE - A shared memory caching implementation for 2D SPH using CUDA
gpustats: GPU Library for Statistical Computing in Python
GPUstore: Harnessing GPU Computing for Storage Systems in the OS Kernel
GPUSync: A Framework for Real-Time GPU Management
GPUSync: Architecture-Aware Management of GPUs for Predictable Multi-GPU Real-Time Systems
GPUTeraSort: high performance graphics co-processor sorting for large database management
GPUVerify: A Verifier for GPU Kernels
GPUVM: GPU-driven Unified Virtual Memory
GPUvm: Why Not Virtualizing GPUs at the Hypervisor?
GpuWars: Design and Implementation of a GPGPU Game
GPUWattch: Enabling Energy Optimizations in GPGPUs
gR: A GPU-based Router
Grace: a Cross-platform Micromagnetic Simulator On Graphics Processing Units
GRace: a low-overhead mechanism for detecting data races in GPU programs
Gradient based dominant motion estimation with integral projections for real time video stabilisation
Graduate Operating Systems: Project Report
GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch
Granular visibility queries on the GPU
Graph Analysis with High-Performance Computing
Graph Coarsening and Clustering on the GPU
Graph Generation on GPUs using Dynamic Memory Allocation
Graph grammar based multi-frontal direct solver for isogeometric FEM simulations on GPU
Graph Processing on GPU
Graph Processing on GPUs: A Survey
Graph-based Parallel Analysis of Large Analog Circuits Based on GPU Platforms
Graph-Based Substructure Pattern Mining Using CUDA Dynamic Parallelism
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
Graphic Processing Unit Simulation of Axon Growth and Guidance through Cue Diffusion on Massively Parallel Processors
Graphic processing unit-accelerated mutual information-based 3D image rigid registration
Graphic processors to speed-up simulations for the design of high performance solar receptors
Graphic-Card Cluster for Astrophysics (GraCCA) - Performance Tests
Graphic-Processing-Units Based Adaptive Parameter Estimation of a Visual Psychophysical Model
Graphical Asian Options
Graphical future
Graphical processing unit implementation of an integrated shape-based active contour: Application to digital pathology
Graphical Processing Units (GPU) acceleration of finite-difference frequency-domain (FDFD) technique
Graphical Processing Units (GPU)-based modeling for Acoustic and Ultrasonic NDE
Graphical Processing Units for Quantum Chemistry
Graphics Card as a Cheap Supercomputer
Graphics hardware & GPU computing: past, present, and future
Graphics Hardware based Efficient and Scalable Fuzzy C-Means Clustering
Graphics Hardware Implementation of the Parameter-Less Self-organising Map
Graphics Hardware-Based Level-Set Method for Interactive Segmentation and Visualization
Graphics Processing Unit (GPU) Implementation Methodology of AERMOD Model
Graphics processing unit (GPU) programming strategies and trends in GPU computing
Graphics processing unit accelerated non-uniform fast Fourier transform for ultrahigh-speed, real-time Fourier-domain OCT
Graphics Processing Unit Accelerated O(N) Micromagnetic Solver
Graphics Processing Unit Acceleration of the Explicit Solution of the Time Domain Volume Integral Equation Using OpenACC
Graphics Processing Unit acceleration of the Random Phase Approximation in the projector augmented wave method
Graphics Processing Unit Audio Signals Processing in Pure Data and PdCUDA an Implementation with the CUDA Runtime API
Graphics Processing Unit based searching the critical slip surface of slopes by the Vector Sum Analysis Method
Graphics Processing Unit Bloom Filters: Classical and Probabilistic
Graphics processing unit implementation of lattice Boltzmann models for flowing soft systems
Graphics processing unit implementations of relative expression analysis algorithms enable dramatic computational speedup
Graphics processing unit parallel accelerated solution of the discrete ordinates for photon transport in biological tissues
Graphics Processing Unit Utilization in Circuit Simulation
Graphics processing unit--accelerated holography by simulated annealing
Graphics Processing Unit-Accelerated Quantitative Trait Loci Detection
Graphics Processing Unit-Based Computer-Aided Design Algorithms for Electronic Design Automation
Graphics Processing Units and Genetic Programming: An overview
Graphics Processing Units and High-Dimensional Optimization
Graphics Processing Units for Handhelds
Graphics Processing Units for the Real-time Linear Elastostatic Simulation of Liver
Graphics Processing Units in Acceleration of Bandwidth Selection for Kernel Density Estimation
Graphics Processing Units: More Than the Pathway to Realistic Video-Games
Graphics Processor Clusters for High Speed Backpropagation
Graphics Processor Unit (GPU) Acceleration of Finite-Difference Frequency-Domain (FDFD) Method
Graphics processor unit (GPU) acceleration of finite-difference time-domain (FDTD) algorithm
Graphics Programming on the Web WebCL Course Notes
Graphics Supercomputing Applied to Brain Image Analysis with NiftyReg
Graphtoy: Fast Software Simulation of Applications for AMD's AI Engines
GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding
GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration
GraviDy: a GPU modular, parallel N-body integrator
Gravitational tree-code on graphics processing units: implementation in CUDA
Gravitational wave astrophysics, data analysis and multimessenger astronomy
GrAVity: a massively parallel antivirus engine
GRay: a Massively Parallel GPU-Based Code for Ray Tracing in Relativistic Spacetimes
Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures
GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures
Grex: An efficient MapReduce framework for graphics processing units
Grid-based SAH BVH construction on a GPU
Grids, Clouds and Virtualization
grim: A Flexible, Conservative Scheme for Relativistic Fluid Theories
GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices based on Fine-Grained Structured Weight Sparsity
GrIP: A Framework for Experiments with Screen Space Algorithms
GRN: Gated Relation Network to Enhance Convolutional Neural Network for Named Entity Recognition
GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability
GROMACS on Hybrid CPU-GPU and CPU-MIC Clusters: Preliminary Porting Experiences, Results and Next Steps
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers
GROPHECY: GPU performance projection from CPU code skeletons
Group Marching Tree: Sampling-Based Approximately Optimal Motion Planning on GPUs
Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels
GRS - GPU radix sort for multifield records
gScan: Accelerating Graham Scan on the GPU
gSLIC: a real-time implementation of SLIC superpixel segmentation
gSLICr: SLIC superpixels at over 250Hz
gSMat: A Scalable Sparse Matrix-based Join for SPARQL Query Processing
GSNP: A DNA Single-Nucleotide Polymorphism Detection System with GPU Acceleration
GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism
GStream: A General-Purpose Data Streaming Framework on GPU Clusters
gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs
GT4Py: High Performance Stencils for Weather and Climate Applications using Python
Guardian: Safe GPU Sharing in Multi-Tenant Environments
GUESS-ing Polygenic Associations with Multiple Phenotypes Using a GPU-Based Evolutionary Stochastic Search Algorithm
Guided Profiling for Auto-Tuning Array Layouts on GPUs
Gunrock: A High-Performance Graph Processing Library on the GPU
Gunrock: GPU Graph Analytics
Gvim: Gpu-accelerated virtual machines
Gyrofluid Modeling of Turbulent, Kinetic Physics
Gyrokinetic Particle-in-Cell Optimization on Emerging Multi- and Manycore Platforms
Gyrokinetic Toroidal Simulations on Leading Multi-and Manycore HPC Systems
gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
H- and C-level WFST-based large vocabulary continuous speech recognition on Graphics Processing Units
H-LU Factorization on Many-Core Systems
H. 264 Parallel Optimization on Graphics Processors
H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA)
HACC: Simulating Sky Surveys on State-of-the-Art Supercomputing Architectures
HAccRG: Hardware-Accelerated Data Race Detection in GPUs
Hacking Neural Networks: A Short Introduction
Hadoop Mapreduce OpenCL Plugin
Hadoop+Aparapi: Making heterogenous MapReduce programming easier
HadoopCL: MapReduce on Distributed Heterogeneous Platforms Through Seamless Integration of Hadoop and OpenCL
Hadoopcl2: Motivating the design of a distributed, heterogeneous programming system with machine-learning applications
HALF: Holistic Auto Machine Learning for FPGAs
HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC
Halo Gathering Scalability for Large Scale Multi-dimensional Sznajd Opinion Models Using Data Parallelism with GPUs
HAM - Heterogenous Active Messages for Efficient Offloading on the Intel Xeon Phi
Hand Tracking based on Hierarchical Clustering of Range Data
Handbook of open source tools
Handwritten Digit Recognition with a Committee of Deep Neural Nets on GPUs
HaoCL: Harnessing Large-scale Heterogeneous Processors Made Easy
HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
Haptic and graphic rendering of deformable objects based on GPUs
Haptic feedback for the GPU-based surgical simulator
Haptic guided 3-D deformable image registration
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
Hard-Sphere Collision Simulations with Multiple GPUs, PCIe Extension Buses and GPU-GPU Communications
Hardware Accelerated Molecular Docking: A Survey
Hardware accelerated multi-resolution geometry synthesis
Hardware Accelerated Skin Deformation for Animated Crowds
Hardware accelerated symmetric condensed node TLM procedure for NVIDIA graphics processing units
Hardware Acceleration for Neural Networks: A Comprehensive Survey
Hardware Acceleration for Unstructured Big Data and Natural Language Processing
Hardware Acceleration of EDA Algorithms: Custom ICs, FPGAs and GPUs
Hardware Acceleration of EDA Algorithms: GPU Architecture and the CUDA Programming Model
Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs
Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact
Hardware acceleration vs. algorithmic acceleration: can GPU-based processing beat complexity optimization for CT?
Hardware Accelerators for Artificial Intelligence
Hardware accelerators for biocomputing: A survey
Hardware Accelerators for Cartesian Genetic Programming 
Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead
Hardware Checkpointing and Productive Debugging Flows for FPGAs
Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems
Hardware Implementation and Quantization of Tiny-Yolo-v2 using OpenCL
Hardware thread reordering to boost OpenCL throughput on FPGAs
Hardware Transactional Memory for GPU Architectures
Hardware-accelerated 3D visualization of mass spectrometry data
Hardware-Accelerated Adaptive EWA Volume Splatting
Hardware-accelerated parallel non-photorealistic volume rendering
Hardware-Accelerated Raycasting: Towards an Effective Brain MRI Visualization
Hardware-Accelerated Volume Rendering for Real-Time Medical Data Visualization
Hardware-assisted feature analysis and visualization of procedurally encoded multifield volumetric data
Hardware-Assisted High-Efficiency Ray Casting of Unstructured Time-Varying Flows Using Temporal Coherence
Hardware-Assisted Projected Tetrahedra
Hardware-assisted Rendering of CSG Models
Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing
Hardware-assisted visibility sorting for unstructured volume rendering
Hardware-based nonlinear filtering and segmentation using high-level shading languages
Hardware-based simulation and collision detection for large particle systems
Hardware-Efficient Belief Propagation
Hardware-Oblivious Parallelism for In-Memory Column-Stores
Hardware-Oriented Multigrid Finite Element Solvers on GPU-Accelerated Clusters
Hardware/Software Co-Design for Data-Intensive Genomics Workloads
Hardware/Software Co-design for Energy-Efficient Seismic Modeling
Hardware/Software Vectorization for Closeness Centrality on Multi-/Many-Core Architectures
Harmonic CUDA: Asynchronous Programming on GPUs
Harnessing Aspect Oriented Programming on GPU: Application to Warp-Level Parallelism (WLP)
Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems
Harnessing GPU Computing in System-Level Software
Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition
Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper
Harnessing the GPU for Real-Time Haptic Tissue Simulation
Harnessing the Power of GPUs without Losing Abstractions in SaC and ArrayOL: A Comparative Study
Harnessing the power of idle GPUs for acceleration of biological sequence alignment
Harvesting graphics power for MD simulations
Hash-Based Authentication Revisited in the Age of High-Performance Computers
HashGraph - Scalable Hash Tables Using A Sparse Graph Data Structure
Hashing, Caching, and Synchronization: Memory Techniques for Latency Masking Multithreaded Applications
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
Have GPUs made FPGAs redundant in the field of video processing?
HCudaBLAST: an implementation of BLAST on Hadoop and Cuda
HCW 2009 keynote talk: GPU computing: Heterogeneous computing for future systems
HDArray: Parallel Array Interface for Distributed Heterogeneous Devices
Head Pose Tracking Using GPU Based Real-time 3D Registration
Heat Load Modelling for District Heating Plants Using an OpenCL-based Algorithm
HEATS: Heterogeneity- and Energy-Aware Task-based Scheduling
HELIOS-K: An Ultrafast, Open-source Opacity Calculator for Radiative Transfer
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
Hera-JVM: a runtime system for heterogeneous multi-core architectures
Hercules: A Compiler for Productive Programming of Heterogeneous Systems
Hermes: an integrated CPU/GPU microarchitecture for IP routing
HeSP: a simulation framework for solving the task scheduling-partitioning problem on heterogeneous architectures
HetCCL: Accelerating LLM Training with Heterogeneous GPUs
Hetero-DB: Next Generation High-Performance Database Systems by Best Utilizing Heterogeneous Computing and Storage Resources
Hetero-Mark, A Benchmark Suite for CPU-GPU Collaborative Computing
HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
Heterogeneity-aware Fault Tolerance using a Self-Organizing Runtime System
Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud
Heterogeneous (CPU+GPU) Working-set Hash Tables
Heterogeneous Accelerated Bioinformatics-Perspectives for Cancer Research
Heterogeneous Acceleration of Volumetric JPEG 2000
Heterogeneous Active Messages (HAM) - Implementing Lightweight Remote Procedure Calls in C++
Heterogeneous Clustering with Homogeneous Code: Accelerate MPI Applications Without Code Surgery Using Intel Xeon Phi Coprocessors
Heterogeneous Computing and Grid Scheduling with Hierarchically Parallel Evolutionary Algorithms
Heterogeneous Computing and Load Balancing Techniques for Monte Carlo Simulation in a Distributed Environment
Heterogeneous Computing for Data Stream Mining
Heterogeneous Computing for Real-Time Stereo Matching
Heterogeneous Computing for Solving System of the Linear Equations by the Conjugate Gradient Method
Heterogeneous Computing for Vertebra Detection and Segmentation in X-Ray Images
Heterogeneous Computing in Economics: a Simplified Approach
Heterogeneous Computing on Mixed Unstructured Grids with PyFR
Heterogeneous computing with an algorithmic skeleton framework
Heterogeneous Computing with OpenCL
Heterogeneous CPU/(GP) GPU Memory Hierarchy Analysis and Optimization
Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics
Heterogeneous Distributed Big Data Clustering on Sparse Grids
Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments
Heterogeneous FTDT for Seismic Processing
Heterogeneous GPU and CPU acceleration of a finite volume compressible flow solver for multiblock structured grids
Heterogeneous GPU&CPU cluster for High Performance Computing in cryptography
Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi
Heterogeneous Highly Parallel Implementation of Matrix Exponentiation Using GPU
Heterogeneous multicore parallel programming for graphics processing units
Heterogeneous Network Embedding via Deep Architectures
Heterogeneous NPACI-Rocks/MPI/CUDA distributed multi-GPGPU application for seeking counterexamples to Beal's Conjecture: MPI/CUDA integration component
Heterogeneous parallel algorithms for Computational Fluid Dynamics on unstructured meshes
Heterogeneous parallel computing for image registration and linear algebra applications
Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS
Heterogeneous Programming with Single Operation Multiple Data
Heterogeneous Resource-Elastic Management for FPGAs: Concepts, Theory and Implementation
Heterogeneous Resource-Elastic Scheduling for CPU+FPGA Architectures
Heterogeneous Task Scheduling for Accelerated OpenMP
Heterogenous Acceleration for Linear Algebra in Multi-Coprocessor Environments
HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators
HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments
HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines
HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism
Heuristic Adaptability to Input Dynamics for SpMM on GPUs
Heuristic Optimization Methods for Improving Performance of Recursive General Purpose Applications on GPUs
Heuristics for Conversion Process of GPU's Kernels for Multiples Kernels with Concurrent Optimization Divergence
Heuristics for the Variable Sized Bin Packing Problem Using a Hybrid P-System and CUDA Architecture
HexServer: an FFT-based protein docking server powered by graphics processors
HG-Caffe: Mobile and Embedded Neural Network GPU (OpenCL) Inference Engine with FP16 Supporting
HHT-based time-frequency analysis method for biomedical signal applications
HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems
HiCCL: A Hierarchical Collective Communication Library
hiCUDA: a high-level directive-based language for GPU programming
hiCUDA: High-Level GPGPU Programming
Hidden Surface Removal Using BSP Tree with CUDA
HiDP: A Hierarchical Data Parallel Language
Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture
Hierarchical belief propagation to reduce search space using CUDA for stereo and motion estimation
Hierarchical clustering of gene expression profiles with graphics hardware acceleration
Hierarchical DAG Scheduling for Hybrid Distributed Systems
Hierarchical Exploration of Volumes Using Multilevel Segmentation of the Intensity-Gradient Histograms
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms
Hierarchical Line Integration
Hierarchical Mapping Techniques for Signal Processing Systems on Parallel Platforms
Hierarchical Markov Random Fields Applied to Model Soft Tissue Deformations on Graphics Hardware
Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression
Hierarchical N-body simulations with auto-tuning for heterogeneous systems
Hierarchical Octree and Sub-Volume Texture Block Projection for GPU Accelerated Ray Casting Volume Rendering
Hierarchical overlapped tiling
Hierarchical parallel processing of large scale data clustering on a PC cluster with GPU co-processing
Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters
Hierarchical QR factorization algorithms for multi-core cluster systems
Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach
Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs
Hierarchical Semantic Parsing for Object Pose Estimation in Densely Cluttered Scenes
Hierarchical Stochastic Motion Blur Rasterization
Hierarchical Transparent Programming for Heterogeneous Computing
Hierarchical Visualization and Compression of Large Volume Datasets Using GPU Clusters 
High accuracy electron beam model development: MICHELLE eBEAM
High Accuracy Gravitational Waveforms from Black Hole Binary Inspirals Using OpenCL
High accuracy solutions to energy gradient flows from material science models
High dimensional pricing of exotic European contracts on a GPU Cluster, and comparison to a CPU cluster
High Dimensional Spaces and Modelling in the task of Speaker Recognition
High energy electromagnetic particle transportation on the GPU
High Level High Performance Computing for Multitask Learning of Time-varying Models
High Level Programming for Heterogeneous Architectures
High Level Synthesis and Evaluation of the Secure Hash Standard for FPGAs
High locality and increased intra-node parallelism for solving finite element models on GPUs by novel element-by-element implementation
High Performance Adaptive Image Processing on Multi-scale Hybrid Architectures
High Performance Algorithms for Counting Collisions and Pairwise Interactions
High Performance Algorithms to Improve the Runtime Computation of Spacecraft Trajectories
High Performance and Scalable GPU Graph Traversal
High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing
High Performance Approximate Sort Algorithm Using GPUs
High performance bioinformatics and computational biology on general-purpose graphics processing units
High performance cellular level agent-based simulation with FLAME for the GPU
High Performance Client-Side Web Programming with SPOC and Js_of_ocaml
High Performance Code Generation for Stencil Computation on Heterogeneous Multi-device Architectures
High performance comparison-based sorting algorithm on many-core GPUs
High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs
High performance computing for deformable image registration: Towards a new paradigm in adaptive radiotherapy
High Performance Computing for Large Graphs of Internet Applications using GPU
High performance computing for linear acoustic wave simulation
High Performance Computing for solving large sparse systems. Optical Diffraction Tomography as a case of study
High Performance Computing Image Analysis for Radiotherapy Planning
High Performance Computing of Dynamic Structural Response Analysis for the Integrated Earthquake Simulation
High Performance Computing of Meshless Time Domain Method on Multi-GPU Cluster
High performance computing on Android devices - a case study
High Performance Computing on Astrophysics with Artificial Intelligence Algorithms
High Performance Computing on GPU for Electromagnetic Logging
High Performance Computing using GPGPU's
High Performance Computing Using MPI and OpenMP on Multi-core Parallel Systems
High Performance Computing via a GPU
High Performance Computing via High Level Synthesis
High Performance Computing with Accelerators
High Performance Computing with FPGAs and OpenCL
High Performance Computing with GPUs
High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning
High performance content-based matching using GPUs
High Performance Data Leak Detection
High Performance Data Mining Using R on Heterogeneous Platforms
High performance dense linear system solver with soft error resilience
High Performance Direct Gravitational N-body Simulations on Graphics Processing Unit I: An implementation in Cg
High Performance Direct Gravitational N-body Simulations on Graphics Processing Units
High performance direct gravitational N-body simulations on graphics processing units II: An implementation in CUDA
High Performance Direct Gravitational N-body Simulations on Graphics Processing Units: An implementation in CUDA (thesis)
High performance discrete Fourier transforms on graphics processors
High Performance Error Correction for Quantum Key Distribution using Polar Codes
High Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications
High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform
High Performance Financial Simulation Using Randomized Quasi-Monte Carlo Methods
High performance finite difference PDE solvers on GPUs
High performance gate-level simulation with GP-GPU computing
High performance genetic programming on GPU
High Performance GPU Accelerated Local Optimization in TSP
High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results
High Performance GPU Implementation of KNN Algorithm: A Review
High Performance GPU-based Fourier Volume Rendering
High Performance GPU-based Proximity Queries using Distance Fields
High performance high-order numerical methods: applications in ocean modeling
High performance histogramming on massively parallel processors
High Performance Histograms on SIMT and SIMD Architectures
High Performance Hybrid Functional Petri Net Simulations of Biological Pathway Models on CUDA
High performance implementation of hydrodynamic interactions and applications with the sub-cellular element method
High Performance Implementation of Ultrasound Color Doppler Imaging on GPU platform
High performance in silico virtual drug screening on many-core processors
High Performance Iterative Solver for Linear System using Multi GPU
High Performance Lattice Boltzmann Solvers on Massively Parallel Architectures with Applications to Building Aeraulics
High Performance Low Power Embedded Vision Systems
High performance massively parallel direct N-body simulations on large GPU clusters
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs
High Performance Matrix Multiplication
High performance memetic algorithm particle filter for multiple object tracking on modern GPUs
High performance methods for frequent pattern mining
High Performance Monte Carlo and Time-Stepping Dynamics for the Classical Spin Heisenberg Model on GPUs
High Performance Monte Carlo Simulation of Ising Model on TPU Clusters
High performance MRI simulations of motion on multi-GPU systems
High Performance Multi-agent System based Simulations
High Performance Multi-dimensional (2D/3D) FFT-Shift Implementation on Graphics Processing Units (GPUs)
High Performance N-Body Simulation and Visualization through CUDA Architecture
High Performance Non-Blocking Collective Communication for Next Generation Infiniband Clusters
High Performance Parallel Design Based on Session Programming
High Performance Parallel Implementation of Compressive Sensing SAR Imaging
High performance pattern matching and data remanence on graphics processing units
High Performance Poisson Equation Solver for Hybrid CPU/GPU Systems
High Performance Portable Tsunami Simulations on Many-core CPU, GPU, and FPGA
High Performance Power Spectrum Analysis Using a FPGA Based Reconfigurable Computing Platform
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs
High Performance Privacy Preserving AI
High Performance Processor Development for Consumer Electronics Game Processor Perspective
High Performance Programming for Soft Computing
High Performance Radiation Transport Simulations: Preparing for TITAN
High performance realtime vision for mobile robots on the GPU
High Performance Relevance Vector Machine on GPUs
High Performance Remote Sensing Image Processing Using CUDA
High performance sequence mining using pairwise statistical significance
High Performance Simulation for Scalable Multi-Agent Reinforcement Learning
High Performance Stencil Code Algorithms for GPGPUs
High Performance Stencil Code Generation with Lift
High Performance Stereo Vision Designed for Massively Data Parallel Platforms
High performance stream computing for particle beam transport simulations
High Performance Streaming Smith-Waterman Implementation with Implicit Synchronization on Intel FPGA using OpenCL
High performance system for the Interactive rendering of a 3D Model into MPEG-4
High Performance System in GPU and CUDA Media Processing System
High performance technique for database applications using a hybrid GPU/CPU platform
High performance transcription factor-DNA docking with GPU computing
High performance volume splatting for visualization of neurovascular data
High Precision Integer Multiplication with a GPU Using Strassen's Algorithm with Multiple FFT Sizes
High precision integer multiplication with a graphics processing unit
High productivity multi-device exploitation with the Heterogeneous Programming Library
High Quality Cone-beam CT Reconstruction on the GPU
High Quality Elliptical Texture Filtering on GPU
High Quality Image Reconstruction of Point Models
High Quality Interactive Rendering of Massive Point Models Using Multi-way kd-Trees
High Rayleigh Number Mantle Convection on GPU
High Resolution Program Flow Visualization of Hardware Accelerated Hybrid Multi-core Applications
High Resolution Sparse Voxel DAGs
High speed 3-D registration using GPU
High Speed Articulated Object Tracking Using GPUs: A Particle Filter Approach
High speed cipher cracking: the case of Keeloq on CUDA
High Speed Compressed Sensing Reconstruction in Dynamic Parallel MRI Using Augmented Lagrangian and Parallel Processing
High speed view interpolation for tele-teaching and tele-conferencing
High Throughput Low Latency LDPC Decoding on GPU for SDR Systems
High throughput multiple-precision GCD on the CUDA architecture
High Throughput Variable Size Non-square Gabor Engine with Feature Pooling Based on GPU
High-accuracy Optimization by Parallel Iterative Discrete Approximation and GPU Cluster Computing
High-accuracy Optimization by Parallel Iterative Discrete Approximation and Multi-GPU Computing
High-Dimensional Adaptive Particle Swarm Optimization on Heterogeneous Systems
High-dimensional Planning on the GPU
High-dimensional wave atoms and compression of seismic datasets
High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures
High-Level Design for FPGA-based Multiprocessor Accelerators
High-Level Energy Model of Embedded GPU for Real-Time Graphic Rendering
High-level GPU computing with jacket for MATLAB and C/C++
High-level GPU programming in Julia
High-Level Manipulation of OpenCL-Based Subvectors and Submatrices
High-level Parallel Programming Support for Heterogeneous Systems
High-Level Programming Framework for Executing Streaming Applications on Heterogeneous OpenCL Platforms
High-Level programming of graphics hardware to increase performance of electromagnetics simulation
High-level Programming of Vulkan-based GPUs Through OpenMP
High-Level Support for Pipeline Parallelism on Many-Core Architectures
High-Level Synthesis for FPGAs: From Prototyping to Deployment
High-Order Algorithms for Compressible Reacting Flow with Complex Chemistry
High-Order Discontinuous Galerkin Methods by GPU Metaprogramming
High-Order Error-Optimized FDTD Algorithm With GPU Implementation
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster
High-Order Schemes for the Shallow Water Equations on GPUs
High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations
High-performance 3D Compressive Sensing MRI reconstruction
High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures
High-performance and Embedded Systems for Cryptography
High-performance and Hardware-aware Computing: Proceedings of the First International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC'08)
High-performance astrophysical visualization using Splotch
High-performance bankruptcy prediction model using Graphics Processing Units
High-performance biocomputing for simulating the spread of contagion over large contact networks
High-performance Blob-based iterative reconstruction of electron tomography on multi-GPUs
High-performance blob-based iterative three-dimensional reconstruction in electron tomography using multi-GPUs
High-Performance Code Generation for Stencil Computations on GPU Architectures
High-Performance Computation of a Jet in Cross Flow by Lattice Boltzmann Based Parallel Direct Numerical Simulation
High-Performance Computing Algorithms for Constructing Inverted Files on Emerging Multicore Processors
High-performance Computing in China: Research and Applications
High-Performance Computing using GPUs
High-Performance Computing with Accelerators
High-Performance Computing: from Optimization to Automation
High-performance cone beam reconstruction using CUDA compatible GPUs
High-performance CUDA kernel execution on FPGAs
High-Performance Deep Learning via a Single Building Block
High-Performance Diagnostic Fault Simulation on GPUs
High-Performance Distributed Multi-Model / Multi-Kernel Simulations: A Case-Study in Jungle Computing
High-performance Dynamic Programming on FPGAs with OpenCL
High-Performance Energy-Efficient Multicore Embedded Computing
High-Performance General Solver for Extremely Large-Scale Semidefinite Programming Problems
High-Performance GPGPU Programming with OCaml
High-performance GPU based Rendering for Real-Time, rigid 2D/3D-Image Registration in Radiation Oncology
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs
High-Performance High-Order Stencil Computation on FPGAs Using OpenCL
High-Performance Holistic XML Twig Filtering Using GPUs
High-Performance Image Synthesis for Radio Interferometry
High-performance Implementations and Large-scale Validation of the Link-wise Artificial Compressibility Method
High-Performance Interactive Scientific Visualization With Datoviz via the Vulkan Low-Level GPU API
High-Performance Iterative Electron Tomography Reconstruction with Long-Object Compensation using Graphics Processing Units (GPUs)
High-Performance Location-Aware Publish-Subscribe on GPUs
High-Performance Matrix-Vector Multiplication on the GPU
High-Performance Monte Carlo Radiosity on GPU based on Scene Partitioning
High-Performance Multi-View Reconstruction
High-Performance Neural Networks for Visual Object Classification
High-Performance Online Spatial and Temporal Aggregations on Multi-core CPUs and Many-Core GPUs
High-Performance Out-of-core Block Randomized Singular Value Decomposition on GPU
High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context
High-performance polynomial GCD computations on graphics processors
High-Performance Pseudo-Random Number Generation on Graphics Processing Units
High-Performance Reverse Time Migration on GPU
High-performance SIMT code generation in an active visual effects library
High-performance software rasterization on GPUs
High-performance sparse matrix-matrix products on Intel KNL and multicore architectures
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations
High-Performance Spatial Join Processing on GPGPUs with Applications to Large-Scale Taxi Trip Data
High-Performance Spatial Query Processing on Big Taxi Trip Data using GPGPUs
High-Performance Symmetric Block Ciphers on Multicore CPU and GPUs
High-Performance Tensor Contractions for GPUs
High-Performance Zonal Histogramming on Large-Scale Geospatial Rasters Using GPUs and GPU-Accelerated Clusters
High-precision molecular dynamics simulation of UO2-PuO2: Anion self-diffusion in UO2
High-precision molecular dynamics simulation of UO2-PuO2: pair potentials comparison
High-precision molecular dynamics simulation of UO2-PuO2: superionic transition in uranium dioxide
High-precision Monte Carlo study of the three-dimensional XY model on GPU
High-Precision Numerical Simulations of Rotating Black Holes Accelerated by CUDA
High-quality cardiac image dynamic visualization with feature enhancement and virtual surgical tool inclusion
High-Quality Point-Based Rendering on Modern GPUs
High-quality pre-integrated volume rendering using hardware-accelerated pixel shading
High-quality Real-time Stereo using Adaptive Cost Aggregation and Dynamic Programming
High-Quality Rendering of Quartic Spline Surfaces on the GPU
High-Quality Rendering of Varying Isosurfaces with Cubic Trivariate C1-Continuous Splines
High-quality surface splatting on today's GPUs
High-Quality, Semi-Analytical Volume Rendering for AMR Data
High-resolution stereo video rectification through a cost-efficient real-time GPU implementation using intrinsic and extrinsic camera parameters
High-Speed Dense Stereo Via Directional Center-Biased Windows on Graphics Hardware
High-speed electromagnetic field simulation by HIE-FDTD method with GPGPU
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System
High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units
High-Speed Object Detection: Design, Study and Implementation of a Detection Framework using Channel Features and Boosting
High-speed parallel wavelet algorithm based on CUDA and its application in three-dimensional surface texture analysis
High-Speed Private Information Retrieval Computation on GPU
High-Speed Stream-Centric Dense Stereo and View Synthesis on Graphics Hardware
High-Speed Turbo Equalization for GPP-based Software Defined Radios
High-speed volume ray casting with CUDA
High-Throughput All-Atom Molecular Dynamics Simulations Using Distributed Computing
High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms
High-throughput bayesian computing machine with reconfigurable hardware
High-throughput Bayesian network learning using heterogeneous multicore computers
High-throughput Execution of Hierarchical Analysis Pipelines on Hybrid Cluster Platforms
High-Throughput parallel blind Virtual Screening using BINDSURF
High-Throughput Parallel Viterbi Decoder on GPU Tensor Cores
High-throughput protein crystallization on the World Community Grid and the GPU
High-throughput sequence alignment using Graphics Processing Units
High-Throughput Sequence Translation Using CUDA
High-throughput stream categorization and intrusion detection on GPU
High-Throughput Transaction Executions on Graphics Processors
Higher order FEM numerical integration on GPUs with OpenCL
Higher-order CFD and Interface Tracking Methods on Highly-Parallel MPI and GPU systems
Highly accelerated feature detection in proteomics data sets using modern graphics processing units
Highly accelerated simulations of glassy dynamics using GPUs: caveats on limited floating-point precision
Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Highly Efficient Forward and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification
Highly Efficient Lattice-Boltzmann Multiphase Simulations of Immiscible Fluids at High-Density Ratios on CPUs and GPUs through Code Generation
Highly efficient mapping of the Smith-Waterman algorithm on CUDA-compatible GPUs
Highly interactive computational steering for coupled 3D flow problems utilizing multiple GPUs
Highly Optimized Full GPU-Acceleration of Non-hydrostatic Weather Model SCALE-LES
Highly optimized simulations on single- and multi-GPU systems of 3D Ising spin glass
Highly parallel decoding of space-time codes on graphics processing units
Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors
Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards
Highly Scalable Multiplication for Distributed Sparse Multivariate Polynomials on Many-core Systems
Hinomiyagura Infrastructure Competiton TDP: Platform of rescue simulation using GPGPU
HIPAcc: A Domain-Specific Language and Compiler for Image Processing
HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark
HipKittens: Fast and Furious AMD Kernels
HIPRT: A Ray Tracing Framework in HIP
HiRace: Accurate and Fast Source-Level Race Checking of GPU Programs
HISQ inverter on Intel Xeon Phi and NVIDIA GPUs
Histogram Computations on GPUs Kernel using Global and Shared Memory Atomics
Historic Learning Approach for Auto-tuning OpenACC Accelerated Scientific Applications
Historygrams: Enabling Interactive Global Illumination in Direct Volume Rendering using Photon Mapping
HLS Portability from Intel to Xilinx: A Case Study
hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis
hlslib: Software Engineering for Hardware Design
HOCL: A Family of Embedded Languages
Home-made Diffusion Model from Scratch to Hatch
Homomorphic Autocomplete
Homomorphic-Encrypted Volume Rendering
Homunculus Warping: Conveying importance using self-intersection-free non-homogeneous mesh deformation
HONEI: A collection of libraries for numerical computations targeting multiple processor architectures
HORIZON: Accelerated General Relativistic Magnetohydrodynamics
Hotspot Analysis Based Partial CUDA Acceleration of HMMER 3.0 on GPGPUs
How a Single Chip Causes Massive Power Bills. GPUSimPow: A GPGPU Power Simulator
How GPUs Can Improve the Quality of Magnetic Resonance Imaging
How GPUs Work
How much can we gain from Tensor Kernel Fusion on GPUs?
How to Benefit from AMD, Intel and Nvidia Accelerator Technologies in Scilab
How to Correctly Deal With Pseudorandom Numbers in Manycore Environments - Application to GPU programming with Shoverand
How to distribute most efficiently a computation intensive calculation on an Android device to external compute units with an Android API
How to obtain efficient GPU kernels: an illustration using FMM & FGT algorithms
How to Render FDTD Computations More Effective Using a Graphics Accelerator
How to Rent GPUs on a Budget
How to scale distributed deep learning?
How to Train BERT with an Academic Budget
How well do STARLAB and NBODY compare? II: Hardware and accuracy
HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU
HPC acceleration of large (min, +) matrix products to compute domination-type parameters in graphs
HPC on the Intel Xeon Phi: Homomorphic Word Searching
HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages
HPC++: An LLVM-Based Automatic Parallelization Framework with Heterogeneous CPU–GPU Execution
HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration
HPerf: A Lightweight Profiler for Task Distribution on CPU+GPU Platforms
HPP-Controller: An intra-node controller designed for connecting heterogeneous CPUs
HPVM: A Portable Virtual Instruction Set for Heterogeneous Parallel Systems
HPVM: Heterogeneous Parallel Virtual Machine
HPX - The C++ Standard Library for Parallelism and Concurrency
HSApriori: High Speed Association Rule Mining using Apriori Based Algorithm for GPU
HSPA+/LTE-A Turbo Decoder on GPU and Multicore CPU
HSTREAM: A directive-based language extension for heterogeneous stream computing
HTML5 WebSocket protocol and its application to distributed computing
HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads
Human Re-identification System On Highly Parallel GPU and CPU Architectures
Humanoid navigation planning using future perceptive capability
Hunting CUDA Bugs at Scale with cuFuzz
Hybrid Acceleration of a Molecular Dynamics Simulation Using Short-Ranged Potentials
Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators
Hybrid Algorithms for List Ranking and Graph Connected Components
Hybrid coherence for scalable multicore architectures
Hybrid computational voxelization using the graphics pipeline
Hybrid Core Acceleration of UWB SIRE Radar Signal Processing
Hybrid CPU and GPGPU Volunteer Computing Framework over the Extensible Messaging and Presence Protocol for Prallel Branch and Bound Optimization of Truss Structures
Hybrid CPU-GPU Distributed Framework for Large Scale Mobile Networks Simulation
Hybrid CPU-GPU execution support in the skeleton programming framework SkePU
Hybrid CPU-GPU Framework for Network Motifs
Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods
Hybrid CPU-GPU Implementation of Tracking-Learning-Detection Algorithm
Hybrid CPU-GPU Pipeline Framework
Hybrid CPU/GPU KD-Tree Construction for Versatile Ray Tracing
Hybrid CPU/GPU/APU accelerated query, insert, update and erase operations in hash tables with string keys
Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters
Hybrid Embarrassingly Parallel on heterogeneous platform
Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model
Hybrid Framework for pairwise DNA Sequence Alignment Using the CUDA compatible GPU
Hybrid GATE: A GPU/CPU implementation for imaging and therapy applications
Hybrid general-purpose computation on GPU (GPGPU) and computer graphics synthetic aperture radar simulation for complex scenes
Hybrid GPU-Based Single- and Double-Bounce SAR Simulation
Hybrid GPU-CPU Adaptive Precision Ray-Triangle Intersection Tests for Robust High-Performance GPU Dosimetry Computations
Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters
Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters
Hybrid Monte Carlo CT Simulation on GPU
Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU
Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters
Hybrid MPI/GPU Interpolation for Grid DEM Construction
Hybrid Multicore Algorithms for Some Semi-Numerical Applications and Graphs
Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework
Hybrid OpenCL over high speed networks
Hybrid OpenCL: Connecting Different OpenCL Implementations over Network
Hybrid OpenCL: Enhancing OpenCL for Distributed Processing
Hybrid Parallel Light-Weight Programming of Hybrid Systems
Hybrid parallel programming - evaluation of OpenACC
Hybrid Parallel Streamline Extraction Combining MPI and OpenCL
Hybrid Parallelism for Volume Rendering on Large, Multi-and Many-core Systems
Hybrid Particle Lattice Boltzmann Shallow Water for interactive fluid simulations
Hybrid Programming using OpenSHMEM and OpenACC
Hybrid quantum programming with PennyLane Lightning on HPC platforms
Hybrid Ray Tracing and Path Tracing of Bezier Surfaces Using A Mixed Hierarchy
Hybrid Sample-based Surface Rendering
Hybrid Scheduling for Event-driven Simulation over Heterogeneous Computers
Hybrid Single/Double Precision Floating-Point Computation on GPU Accelerators for 2-D FDTD
Hybrid smoothed particle hydrodynamics
Hybrid strategy for stencil computations on the APU
Hybrid Update Algorithms for Regular Lattice and Small-World Ising Models on Graphical Processing Units
Hybrid Use of OmpSs for a Shock Hydrodynamics Proxy Application
Hybrid Visualization for White Matter Tracts using Triangle Strips and Point Sprites
Hydra: a C++11 framework for data analysis in massively parallel platforms
Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters
Hyper neural network on OpenCL
Hypercubic Storage Layout and Transforms in Arbitrary Dimensions using GPUs and CUDA
Hyperfast Parallel--Beam Backprojection
Hyperfast Perspective Cone--Beam Backprojection
Hyperspectral Unmixing on GPUs and Multi-Core Processors: A Comparison
HyPHI - task based hybrid execution C++ library for the Intel Xeon Phi coprocessor
I/O Lower Bounds for Auto-tuning of Convolutions in CNNs
I3DC: Interactive Three-Dimensional Cubes
IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication
IBM Deep Learning Service
Ice Simulation Using GPGPU
IceCubes GPGPU's cluster for extensive MC production
ICNet for Real-Time Semantic Segmentation on High-Resolution Images
Identification and Elimination of Platform-Specific Code Smells in High Performance Computing Applications
Identifying scalar behavior in CUDA kernels
Identifying the Key Features of Intel Xeon Phi: A Comparative Approach
IgNet. A Super-precise Convolutional Neural Network
Ignite-GPU: a GPU-enabled in-memory computing architecture on clusters
iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud
iGPU: Exception Support and Speculative Execution on GPUs
iGUARD: In-GPU Advanced Race Detection
Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout
Ilargi: a GPU Compatible Factorized ML Model Training Framework
Illustrative Rendering of Particle Systems
Illustrative Stream Surfaces
Illustrative Volume Visualization Using GPU-Based Particle Systems
Image and Video Processing on CUDA: State of the Art and Future Directions
Image and Video Processing on GPU: Implementation Scheme, Applications and Future Directions
Image Classification with Pyramid Representation and Rotated Data Augmentation on Torch 7
Image Convolution Processing: a GPU versus FPGA Comparison
Image Denoising Using Wavelet Transform and CUDA
Image Encryption Using Parallel RSA Algorithm on CUDA
Image Noise Removal on Heterogeneous CPU-GPU Configurations
Image Object Tracking System Using Parallel Mean Shift Algorithm
Image parallel processing based on GPU
Image processing algorithm optimization with CUDA for Pure Data
Image processing applications on a low power highly parallel SIMD architecture
Image Processing on Graphical Processing Units for faster DNA Sequencing
Image Processing using Parallel Computing
Image Processing with CUDA
Image reconstruction in digital holographic microscopy on GPU
Image registration on GPU
Image representation by blob and its application in CT reconstruction from few projections
Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods
Image selection for improved Multi-View Stereo
Image Space Gathering
Image spatial diffusion on GPUs
Image super-resolution by vectorizing edges
Image Super-Resolution Using Deep Convolutional Networks
Image-based fast three-dimensional leaf modeling
Image-Based Material Restyling with Fast Non-local Means Filtering
Image-Based Proxy Accumulation for Real-Time Soft Global Illumination
Image-Space Caustics and Curvatures
Image-Space Collision Detection Through Alternate Surface Peeling
Image-Space GPU Metaballs for Time-Dependent Particle Data Sets
ImageCL: An Image Processing Language for Performance Portability on Heterogeneous Systems
ImageCL: Language and source-to-source compiler for performance portability, load balancing, and scalability prediction on heterogeneous systems
Impact of asynchronism on GPU accelerated parallel iterative computations
Impact of communication times on mixed CPU/GPU applications scheduling using KAAPI
Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation
Impact of Floating-Point Precision on Boundary Layer Instabilities Modeled on Fermi GPU
Impact of GPU Memory Access Patterns on FDTD
Impact of Modern OpenGL on FPS
Impact of the channel count on the nonlinear tolerance in coherently-detected POLMUX-QPSK modulation
Impact of the Random Number generator quality on particle swarm optimization algorithm running on graphic processor units
Impact of Warp Formation on GPU Performance
Impacts of Parallel Programming on Limited-Resource Hardware
Implementability of shading models for current game engines
Implementation & Parallelisation of FDTD code for Electromagnetic Scattering
Implementation and Analysis of AES Encryption on GPU
Implementation and Evaluation of Recurrence Equation Solvers on GPGPU systems using Rearrangement of Array Configurations
Implementation and Evaluation of Scientific Simulations on High Performance Computing Architectures
Implementation and evaluation of various demons deformable image registration algorithms on GPU
Implementation and Experimental Evaluation of a CUDA Core under Single Event Effects
Implementation and Optimization of Image Processing Algorithms on Embedded GPU
Implementation and optimization of image processing algorithms on handheld GPU
Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator
Implementation and Performance Analysis of SEAL Encryption on FPGA, GPU and Multi-core Processors
Implementation and performance analysis of the AXPY, DOT, and SpMV functions on Intel Xeon Phi and NVIDIA Tesla using OpenCL
Implementation and Performance Comparison of the Motion Compensation Kernel of the AVS Video Decoder on FPGA, GPU and Multicore Processors
Implementation and performance evaluation of a GPU particle-in-cell code
Implementation and performance evaluation of reconstruction algorithms on graphics processors
Implementation Details of GPU-based Out-of-Core Many-Lights Rendering
Implementation of 2-D Discrete Cosine Transform Algorithm on GPU
Implementation of 3D FFTs Across Multiple GPUs in Shared Memory Environments
Implementation of 3D Monte Carlo PET reconstruction algorithm on GPU
Implementation of 802.11n on 128-CORE Processor
Implementation of a 3GPP LTE turbo decoder accelerator on GPU
Implementation of a distributed real-time video panorama pipeline for creating high quality virtual views
Implementation of a Fast Image Coding and Retrieval System Using a GPU
Implementation of a High Throughput 3GPP Turbo Decoder on GPU
Implementation of a High Throughput Soft MIMO Detector on GPU
Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA
Implementation of a Lattice–Boltzmann method for numerical fluid mechanics using the nVIDIA CUDA technology
Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL
Implementation of a Multi-User Detector for Satellite Return Links on a GPU Platform
Implementation of a multigrid solver on GPU for Stokes equations with strongly variable viscosity based on Matlab and CUDA
Implementation of a Parallel Tree Method on a GPU
Implementation of a PIC simulation using WebGL
Implementation of a Power Efficient Synthetic Aperture Radar Back Projection Algorithm on FPGAs Using OpenCL
Implementation of a Practical Distributed Calculation System with Browsers and JavaScript, and Application to Distributed Deep Learning
Implementation of a programming environment with a multithread model for reconfigurable systems
Implementation of a Soft Morphological Filter Based on GPU Framework
Implementation of Advanced Encryption Standard for encryption and decryption of images and text on a GPU
Implementation of algorithms for relativistic hydrodynamics using graphics processing units in CUDA framework
Implementation of algorithms with a fine-grained parallelism on GPUs
Implementation of Ant Colony Algorithm Based on GPU 
Implementation of association rule mining using CUDA
Implementation of Autoencoders with Systolic Arrays through OpenCL
Implementation Of Decoders for LDPC Block Codes and LDPC Convolutional Codes Based on GPUs
Implementation of Diamond Search Algorithm Using Parallel Processing Architecture
Implementation of digital down converter in GPU
Implementation of Fast Artificial Neural Network for Pattern Classification on Heterogeneous System
Implementation of FDTD-Compatible Green's Function on Heterogeneous CPU-GPU Parallel Processing System
Implementation of Filtering Beamforming Algorithms for Sonar Devices Using GPU
Implementation of float-float operators on graphics hardware
Implementation of Frequency Domain Convolution for the Caffe-Framework
Implementation of high speed hash function Keccak on GPU
Implementation of Jacobi iterative method on graphics processor unit
Implementation of Just In Time Value Specialization for the Optimization of Data Parallel Kernels
Implementation of k-Means Clustering Algorithm in CUDA
Implementation of K-shortest Path Algorithm in GPU Using CUDA
Implementation of Kd-Trees on the GPU to Achieve Real Time Graphics Processing
Implementation of Keccak hash function in Tree hashing mode on Nvidia GPU
Implementation of Kernel Methods on the GPU
Implementation of Kirchhoff prestack depth migration on GPU
Implementation of large-scale FIR adaptive filters on NVIDIA GeForce graphics processing unit
Implementation of LTE Mini receiver on GPUs
Implementation of Massive Artificial Neural Networks with CUDA
Implementation of medical image segmentation in CUDA
Implementation of Motion Estimation Based on Heterogeneous Parallel Computing System with OpenCL
Implementation of Parallel Fast Hartley Transform (FHT) Using Cuda
Implementation of Parallel Genetic Algorithms on Graphics Processing Units
Implementation of Parallel Simplified Swarm Optimization in CUDA
Implementation of PDE models of cardiac dynamics on GPUs using OpenCL
Implementation of QR Updating Algorithms on the GPU
Implementation of random linear network coding on OpenGL-enabled graphics cards
Implementation of Sequential Importance Sampling in GPGPU
Implementation of Smith-Waterman Algorithm in OpenCL for GPUs
Implementation of Smith-Waterman algorithm in OpenCL for GPUs
Implementation of Spectral Angle Mapper (SAM) Algorithm on a Graphic processing unit (GPU)
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of stereophonic acoustic echo canceller on nVIDIA GeForce graphics processing unit
Implementation of the "Local Rank Differences" Image Feature Using SIMD Instructions of CPU
Implementation of the FDTD Method Based on Lorentz-Drude Dispersive Model on GPU for Plasmonics Applications
Implementation of the genetic algorithm by means of CUDA technology involved in travelling salesman problem
Implementation of the Lucas-Kanade image registration algorithm on a GPU for 3D computational platform stabilisation
Implementation of the Neuberger-Dirac operator on GPUs
Implementation of the optimization algorithms on GPGPU architecture and multi-cores
Implementation of the r.cuda.los module in the open source GRASS GIS by using parallel computation on the NVIDIA CUDA graphic cards
Implementation of the SYCL Heterogeneous Computing Library
Implementation of the twisted mass fermion operator in the QUDA library
Implementation of usual computerized tomography methods on GPU using the Compute Unified Device Architecture (CUDA)
Implementation of Variable Preconditioned GCR with mixed precision on GPU using CUDA
Implementation of Virtual Embryology using the Thrust library for CUDA
Implementation Techniques for SPMD Kernels on CPUs
Implementations of a Parallel Algorithm for Computing Euclidean Distance Map in Multicore Processors and GPUs
Implementations of hardware acceleration for MD4-family algorithms based on GPU
Implementations of Parallel Computation of Euclidean Distance Map in Multicore Processors and GPUs
Implementations of the FFT algorithm on GPU
Implementations of the Hough Transform on the Embedded Multicore Processors
Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU
Implementing a Finite Difference-Based Real-time Sound Synthesizer using GPUs
Implementing a GPU Programming Model on a non-GPU Accelerator Architecture
Implementing a GPU-Enhanced Cluster for Large-Scale Simulations
Implementing a Photorealistic Rendering System using GLSL
Implementing a Preconditioned Iterative Linear Solver Using Massively Parallel Graphics Processing Units
Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-sigma formats on NVIDIA GPUs
Implementing AES on GPU: Final Report
Implementing an architecture for efficient network traffic processing on modern graphics hardware
Implementing an efficient method of check-pointing on CPU-GPU
Implementing an embedded GPU language by combining translation and generation
Implementing an Interior Point Method for Linear Programs on a CPU-GPU System
Implementing and evaluating an heterogeneous, scalable, tridiagonal linear system solver with OpenCL to target FPGAs, GPUs, and CPUs
Implementing and Evaluating Candidate-Based Invariant Generation
Implementing cartesian genetic programming classifiers on graphics processing units using GPU.NET
Implementing CFD (Computational Fluid Dynamics) in OpenCL for Building Simulation
Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420
Implementing Continuous Integration Software in an Established Computational Chemistry Software Package
Implementing Decision Trees and Forests on a GPU
Implementing Deep Neural Networks for Financial Market Prediction on the Intel Xeon Phi
Implementing density functional theory (DFT) methods on many-core GPGPU accelerators
Implementing Domain-Specific Languages for Heterogeneous Parallel Computing
Implementing Efficient, Portable Computations for Machine Learning
Implementing general matrix-matrix multiplication algorithm on the Intel Xeon Phi Knights Landing Processor
Implementing Genetic Algorithms to CUDA Environment Using Data Parallelization
Implementing implicit OpenMP data sharing on GPUs
Implementing Independent Component Analysis in General-Purpose GPU Architectures
Implementing Interactive 3D Segmentation on CUDA Using Graph-Cuts and Watershed Transformation
Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units
Implementing LNS using filtering units of GPUs
Implementing Machine Learning Algorithms on GPUs for Real-Time Traffic Sign Classification
Implementing mesh-based approaches for deformable objects on GPU
Implementing modular arithmetic using OpenCL
Implementing Molecular Dynamics on Hybrid High Performance Computers - Particle-Particle Particle-Mesh
Implementing molecular dynamics on hybrid high performance computers - short range forces
Implementing Molecular Dynamics on Hybrid High Performance Computers - Three-Body Potentials
Implementing Neural Networks Efficiently
Implementing Open-Source CUDA Runtime
Implementing Parallel SMO to Train SVM on CUDA-Enabled Systems
Implementing Push-Pull Efficiently in GraphBLAS
Implementing QR Factorization Updating Algorithms on GPUs
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format
Implementing Sparse Matrix-Vector Multiplication with QCSR on GPU
Implementing Stereo Vision of GPU-Accelerated Scientific Simulations using Commodity Hardware
Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs
Implementing the Approximate Message Passing (AMP) Algorithm on a GPU
Implementing the Himeno benchmark with CUDA on GPU clusters
Implementing the PGI Accelerator model
Implementing the Projected Spatial Rich Features on a GPU
Implementing Ultrasound Beamforming on the GPU using CUDA
Implications of the Turing completeness of reaction-diffusion models, informed by GPGPU simulations on an XBox 360: cardiac arrhythmias, re-entry and the Halting problem
Implicit Adaptive Volume Ray Casting
Implicit and dynamic trees for high performance rendering
Implicit Boundary Control of Vector Field Based Shape Deformations
Implicit Feature-Based Alignment System for Radiotherapy
Implicit Methods for Real-Time simulation of Interactive Waves
Implicit Parallel Time Integrators
Implicit Skinning: Real-Time Skin Deformation with Contact Modeling
Importance of Data Loading Pipeline in Training Deep Neural Networks
Importance of Explicit Vectorization for CPU and GPU Software Performance
Importance Point Projection for GPU-based Final Gathering
Importance sampling algorithms for first passage time probabilities in the infinite server queue
Importance Sampling of Realistic Light Sources
Importance-driven compositing window management
Importance-Driven Isosurface Decimation for Visualization of Large Simulation Data Based on OpenCL
Importance-Driven Particle Techniques for Flow Visualization
Impostors and pseudo-instancing for GPU crowd rendering
Impostors, Pseudo-instancing and Image Maps for GPU Crowd Rendering
Improved automated lattice perturbation theory in background field gauge
Improved Distance Weighted GPU-based 3D Ultrasound Reconstruction Methods
Improved FCM algorithm for Clustering on Web Usage Mining
Improved Finite Difference Schemes for a 3-D Viscothermal Wave Equation on a GPU
Improved GPU Co-processor Sorting Algorithm with Barrier Synchronization
Improved Implementation of Simulation for Membrane Computing on the Graphic Processing Unit
Improved Integral Histogram Algorithm for Big Sized Images in CUDA Environment
Improved Lossless Image Compression Model Using Coefficient Based Discrete Wavelet Transform
Improved OpenCL-based Implementation of Social Field Pedestrian Model
Improved Performance of CaFE and IRIS Model Fitting Using CUDA
Improved Poisson Matting for a Real Time Tele-presence System Using GPU
Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring
Improved Real-Time Stereo on Commodity Graphics Hardware
Improved Row-Grouped CSR Format for Storing of Sparse Matrices on GPU
Improved Sequential & Parallel Designs and Implementations of the Eight Direction Prewitt Edge Detection
Improvement of the fused CUDA kernels performance prediction
Improvement Study of EEMD Decomposition Efficiency Based on CUDA Architecture
Improvements to Physically Based Cloth Simulation
Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors
Improving accuracy for matrix multiplications on GPUs
Improving Atmospheric Model Performance on a Multi-Core Cluster System
Improving Automatic Parallel Training via Balanced Memory Workload Optimization
Improving Cache Locality for GPU-based Volume Rendering
Improving Cache Locality for Ray Casting with CUDA
Improving Code Generation via Small Language Model-as-a-judge
Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters
Improving Communication Performance in GPU-Accelerated HPC Clusters
Improving CUDA DNA Analysis Software with Genetic Programming
Improving CUDASW++, a Parallelization of Smith-Waterman for CUDA Enabled Devices
Improving energy and power efficiency using NComputing and approaches for predicting reliability of complex computing systems
Improving Energy Efficiency of Basic Linear Algebra Routines on Heterogeneous Systems with Multiple GPUs
Improving Energy Efficiency of GPU based General-Purpose Scientific Computing through Automated Selection of Near Optimal Configurations
Improving GPGPU Concurrency with Elastic Kernels
Improving GPU particle filter with shader model 3.0 for visual tracking
Improving GPU Performance by Regrouping CPU-Memory Data
Improving GPU Performance Prediction with Data Transfer Modeling
Improving GPU Performance through Instruction Redistribution and Diversification
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling
Improving GPU Performance: Reducing Memory Conflicts and Latency
Improving GPU programming models through hardware cache coherence
Improving GPU Robustness by Making Use of Faulty Parts
Improving GPU Simulations of Spiking Neural P Systems
Improving GPU Sparse Matrix-Vector Multiplication for Probabilistic Model Checking
Improving GPU-accelerated Adaptive IDW Interpolation Algorithm Using Fast kNN Search
Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards
Improving Hybrid OpenCL Performance by High Speed Networks
Improving Locality of Unstructured Mesh Algorithms on GPUs
Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS
Improving many flavor QCD simulations using multiple GPUs
Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms
Improving numerical reproducibility and stability in large-scale numerical simulations on GPUs
Improving OpenACC compatibility within accULL
Improving OpenCL Performance by Specializing Compiler Phase Selection and Ordering
Improving OpenCL Programmability with the Heterogeneous Programming Library
Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers
Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra
Improving Performance and Energy Efficiency of GPUs through Locality Analysis
Improving Performance and Energy Efficiency of Heterogeneous Systems with rCUDA
Improving performance for emergent environments parameter tuning and simulation in games using GPU
Improving Performance of Hardware Accelerators by Optimizing Data Movement: A Bioinformatics Case Study
Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels
Improving Performance of Matrix Multiplication and FFT on GPU
Improving Performance of OpenCL on CPUs
Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow
Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations
Improving Performance Portability in OpenCL Programs
Improving processing time for visual measurements of displacements of IPMC actuators using CUDA
Improving programmability of heterogeneous many-core systems via explicit platform descriptions
Improving Resource Efficiency in Virtualized Datacenters
Improving Resource Utilization in Heterogeneous CPU-GPU Systems
Improving Scheduling Techniques in Heterogeneous Systems with Dynamic, On-Line Optimisations
Improving SIMD efficiency for parallel Monte Carlo light transport on the GPU
Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels
Improving SMT performance: an application of genetic algorithms to configure resizable caches
Improving Student Learning in Computer Science Courses by Using Virtual OpenCL Laboratory
Improving Synchronization and Data Access in Parallel Programming Models
Improving tasks throughput on accelerators using OpenCL command concurrency
Improving the Efficiency of GPU Clusters
Improving the Efficiency of OpenCL Kernels through Pipes
Improving the GPU space of computation under triangular domain problems
Improving the Mapping of Smith-Waterman Sequence Database Searches onto CUDA-Enabled GPUs
Improving the Neural GPU Architecture for Algorithm Learning
Improving the Performance of a Ray Tracing Algorithm Using a GPU
Improving the Performance of CA-GMRES on Multicores with Multiple GPUs
Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose
Improving the Performance of Hyperspectral Image and Signal Processing Algorithms Using Parallel, Distributed and Specialized Hardware-Based Systems
Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network
Improving the performance of PIR Protocol in Outsourced Databases
Improving the performance of spatial raster analysis in GIS using GPU
Improving the Performance of the Contextual Spaces Re-Ranking Algorithm on Heterogeneous Systems
Improving the Performance of the Linear Systems Solvers Using CUDA
Improving the Performance of the Sparse Matrix Vector Product with GPUs
Improving the Performance, Portability, and Productivity of Hardware Accelerators
Improving the Programmability of GPU Architectures
Improving the scalability of modern applications by parallel multi-core and many-core programming
Improving the speed of neural networks on CPUs
Improving the Speed of Virtual Rear Projection: A GPU-Centric Architecture
Improving the usability of hierarchical representations for interactively labeling large image data sets
In Search of Self-Organization
In Situ Power Analysis of General Purpose Graphical Processing Units
In vivo interactive visualization of four-dimensional blood flow patterns
In-Datacenter Performance Analysis of a Tensor Processing Unit
In-Memory Data Analytics on Coupled CPU-GPU Architectures
In-memory database acceleration on FPGAs: a survey
In-memory grid files on graphics processors
In-Place Recursive Approach for All-Pairs Shortest Paths Problem Using OpenCL
In-process optical characterization method for sub-100-nm nanostructures
In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units
In-Situ Techniques on GPU-Accelerated Data-Intensive Applications
Incoherent Ray tracing on GPU
Incomplete-LU and Cholesky Preconditioned Iterative Methods Using CUSPARSE and CUBLAS
Increased reliability on Intel GPUs via software diverse redundancy
Increasing Deep Neural Network Acoustic Model Size for Large Vocabulary Continuous Speech Recognition
Increasing GPU Throughput using Kernel Interleaved Thread Block Scheduling
Increasing Memory Miss Tolerance for SIMD Cores
Increasing precision of uniform pseudorandom number generators
Increasing predictability of GPU's
Increasing programmability of an embedded domain specific language for GPGPU kernels using static analysis
Increasing Realism and Supporting Content Planning for Dynamic Scenes in a Mixed Reality System incorporating a Time-of-Flight Camera
Increasing the Accuracy of the Space-Sweeping Approach to Stereo Reconstruction, using Spherical Backprojection Surfaces
Increasing the performance of AllToAll variant of self-organizing migration algorithm using CUDA
Incremental Bounded Model Checking of Artificial Neural Networks in CUDA
Incremental Raycasting of Piecewise Quadratic Surfaces on the GPU
Indexing million of packets per second using GPUs
Indexing of Spatiotemporal Trajectories for Efficient Distance Threshold Similarity Searches on the GPU
Indigo: A Domain-Specific Language for Fast, Portable Image Reconstruction
Industrial Robot Collision Handling in Harsh Environments
Inertial Coupling Method for particles in an incompressible fluctuating fluid
Inertial-aided KLT feature tracking for a moving camera
Inexpensive Immersive Projection
iNFAnt: NFA pattern matching on GPGPU devices
Inferring the Scheduling Policies of an Embedded CUDA GPU
Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Influence of InfiniBand FDR on the Performance of Remote GPU Virtualization
Information Visualization of Multi-dimensional Cellular Automata using GPU Programming
Initial condition for efficient mapping of level set algorithms on many-core architectures
Initial Experiences Porting a Bioinformatics Application to a Graphics Processor
Initial Explorations of ARM Processors for Scientific Computing
Inline Vector Compression for Computational Physics
Innovative prospective of Antenna-Gain removing the pain of EMI engineers
Input Sensitivity of GPU Program Optimizations
Input Space Splitting for OpenCL
Input-Aware Auto-Tuning for Directive-based GPU Programming
Input-Aware Auto-Tuning of Compute-Bound HPC Kernels
Inside VOLT: Designing an Open-Source GPU Compiler
Inside VOLT: Designing an Open-Source GPU Compiler (Tool)
INSPIRE: an interactive image assisted non-photorealistic rendering system
INSTA-YOLO: Real-Time Instance Segmentation
Instructions' Latencies Characterization for NVIDIA GPGPUs
Instruments of Productivity for High Performance Computing
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Integer sorting on multicores: some (experiments and) observations
Integrated Arrival and Departure Schedule Optimization Under Uncertainty
Integrated Framework for Heterogeneous Embedded Platforms Using OpenCL
Integrated GPUs: how useful are they in HPC?
Integrated Modelling of Hydrodynamic Processes, Faecal Indicator Organisms and Related Parameters with Improved Accuracy using Parallel (GPU) Computing
Integrating a large-scale testing campaign in the CK framework
Integrating Accelerators in Heterogeneous Systems
Integrating GPGPU computations with CPU coroutines in C++
Integrating GPUs as fast co-processors into the existing parallel FE package FEAST 
Integrating Multi-GPU Execution in an OpenACC Compiler
Integrating multi-threading and accelerators into DUNE-ISTL
Integrating Object Detection with 3D Tracking Towards a Better Driver Assistance System
Integrating Occlusion Culling with Parallel LOD for Rendering Complex 3D Environments on GPU
Integrating Post-Newtonian Equations on Graphics Processing Units
Integrating Profiling into MDE Compilers
Integrating SkePU's algorithmic skeletons with GPI on a cluster
Integrating Two-Way Interaction Between Fluids and Rigid Bodies in the Real-Time Particle Systems Library
Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX)
Integrative multicellular biological modeling: a case study of 3D epidermal development using GPU algorithms
Intel FPGA SDK for OpenCL
Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning
Intel oneAPI DPC++ FPGA Optimization Guide
Intel Xeon Phi acceleration of Hybrid Total FETI solver
Intel Xeon Phi Coprocessor High-Performance Programming
Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language
Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCL
Intelligent Edge Detection using a CUDA Simulator of Multilayer Neural Network Based on Multi-Valued Neurons
Intelligent GPGPU Classification in Volume Visualization: A framework based on Error-Correcting Output Codes
Intensity model with blur effect on GPUs applied to large-scale star simulators
Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive
Inter-Block GPU Communication via Fast Barrier Synchronization
Inter-block synchronization on a GPGPU
Inter-cluster communication on clustered SIMD architectures
Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Interacting with Volume Data: Deformations using Forward Projection
Interaction and Visualization Techniques for Immersive Exploration and Perception of 3D datasets
Interactive 3D distance field computation using linear factorization
Interactive Approximate Rendering of Reflections, Refractions, and Caustics
Interactive Bi-scale Editing of Highly Glossy Materials
Interactive BRDF Estimation for Mixed-Reality Applications
Interactive collision detection for complex and deformable models using programmable graphics hardware
Interactive Collision Detection for Deformable Models Using Streaming AABBs
Interactive Computer Graphics: A Top-Down Approach Using OpenGL (5th Edition)
Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware
Interactive Design Exploration for Constrained Meshes
Interactive Dynamic Water Surface Fast Rendering Algorithm
Interactive exploration of unsteady 3D flow with linked 2D/3D texture advection
Interactive fluid-particle simulation using translating Eulerian grids
Interactive free form deformer for point-based objects by GPU acceleration
Interactive global illumination in dynamic environments using commodity graphics hardware
Interactive GPU active contours for segmenting inhomogeneous objects
Interactive GPU Ray Casting using Progressive Blue Noise Sampling
Interactive GPU-based adaptive cartoon-style rendering 
Interactive GPU-based Collision Detection
Interactive Histology of Large-Scale Biomedical Image Stacks
Interactive Illustrative Line Styles and Line Style Transfer Functions for Flow Visualization
Interactive Indirect Illumination Using Adaptive Multiresolution Splatting
Interactive Isogeometric Volume Visualization with Pixel-Accurate Geometry
Interactive Isosurfaces with Quadratic C1 Splines on Truncated Octahedral Partitions
Interactive landscape visualization using GPU ray casting
Interactive Level-of-Detail Selection Using Image-Based Quality Metric for Large Volume Visualization
Interactive machinability analysis of free-form surfaces using multiple-view image space techniques on the GPU
Interactive Manycore Photon Mapping
Interactive multi-pass programmable shading
Interactive multiple anisotropic scattering in clouds
Interactive Out-of-core Visualisation of Very Large Landscapes on Commodity Graphics Platform
Interactive Parallelization of C Programs in SAPFOR
Interactive physically-based X-ray simulation: CPU or GPU?
Interactive Pixel-Accurate Free Viewpoint Rendering from Images with Silhouette Aware Sampling
Interactive Point-based Isosurface Exploration and High-quality Rendering
Interactive Point-Based Rendering of Higher-Order Tetrahedral Data
Interactive Program Debugging and Optimization for Directive-Based, Efficient GPU Computing
Interactive Quantum Chemistry: A Divide-and-Conquer ASED-MO Method
Interactive Ray Tracing with Data Locality Optimizations
Interactive Ray-tracing Based on OptiX to Visualize Signed Distance Fields
Interactive Reaction-Diffusion on Surface Tiles
Interactive Refactoring for GPU Parallelization of Affine Loops
Interactive rendering of acquired materials on dynamic geometry using bandwidth prediction
Interactive Rendering of Dynamic Geometry
Interactive rendering of large unstructured grids using dynamic level-of-detail
Interactive Separating Streak Surfaces
Interactive Simulation and Visualization of Fluids with Surface Raycasting
Interactive Simulations with Navier-Stokes Equations on many-core Architectures
Interactive Soft Tissue for Surgical Simulation
Interactive soft-fabrics watering simulation on GPU
Interactive SPH Simulation and Rendering on the GPU
Interactive Streak Surface Visualization on the GPU
Interactive transparency rendering for large CAD models
Interactive Two-sided Refraction for Dynamic Object on GPU
Interactive visibility culling in complex environments using occlusion-switches
Interactive visual analysis of contrast-enhanced ultrasound data based on local neighborhood statistics
Interactive visualisation of spins and clusters in regular and small-world Ising models with CUDA on GPUs
Interactive Visualization of Molecular Surface Dynamics
Interactive visualization of streaming data with Kernel Density Estimation
Interactive Visualization of the Largest Radioastronomy Cubes
Interactive Visualization of Volumetric White Matter Connectivity in DT-MRI Using a Parallel-Hardware Hamilton-Jacobi Solver
Interactive volume illustration
Interactive Volume Rendering Aurora on the GPU
Interactive Volume Rendering of Functional Representations in Quantum Chemistry
Interactive volumetric lighting simulating scattering and shadowing
Interactive water streams with sphere scan conversion 
Interactive Wave Simulations
Interactive, GPU-Based Level Sets for 3D Segmentation
Interactively Rendering Dynamic Caustics on GPU
Interactively Simulating Fluid based on SPH and CUDA
Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric
Interective Point Clouds Fairing on Many-Core System
Interference-driven resource management for GPU-based heterogeneous clusters
Interlanguages and synchronic models of computation
Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
Interleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels
Intermediate fabrics: virtual architectures for circuit portability and fast placement and routing
Intermediate Language Extensions for Parallelism
Interoperable GPU Kernels as Latency Improver for MEC
InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA
Interpolated pressure laws in two-fluid simulations and hyperbolicity
Interpolation with Radial Basis Functions on GPGPUs using CUDA
Interpretive OpenGL for computer graphics
Intersecting two families of sets on the GPU
Interventional 4-D Motion Estimation and Reconstruction of Cardiac Vasculature without Motion Periodicity Assumption
Intra-Application Data-Communication Characterization
Intra-node Memory Safe GPU Co-Scheduling
Introducing 'Bones': A Parallelizing Source-to-Source Compiler Based on Algorithmic Skeletons
Introducing CURRENNT - the Munich open-source CUDA RecurREnt Neural Network Toolkit
Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit
Introducing Energy Efficiency into Graphics Processors
Introducing Parallelism to the Ranges TS
Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM
Introduction to GPGPU programming
Introduction to GPGPU, a hardware and software background
Introduction to GPU Computing and CUDA Programming: A Case Study on FDTD [EM Programmer's Notebook]
Introduction to GPU programming for EDA
Introduction to GPU Programming with GLSL
Introduction to GPU Radix Sort
Introduction to the Report "Interlanguages and Synchronic Models of Computation."
Introduction to the Special Issue on Digital Signal Processing in Radio Astronomy
Intrusion Detection Architecture Utilizing Graphics Processors
Intrusion Detection using Spiking Neural Networks
Inverse scattering and refraction corrected reflection for breast cancer imaging
Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers
Investigating Host-Device communication in a GPU-based H.264 encoder
Investigating Input Representations and Representation Models of Source Code for Machine Learning
Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures
Investigating performance variations of an optimized GPU-ported granulometry algorithm
Investigating Single Precision Floating General Matrix Multiply in Heterogeneous
Investigating SRAM PUFs in large CPUs and GPUs
Investigating the Impact of Data Parallelism and GPU Technology on Computer Gaming
Investigating the Performance of Motion Estimation Block-Matching Algorithms on GPU Cards
Investigating the use of GPU-accelerated nodes for SAR image formation
Investigating the use of GPUs with a Monte Carlo Astrophysical Simulation
Investigating Warp Size Impact in GPUs
Investigation of General-Purpose Computing on Graphics Processing Units and its Application to the Finite Element Analysis of Electromagnetic Problems
Investigation of GPU-based Pattern Matching
Investigation of heterogeneous computing through novel parallel programming platforms
Investigation of Parallel Computation - MPI, CUDA and Parallel Visualization
Investigation of the OpenCL SYCL Programming Model
Investigation of the SYCL for OpenCL Programming Model
Investigation on the Use of GPGPU for Fast Sparse Matrix Factorization
Invitation to a Standard Programming Interface for Massively Parallel Computing Environment: OpenCL
Invited paper: Accelerating neuromorphic vision on FPGAs
IODA: an Input/Output Deep Architecture for image labeling
IP routing processing with graphic processors
IPMACC: Open Source OpenACC to CUDA/OpenCL Translator
IPMACC: Translating OpenACC API to OpenCL
Iris Matching Algorithm on Many-Core Platforms
Iris recognition on GPU with the usage of Non-Negative Matrix Factorization
Iris: First-Class Multi-GPU Programming Experience in Triton
IRIS: Illustrative Rendering for Integral Surfaces
Irradiation Instability at the Inner Edges of Accretion Disks
Irregular algorithms on the Xeon Phi
Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization
Is GPGPU CCL worth it? A performance comparison between some GPU and CPU algorithms for solving connected components labeling on binary images
Is OpenCL a suitable platform for algorithm development in health care systems?
Is the game worth the candle? Evaluation of OpenCL for object detection algorithm optimization
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs
Isocube: Exploiting the Cubemap Hardware
Isolated Scheduling for Distributed Training Tasks in GPU Clusters
Isosurface Extraction and View-Dependent Filtering from Time-Varying Fields Using Persistent Time-Octree (PTOT)
Issues and challenges in compiling for graphics processors
Issues in Heterogenenous GPU Clusters
It's all about data movement: Optimising FPGA data access to boost performance
Iterative and Predictive Ray-Traced Collision Detection for Multi-GPU Architectures
Iterative CT Reconstruction on the GPU
Iterative GPGPU Linear Solvers for Sparse Matrices
Iterative Hard Thresholding for Model Selection in Genome-Wide Association Studies
Iterative induced dipoles computation for molecular mechanics on GPUs
Iterative Krylov solution methods for geophysical electromagnetic simulations on throughput-oriented processing units
Iterative layer-based raytracing on CUDA
Iterative Methods for Visualization of Implicit Surfaces On GPU
Iterative optimization methods for efficient image restoration on multicore architectures
Iterative SLE Solvers over a CPU-GPU Platform
Iterative Solution of Linear Systems in Electromagnetics (and not only): Experiences with CUDA
Iterative Statistical Kernels on Contemporary GPUs
iTree: Exploring Time-Varying Data using Indexable Tree
Jacobian-free Newton-Krylov methods with GPU acceleration for computing nonlinear ship wave patterns
Jailbreaking LLM-Controlled Robots
Java on CUDA architecture
Java with Auto-Parallelization on Graphics Coprocessing Architecture
JAX, M.D.: End-to-End Differentiable, Hardware Accelerated, Molecular Dynamics in Pure Python
JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA
JCudaMP: OpenMP/Java on CUDA
JIT-Compilation for Interactive Scientific Visualization
Jit4OpenCL: a compiler from Python to OpenCL
Jitter analysis of PLL-generated clock propagation using Jitter Mitigation techniques with laser voltage probing
Job Parallelism using Graphical Processing Unit individual Multi-Processors and Highly Localised Memory
Job Parallelism using Graphical Processing Unit Individual Multi-Processors and Localised Memory
Join Algorithms on GPUs: A Revisit After Seven Years
Join Execution Using Fragmented Columnar Indices on GPU and MIC
Joint Forces: From Multithreaded Programming to GPU Computing
Joint Training on AMD and NVIDIA GPUs
Joint-MAP Tomographic Reconstruction with Patch Similarity Based Mixture Prior Model
JPEG 2000 Wireless Image Transmission System using Encryption Domain Authentication
JPEG-GPU:: a GPGPU Implementation of JPEG Core Coding Systems
JSDoop and TensorFlow.js: Volunteer Distributed Web Browser-Based Neural Network Training
Julia as a unifying end-to-end workflow language on the Frontier exascale system
Jump flooding in GPU with applications to Voronoi diagram and distance transform
Just-in-time Acceleration of JavaScript
Just-in-Time Catching Test Generation at Meta
Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading
K-Means on Commodity GPUs with CUDA
K-Means on GPU: A Review
K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching
k+-buffer: Fragment Synchronized k-buffer
K3 Moore's Law in the Era of GPU Computing
KAdvice: infering synchronization patterns from an existing codebase
KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks
Kalman Filter Tracking on Parallel Architectures
Kalman-Filter-Based Particle Tracking on Parallel Architectures at Hadron Colliders
kANN on the GPU with Shifted Sorting
Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras
Kargus: a Highly-scalable Software-based Intrusion Detection System
KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators
Kd-Jump: a Path-Preserving Stackless Traversal for Faster Isosurface Raytracing on GPUs
KD-tree acceleration structures for a GPU raytracer
Kd-tree Based N-Body Simulations with Volume-Mass Heuristic on the GPU
kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos
Keeneland: Bringing heterogeneous GPU computing to the computational science community
KEET: Explaining Performance of GPU Kernels Using LLM Agents
Keras Sig: Efficient Path Signature Computation on GPU in Keras 3
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications
Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)
Kernel Tuner: A search-optimizing GPU code auto-tuner
Kernel Tuning Toolkit
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation
Kernel-as-a-Service: A Serverless Interface to GPUs
Kernel-Centric Optimizations for Deep Neural Networks on GPGPU
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit
KernelBench: Can LLMs Write Efficient GPU Kernels?
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning
Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
KernelFoundry: Hardware-aware evolutionary GPU kernel optimization
KERNELGEN - A Toolchain for Automatic GPU-centric Applications Porting
KernelGen - the design and implementation of a next generation compiler platform for accelerating numerical models on GPUs
KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters
Kernelized Renyi distance for speaker recognition
KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization
KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications
Kevin: Multi-Turn RL for Generating CUDA Kernels
Key derivation functions and their GPU implementation
Key Reconciliation with Low-Density Parity-Check Codes for Long-Distance Quantum Cryptography
Keynote address: Immersive exploration of large datasets
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators
KFusion: Obtaining Modularity and Performance with Regards to General Purpose GPU Computing and Co-processors
Kinematic Modelling of Disc Galaxies using Graphics Processing Units
Kinetics of liquid-solid phase transition in large nickel clusters
KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling
Kite: Braided Parallelism for Heterogeneous Systems
KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch Parameters Targeting CUDA Programs
kNN Query Processing in Metric Spaces Using GPUs
Kokkidio: Fast, expressive, portable code, based on Kokkos and Eigen
Kokkos: Enabling performance portability across manycore architectures
Krylov Subspace Accelerated Algebraic Multigrid for Mimetic Finite Differences on GPUs
KUDA: GPU Accelerated Split Race Checker
LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure
LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi
LAMMPScuda - a new GPU accelerated Molecular Dynamics Simulations Package and its Application to Ion-Conducting Glasses
Landau Gauge Fixing on GPUs
Landau Gauge Fixing on GPUs and String Tension
Langevin dynamics simulations of biomolecules on graphics processors
Language Modeling with Gated Convolutional Networks
Language virtualization for heterogeneous parallel computing
Large calculation of the flow over a hypersonic vehicle using a GPU
Large data real-time classification with Non-negative Matrix Factorization and Self-Organizing Maps on GPU
Large data visualization on distributed memory multi-GPU clusters
Large Graphs on multi-GPUs
Large Integer Arithmetic in GPU for Cryptography
Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework
Large neighborhood local search optimization on graphics processing units
Large scale 3D shape retrieval by exploiting multi-core and GPU
Large Scale Artificial Neural Network Training Using Multi-GPUs
Large Scale Bioinformatics Data Mining with Parallel Genetic Programming on Graphics Processing Units
Large Scale DNA Sequence Alignment and Kernel Method Implemented with GPUs
Large Scale Finite Element Analysis Using GPU Parallel Computing
Large Scale GPU Accelerated PPMLR-MHD Simulations for Space Weather Forecast
Large Scale GPU Based Simulations of Turbulent Bubbly Flow in a Square Duct
Large Scale Language Modeling: Converging on 40GB of Text in Four Hours
Large Scale Monte Carlo Tree Search on GPU
Large scale parallel state space search utilizing graphics processing units and solid state disks
Large Scale Physical Modeling Sound Synthesis
Large Scale Plane Wave Pseudopotential Density Functional Theory Calculations on GPU Clusters
Large Scale Simulations of the Euler Equations on GPU Clusters
Large Speed Increase Using Novel GPU Based Algorithms to Simulate Cardiac Excitation Waves in a Rabbit Ventricle
Large steps in GPU-based deformable bodies simulation
Large-eddy simulations with ClimateMachine: a new open-source code for atmospheric simulations on GPUs and CPUs
Large-Scale Compute-Intensive Analysis via a Combined In-Situ and Co-Scheduling Workflow Approach
Large-Scale Data Computing Performance Comparisons on SYCL Heterogeneous Parallel Processing Layer Implementations
Large-Scale Deep Learning on the YFCC100M Dataset
Large-scale deep unsupervised learning using graphics processors
Large-Scale DNS of Gas-Solid Flow on Mole-8.5 
Large-scale ferrofluid simulations on graphics processing units
Large-scale FFT on GPU clusters
Large-Scale Geospatial Processing on Multi-Core and Many-Core Processors: Evaluations on CPUs, GPUs and MICs
Large-Scale High-Lundquist Number Reduced MHD Simulations of the Solar Corona Using GPU Accelerated Machines
Large-scale image analysis using docker sandboxing
Large-scale mixer simulations using massively parallel GPU architectures
Large-scale Monte Carlo simulation of two-dimensional classical XY model using multiple GPUs
Large-Scale Motion Modelling using a Graphical Processing Unit
Large-scale multi-dimensional document clustering on GPU clusters
Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters
Large-scale network simulation over heterogeneous computing architecture
Large-Scale Paralleled Sparse Principal Component Analysis
Large-Scale Physics-Based Terrain Editing Using Adaptive Tiles on the GPU
Large-Scale Sound Field Rendering in Rectangular Room with Specular Reflection
Large-Scale Stereo Display Wall Using Programmable Graphics Hardware
Large-Scale Stochastic Learning using GPUs
Large-scale transient stability simulation on graphics processing units
Large-scale Virtual Acoustics Simulation at Audio Rates Using Three Dimensional Finite Difference Time Domain and Multiple GPUs
Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation
Larrabee: a many-core x86 architecture for visual computing
Latency considerations of depth-first GPU ray tracing
Lattice Based Volumetric Global Illumination
Lattice Boltzmann based PDE solver on the GPU
Lattice Boltzmann Method for Simulating Turbulent Flows
Lattice Boltzmann Simulation of Binary Mixture Diffusion Using Modern Graphics Processors
Lattice Boltzmann Simulations of Multiphase Flows
Lattice Boltzmann simulations of the permeability and capillary adsorption of cement model microstructures
Lattice Boltzmann Simulations on a GPU: An optimization approach using C++ AMP
Lattice Group Models: GPU Acceleration and Numerics
Lattice QCD as a video game
Lattice QCD based on OpenCL
Lattice QCD on Intel Xeon Phi
Lattice QCD on new chips: a community summary
Lattice QCD simulations using the OpenACC platform
Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors
Lattice Quantum Chromodynamics on Intel Xeon Phi based supercomputers
Lattice Simulations using OpenACC compilers
Lattice SU(2) on GPU's 
Lattice-based flow field modeling
Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors
Lattice-Boltzmann simulation of the shallow-water equations with fluid-structure interaction on multi-and manycore processors
Lattice-boltzmann water waves
LatticeQCD using OpenCL
Launch-time Optimization of OpenCL Kernels
Layered Interpretation of Street View Images
Lazy Solid Texture Synthesis
LazyTensor: combining eager execution with domain-specific compilers
LBCL: multi-device automatic load balancing
LBM based flow simulation using GPU computing processor
LDetector: A Low Overhead Race Detector For GPU Programs
Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
Learnergy: Energy-based Machine Learners
Learning a Metric Embedding for Face Recognition using the Multibatch Method
Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing
Learning Blood Management in Orthopedic Surgery through Gameplay
Learning hash codes for efficient content reuse detection
Learning Massive Graph Embeddings on a Single Machine
Learning Random Forests on the GPU
Learning Representation for Scene Understanding: Epitomes, CRFs, and CNNs
Learning Sparse Recurrent Neural Networks in Language Modeling
Learning Structured Sparsity in Deep Neural Networks
Learning to Detect Roads in High-Resolution Aerial Images
Learning to Optimize Tensor Programs
Learning Two-View Stereo Matching
Least Squares on GPUs in Multiple Double Precision
Lectures on Parallel Computing
LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations
LeFlow: Enabling Flexible FPGA High-Level Synthesis of Tensorflow Deep Neural Networks
LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory
Legion: Programming Distributed Heterogeneous Architectures with Logical Regions
Legolizer: A Real-Time System for Modeling and Rendering LEGO Representations of Boundary Models
Lensed: a code for the forward reconstruction of lenses and sources from strong lensing observations
Leo: A Profile-Driven Dynamic Optimization Framework for GPU Applications
Lessons learned from contrasting a BLAS kernel implementations
Lessons learned in a decade of research software engineering GPU applications
Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame
Let's sort this out: GPGPU Verification of Radix Sort
Lettuce: PyTorch-based Lattice Boltzmann Framework
Level Sets and Voronoi based Feature Extraction from any Imagery
Level-of-Detail Triangle Strips for Deforming Meshes
Leveraging AI Ecosystem for Portable and Sustainable GPU Kernels in HPC
Leveraging Binary Translation for Heterogeneous Profiling
Leveraging Computation Sharing and Parallel Processing in Location-Based Services
Leveraging Data-Flow Information for Efficient Scheduling of Task-Parallel Programs on Heterogeneous Systems
Leveraging LLVM OpenMP GPU Offload Optimizations for Kokkos Applications
Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Leveraging Parallelism with CUDA and OpenCL
Leveraging the potential of task-based programming with OpenMP task graphs
Levy Flights for Particle Swarm Optimisation Algorithms on Graphical Processing Units
LeXInt: GPU-accelerated Exponential Integrators package
LHCb GPU acceleration project
libcloudph++ 0.1: single-moment bulk, double-moment bulk, and particle-based warm-rain microphysics library in C++
libCudaOptimize: an Open Source Library of GPU-based Metaheuristics
libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms
libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications
Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication
libWater: Heterogeneous Distributed Computing Made Easy
LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning
Light Loss-Less Data Compression, with GPU Implementation
Light propagation for mixed polygonal and volumetric data
Light Propagation Maps on Parallel Graphics Architectures
Lighting Details Preserving Photon Density Estimation
LightNet: A Versatile, Standalone Matlab-based Environment for Deep Learning
Lightning: Scaling the GPU Programming Model Beyond a Single GPU
LightPlay: Efficient Replay with GPUs
LightRNN: Memory and Computation-Efficient Recurrent Neural Networks
LightScan: Faster Scan Primitive on CUDA Compatible Manycore Processors
Lightweight bleeding and smoke effect for surgical simulators
Lightweight Modular Staging and Embedded Compilers: Abstraction Without Regret for High-Level High-Performance Programming
Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs
Lina: a fast design optimisation tool for software-based FPGA programming
linalg: Matrix Computations in Apache Spark
Line-art Illustration of Dynamic and Specular Surfaces
Linear Algebra Algorithms for Hybrid Architectures with XKaapi
Linear algebra operators for GPU implementation of numerical algorithms
Linear Feature Detection on GPUs
Linear genetic programming GPGPU on Microsoft's Xbox 360
Linear optimization on modern GPUs
Linear Performance-Breakdown Model: A Framework for GPU kernel programs performance analysis
Linear Solvers for Stable Fluids: GPU vs CPU
Linearised inversion with GPUs
Linpack evaluation on a supercomputer with heterogeneous accelerators
linus: Conveniently explore, share, and present large-scale biological trajectory data from a web browser
liquidSVM: A Fast and Versatile SVM package
List Mode PET reconstruction
Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers
LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters
Literature Review and Implementation Overview: High Performance Computing with Graphics Processing Units for Classroom and Research Use
Literature review: Build and Travel KD-Tree with CUDA
Literature Review: Parallel Computing on linear equations of linear elastic FEM stimulation with CUDA
LithOS: An Operating System for Efficient Machine Learning on GPUs
Live Migration for OpenCL FPGA Accelerators
Live Migration of FPGA Applications
Live, Video-Rate Super-Resolution Microscopy Using Structured Illumination and Rapid GPU-Based Parallel Processing
Living Flows: Enhanced Exploration of Edge-Bundled Graphs Based on GPU-Intensive Edge Rendering
LLload: An Easy-to-Use HPC Utilization Tool
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLMPerf: GPU Performance Modeling meets Large Language Models
LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs
LLOR: Automated Repair of OpenMP Programs
LLVM to PTX Backend
LLVM-based automation of memory decoupling for OpenCL applications on FPGAs
LN-Annote: An Alternative Approach to Information Extraction from Emails using Locally-Customized Named-Entity Recognition
LNA: Fast Protein Classification Using A Laplacian Characterization of Tertiary Structure
LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs
Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model Visualization
Load Balancing for Constraint Solving with GPUs
Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability
Load Balancing in Data Warehouse - Evolution and Perspectives
Load Balancing Utilizing Data Redundancy in Distributed Volume Rendering
Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study
Load-Balanced Multi-GPU Ambient Occlusion for Direct Volume Rendering
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Local Histogram Modification Based Contrast Enhancement with GPU Acceleration
Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid
Local Search Algorithms on Graphics Processing Units. A Case Study: The Permutation Perceptron Problem
Local Volatility FX Basket Option on CPU and GPU
Local vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments
Locality Analysis for Characterizing Applications Based on Sparse Matrices
Locality and parallelism optimization for dynamic programming algorithm in bioinformatics
Locality Aware Work-Stealing Based Scheduling in Hybrid CPU-GPU
Locality optimization on a NUMA architecture for hybrid LU factorization
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Locality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model
Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures
LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs
Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures
Location-based Matching in Publish/Subscribe Revisited
LOD Terrain Rendering by Local Parallel Processing on GPU
Log File Regular Expression Pattern Matching And Capture With GPUs
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
LoGV: Low-overhead GPGPU Virtualization
Long Code for Code Search
Long time-scale simulations of in vivo diffusion using GPU hardware
Long Timestep Molecular Dynamics on the Graphical Processing Unit
Long-time Simulations with Complex Code Using Multiple Nodes of Intel Xeon Phi Knights Landing
Loo.py: From Fortran to performance via transformation and substitution rules
Loo.py: transformation-based code generation for GPUs and CPUs
Looking at the surprise: Bottom-up attentional control of an active camera system
LookNN: Neural Network with No Multiplication
Loop Perforation in OpenACC
Loop Transformation Recipes for Code Generation and Auto-Tuning
LoopBench: An Evaluation of Loop Acceleration in Heterogeneous Systems
LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
Loose capacity-constrained representatives for the qualitative visual analysis in molecular dynamics
Lossless Acceleration for Seq2seq Generation with Aggressive Decoding
Lossless Compression of Variable-Precision Floating-Point Buffers on GPUs
Lossless data compression on GPGPU architectures
Lossless LZW Data Compression Algorithm on CUDA
Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level
Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation
Low Complexity Corner Detector Using CUDA for Multimedia Applications
Low cost approach to real-time vehicle to vehicle communication using parallel CPU and GPU processing
Low cost, high performance GPU computing solution for atomic resolution cryoEM single-particle reconstruction
Low Latency Complex Event Processing on Parallel Hardware
Low latency photon mapping using block hashing
Low viscosity flow simulations for animation
Low-complexity Distributed Tomographic Backprojection for large datasets
Low-cost edge computing using upcycled smartphones
Low-cost, high-speed computer vision using NVIDIA's CUDA architecture
Low-Frequency MLFMA on Graphics Processors
Low-Impact Profiling of Streaming, Heterogeneous Applications
Low-Latency Elliptic Curve Scalar Multiplication
Low-latency Image Recognition with GPU-accelerated Convolutional Networks for Web-based Services
Low-overhead diskless checkpoint for hybrid computing systems
Low-Overhead Trace Collection and Profiling on GPU Compute Kernels
Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II
Low-power Task Scheduling for GPU Energy Reduction
Lowering IrGL to CUDA
LS-CAT: A Large-Scale CUDA AutoTuning Dataset
LTE Physical Layer Implementation Using GPU Based High Performance Computing
LTTng CLUST: A system-wide unified CPU and GPU tracing tool for OpenCL applications
LU Factorization for Accelerator-based Systems
LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System
LU Factorization with Partial Pivoting for a Multicore System with Accelerators
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
LU, QR, and Cholesky factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi
LUDA: Boost LSM Key Value Store Compactions with GPUs
Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference
Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures
Lyra2: Password Hashing Scheme with improved security against time-memory trade-offs
MACC: An OpenACC Transpiler for Automatic Multi-GPU Use
Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey
Machine Learning at the Limit
Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability
Machine Learning Based Intrusion Detection in Controller Area Networks
Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)
Machine Learning for CUDA+MPI Design Rules
Machine Learning for Predictive Auto-Tuning with Boosted Regression Trees
Machine learning for ultrafast X-ray diffraction patterns on large-scale GPU clusters
Machine Learning from Streaming Data in Heterogeneous Computing Environments
Machine Learning in Compilers: Past, Present and Future
Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machines and Algorithms
MacroSS: macro-SIMDization of streaming applications
Maestro: Data Orchestration and Tuning for OpenCL Devices
MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs
MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing
MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing
Magneto-hydrodynamics simulation in astrophysics
Magnetohydrodynamics on Heterogeneous architectures: a performance comparison
Magnetohydrodynamics simulations on graphics processing units
Maintaining constant frame rates in 3D texture-based volume rendering
Makespan computation for GPU threads running on a single streaming multiprocessor
Making Human Connectome Faster: GPU Acceleration of Brain Network Analysis
Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts
Making the case of GPUs in courses on computational physics
MALBEC: a new CUDA-C ray-tracer in General Relativity
MambaCPU: Enhanced Correlation Mining with State Space Models for CPU Performance Prediction
Managing coherent groups
Managing Extreme Heterogeneity in Next Generation HPC Systems
Managing heterogeneous device memory using C++17 memory resources
Managing Multi Instance GPUs for High Throughput and Energy Savings
Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality (hwloc)
Managing, Profiling, and Optimizing Heterogeneous GPU Workloads
Manas: Mining Software Repositories to Assist AutoML
ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills
Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview
Many-body quantum chemistry on graphics processing units
Many-Core Algorithms for Combinatorial Optimization
Many-core algorithms for statistical phylogenetics
Many-core applications to online track reconstruction in HEP experiments
Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques
Many-Core Compiler Fuzzing
Many-core GPU computing with NVIDIA CUDA
Many-core parallel computing - Can compilers and tools do the heavy lifting?
Many-Core vs. Many-Thread Machines: Stay Away From the Valley
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
Many-threaded Differential Evolution on the GPU
Many-threaded implementation of differential evolution for the CUDA platform
Manycore high-performance computing in bioinformatics
Manycore processing of repeated k-NN queries over massive moving objects observations
Manycore processing of repeated range queries over massive moving objects observations
MAP-based Brain Tissue Segmentation using Manifold Learning and Hierarchical Max-Flow regularization
Map-reduce as a Programming Model for Custom Computing Machines
MapCG: writing parallel program portable between CPU and GPU
MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs
Mapping a Data-Flow Programming Model onto Heterogeneous Platforms
Mapping a Dataflow Programming Model onto Heterogeneous Architectures
Mapping a Guided Image Filter on the HARP Reconfigurable Architecture Using OpenCL
Mapping computational concepts to GPUs
Mapping dynamic programming algorithms on graphics processing units
Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures
Mapping Iterative Medical Imaging Algorithm on Cell Accelerator
Mapping of a film grain removal algorithm to a heterogeneous reconfigurable architecture
Mapping parallel programs to heterogeneous multi-core systems
Mapping Streaming Applications to OpenCL
Mapping the Arnold web with a GPU-supercomputer
Mapping the Arnold web with a graphic processing unit
Mapping the SBR and TW-ILDCs to Heterogeneous CPU-GPU Architecture for Fast Computation of Electromagnetic Scattering
MapReduce for Counting Word Frequencies with MPI and GPUs
MapSQ: A MapReduce-based Framework for SPARQL Queries on GPU
MARC: A Many-Core Approach to Reconfigurable Computing 
March of the Froblins: simulation and rendering massive crowds of intelligent and detailed creatures on GPU
Marian: Cost-effective High-Quality Neural Machine Translation in C++
Markerless View-Independent Registration of Multiple Distorted Projectors on Extruded Surfaces Using an Uncalibrated Camera
Markov Chain Monte Carlo on the GPU
Mars: a MapReduce framework on graphics processors
Mars: Accelerating MapReduce with Graphics Processors
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
MASCOT: Fast and Highly Scalable SVM Cross-validation using GPUs and SSDs
Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations
Masivo: Parallel Simulation Model Based on OpenCL for Massive Public Transportation Systems' Routes
Mass Estimation from Images using Deep Neural Network and Sparse Ground Truth
Mass-spring systems on the GPU
Massive Exploration of Neural Machine Translation Architectures
Massive exploration of perturbed conditions of the blood coagulation cascade through GPU parallelization
Massive Image Editing on the Cloud
Massive Parallel Implementation of ODE Solvers
Massive parallel LDPC decoding on GPU
Massive Parallelism with GPUs for Centrality Ranking in Complex Networks
Massive parallelization of combinatorial statistical genetics analyses porting machine learning methods on general purpose graphics processing units (GPU)
Massive parallelization of serial inference algorithms for a complex generalized linear model
Massively Deep Artificial Neural Networks for Handwritten Digit Recognition
Massively LDPC Decoding on Multicore Architectures
Massively Parallel A* Search on a GPU
Massively Parallel Algorithms for CFD Simulation and Optimization on Heterogeneous Many-Core Architectures
Massively Parallel Analysis of Similarity Matrices on Heterogeneous Hardware
Massively parallel approximate Gaussian process regression
Massively Parallel Computation of Accurate Densities for N-body Dark Matter Simulations using the Phase-Space-Element Method
Massively parallel computation using graphics processors with application to optimal experimentation in dynamic control
Massively Parallel Computing in Economics
Massively Parallel Construction of the Cell Graph
Massively parallel differential evolution-pattern search optimization with graphics hardware acceleration: an investigation on bound constrained optimization problems
Massively Parallel Finite Element Simulator for Full-Chip STI Stress Analysis
Massively Parallel GPU Computing of Continuum Robotic Dynamics
Massively Parallel GPU Memory Compaction
Massively Parallel Identification of Intersection Points for GPGPU Ray Tracing
Massively parallel implementation of cyclic LDPC codes on a general purpose graphics processing unit
Massively Parallel Jacobian Computation
Massively Parallel kNN using CUDA on Spam-Classification
Massively Parallel Localization of Pulsed Signal Transitions Using a GPU
Massively Parallel Logic Simulation with GPUs
Massively Parallel Lossless Compression of Medical Images Using Least-Squares Prediction and Arithmetic Coding
Massively parallel Monte Carlo for many-particle simulations on GPUs
Massively Parallel Network Coding on GPUs
Massively Parallel Neural Encoding and Decoding of Visual Stimuli
Massively Parallel Ray Tracing Algorithm Using GPU
Massively parallel read mapping on GPUs with PEANUT
Massively parallel read mapping on GPUs with the q-group index and PEANUT
Massively Parallel Sequential Monte Carlo for Bayesian Inference
Massively parallel simulations of relativistic fluid dynamics on graphics processing units with CUDA
Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs
Massively parallel two-dimensional TLM algorithm on graphics processing units
Massively parallelizable list-mode reconstruction using a Monte Carlo-based elliptical Gaussian model
Massively Parallelized Monte Carlo Simulation and its Applications in Finance
Massively parallelized replica-exchange simulations of polymers on GPUs
Massively-Parallel Lossless Data Decompression
Mastering Atari with Discrete World Models
Mastering Software Variant Explosion for GPU Accelerators
Matched Filter Computation on FPGA, Cell and GPU
MatConvNet - Convolutional Neural Networks for MATLAB
Material Removal Simulation and Cutting Force Prediction of Multi-Axis Machining Processes on General-Purpose Graphics Processing Units
Mathematical limits of parallel computation for embedded systems
MATLAB and Python for GPU Computing
MATLAB graphical interface for GPU based FDTD method
MATLAB Medical Images Classification on Graphics Processors
MATLAB Parallelization through Scalarization
Matrix Computations and Optimization in Apache Spark
Matrix Convolution using Parallel Programming
Matrix Factorization on GPUs with Memory Optimization and Approximate Computing
Matrix inversion speed up with CUDA
Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation
Matrix Multiplication on GPUs with On-Line Fault Tolerance
Matrix Multiplication Using Only Addition
Matrix Multiplication with CUDA - A basic introduction to the CUDA programming model
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs
Matrix-Matrix Multiplications on GPUs for Accelerating a Parallel Fluid Dynamics Code
maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs
Maximal Information Coefficient Analysis
Maximize Performance on GPUs Using the Rake-based Optimization: A Case Study
Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution
Maximum likelihood event estimation and list-mode image reconstruction on GPU hardware
Maximum mipmaps for fast, accurate, and scalable dynamic height field rendering
MaxSSmap: A GPU program for short read mapping with the maximum scoring subsequence
MC-RANSAC: A Pre-processing Model for RANSAC using Monte Carlo method implemented on a GPU
MCBooster: a library for fast Monte Carlo generation of phase-space decays on massively parallel platforms
MCMini: Monte Carlo on GPGPU
MCS 572: Introduction to Supercomputing
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores
md_poly: A Performance-Portable Polyhedral Compiler Based on Multi-Dimensional Homomorphisms
MDLab: A molecular dynamics simulation prototyping environment
MDR: performance model driven runtime for heterogeneous parallel platforms
Mean shift for graph bundling
Mean Shift Parallel Tracking on GPU
Measurement and Analysis of GPU-accelerated Applications with HPCToolkit
Measurements of performance of hardware and general purpose classical molecular dynamics simulation software
Measuring Bandwidth for Super Computer Workloads
Measuring the evolving Internet ecosystem with exchange points
Measuring the Impact of Configuration Parameters in CUDA Through Benchmarking
Measuring the Performance of Realtime DSP Using Pure Data and GPU
Mechanical Characterization and Performance Optimization for GPU Fan-Sink Cooling Module Assembly
Median Based Parallel Steering Kernel Regression for Image Reconstruction
Medical Image Registration using OpenCL
Medical imaging using CUDA
MEDINA: MECCA Development in Accelerators - KPP Fortran to CUDA source-to-source Preprocessor
Medium-Grained Functions Mapping using Modern GPUs
Medusa: A Parallel Graph Processing System on Graphics Processors
Medusa: Simplified Graph Processing on GPUs
Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores
MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
Megakernels Considered Harmful: Wavefront Path Tracing on GPUs
Megapixel Topology Optimization on a Graphics Processing Unit
MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Melia: A MapReduce Framework on OpenCL-based FPGAs
MELT-a Translated Domain Specific Language Embedded in the GCC Compiler
MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning
MemcachedGPU: Scaling-up Scale-out Key-value Stores
Memory Access Optimized Implementation of Cyclic and Quasi-Cyclic LDPC Codes on a GPGPU
Memory Bandwidth and Latency in HPC: System Requirements and Performance Impact
Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes
Memory Efficient Mixed-Precision Optimizers
Memory Interference and Performance Prediction in GPU-Accelerated Heterogeneous Systems
Memory layout in GPU implementation of lattice Boltzmann method for sparse 3D geometries
Memory Optimization for Deep Networks
Memory Saving Discrete Fourier Transform on GPUs
Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs
Memory-efficient Adaptive Subdivision for Software Rendering on the GPU
Memory-efficient implementation of a graphics processor-based cluster detection algorithm for large spatial databases
Memory-Efficient Implementation of DenseNets
Memory-Efficient Object-Oriented Programming on GPUs
Memory-Efficient Single-Pass GPU Rendering of Multi-fragment Effects
Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model
Memory-Scalable GPU Spatial Hierarchy Construction
Merge or Separate? Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms
Merge: a programming model for heterogeneous multi-core systems
Mersenne Twister Random Number Generation on FPGA, CPU and GPU
Mesh deformations in X3D via CUDA with freeform deformation lattices
Mesh Independent Loop Fusion for Unstructured Mesh Applications
Mesh mutation in programmable graphics hardware
Meshfree/GFEM in hardware-efficiency prospective
Message passing for GPGPU clusters: CudaMPI
Message Passing Interface support for the runtime adaptive multi-processor system-on-chip RAMPSoC
Message passing on data-parallel architectures
Meta Networks for Neural Style Transfer
Meta-Programming and Auto-Tuning in the Search for High Performance GPU Code
Meta-programming and Multi-stage Programming for GPGPUs
Meta-simulation of large WSN on multi-core computers
MetaBinG: Using GPUs to Accelerate Metagenomic Sequence Classification
MetaCL - A Model-Based Approach to Programming Heterogeneous Architectures Using OpenCL
MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators and Its Application to the Generation of Parametric CUDA Kernels
MetaMorph: A Library Framework for Interoperable Kernels on Multi- and Many-core Clusters
Metamorphic Testing for (Graphics) Compilers
Metaprogramming GPUs with Sh
Method for simulation of coastal terrain on GPU
Methodology of control and supervision of web connected mobile robots with CUDA technology application
Methods and Metrics for Fair Server Assessment under Real-Time Financial Workloads
Methods for Accelerating Machine Learning in High Performance Computing
Methods for GPU Acceleration of Big Data Applications
Methods for Optimizing OpenCL Applications on Heterogeneous Multicore Architectures
MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring
MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization
MIC-SVM: Designing A Highly Efficient Support Vector Machine For Advanced Modern Multi-Core and Many-Core Architectures
MICA: A fast short-read aligner that takes full advantage of Intel Many Integrated Core Architecture (MIC)
Microarchitectural Performance Characterization of Irregular GPU Kernels
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
Microbenchmarking NVIDIA's Blackwell Architecture: An in-depth Architectural Analysis
Microbenchmarks for GPU characteristics: the occupancy roofline and the pipeline model
Microbranching in mode-I fracture using large scale simulations of amorphous and perturbed lattice models
Microlensing Observations Rapid Search for Exoplanets: MORSE code for GPUs
Micropolygon ray tracing with defocus and motion blur
MIDeA: a multi-parallel intrusion detection architecture
Migrating CUDA to oneAPI: A Smith-Waterman Case Study
Migrating from OpenGL ES to Vulkan
Migrating real-time depth image-based rendering from traditional to next-gen GPGPU
MILC Code Performance on High End CPU and GPU Supercomputer Clusters
MILC on GPUs
MILC staggered conjugate gradient performance on Intel KNL
MILJS: Brand New JavaScript Libraries for Matrix Calculation and Machine Learning
MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster
MIMD Interpretation on a GPU
Mimetic Methods for Lagrangian Relaxation of Magnetic Fields
Mìmir: A real-time interactive visualization library for CUDA programs
MIML Learning with CNNs: Yelp Restaurant Photo Classification
Mind the gap!: bridging the dichotomy of design and implementation
Minerals detection for hyperspectral images using adapted linear unmixing: LinMin
Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning
MinGPU: a minimum GPU library for computer vision
miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations
Minimal models for finite particles in fluctuating hydrodynamics
minimap2-fpga: Integrating hardware-accelerated chaining for efficient end-to-end long-read sequence mapping
Minimising Testing in Genetic Programming
Mining Rare Features in Fingerprints Using Core Points and Triplet-based Features
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Minuet: Accelerating 3D Sparse Convolutions on GPUs
MIOpen: An Open Source Library For Deep Learning Primitives
Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
Mirovia: A Benchmarking Suite for Modern Heterogeneous Computing
MITHRA: Multiple data independent tasks on a heterogeneous resource architecture 
Mix-and-Match: A Model-driven Runtime Optimisation Strategy for BFS on GPUs
Mixed precision in Graphics Processing Unit
Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
Mixed Precision Solver Scalable to 16000 MPI Processes for Lattice Quantum Chromodynamics Simulations on the Oakforest-PACS System
Mixed-Precision Embedding Using a Cache
Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration
Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers 
Mixed-precision numerics in scientific applications: survey and perspectives
Mixed-precision Orthogonalization Scheme and Adaptive Step Size for CA-GMRES on GPUs
Mixed-precision orthogonalization scheme and its case studies with CA-GMRES on a GPU
Mixed-Resolution Patch-Matching
Mixed-Tool Performance Analysis on Hybrid Multicore Architectures
Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training
Mixing Multi-Core CPUs and GPUs for Scientific Simulation Software
MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA
ML Inference Scheduling with Predictable Latency
ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming
MLitB: Machine Learning in the Browser
MLS-based scalar fields over triangle meshes and their application in mesh processing
MNN: A Universal and Efficient Inference Engine
Mobile 3D Graphics
Mobile GPGPU Acceleration of Embodied Robot Simulation
Mobile GPU Computing Based Filter Bank Convolution for Three-dimensional Wavelet Transform
Mobile visual computing
MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?
MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU
MobiRT: an implementation of OpenGL ES-based CPU-GPU hybrid ray tracer for mobile devices
Model Coupling between the Weather Research and Forecasting Model and the DPRI Large Eddy Simulator for Urban Flows on GPU-accelerated Multicore Systems
Model-Based 3D Object Tracking Using an Extended-Extended Kalman Filter and Graphics Rendered Measurements
Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing
Model-Based Warp-Level Tiling for Image Processing Programs on GPUs
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Model-driven optimisation of memory hierarchy and multithreading on GPUs
Model-Driven Tile Size Selection for DOACROSS Loops on GPUs
Model-independent partial wave analysis using a massively-parallel fitting framework
Model-T: Rethinking the OS for terabit speeds
Modeling and Evaluation of Synchronous Stochastic Gradient Descent in Distributed Deep Learning on Multiple GPUs
Modeling and generating complex motion blur for real-time tracking
Modeling and Optimization of Parallel Matrix-based Computations on GPU
Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures
Modeling Deep Learning Accelerator Enabled GPUs
Modeling GPU Dynamic Parallelism for Self Similar Density Workloads
Modeling GPU-CPU Workloads and Systems
Modeling Image Patches with a Generic Dictionary of Mini-Epitomes
Modeling of Heat Diffusion Through Isotropic Media Using Graphical Processing Units
Modeling of Heterogeneous Architecture with GPU to Exascale System
Modeling of High Performance Programs to Support Heterogeneous Computing
Modeling of the behavior of 222 Rn progeny in diffusion chamber using CUDA
Modeling of tsunami waves and atmospheric swirling flows with graphics processing unit 
Modeling Parallel Programs for Heterogeneous Computing
Modeling Parallel Programs using Large Language Models
Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster
Modeling system for GPU parallel tasks performance simulation
Modeling the propagation of elastic waves using spectral elements on a cluster of 192 GPUs
Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices
Modeling the spatio-temporal evolution of fracture networks and fluid-rock interactions in GPU: Applications to lithospheric geodynamics
Modelling sea water intrusion in coastal aquifers using heterogeneous computing
Modelling the Formation of Ordered Acentrosomal Microtubule Arrays
Modelling, simulating and visualising the Cahn-Hilliard-Cook field equation
Modern GPGPU Frameworks and their Application to the Physical Core of the ASUCA Weather Prediction Model
Modern GPU-Based Forward-Projection Algorithm with a New Sampling Method
Modern Gyrokinetic Particle-In-Cell Simulation of Fusion Plasmas on Top Supercomputers
Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi
Modernization and Optimization of MPI Codes
Modernizing the core quantum chemistry algorithms
MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures
Modification of self-organizing migration algorithm for OpenCL framework
Modified Bloom filter for high performance hybrid NoSQL systems
Modified Levels of Parallel Odd-Even Transposition Sorting Network (OETSN) with GPU Computing using CUDA
Modular & Scalable Ultrasound Platform with GPU Processing
Modular Arithmetic for Solving Linear Equations on the GPU
Modular FPGA Systems with Support for Dynamic Workloads and Virtualisation
Modular Resultant Algorithm for Graphics Processors
Modular Technology in the Modelling of Large Virtual Environments in Driving Simulators
Moim: A Multi-GPU MapReduce Framework
Mojo: MLIR-Based Performance-Portable HPC Science Kernels on GPUs for the Python Ecosystem
Molecular Activity Prediction using Deep Learning Software Library
Molecular Distance Geometry Optimization Using Geometric Build-up and Evolutionary Techniques on GPU
Molecular Docking on FPGA and GPU Platforms
Molecular dynamics for long-range interacting systems on Graphic Processing Units
Molecular Dynamics on a Grand Scale
Molecular dynamics recipes for genome research
Molecular Dynamics Simulation Based on Hadoop MapReduce
Molecular dynamics simulation of complex multiphase flow on a computer cluster with GPUs
Molecular Dynamics Simulation of Macromolecules Using Graphics Processing Unit
Molecular Dynamics Simulation of Multi-Scale Flows on GPUs
Molecular dynamics simulation of the supercooled Al melt on GPUs
Molecular dynamics simulation of UO2 nanocrystals melting
Molecular dynamics simulations of the relaxation processes in the condensed matter on GPUs
Molecular Dynamics Simulations on Commodity GPUs with CUDA
Molecular dynamics simulations through GPU video games technologies
Molecular Dynamics Simulations Using Graphics Processing Units
Molecular dynamics simulations with many-body potentials on multiple GPUs - the implementation, package and performance
Molecular Simulation of ab Initio Protein Folding for a Millisecond Folder NTL9(1-39)
Molecular Simulations using CUDA
Molecular structural mechanics approach to carbon nanotubes on graphics processing units
Monadic Deep Learning
Monitoring Collective Communication Among GPUs
Monitoring Large-scale Microblog on GPUs
Monitoring Multiple Streams with Dynamic Time Warping using Graphic Processors
Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer
Montblanc: GPU accelerated Radio Interferometer Measurement Equations in support of Bayesian Inference for Radio Observations
Monte Carlo integration on GPU
Monte Carlo methods for massively parallel computers
Monte Carlo Modeling of Electron Transport Using CUDA Technology
Monte Carlo Path Tracing with OpenCL
Monte Carlo Radiative Transport on the GPU
Monte Carlo randomization tests for large-scale abundance datasets on the GPU
Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units
Monte Carlo simulations on Graphics Processing Units
Monte-Carlo Black-Scholes Implementation using OpenCL Standard
More Bang For Your Buck(et): Fast and Space-efficient Hardware-accelerated Coarse-granular Indexing on GPUs
Morph Algorithms on GPUs
Morphological Proximity Priors: Spatial Relationships for Semantic Segmentation
Motion Compensation and Reconstruction of H.264/AVC Video Bitstreams using the GPU
Motion Estimation for H.264/AVC using Programmable Graphics Hardware
Motion Estimation with Non-Local Total Variation Regularization
Motion planning for autonomous driving with a conformal spatiotemporal lattice
Movement Tracking in Terrain Conditions Accelerated with CUDA
Moving Least-Squares Reconstruction of Large Models with GPUs
Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs
MPC Toolbox with GPU Accelerated Optimization Algorithms
MPC: A Massively Parallel Compression Algorithm for Scientific Data
MPI Derived Datatypes Processing on Noncontiguous GPU-resident Data
MPI Parallelization of GPU-based Lattice Boltzmann Simulations
MPI within a GPU
MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems
MPI-CUDA parallelization of a finite-strip program for geometric nonlinear analysis: A hybrid approach
MPI-GIS: New Parallel Overlay Algorithm and System Prototype
MPI-GPU parallelism in iterative eigensolvers for block-tridiagonal matrices
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark
MR-API: A Comprehensive API Framework for Heterogeneous Multi-core Systems using Map Reduce Programming Model
Mr. Scan: Extreme Scale Density-Based Clustering using a Tree-Based Network of GPGPU Nodes
MrBayes on a Graphics Processing Unit
MrBayes tgMC3: A Tight GPU Implementation of MrBayes
MRCUDA: MapReduce Acceleration Framework Based on GPU
MRPB: Memory Request Prioritization for Massively Parallel Processors
MSA-CUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA
MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications
MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems
MSTg: Cryptographically strong pseudorandom number generator and its realization
MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies
mu-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching
mu-grind: A Framework for Dynamically Instrumenting HLS-Generated RTL
Multi Agent Navigation on the GPU 
Multi GPU Implementation of Iterative Tomographic Reconstruction Algorithms
Multi GPU Implementation of the Simplex Algorithm
Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions
Multi GPU Performance of Conjugate Gradient Solver with Staggered Fermions in Mixed Precision
Multi scale block histogram of template feature for pedestrian detection
Multi- and many-core data mining with adaptive sparse grids
Multi-Agent Systems and General-Purpose Computing on Graphics Processing Units: A Survey
Multi-agent traffic simulation with CUDA
Multi-camera real-time depth estimation with discontinuity handling on PC graphics hardware
Multi-Centroid PSO Classification Learning on the GPU
Multi-core CPU or GPU-accelerated Multiscale Modeling for Biomolecular Complexes
Multi-core CUDA Architecture for Parallelization of Hierarchical Text Clustering
Multi-core parallelism in a column-store
Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions
Multi-core programming with OpenCL: performance and portability: OpenCL in a memory bound scenario
Multi-dimensional characterization of electrostatic surface potential computation on graphics processors
Multi-dimensional characterization of temporal data mining on graphics processors
Multi-dimensional Functional Principal Component Analysis
Multi-Directional Optimisation on the GPU
Multi-domain, Higher Order Level Set Scheme for 3D Image Segmentation on the GPU
Multi-Elimination ILU Preconditioners on GPUs
Multi-fragment effects on the GPU using the k-buffer
Multi-GPGPU Cellular Automata Simulations using OpenACC
Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model
Multi-GPU Accelerated Parallel Algorithm of Wallis Transformation for Image Enhancement
Multi-GPU Acceleration of Black-Scholes Equation based Option Pricing
Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations
Multi-GPU Based Lattice Boltzmann Method for Hemodynamic Simulation in Patient-Specific Cerebral Aneurysm
Multi-GPU based on multicriteria optimization for motion estimation system
Multi-GPU cluster wave propagation and OpenGL visualization
Multi-GPU Computing for Achieving Speedup in Real-time Aggregate Risk Analysis
Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling
Multi-GPU Graph Analytics
Multi-GPU Implementation for Iterative MR Image Reconstruction with Field Correction
Multi-GPU Implementation of a Hybrid Thermal Lattice Boltzmann Solver using the TheLMA Framework
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Multi-GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL
Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing
Multi-GPU implementation of the NICAM atmospheric model
Multi-GPU Implementation of the Uniformization Method for Solving Markov Models
Multi-GPU Island-Based Genetic Algorithm
Multi-GPU Island-Based Genetic Algorithm for Solving the Knapsack Problem
Multi-GPU Load Balancing for In-Situ Simulation and Visualization
Multi-GPU Load Balancing for In-situ Visualization
Multi-GPU numerical simulation of electromagnetic waves
Multi-GPU Parallel Computing and Task Scheduling under Virtualization
Multi-GPU parallel memetic algorithm for capacitated vehicle routing problem
Multi-GPU parallelization of a 3D Bayesian CT algorithm and its application on real foam reconstruction with incomplete data set
Multi-GPU Performance of Incompressible Flow Computation by Lattice Boltzmann Method on GPU Cluster
Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms
Multi-GPU performance optimization of a computational fluid dynamics code using OpenACC
Multi-GPU Rendering with Vulkan API
Multi-GPU Support on Shared Memory System using Directive-based Programming Model
Multi-GPU Support on Single Node Using Directive-Based Programming Model
Multi-GPU Support on the Marrow Algorithmic Skeleton Framework
Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI
Multi-GPU volume rendering using MapReduce
Multi-GPU-based Swendsen-Wang multi-cluster algorithm for the simulation of two-dimensional q-state Potts model
Multi-grain Parallel Processing of Data-Clustering on Programmable Graphics Hardware
Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment
Multi-Kepler GPU vs. Multi-Intel MIC for spin systems simulations
Multi-kernel Data Partitioning with Channel on OpenCL-based FPGAs
Multi-layer depth peeling via fragment sort
Multi-level Debugging for Multi-stage, Parallelizing Compilers
Multi-Level Ewald: A Hybrid Multigrid/Fast Fourier Transform Approach to the Electrostatic Particle-Mesh Problem
Multi-Level Graph Layout on the GPU
Multi-level Parallelism for Incompressible Flow Computations on GPU Clusters
Multi-level Parallelism for Time- and Cost-efficient Parallel Discrete Event Simulation on GPUs
Multi-level Parallelism with MPI and OpenACC for CFD Applications
Multi-level parallelism, global arrays, GPGPU Programming: Unify programming paradigms on Grid computing with efficiency
Multi-level parallelization for hybrid ACO
Multi-level Parallelization of Advanced Video Coding on Hybrid CPU/GPU Platform
Multi-line AI-assisted Code Authoring
Multi-Lingual Speech Recognition with Low-Rank Multi-Task Deep Neural Networks
Multi-mass solvers for lattice QCD on GPUs
Multi-Moment Methods for PDEs and GPUs for Large-Scale Scientific Computations
Multi-Object Geodesic Active Contours (MOGAC): A Parallel Sparse-Field Algorithm for Image Segmentation
Multi-Pass and Frame Parallel Algorithms of Motion Estimation in H.264/AVC for Generic GPU
Multi-platform Linear Algebra
Multi-Platform LU-Decomposition Solution in OpenCL
Multi-scale modeling of nano scale phenomenon using CUDA based HPC setup
Multi-scale neural texture classification using the GPU as a stream processing engine
Multi-scale problems, high performance computing and hybrid numerical methods
Multi-Scale Scheduling Techniques for Signal Processing Systems
Multi-Scale, Multi-Level, Heterogeneous Features Extraction and Classification of Volumetric Medical Images
Multi-Science Applications with Single Codebase - GAMER - for Massively Parallel Architectures
Multi-swarm PSO algorithm for the Quadratic Assignment Problem: a massive parallel implementation on the OpenCL platform
Multi-target DPA attacks: Pushing DPA beyond the limits of a desktop computer
Multi-target vectorization with MTPS C++ generic library
Multi-Tasking Scheduling for Heterogeneous Systems
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Multi-thread implementations of the lattice Boltzmann method on non-uniform grids for CPUs and GPUs
Multi-Threaded Automatic Integration Using OpenMP and CUDA
Multi-threaded Geant4 on the Xeon-Phi with Complex High-Energy Physics Geometry
Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture
Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance
Multi-user real-time speech recognition with a GPU
Multi-view Rendering Approach for Cloud-based Gaming Services
Multi-walk Parallel Pattern Search Approach on a GPU Computing Platform
Multi2Sim: a simulation framework for CPU-GPU computing
Multicore and GPU Algorithms for Nussinov RNA Folding
Multicore and GPU Parallelization of Neural Networks for Face Recognition
Multicore and Manycore Algorithms for Octrees
Multicore architecture and cache optimization techniques for solving graph problems
Multicore bundle adjustment
Multicore Computing: Algorithms, Architectures, and Applications
Multicore performance optimization using partner cores
Multicore Processing for Classification and Clustering Algorithms
Multicore Processing for Clustering Algorithms
Multicore Scheduling of Parallel Real-Time Tasks with Multiple Parallelization Options
Multidimensional Costas Arrays and Their Enumeration Using GPUs and FPGAs
Multidimensional Dataflow Graph Modeling and Mapping for Efficient GPU Implementation
Multidimensional Parallelization for Streaming Text Processing Applications Based on Parabix Framework
Multidimensional upwind hydrodynamics on unstructured meshes using Graphics Processing Units I. Two-dimensional uniform meshes
Multifactor dimensionality reduction for graphics processing units enables genome-wide testing of epistasis in sporadic ALS
Multifold Acceleration of Neural Network Computations Using GPU
Multifrontal computations on GPUs and their multi-core hosts
Multifrontal Factorization of Sparse SPD Matrices on GPUs
Multifrontal Sparse Matrix Factorization on Graphics Processing Units
MultiGPU computing using MPI or OpenMP
Multigrid on GPU: Tackling Power Grid Analysis on parallel SIMT platforms
Multigrid Optimization Methods for High Performance Computing
Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs
Multilayered Abstractions for Partial Differential Equations
Multilevel Granularity Parallelism Synthesis on FPGAs
Multilevel Multidimensional Scaling on the GPU
Multilevel summation of electrostatic potentials using graphics processing units
Multilevel Tile Load Map on Massive Terrain Visualization
Multimodal collaboration and human-computer interaction
Multimodal Image Registration Using GPU Parallel Computing Technology
Multimodality imaging and state-of-art GPU technology in discriminating benign from malignant breast lesions on real time decision support system
Multipattern String Matching On A GPU
Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU
Multiphase Fluid Simulations on a Multiple GPGPU PC Using Unsplit Time Integration VSIAM3
Multiple Bounding Boxes Algorithm in Collision Detection and Its Performances in Sequential vs CUDA Parallel Processing
Multiple String Matching on a GPU using CUDAs
Multiple Time Scales Recurrent Neural Network for Complex Action Acquisition
Multiple-GPU Scalability of Phase-Field Simulation for Dendritic Solidification
Multiple-GPUs Algorithm for Lattice Boltzmann Method
Multiple-Tasks on Multiple-Devices (MTMD): Exploiting Concurrency in Heterogeneous Managed Runtimes
Multiprocessing Acceleration of H.264/AVC Motion Estimation Full Search Algorithm under CUDA Architecture
Multireduce and Multiscan on Modern GPUs
Multiresolution Flow Simulations on Multi/many-core Architectures
Multiresolution MIP Rendering of Large Volumetric Data Accelerated on Graphics Hardware
Multiscale Hemodynamics Using GPU Clusters
Multiscale texture synthesis
Multithread Content Based File Chunking System in CPU-GPGPU Heterogeneous Architecture
Multithreaded Dense Linear Algebra on Asymmetric Multi-core Processors
Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors
Multithreading for Visual Effects
MuMax: a new high-performance micromagnetic simulation tool
MUPPET: Optimizing Performance in OpenMP via Mutation Testing
MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU
Muscle pushing based skin deformation on GPU
Mutual information computation and maximization using GPU
Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
MyCaffe: A Complete C# Re-Write of Caffe with Reinforcement Learning
MYRIAD: A new N-body code for simulations of Star Clusters
Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks
Myths and Legends in High-Performance Computing
N-body Simulation for Astronomical Collisional Systems with a New SIMD Instruction Set Extension to the x86 Architecture, Advanced Vector Extensions
N-Body Simulation Using GP-GPU: Evaluating Host/Device Memory Transference Overhead
N-Body Simulations on GPUs
N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks
NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features
NaNet:a low-latency NIC enabling GPU-based, real-time low level trigger systems
NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model
Native Offload of Haskell Repa Programs to GPGPU
Natural HPC substrate: Exploitation of mixed multicore CPU and GPUs
NaturalCC: A Toolkit to Naturalize the Source Code Corpus
Navier-Stokes on programmable graphics hardware using SMAC
Navigating An Evolutionary Fast Path to Exascale - Expanded Version
NBODY6++GPU: Ready for the gravitational million-body problem
NBSymple, a double parallel, symplectic N-body code running on Graphic Processing Units
NCAM: Near-Data Processing for Nearest Neighbor Search
NCRF++: An Open-source Neural Sequence Labeling Toolkit
ndzip-gpu: Efficient Lossless Compression of Scientific Floating-Point Data on GPUs
Near Memory Similarity Search on Automata Processors
Near real-time Fast Bilateral Stereo on the GPU
Near-LSPA Performance at MSA Complexity
Near-real-time simulations of biolelectric activity in small mammalian hearts using graphical processing units
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs
Nemo: A parallelized Lagrangian particle-tracking model
NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs
Neneta: Heterogeneous Computing Complex-Valued Neural Network Framework
Nengo: a Python tool for building large-scale functional brain models
NengoDL: Combining deep learning and neuromorphic modelling methods
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Neon: A Domain-Specific Programming Language for Image Processing
neoSYCL: a SYCL implementation for SX-Aurora TSUBASA
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
Neptune: An astrophysical smooth particle hydrodynamics code for massively parallel computer architectures
NEPTUNE: Network- and GPU-aware Management of Serverless Functions at the Edge
Nested Data-Parallelism on the GPU
Nested Intervals Tree Encoding with System of Residual Classes
Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems
Network Simulator Tools and GPU Parallel Systems
Network-on-Chip Hardware Accelerators for Biological Sequence Alignment
Neural Architecture Search for Lightweight Non-Local Networks
Neural Architecture Search without Training
Neural Code Comprehension: A Learnable Representation of Code Semantics
Neural Decoding using a Parallel Sequential Monte Carlo method on Point Processes with Ensemble Effect
Neural GPUs Learn Algorithms
Neural Multi-scale Image Compression
Neural Network Computing Using On-Chip Accelerators
Neural Network Implementation Using CUDA and OpenMP
Neural Network Inference on Mobile SoCs
Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives
Neural network modeling on evolution of hydration reaction for Portland cement
Neural Network Simulation: The recognition application
Neural Networks for Beginners. A fast implementation in Matlab, Torch, TensorFlow
Neural Networks through Shared Maps in Mobile Devices
Neural Query Language: A Knowledge Base Query Language for Tensorflow
Neural scene representation and rendering
Neurokernel: An Open Scalable Software Framework for Emulation and Validation of Drosophila Brain Models on Multiple GPUs
Neurokernel: An Open Source Platform for Emulating the Fruit Fly Brain
Neuromorphic models on a GPGPU cluster
Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA
New Basic Linear Algebra Methods for Simulation on GPUs
New efficient integral algorithms for quantum chemistry
New Efficient Method To Solve Longest Overlap Region Problem For Noncoding DNA Sequence
New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code
New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA
New Sparse Matrix Storage Format to Improve The Performance of Total SPMV Time
New Techniques for Spectral Image Acquisition and Analysis
Next-generation acceleration and code optimization for light transport in turbid media using GPUs
nGFSIM: A GPU-based fault simulator for 1-to-n detection and its applications
Nikola: embedding compiled GPU functions in Haskell
NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce
Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
NLSEmagic: Nonlinear Schrodinger Equation Multidimensional Matlab-based GPU-accelerated Integrators using Compact High-order Schemes
NMF-mGPU: non-negative matrix factorization on multi-GPU systems
nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware
NNP/MM: Fast molecular dynamics simulations with machine learning potentials and molecular mechanics
NNS: The Case For Neural Network-based Sorting
No More Shading Languages: Compiling C++ to Vulkan Shaders
Nodal Discontinuous Galerkin Methods on Graphics Processors
Noise Removal from Remote Sensed Images by NonLocal Means with OpenCL Algorithm
Noise-resistant fitting for spherical harmonics
Non-blocking programming on multi-core graphics processors: (extended asbtract)
Non-Determinism in TensorFlow ResNets
Non-deterministic parallelism considered useful
Non-Hydrostatic Pressure Shallow Flows: GPU Implementation Using Finite-Volume and Finite-Difference Scheme
Non-intrusive Performance Analysis of Parallel Hardware Accelerated Applications on Hybrid Architectures
Non-local means denoising algorithm accelerated by GPU
Non-Local Total Generalized Variation for Optical Flow Estimation
Non-Parametric Adaptive Network Pruning
Non-recursive beam search on GPU for formal concept analysis
Non-rigid multi-modal registration on the GPU
Non-separable 2D, 3D and 4D filtering with CUDA
Non-steady relaxation and critical exponents at the depinning transition
Non-symmetric magnetohydrostatic equilibria: a multigrid approach
Non-Uniform Domain Decomposition for Heterogeneous Accelerated Processing Units
Non-Uniformly Partitioned Block Convolution on Graphics Processing Units
Nondissipative Marbling
Nonlinear Dynamic Analysis Efficiency by Using a GPU Parallelization
Nonlinear dynamic finite element analysis with GPU
Nonlinear optimization framework for image-based modeling on programmable graphics hardware
Nonlinear optimization with a massively parallel Evolution Strategy-Pattern Search algorithm on graphics hardware
Nonmetric Priors for Continuous Multilabel Optimization
Nonnegative Tensor Factorization Accelerated Using GPGPU
Nonperturbative Quantum Field Theory in Astrophysics
Not Half Bad: Exploring Half-Precision in Graph Convolutional Neural Networks
NOVA: A Functional Language for Data Parallelism
Novel Architectures: Solving Computational Problems with GPU Computing
Novel Computing Architectures 
Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms
Novel GPU Implementation of Jacobi Algorithm for Karhunen-Loeve Transform of Dense Matrices
Novel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC)
Novel insights on atomic synchronization for sort-based group-by on GPUs
Novel Methodologies for Predictable CPU-To-GPU Command Offloading
Novel Multi-Layer Network Decomposition Boosting Acceleration of Multi-core Algorithms
Novel Parallel Approaches to Efficiently Solve Spatial Problems on Heterogeneous CPU-GPU Systems
Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems
NPBench: A Benchmarking Suite for High-Performance NumPy
NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers
NQueens on CUDA: Optimization Issues
Nsight Python: A Python-First Profiling Toolkit for Seamless GPU Kernel Analysis (Tool)
NT-SIM: A Co-Simulator for Networked Signal Processing Applications
Nucleation of nanoparticles in a coarse grained fluid using OpenCL
Nucleation Studies on Graphics Processing Units
Nuclei: GPU-Accelerated Many-Core Network Coding
NUMA Data-Access Bandwidth Characterization and Modeling
NUMA-Aware Image Compositing on Multi-GPU Platform
Numerical Accuracy Analysis Based on the Discrete Stochastic Arithmetic on Multiprocessor Platforms
Numerical Accuracy Differences in CPU and GPGPU Codes
Numerical computations in Java with CUDA
Numerical Computations with GPUs
Numerical cosmology on the GPU with Enzo and Ramses
Numerical integration on GPUs for higher order finite elements
Numerical investigations on nonlinear nonparaxial beam propagation using graphics processing units
Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects
Numerical Model of Shallow Water: the Use of NVIDIA CUDA Graphics Processors
Numerical Modeling of Atmospheric Vortices
Numerical modeling of gravitational wave sources accelerated by OpenCL
Numerical Ocean Modeling and Simulation with CUDA
Numerical Parallel Processing Based on GPU with CUDA Architecture
Numerical Precision and Benchmarking Very-High-Order Integration of Particle Dynamics on GPU Accelerators
Numerical resolution of conservation laws with OpenCL
Numerical Simulation for the MHD System in 2D Using OpenCL
Numerical simulation of 3D particulate flows based on GPU technology
Numerical Simulation of Melting with Natural Convection Based on Lattice Boltzmann Method and Performed with CUDA Enabled GPU
Numerical Simulation of the Complex Ginzburg-Landau Equation on GPUs with CUDA
Numerical Simulation of the Frank-Kamenetskii PDE: GPU vs. CPU Computing
Numerical simulations of acoustic waves with the graphic acceleration GAMER code
Numerical solution of PDEs with hybrid and heterogeneous computing models
Numerical Solutions of Heat and Mass Transfer with the Third Kind Boundary and Initial Conditions in Capillary Porous Media Using Programmable Graphics Hardware
Numerical Study of Geometric Multigrid Methods on CPU--GPU Heterogeneous Computers
NUPAR: A Benchmark Suite for Modern GPU Architectures
NVIDIA CUDA software and gpu parallel computing architecture
NVIDIA Nemotron Parse 1.1
NVIDIA SimNet: an AI-accelerated multi-physics simulation framework
NVIDIA Tensor Core Programmability, Performance & Precision
NVIDIA Tesla: A Unified Graphics and Computing Architecture
Object Detection Based Handwriting Localization
Object Oriented Framework for CUDA based Pyramidal Image Blending
Object oriented framework for real-time image processing on GPU
Object Space Based Collision Detection for Cloth Simulation on the GPU
Object support for OpenMP-style programming of GPU clusters in Java
Object-oriented stream programming using aspects
Object-oriented stream programming using Aspects: a high-productivity programming paradigm for hybrid platforms
Objective-Driven Workload Allocation in Heterogeneous Computing Systems
Obsidian: GPU Kernel Programming in Haskell (thesis)
Obsidian: GPU Programming in Haskell
Obtaining a 35x Speedup in 2D Phase Unwrapping Using Commodity Graphics Processors
OCCA: A unified approach to multi-threading languages
Ocean wave simulation in real-time using GPU
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Ocelot/HyPE: Optimized Data Processing on Heterogeneous Hardware
OCLoptimizer: An Iterative Optimization Tool for OpenCL
OCT on CUDA: Speeding up the image reconstruction algorithm for an Optical Coherence Tomography system using NVIDIA's CUDA platform
Oct-tree Method on GPU
Octree Light Propagation Volumes
Octree-based, GPU implementation of a continuous cellular automaton for the simulation of complex, evolving surfaces
Odeint - Solving ordinary differential equations in C++
Odyssey: A Public GPU-Based Code for General-Relativistic Radiative Transfer in Kerr Spacetime
Off-axis quantitative phase imaging processing using CUDA: toward real-time applications
Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads
Offload Compiler Runtime for the Intel Xeon Phi Coprocessor
Offloading Critical Security Operations to the GPU
Offloading IDS Computation to the GPU
Offloading Java to Graphics Processors
Offloading Region Matching of Data Distribution Management with CUDA
Offset, Bisector and Medial Axis Construction on NURBS Surface Based on GPU
OKL: A Unified Language for Parallel Architectures
OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems
OmniDB: Towards Portable and Efficient Query Processing on Parallel CPU/GPU Architectures
Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs
Omniwise: Predicting GPU Kernels Performance with LLMs
OMP2HMPP: Compiler Framework for Energy-Performance Trade-off Analysis of Automatically Generated Codes
OMP2HMPP: HMPP Source Code Generation from Programs with Pragma Extensions
OmpSs task offload
On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures
On accelerating iterative algorithms with CUDA: A case study on Conditional Random Fields training algorithm for biological sequence alignment
On algorithmic reductions in task-parallel programming models
On Benchmarking the Matrix Multiplication Algorithm using OpenMP, MPI and CUDA Programming Languages
On Binaural Spatialization and the Use of GPGPU for Audio Processing
On continuous maximum flow image segmentation algorithm
On CUDA implementation of a multichannel room impulse response reshaping algorithm based on p-norm optimization
On Demand Solid Texture Synthesis Using Deep 3D Networks
On Development, Feasibility, and Limits of Highly Efficient CPU and GPU Programs in Several Fields
On Dynamic Load Balancing on Graphics Processors
On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors
On Expressing Different Concurrency Paradigms on Virtual Execution Systems
On Expressing Different Concurrency Paradigms on Virtual Execution Systems (thesis)
On GPU Fourier Transformations
On GPU-Accelerated Fast Direct Solvers and Their Applications in Image Denoising
On GPU's viability as a middleware accelerator
On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest
On learning optimized reaction diffusion processes for effective image restoration
On Leveraging GPUs for Security: discussing k-anonymity and pattern matching
On Longest Repeat Queries Using GPU
On Migration and Consolidation of VMs in Hybrid CPU-GPU Environments
On modelling of anisotropic viscoelasticity for soft tissue simulation: numerical solution and GPU execution
On optimization of finite-difference time-domain (FDTD) computation on heterogeneous and GPU clusters
On optimization techniques for the matrix multiplication on hybrid CPU+GPU platforms
On Optimizing Complex Stencils on GPUs
On Parallel Software Verification using Boolean Equation Systems
On Password Guessing with GPUs and FPGAs
On Performance of GPU and DSP Architectures for Computationally Intensive Applications
On Pre-Trained Image Features and Synthetic Images for Deep Learning
On Reinforcement Learning for Full-length Game of StarCraft
On Runtime Systems for Task-based Programming on Heterogeneous Platforms
On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention
On Simplifying and Optimizing Programs for Heterogeneous Computing Systems
On sorting and load balancing on GPUs
On Static Timing Analysis of GPU Kernels
On testing GPU memory for hard and soft errors
On the Accelerating of Two-dimensional Smart Laplacian Smoothing on the GPU
On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats
On the Characterization of OpenCL Dwarfs on Fixed and Reconfigurable Platforms
On the Choice of Tensor Estimation for Corner Detection, Optical Flow and Denoising
On the Compilation Performance of Current SYCL Implementations
On the Correctness of the SIMT Execution Model of GPUs
On the Cryptanalysis of Public-Key Cryptography
On the design of architecture-aware algorithms for emerging applications
On the design of sparse hybrid linear solvers for modern parallel architectures
On the Development and Implementation of High-Order Flux Reconstruction Schemes for Computational Fluid Dynamics
On the Effect of Using Multiple GPUs in Solving QAPs with CUDA
On the Effectiveness of OpenMP teams for Programming Embedded Manycore Accelerators
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
On the Efficacy of GPU-Integrated MPI for Scientific Applications
On the Efficiency of CPU and Hybrid CPU-GPU Systems in Computational Biology Tasks
On the efficiency of iterative ordered subset reconstruction algorithms for acceleration on GPUs
On the energy efficiency of graphics processing units for scientific computing
On the evaluation of matrix polynomials using several GPGPUs
On the Fly Porn Video Blocking Using Distributed Multi-GPU and Data Mining Approach
On the GPGPU parallelization issues of finite element approximate inverse preconditioning
On the limits of GPU acceleration
On the numerical sensitivity of computer simulations on hybrid and parallel computing systems
On the numerical solution of chaotic dynamical systems using extend precision floating point arithmetic and very high order numerical methods
On the origin of yet another channel
On the Parallelization of Integer Polynomial Multiplication
On the Partitioning of GPU Power among Multi-Instances
On the Performance and Energy-efficiency of Multi-core SIMD CPUs and CUDA-enabled GPUs
On the performance of a highly-scalable Computational Fluid Dynamics code on AMD, ARM and Intel processors
On the performance of GPU public-key cryptography
On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures
On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation
On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation
On the Portability of the OpenCL Dwarfs on Fixed and Reconfigurable Parallel Platforms
On the Programmability and Performance of Heterogeneous Platforms
On the programmability of multi-GPU computing systems
On the Relation between Anisotropic Diffusion and Iterated Adaptive Filtering
On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU
On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit
On the Simulations of Evolution-Communication P Systems with Energy without Antiport Rules for GPUs
On the technology roadmap of Free-Viewpoint 3DTV receivers
On the Three P's of Parallel Programming for Heterogeneous Computing: Performance, Productivity, and Portability
On the type of the temperature phase transition in phi-4 model
On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences
On the Use of a GPU-Accelerated Mobile Device Processor for Sound Source Localization
On the Use of an Algebraic Language Interface for Waveform Definition
On the use of deep Boltzmann machines for road signs classification
On the Use of GPUs in Realizing Cost-Effective Distributed RAID
On the Use of Graphic Processing Units for the Efficient Implementation of MIMO Detectors
On the Use of Graphics Processing Units (GPUs) for Molecular Dynamics Simulation of Spherical Particles
On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications
On the Use of Small 2D Convolutions on GPUs
On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods
On the Validation and Applications of a Parallel Flexible Multi-Body Dynamics Implementation
On the Visualization of Social and other Scale-Free Networks
On the Way to Future's High Energy Particle Physics Transport Code
On Using GPU to Compute Options and Derivatives
On Vectorization of Deep Convolutional Neural Networks for Vision Tasks
On-Demand Generating and Scheduling Optimised Parallel Applications on Heterogeneous Platforms
On-Demand Source Code Generation & Scheduling Optimised Parallel Applications on Heterogeneous Platforms
On-line free-viewpoint video: From single to multiple view rendering
On-the-Fly Computing on Many-Core Processors in Nuclear Applications
On-the-fly elimination of dynamic irregularities for GPU computing
On-the-fly Generation and Rendering of Infinite Cities on the GPU
On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-based FPGAs
Oncilla: A GAS Runtime for Efficient Resource Allocation and Data Movement in Accelerated Clusters
One machine, one minute, three billion tetrahedra
One OpenCL to Rule Them All?
One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation
One weird trick for parallelizing convolutional neural networks
One-shot tuner for deep learning compilers
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Onesweep: A Faster Least Significant Digit Radix Sort for GPUs
Online Adaptive Code Generation and Tuning
Online Dynamic Graph Drawing
Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach
Online Performance Projection for Clusters with Heterogeneous GPUs
Online rapid prototyping of 3D objects using GPU-based 3D cloud computing: Application to 3D face modelling
Online video synthesis for removing occluding objects using multiple uncalibrated cameras via plane sweep algorithm
OP2: An Active Library Framework for Solving Unstructured Mesh-based Applications on Multi-Core and Many-Core Architectures
Opal: A Modular Framework for Optimizing Performance using Analytics and LLMs
Open Source Face Recognition API
Open SYCL on heterogeneous GPU systems: A case of study
Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark
OpenABLext: An automatic code generation framework for agent-based simulations on CPU-GPU-FPGA heterogeneous platforms
OpenACC - First Experiences with Real-World Applications
OpenACC cache Directive: Opportunities and Optimizations
OpenACC Implementations Comparison
OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs
OpenACC-based GPU Acceleration of a 3-D Unstructured Discontinuous Galerkin Method
OpenACC-based Snow Simulation
OpenCL - An effective programming model for data parallel computations at the Cell Broadband Engine
OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture
OpenCL 2.0 for FPGAs using OCLAcc
OpenCL 2.2 API Specification
OpenCL Accelerated Multi-GPU Cone-Beam Reconstruction
OpenCL Acceleration for TensorFlow
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
OpenCL and parallel primitives for digital TV applications
OpenCL and the 13 Dwarfs: A Work in Progress
OpenCL API Extensions to achieve Multi-level Parallelism for Efficient Implementation of Strassen's Matrix Multiplication on GPUs
OpenCL Based Digital Image Projection Acceleration
OpenCL Based High-Quality HEVC Motion Estimation on GPU
OpenCL based machine learning labeling of biomedical datasets
OpenCL C++
OpenCL Cryptographic Library
OpenCL embedded profile prototype in mobile device
OpenCL Evaluation for Numerical Linear Algebra Library Development
OpenCL Fast Fourier Transform
OpenCL Floating Point Software on Heterogeneous Architectures - Portable or Not?
OpenCL for Database Query Processing
OpenCL for FPGAs: Prototyping a Compiler
OpenCL for programming shared memory multicore CPUs
OpenCL FPGA Optimization guided by memory accesses and roofline model analysis applied to tomography acceleration
OpenCL framework for a CPU, GPU, and FPGA Platform
OpenCL Implementation of a Color Based Object Tracking
OpenCL Implementation of a Parallel Universal Kriging Algorithm for Massive Spatial Data Interpolation on Heterogeneous Systems
OpenCL Implementation of LiDAR Data Processing
OpenCL Implementation of Montgomery Multiplication on FPGA
OpenCL Implementation of Motion Estimation for Cloud Video Processing
OpenCL in Action: How to Accelerate Graphics and Computations
OpenCL JIT Compilation for Dynamic Programming Languages
OpenCL Library for Parallel Graph Search Algorithms
OpenCL Numerical Simulations of Two-Fluid Compressible Flows With a 2D Random Choice Method
OpenCL parallel Processing using General Purpose Graphical Processing units - TiViPE software development
OpenCL Parallel Programming Development Cookbook
OpenCL Performance Evaluation on Modern Multi Core CPUs
OpenCL Performance on the Intel Heterogeneous Architecture Research Platform
OpenCL Performance Prediction using Architecture-Independent Features
OpenCL Programming by Example
OpenCL Programming Guide
OpenCL Programming Guide for Mac
OpenCL programming using Python syntax
OpenCL simulations of two-fluid compressible flows with a random choice method
OpenCL Sparse Linear Solver for Circuit Simulation
OpenCL Task Partitioning in the Presence of GPU Contention
OpenCL Vector Swizzling Optimization under Global Value Numbering
OpenCL vs: Accelerated Finite-Difference Digital Synthesis
OpenCL vs. OpenMP: A Programmability Debate
OpenCL-Accelerated Computation of a 3D SPECT Projection Operator for the Content Adaptive Mesh Model
OpenCL-accelerated object classification in video streams using Spatial Pooler of Hierarchical Temporal Memory
OpenCL-accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing
OpenCL-Accelerated Simplified General Perturbations 4 Algorithm
OpenCL-based Algorithm for Heat Load Modelling of District Heating System
OpenCL-based design methodology for application-specific processors
OpenCL-Based Design of an FPGA Accelerator for Phase-Based Correspondence Matching
OpenCL-Based Erasure Coding on Heterogeneous Architectures
OpenCL-Based FPGA Accelerator for 3D FDTD with Periodic and Absorbing Boundary Conditions
OpenCL-Based Implementation of an FPGA Accelerator for Molecular Dynamics Simulation
OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges
OpenCL-based optimizations for acceleration of object tracking on FPGAs and GPUs
OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework
OpenCL-HPX Integration
OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing
OpenCL-Z Android Released on Google Play
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems
OpenCL: a viable solution for high-performance medical image reconstruction?
OpenCL: Make Ubiquitous Supercomputing Possible
OpenCL/CUDA algorithms for parallel decoding of any irregular LDPC code using GPU
OpenCL/OpenGL aproach for studying active Brownian motion
OpenCLIPER: an OpenCL-based C++ Framework for Overhead-Reduced Medical Image Processing and Reconstruction on Heterogeneous Devices
OpenCUDA+MPI: A Framework for Heterogeneous GP-GPU Distributed Computing
OpenDNN: An Open-source, cuDNN-like Deep Learning Primitive Library
OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing
OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures
OpenFace: A general-purpose face recognition library with mobile applications
OpenGL application live migration with GPU acceleration in personal cloud
OpenGL SuperBible: Comprehensive Tutorial and Reference (5th Edition)
Opengl-Based Control of Semi-Active 3D Display
OpenGL(R) ES 2.0 Programming Guide
OpenGL(R) Programming Guide: The Official Guide to Learning OpenGL(R), Version 2 (5th Edition)
OpenGL(R) Shading Language (2nd Edition)
OpenGL(R) SuperBible: Comprehensive Tutorial and Reference (4th Edition)
Opening the Black Box: Performance Estimation during Code Generation for GPUs
OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials
OpenMM: A Hardware-Independent Framework for Molecular Simulations
OpenMP Advisor
OpenMP as a High-Level Specification Language for Parallelism And its use in Evaluating Parallel Programming Systems
OpenMP for Accelerators
OpenMP in Multicore Architectures (tech. report)
OpenMP Kernel Language Extensions for Performance Portable GPU Codes
OpenMP offload at the Exascale using Intel GPU Max 1550: evaluation of STREAmS compressible solver
OpenMP Offloading in the Jetson Nano Platform
OpenMP on Multicore Architectures
OpenMP Parallelization and Optimization of Graph-based Machine Learning Algorithms
OpenMP performance analysis for many-core platforms with non-uniform memory access
OpenMP Programming on Intel R Xeon Phi TM Coprocessors: An Early Performance Comparison
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
OpenMP, OpenMP/MPI, and CUDA/MPI C programs for solving the time-dependent dipolar Gross-Pitaevskii equation
OpenMPC: Extended OpenMP for Efficient Programming and Tuning on GPUs
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
OpenNMT: Open-Source Toolkit for Neural Machine Translation
OpenOF: Framework for Sparse Non-linear Least Squares Optimization on a GPU
OpenRAND: A Performance Portable, Reproducible Random Number Generation Library for Parallel Computations
OpenRCL: Low-Power High-Performance Computing with Reconfigurable Devices
OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures
OpenSBLI: Automated code-generation for heterogeneous computing architectures applied to compressible fluid dynamics on structured grids
OpenSSL acceleration using Graphics Processing Units
OpenVIDIA: parallel GPU computer vision
Operating Systems Challenges for GPU Resource Management
Operating systems must support GPU abstractions
OPNET: An Integrated Design Paradigm for Simulations
Opportunities for Heterogeneous CPUGPU Task Scheduling
Opportunities for Nonvolatile Memory Systems in Extreme-Scale High Performance Computing
Opportunities for Parallelism in Matrix Multiplication
Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging
Optical Flow Computation on Compute Unified Device Architecture
Optical Flow via Locally Adaptive Fusion of Complementary Data Costs
Optimal Alignment of Three Sequences On A GPU
Optimal automatic multi-pass shader partitioning by dynamic programming
Optimal Configuration of GPU Cache Memory to Maximize the Performance
Optimal Control of the Process Systems Using Graphic Processing Unit
Optimal Control Problem and Power-Efficient Medical Image Processing Using Puma
Optimal Image Upscaling Using Pixel Classification
Optimal Kernel Orchestration for Tensor Programs with Korch
Optimal loop unrolling for GPGPU programs
Optimal loop unrolling for GPGPU programs (thesis)
Optimal Periods for Probing Convergence of Infinite-stage Dynamic Programmings on GPUs
Optimal Piecewise Linear Function Approximation for GPU-based Applications
Optimal polygonal L1 linearization and fast interpolation of nonlinear systems
Optimal program variant generation for hybrid manycore systems
Optimal rotation alignment of 3D objects using a GPU-based similarity function
Optimal similarity registration of volumetric images
Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs
Optimal structure of face detection algorithm using GPU architecture
Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
Optimal Workload Placement on Multi-Instance GPUs
Optimisation and GPU code generation of Stencils for Futhark
Optimisation and Parallelism in Synchronous Digital Circuit Simulators
Optimising Convolutional Neural Networks Inference on Low-Powered GPUs
Optimising Cosmological N-body Simulations in GPU Clusters
Optimising GPR modelling: A practical, multi-threaded approach to 3D FDTD numerical modelling
Optimising Hydrodynamics applications for the Cray XC30 with the application tool suite
Optimising Monte Carlo option pricing using GPUs
Optimising OpenCL kernels for the ARM Mali-T600 GPUs
Optimising Purely Functional GPU Programs
Optimising Purely Functional GPU Programs (Thesis)
Optimising Reconfigurable Systems for Real-time Applications
Optimising the DBCSR GPU Implementation
Optimising Unstructured Mesh Computational Fluid Dynamics Applications on Multicores via Machine Learning and Code Transformation
Optimistic Parallelism on GPUs
Optimization and Evaluation of VLPL-S Particle-in-cell Code on Knights Landing
Optimization and Implementation of LBM Benchmark on Multithreaded GPU
Optimization and Large Scale Computation of an Entropy-Based Moment Closure
Optimization and Parallelization Methods for the Design of Next-Generation Radio Networks
Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors
Optimization and parameter exploration using GPU based FDTD solvers
Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs
Optimization of a discontinuous finite element solver with OpenCL and StarPU
Optimization of a discontinuous Galerkin solver with OpenCL and StarPU
Optimization of a FDTD code for graphical processing units
Optimization of a finite element code implemented in MATLAB: On the use of GPUs for High Performance Computing
Optimization of a GPU Implementation of Multi-Dimensional RF Pulse Design Algorithm
Optimization of a Machine Learning Algorithm on the Heterogeneous system using OpenCL
Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs
Optimization of Data Assignment for Parallel Processing in a Hybrid Heterogeneous Environment Using Integer Linear Programming
Optimization of Data-Parallel Scientific Applications on Highly Heterogeneous Modern HPC Platforms
Optimization of GPU workloads using natural language processing based on deep learning techniques
Optimization of HEP codes on GPUs
Optimization of Heterogeneous Parallel Computing Systems using Machine Learning
Optimization of Heterogeneous Systems with AI Planning Heuristics and Machine Learning: A Performance and Energy Aware Approach
Optimization of Hierarchical Matrix Computation on GPU
Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems
Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers
Optimization of linked list prefix computations on multithreaded GPUs using CUDA
Optimization of mapped functions sequences using fusions on GPU
Optimization of massive data applications on heterogeneous architectures
Optimization of Molecular Dynamics Simulation Code and Applications to Biomolecular Systems
Optimization of OpenCL applications on FPGA
Optimization of parallel Genetic Algorithms for nVidia GPUs
Optimization of Pattern Matching Algorithms for Multi- and Many-Core Platforms
Optimization of Ported CFD Kernels on Intel Data Center GPU Max 1550 using oneAPI ESIMD
Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors
Optimization of RAID Erasure Coding Algorithms for Intel Xeon Phi
Optimization of real-time ultrasound PCIe data streaming and OpenCL processing for SAFT imaging
Optimization of solver for gas flow modeling
Optimization of Spatial Convolution in ConvNets on Intel KNL
Optimization of tele-immersion codes
Optimization of the Brillouin operator on the KNL architecture
Optimization of the Gaussian Mixture Model Evaluation on GPU
Optimization of the HEFT algorithm for a CPU-GPU environment
Optimization of the Oktay-Kronfeld Action Conjugate Gradient Inverter
Optimization of the Particle-based Volume Rendering for GPUs with Hiding Data Transfer Latency
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Optimization procedures during parallelization of specialized software for fluid flow simulations
Optimization Solutions for Improving the Performance of the Parallel Reduction Algorithm Using Graphics Processing Units
Optimization solutions for the segmented sum algorithmic function
Optimization strategies for parallel CPU and GPU implementations of a meshfree particle method
Optimization Techniques for CUDA Application
Optimization Techniques for GPU Programming
Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms
Optimization Techniques on GPU: A Survey
Optimization, Specification and Verification of the Prefix Sum Program in an OpenCL Environment
Optimizations and Performance of a Robotics Grasping Algorithm Described in Geometric Algebra
Optimizations in Bioinformatics using GPU Processing on Binary Data
Optimize or Wait? Using llc Fast-Prototyping Tool to Evaluate CUDA Optimizations
Optimize Overall System Performance Through Workload Sequencing for GPUs Data Offloading
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR
Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers
Optimized Data Transfers Based on the OpenCL Event Management Mechanism
Optimized Deep Learning Architectures with Fast Matrix Operation Kernels on Parallel Platform
Optimized Event-Driven Runtime Systems for Programmability and Performance
Optimized GPU Framework for Pulsed Wave Doppler Ultrasound
Optimized GPU Framework for Speckle Reduction Using Histogram Matching and Region Growing
Optimized GPU Framework for Ultrasound B-Mode Imaging
Optimized GPU Framework for Ultrasound Color Flow Imaging
Optimized GPU Framework for Ultrasound Strain Imaging
Optimized GPU histograms for multi-modal registration
Optimized GPU Implementation and Performance Analysis of HC Series of Stream Ciphers
Optimized GPU simulation of continuous-spin glass models
Optimized HPL for AMD GPU and multi-core CPU usage
Optimized MFCC Feature Extraction on GPU
Optimized Parallel Implementation of Gillespie's First Reaction Method on Graphics Processing Units
Optimized parallel implementation of pedestrian tracking using HOG features on GPU
Optimized Password Recovery for Encrypted RAR on GPUs
Optimized Pattern-Based Adaptive Mesh Refinement Using GPU
Optimized Private Information Retrieval Protocol Using Graphics Processing Unit With Reduced Accessibility
Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs
Optimizing 3D Convolutions for Wavelet Transforms on CPUs with SSE Units and GPUs
Optimizing a Biomedical Imaging Orientation Score Framework
Optimizing a Hardware Network Stack to Realize an In-Network ML Inference Application
Optimizing a High Energy Physics (HEP) Toolkit on Heterogeneous Architectures
Optimizing a Near-duplicate Document Detection System with SIMD Technologies
Optimizing a Semantic Comparator using CUDA-enabled Graphics Hardware
Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform
Optimizing All-to-All and Allgather Communications on GPGPU Clusters
Optimizing an OpenCL Application for Video Watermarking in FPGAs
Optimizing and Auto-tuning Belief Propagation on the GPU
Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures
Optimizing ASP.NET with C++ AMP on the GPU
Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM
Optimizing Communication by Compression for Multi-GPU Scalable Breadth-First Searches
Optimizing Communication for Clusters of GPUs
Optimizing CUDA Code By Kernel Fusion - Application on BLAS
Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization
Optimizing CUDA Shared Memory Usage
Optimizing data intensive GPGPU computations for DNA sequence alignment
Optimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission
Optimizing dataflow applications on heterogeneous environments
Optimizing Deep CNN-Based Queries over Video Streams at Scale
Optimizing Deep Learning Models For Raspberry Pi
Optimizing exact computation of Betweenness Centrality for CUDA
Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming
Optimizing Full Correlation Matrix Analysis of fMRI Data on Intel Xeon Phi Coprocessors
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU Volume Rendering
Optimizing GPU-accelerated Group-By and Aggregation
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps
Optimizing High-Performance Linpack for Exascale Accelerated Architectures
Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs
Optimizing Krylov Subspace Solvers on Graphics Processing Units
Optimizing Lempel-Ziv Factorization for the GPU Architecture
Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer
Optimizing LZSS Compression on GPGPUs
Optimizing MapReduce for GPUs with effective shared memory usage
Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs
Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs
Optimizing memory management on heterogeneous systems using polyhedral, compile-time techniques
Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
Optimizing Monte Carlo radiosity on graphics hardware
Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes
Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs
Optimizing OpenCL Local Work Group Size With Machine Learning
Optimizing Performance and Energy Efficiency in Massively Parallel Systems
Optimizing Performance of Recurrent Neural Networks on GPUs
Optimizing Performance of Stencil Code with SPL Conqueror
Optimizing performance per watt on GPUs in High Performance Computing: temperature, frequency and voltage effects
Optimizing RDF stores by coupling General-purpose Graphics Processing Units and Central Processing Units
Optimizing Real Time GPU Kernels Using Fuzzy Inference System
Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA
Optimizing simulated annealing on GPU: A case study with IC floorplanning
Optimizing Smith-Waterman algorithm on Graphics Processing Unit
Optimizing Sparse Matrix-Matrix Multiplication for the GPU
Optimizing Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures
Optimizing Stencil Computations for NVIDIA Kepler GPUs
Optimizing strassen matrix multiply on GPUs
Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures
Optimizing Sweep3D for Graphic Processor Unit
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs
Optimizing the Computation of Eigenvalues Using Graphics Processing Units
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL
Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems
Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor
Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units
Optimizing the optimizer increasing performance efficiency of modern compilers
Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes
Optimizing the SUSAN corner detection algorithm for a high speed FPGA implementation
Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee
Optimizing Urban Environmental Simulations using Boinc
Optimizing Web Virtual Reality
Optimizing Xeon Phi for Interactive Data Analysis
OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization
OptiML: An implicitly parallel domain-specific language for machine learning
Optimum Application Deployment Technology for Heterogeneous IaaS Cloud
Option Pricing on the GPU
Option pricing with COS method on graphics processing units
Option pricing with multi-dimensional quadrature architectures
OptiX: a general purpose ray tracing engine
Orca: FSS-based Secure Training with GPUs
Orchestrated Scheduling and Prefetching for GPGPUs
Orchestrating Multiple Data-Parallel Kernels on Multiple Devices
Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors
Orchestration by approximation: mapping stream programs onto multicore architectures
Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station
Origami: A Convolutional Network Accelerator
Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic
Orthogononalization on a general purpose graphics processing unit with double double and quad double arithmetic
Orthorectification by Using GPGPU Method
Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs
Out-of-core cone beam reconstruction using multiple GPUs
Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds
Out-of-core singular value decomposition
Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling
Out-of-the-box library support for DBMS operations on GPUs
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Over-synchronization in GPU Programs
Overcoming the GPU memory limitation on FDTD through the use of overlapping subgrids
Overcomplete Dictionary Learning with Jacobi Atom Updates
Overdetermined Shooting Methods for Computing Standing Water Waves with Spectral Accuracy
Overhauling SC atomics in C11 and OpenCL
Overlap fermions on GPUs
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping computation and communication of three-dimensional FDTD on a GPU cluster
Overtaking CPU DBMSes with a GPU in Whole-Query Analytic Processing with Parallelism-Friendly Execution Plan Optimization
Overview of approaches for accelerating scale invariant feature detection algorithm
Overview of implementation of DARPA GPU program in SAIC
OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance
Owl: Differential-based Side-Channel Leakage Detection for CUDA Applications
P-HGRMS: A Parallel Hypergraph Based Root Mean Square Algorithm for Image Denoising
P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code
PacketShader: a GPU-accelerated software router
Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm
Pairwise Sequence Alignment for Very Long Sequences on GPUs
Pairwise Sequence Alignment with Gaps with GPU
PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs
Panda: A Compiler Framework for Concurrent CPU-GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers
PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor
Pangolin: An Efficient and Flexible Graph Mining System on CPU and GPU
PanJoin: A Partition-based Adaptive Stream Join
PANNA: Properties from Artificial Neural Network Architectures
Pannotia: Understanding Irregular GPGPU Graph Applications
PantaRay: fast ray-traced occlusion caching of massive scenes
PAPER - Accelerating parallel evaluations of ROCS
ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation
ParadisEO-MO-GPU: a Framework for Parallel GPU-based Local Search Metaheuristics
Paragon: Collaborative Speculative Loop Execution on GPU and CPU
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels
Paraiso : An Automated Tuning Framework for Explicit Solvers of Partial Differential Equations
Parakeet: A Just-In-Time Parallel Accelerator for Python
Parallax: Automatic Data-Parallel Training of Deep Neural Networks
Paralleizing AwSpPCA for robust facial recognition using CUDA
Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs
Parallel 3D Finite Difference Time Domain Simulations on Graphics Processors with Cuda
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster
Parallel 3D multigrid methods on the STI cell BE architecture
Parallel 5 point SOR for solving the Convection Diffusion equation using graphics processing units
Parallel acceleration of CPU and GPU range queries over large data sets
Parallel Acceleration on Manycore Systems and Its Performance Analysis: OpenCL Case Study
Parallel accelerators for GlimmerHMM bioinformatics algorithm
Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations
Parallel AES algorithm for fast Data Encryption on GPU
Parallel AES Encryption Engines for Many-Core Processor Arrays
Parallel Agent systems on a GPU for use with Simulations and Games
Parallel Algorithm Design and Implementation of Regular/Irregular Problems: An In-depth Performance Study on Graphics Processing Units
Parallel Algorithm for BSDEs Based High Dimensional American Option Pricing on the GPU
Parallel Algorithm for Generation of Test Recommended Path using CUDA
Parallel Algorithm for GPU Processing; for use in High Speed Machine Vision Sensing of Cotton Lint Trash
Parallel Algorithm for Solving Kepler's Equation on Graphics Processing Units: Application to Analysis of Doppler Exoplanet Searches
Parallel Algorithm of IDCT with GPUs and CUDA for Large-scale Video Quality of 3G
Parallel algorithms for approximation of distance maps on parametric surfaces
Parallel Algorithms for Constructing Data Structures for Fast Multipole Methods
Parallel Algorithms for Counting Problems on Graphs Using Graphics Processing Units
Parallel Algorithms for GPU accelerated Probabilistic Inference
Parallel Algorithms for Hybrid Multi-core CPU-GPU Implementations of Component Labelling in Critical Phase Models
Parallel algorithms for problems of cluster analysis with very large amount of data
Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations
Parallel algorithms to a parallel hardware: Designing vision algorithms for a GPU
Parallel and Concurrent Programming in Haskell: Techniques for Multicore and Multithreaded Programming
Parallel and Distributed Deep Learning
Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern Matching Algorithms
Parallel and distributed seismic wave field modeling with combined Linux clusters and graphics processing units
Parallel and efficient Boolean on polygonal solids
Parallel and Heterogeneous Timing Analysis: Partition, Algorithm, and System
Parallel and Improved PageRank Algorithm for GPU-CPU Collaborative Environment
Parallel and in-process compilation of individuals for genetic programming on GPU
Parallel and Scalable Sparse Basic Linear Algebra Subprograms
Parallel ant colony for nonlinear function optimization with graphics hardware acceleration
Parallel Application Library for Object Recognition
Parallel Approach for Longest Common Subsequence problem on GPU
Parallel Approach for Time Series Analysis with General Regression Neural Networks
Parallel Approaches for SWAMP Sequence Alignment
Parallel Approaches to Edit Distance and Approximate String Matching
Parallel Approaches to Shortest-Path Problems for Multilevel Heterogeneous Computing
Parallel Arbitrary-precision Integer Arithmetic
Parallel Asynchronous Modelization and Execution of Cholesky Algorithm using Petri Nets
Parallel Banding Algorithm to compute exact distance transform with the GPU
Parallel Batch Training of the Self-Organizing Map Using OpenCL
Parallel Benefit on Different Programming Paradigms
Parallel Bio-Inspired Methods for Model Optimization and Pattern Recognition
Parallel birth and death process for cell nuclei extraction in histopathology images
Parallel Branch and Bound on a CPU-GPU System
Parallel Branch Prediction on GPU Platform
Parallel Breadth First Search on GPU Clusters
Parallel BTF Compression with Multi-Level Vector Quantization in OpenCL
Parallel calculation of the median and order statistics on GPUs with application to robust regression
Parallel Catmull-Rom Spline Interpolation Algorithm for Image Zooming Based on CUDA
Parallel centerline extraction on the GPU
Parallel Chen-Han (PCH) Algorithm for Discrete Geodesics
Parallel Circuit Simulation on Graphical Processing Unit
Parallel Cloth Simulation Using OpenMP and CUDA
Parallel Compact Genetic Algorithm on CUDA-C Platform
Parallel compact roadmap construction of 3D virtual environments on the GPU
Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems
Parallel Computation for Discrete Orthogonal Moments of Images Using Graphic Processing Unit
Parallel Computation of 2D Morse-Smale Complexes
Parallel computation of a SPECT projection operator for a content adaptative mesh model
Parallel Computation of Functions on Set Partitions
Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images
Parallel Computation of Non-Bonded Interactions in Drug Discovery: Nvidia GPUs vs. Intel Xeon Phi
Parallel computation of spherical parameterizations for mesh analysis
Parallel Computational Fluid Dynamics With the Intel Xeon Phi Coprocessor
Parallel Computational Intelligence-Based Multi-Camera Surveillance System
Parallel Computations for Hierarchical Agglomerative Clustering using CUDA
Parallel computations on GPU in 3D using the vortex particle method
Parallel Computer Vision: Person Data Extraction
Parallel Computing based on GPGPU using Compute Unified Device Architecture
Parallel Computing Experiences with CUDA
Parallel Computing for Accelerated Texture Classification with Local Binary Pattern Descriptors using OpenCL
Parallel Computing for the Inverse of SPD matrix
Parallel computing in a quantitative trading firm
Parallel Computing Methods For Particle Accelerator Design
Parallel Computing Model of Multiple Dimensions Data Streams Canonical Correlation Analysis with GPU
Parallel computing of 3D smoking simulation based on OpenCL heterogeneous platform
Parallel Computing of Discrete Element Method on GPU
Parallel Computing of Particle Trajectory Sonification to Enable Real-Time Interactivity
Parallel computing system for the efficient calculation of molecular similarity based on negative electrostatic potential
Parallel Computing the Longest Common Subsequence (LCS) on GPUs: Efficiency and Language Suitability
Parallel Computing Using GPU for Efficient Traffic Simulation
Parallel computing with CUDA
Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration 
Parallel Computing: Accelerating Computational Science and Engineering (CSE)
Parallel Computing: The Elephant in the Room
Parallel connected-component labeling algorithm for GPGPU applications
Parallel Contour-Buildup Algorithm for the Molecular Surface
Parallel Cosegmentation via Submodular Optimization on Anisotropic Diffusion
Parallel CPU and GPU computations to solve the job shop scheduling problem with blocking
Parallel cross-layer optimization of high-level synthesis and physical design
Parallel Cryptanalysis
Parallel Cycle Based Logic Simulation Using Graphics Processing Units
Parallel CYK Membership Test on GPUs
Parallel Data List Processing on Multicore-GPU Platforms
Parallel data mining algorithms for multi-dimensional points on GPUs
Parallel data mining on graphics processors
Parallel Deblocking Filtering in MPEG-4 AVC/H.264 on Massively-Parallel Architectures
Parallel Decompression of Seismic Data on GPU Using a Lifting Wavelet Algorithm
Parallel Dense Gauss-Seidel Algorithm on Many-Core Processors
Parallel Dictionary Learning Algorithms for Sparse Representations
Parallel Digital Predistortion Design on Mobile GPU and Embedded Multicore CPU for Mobile Transmitters
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel discrete wavelet transform using the Open Computing Language: a performance and portability study
Parallel Distance Threshold Query Processing for Spatiotemporal Trajectory Databases on the GPU
Parallel Distributed Breadth First Search on the Kepler Architecture
Parallel Distributed Face Search System for National and Border Security
Parallel divide-and-evolve: experiments with OpenMP on a multicore machine
Parallel drainage network computation on CUDA
Parallel dual tree traversal on multi-core and many-core architectures for astrophysical N-body simulations
Parallel Dynamic Solidification Model of Continuous Steel Casting on GPU
Parallel Dynamics Computation using Prefix Sum Operations
Parallel Evaluation of a Spatial Traversability Cost Function on GPU for Efficient Path Planning
Parallel Evolutionary Algorithms on Consumer-Level Graphics Processing Unit
Parallel evolutionary algorithms on graphics processing unit
Parallel Exact Inference on a CPU-GPGPU Heterogenous System
Parallel execution of a parameter sweep for molecular dynamics simulations in a hybrid GPU/CPU environment
Parallel Execution of AES-CTR Algorithm Using Extended Block Size
Parallel Execution of Constraint Handling Rules on a Graphical Processing Unit
Parallel Execution of the ASP Computation - an Investigation on GPUs
Parallel experiments with RARE-BLAS
Parallel Explicit FEM Algorithms Using GPU's
Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead
Parallel face Detection and Recognition on GPU
Parallel Fast Gauss Transform
Parallel FDTD Arithmetic Simulation Based on Distributed Heterogeneous Cluster System
Parallel FEM Simulation Using GPUs
Parallel FIM Approach on GPU using OpenCL
Parallel Finite Volume Algorithm on Graphic Processing Units (GPU)
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel For Loops on Heterogeneous Resources
Parallel frequent patterns mining algorithm on GPU
Parallel fuzzy connected image segmentation on GPU
Parallel Game Tree Search Using GPU
Parallel garment drape simulation of triangular mesh using GPU programming 
Parallel Gaussian process with kernel approximation in CUDA
Parallel genetic algorithm on the CUDA architecture
Parallel Genetic Algorithm Solving 0/1 Knapsack Problem Running on the GPU
Parallel Genetic Algorithms on a GPU to Solve the Travelling Salesman Problem
Parallel Genetic Algorithms on Programmable Graphics Hardware
Parallel Genetic Programming on Graphics Processing Units
Parallel GMRES implementation for solving sparse linear systems on GPU clusters
Parallel GPGPU Evaluation of Small Angle X-ray Scattering Profiles in a Markov Chain Monte Carlo Framework
Parallel GPU algorithms for alternate-triangular finite difference schemes
Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism
Parallel GPU Implementation of Hough Transform for Circles
Parallel GPU Implementation of Iterated Local Search for the Travelling Salesman Problem
Parallel GPU Implementation of Iterative PCA Algorithms
Parallel GPU Processing for Fast Radio Signal Propagation Computation in GRASS-RaPlaT
Parallel GPU-accelerated Recursion-based Generators of Pseudorandom Numbers
Parallel GPU-based data-dependent triangulations
Parallel graduated assignment algorithm for multiple graph matching based on a common labelling
Parallel Graph Algorithms on the Xeon Phi Coprocessor
Parallel Graph Component Labelling with GPUs and CUDA
Parallel Graph Mining with GPUs
Parallel Graph Processing on Graphics Processors Made Easy
Parallel Gravitation Field Algorithm Based on the CUDA Platform
Parallel grid-based recursive Bayesian estimation using GPU for real-time autonomous navigation
Parallel H-Tree Based Data Cubing on Graphics Processors
Parallel Hashing, Compression and Encryption with OpenCL under OS X
Parallel heterogeneous Branch and Bound algorithms for multi-core and multi-GPU environments
Parallel Hierarchical Clustering on the GPU
Parallel hierarchical cross entropy optimization for on-chip decap budgeting
Parallel High Resolution Real-time Visual Hull On GPU
Parallel hybrid evolutionary algorithms on GPU
Parallel hybrid genetic algorithms on Consumer-Level graphics hardware
Parallel hybrid metaheuristics for the flexible job shop problem
Parallel hybrid SAT solving using OpenCL
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
Parallel hyperspectral image processing on commodity graphics hardware
Parallel Hyperspectral Unmixing on GPUs
Parallel ID Shadow-Map Decompression on GPU
Parallel Image Processing Based on CUDA
Parallel Image Segmentation Using Reduction-Sweeps On Multicore Processors and GPUs
Parallel implematation of flow and matching algorithms
Parallel Implementation Algorithm of Motion Estimation for GPU Applications
Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA
Parallel implementation of a Quantization algorithm for pricing American style options on GPGPU
Parallel implementation of a ray tracer for underwater sound waves using the cuda libraries: description and application to the simulation of underwater networks
Parallel implementation of a spatio-temporal visual saliency model
Parallel implementation of a spiking neuronal network model of unsupervised olfactory learning on NVidia CUDA
Parallel implementation of artificial neural network training
Parallel implementation of Artificial Neural Network training for speech recognition
Parallel Implementation of Color Based Image Retrieval Using CUDA on the GPU
Parallel Implementation of Compressive Sensing Based SAR Imaging with GPU
Parallel implementation of conjugate gradient method on graphics processors
Parallel Implementation of Devanagari Text Line and Word Segmentation Approach on GPU
Parallel Implementation of Dynamic Programming Algorithm Using Graphics Processing Unit
Parallel implementation of endmember extraction algorithms using NVidia graphical processing units
Parallel Implementation of Finite Element Codes using CUDA
Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments
Parallel implementation of linear repetitive processes identification using subspace algorithms
Parallel Implementation of Moving Averages and Stock Market Prediction
Parallel implementation of Multi-dimensional Ensemble Empirical Mode Decomposition
Parallel Implementation of Niblack's Binarization Approach on CUDA
Parallel Implementation of Otsu's Binarization Approach on GPU
Parallel Implementation of Shape based Image Retrieval Approach on CUDA in Compressed Domain
Parallel Implementation of Similarity Measures on GPU Architecture using CUDA
Parallel Implementation of Souvola's Binarization Approach on GPU
Parallel Implementation of Texture Based Image Retrieval on The GPU
Parallel Implementation of the Finite Element Method on Graphics Processors for the Solution of Incompressible Flows
Parallel implementation of the Finite-Difference Time-Domain method in Open Computing Language
Parallel Implementation of the Heisenberg Model Using Monte Carlo on GPGPU
Parallel implementation of the wideband DOA algorithm on single core, multicore, GPU and IBM cell BE processor
Parallel Implementation of the Wu-Manber Algorithm Using the OpenCL Framework
Parallel Implementation of Travelling Salesman Problem using Ant Colony Optimization
Parallel Implementation of Vortex Element Method on CPUs and GPUs
Parallel implementation of wavelet-based image denoising on programmable PC-grade graphics hardware
Parallel Implementation on GPUs of ADI Finite Difference Methods for Parabolic PDEs with Applications in Finance
Parallel Implementations for Solving Shortest Path Problem using Bellman-Ford
Parallel Implementations of a Disparity Estimation Algorithm Based on a Proximal Splitting Method
Parallel Implementations of Beamforming Design and Filtering for Microphone Array Applications
Parallel Implementations of Hopfield Neural Networks On GPU
Parallel implementations of probabilistic latent semantic analysis on graphic processing units
Parallel Implementations of the Cholesky Decomposition on CPUs and GPUs
Parallel implementations of the MinMin heterogeneous computing scheduler in GPU
Parallel In-Memory Distance Threshold Queries on Trajectory Databases
Parallel Inference on Structured Data with CRFs on GPUs
Parallel Interpretation of L-system Based on CUDA
Parallel Irradiance Caching on the GPU
Parallel Iteration to the Radiative Transport in Inhomogeneous Media with Bootstrapping
Parallel Iterative Linear Solvers on GPU: A Financial Engineering Case
Parallel k-Means Image Segmentation Using Sort, Scan & Connected Components on a GPU
Parallel kinetic Monte Carlo simulation of Coulomb glasses
Parallel kNN on GPU Architecture Using OpenCL
Parallel Language Programming In Different Platforms
Parallel latent semantic analysis using a graphics processing unit
Parallel LDPC Decoder Implementation on GPU Based on Unbalanced Memory Coalescing
Parallel LDPC Decoding on a Heterogeneous Platform using OpenCL
Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach
Parallel LDPC decoding using CUDA and OpenMP
Parallel Level set algorithm with MPI and accelerated on GPU
Parallel Lexicographic Names Construction with CUDA
Parallel local search on GPU and CPU with OpenCL
Parallel Loopy Belief Propagation in Conditional Random Fields
Parallel LZ77 Decoding using a GPU
Parallel Matching and Clustering Algorithms on GPUs
Parallel medical image reconstruction: from graphics processing units (GPU) to Grids
Parallel Medical Image Reconstruction: From Graphics Processors to Grids
Parallel Memory Defragmentation on a GPU
Parallel mesh adaptation and graph analysis using graphics processing units
Parallel Mining of Neuronal Spike Streams on Graphics Processing Units
Parallel Monte Carlo on Intel MIC Architecture
Parallel Morphological Endmember Extraction Using Commodity Graphics Hardware
Parallel Motion Estimation Implementation for Different Block Matching Algorithms onto GPGPU
Parallel Multi Channel Convolution using General Matrix Multiplication
Parallel multi-agent path planning in dynamic environments for real-time applications
Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation
Parallel Multi-dimensional Range Query Processing with R-Trees on GPU
Parallel multi-level analytical global placement on graphics processing units
Parallel multi-objective evolutionary algorithms on graphics processing units
Parallel multiclass classification using SVMs on GPUs
Parallel multigrid preconditioning on graphics processing units (GPUs) for robust power grid analysis
Parallel mutual information estimation for inferring gene regulatory networks on GPUs
Parallel N-Body Simulation using GPUs
Parallel Neural Network Training with OpenCL
Parallel Neutrino Triggers using GPUs for an underwater telescope
Parallel Nonbinary LDPC Decoding on GPU
Parallel numerical simulation of two-phase flow model in porous media using distributed and shared memory architectures
Parallel On-Chip Power Distribution Network Analysis on Multi-Core-Multi-GPU Platforms
Parallel one-versus-rest SVM training on the GPU
Parallel Optical Flow Detection Using CUDA
Parallel Optimization of Queries in XML Dataset Using GPU
Parallel option pricing with Fourier space time-stepping method on graphics processing units
Parallel Outlier Detection on Uncertain Data for GPUs
Parallel packet classification using GPU co-processors
Parallel Pairwise Correlation Computation On Intel Xeon Phi Clusters
Parallel paradigms in optimal structural design
Parallel Parametric Optimisation with Firefly Algorithms on Graphical Processing Units
Parallel particle filter algorithm in face tracking
Parallel Particle Swarm Optimization for Image Segmentation
Parallel Particle Swarm Optimization on Graphical Processing Unit for Pose Estimation
Parallel particle swarm optimization using GPGPU
Parallel Particle-Based Reaction Diffusion: A GPU Implementation
Parallel Peeling Algorithms
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs
Parallel perfusion imaging processing using GPGPU
Parallel Position Weight Matrices Algorithms
Parallel power flow solutions using a biconjugate gradient algorithm and a Newton method: A GPU-based approach
Parallel preconditioned conjugate gradient algorithm on GPU
Parallel preconditioning for spherical harmonics expansions of the Boltzmann transport equation
Parallel Prefix Scan with Compute Unified Device Architecture (CUDA)
Parallel Prefix Sum (Scan) with CUDA
Parallel Primitive Optimization for GPU and Multicore
Parallel Primitives based Spatial Join of Geospatial Data on GPGPUs
Parallel probabilistic model checking on general purpose graphics processors
Parallel processing between GPU and CPU: Concepts in a game architecture
Parallel Processing for Normal Mixture Models of Hyperspectral Data Using a Graphics Processor
Parallel processing for SAR image generation in CUDA - GPGPU platform
Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment
Parallel Processing of the Building-Cube Method on a GPU Platform
Parallel processing on NVIDIA graphics processing units using CUDA
Parallel Processing using FPGAs and GPUs
Parallel Programming and Compressed Material Data for an Eulerian Code
Parallel Programming for FPGAs
Parallel programming for multimedia applications
Parallel Programming in Actor-Based Applications via OpenCL
Parallel programming in mobile devices with FancyJCL
Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
Parallel Programming Models for Heterogeneous Many-Cores: A Survey
Parallel Programming on a Soft-Core Based Multi-core System
Parallel programming on GPU using Intel Array Building Blocks
Parallel Programming using OpenCL on Modern Architectures
Parallel programming with CUDA
Parallel programming with inductive synthesis
Parallel programming with NVIDIA CUDA
Parallel Progressive Mesh Editing
Parallel Pseudo-Random Number Generation
Parallel Quadtree Coding of Large-Scale Raster Geospatial Data on GPGPUs
Parallel Quadtree Coding of Large-Scale Raster Geospatial Data on Multicore CPUs and GPGPUs
Parallel QuadTree Encoding of Large-Scale Raster Geospatial Data on Multicore CPUs and GPGPUs
Parallel Random Numbers: As Easy as 1, 2, 3
Parallel random variates generator for GPUs based on normal numbers
Parallel rate-distortion optimized intra mode decision on multi-core graphics processors using greedy-based encoding orders
Parallel Ray Tracing in Scientific Visualization
Parallel Ray Tracing Simulations with MATLAB for Dynamic Lens Systems
Parallel reconstruction of neighbor-joining trees for large multiple sequence alignments using CUDA
Parallel Rendering on Hybrid Multi-GPU Clusters
Parallel SAT solvers and their application in automatic parallelization
Parallel SAT-Solving with OpenCL
Parallel scalable simulations of biological neural networks using TensorFlow: A beginner's guide
Parallel Search of k-Nearest Neighbors with Synchronous Operations
Parallel search on video cards
Parallel Selectivity Estimation for Optimizing Multidimensional Spatial Join Processing on GPUs
Parallel Semi-Implicit Time Integrators
Parallel Sequential Monte Carlo for Efficient Density Combination: The Deco Matlab Toolbox
Parallel Shooting and Bouncing Ray Method on GPU Clusters for Analysis of Electro-Magnetic Scattering
Parallel Shortest Path Algorithm for Voronoi Diagrams with Generalized Distance Functions
Parallel SIFT-detector implementation for images matching
Parallel SimRank computation on large graphs with iterative aggregation
Parallel simulation of mixed-abstraction SystemC models on GPUs and multicore CPUs
Parallel simulation of Petri nets on desktop PC hardware
Parallel simulation of population balance model-based particulate processes using multi-core CPUs and GPUs
Parallel Simulation of Population Balance Model-Based Particulate Processes Using Multicore CPUs and GPUs
Parallel Simulations for Analysing Portfolios of Catastrophic Event Risk
Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs
Parallel smoothing of quad meshes
Parallel solutions of static Hamilton-Jacobi equations for simulations of geological folds
Parallel Solving Massive Linear Equations with CUDA
Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit
Parallel source code transformation techniques using design patterns
Parallel Sparse Coding for Seafloor Image Analysis
Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms: Parallel Solvers and Preconditioners
Parallel Sparse Matrix Solver on the GPU Applied to Simulation of Electrical Machines
Parallel spatial data structures for interactive rendering
Parallel Spectral Graph Partitioning on CUDA
Parallel Spherical Harmonic Transforms on heterogeneous architectures (GPUs/multi-core CPUs)
Parallel Statistical Analysis of Analog Circuits by GPU-accelerated Graph-based Approach
Parallel Statistical Multi-resolution Estimation
Parallel Streaming Intra Prediction for Full HD H.264 Encoding
Parallel Subgraph Mining on Hybrid Platforms: HPC Systems, Multi-Cores and GPUs
Parallel Support Vector Machines in Practice
Parallel Surface Reconstruction for Particle-Based Fluids
Parallel Surface Reconstruction on GPU
Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms
Parallel technologies for solving system of the linear equations by the conjugate gradient method
Parallel Tempering Simulation of the three-dimensional Edwards-Anderson Model with Compact Asynchronous Multispin Coding on GPU
Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines
Parallel track reconstruction in CMS using the cellular automaton approach
Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging
Parallel Trajectory Planning on GPU
Parallel Tree Traversal for Nearest Neighbor Query on the GPU
Parallel tree-ensemble algorithms for GPUs using CUDA
Parallel Triangular Solvers on GPU
Parallel Two-Stage Least Squares algorithms for Simultaneous Equations Models on GPU
Parallel unmixing of remotely sensed hyperspectral images on commodity graphics processing units
Parallel Unsmoothed Aggregation Algebraic Multigrid Algorithms on GPUs
Parallel Unsteady Flow Line Integral Convolution for High-Performance Dense Visualization
Parallel Variable Distribution Algorithm for Constrained Optimization with Nonmonotone Technique
Parallel Variable Pre-Selection and Lookahead Solving on GPUs
Parallel Verlet neighbor list algorithm for GPU-optimized MD simulations
Parallel View-Dependent Level-of-Detail Control
Parallel view-dependent refinement of progressive meshes
Parallel Viewshed Analysis on GPU Using CUDA
Parallel Volume Rendering for Large Scientific Data
Parallel volume rendering implementation on graphics cards using CUDA
Parallel Voronoi Diagram computation on scaled distance planes using CUDA
Parallel waveform extraction algorithms for the Cherenkov Telescope Array Real-Time Analysis
Parallel Wavelet Schemes for Images
Parallel Worldline Numerics: Implementation and Error Analysis
Parallel Zigzag Scanning and Huffman Coding for a GPU-based MPEG-2 Encoder
Parallel Zonal Summations of Large-Scale Species Occurrence Data on Hybrid CPU-GPU Systems
Parallel-META: A high-performance computational pipeline for metagenomic data analysis
Parallel-META: efficient metagenomic data analysis based on high-performance computation
Parallel, distributed and GPU computing technologies in single-particle electron microscopy
Parallel, stochastic measurement of molecular surface area
Paralleling Variable Block Size Motion Estimation of HEVC on Multi- Core CPU Plus GPU Platform
Parallelisation of Java for Graphics Processors
Parallelisation of Shallow Water Simulation for Heterogeneous Architectures
Parallelising the Transfer-Matrix Method using Graphics Processors
Parallelism in Database Operations
Parallelism of Clonal Selection for PSP on CUDA
Parallelism, Patterns, and Performance in Iterative MRI Reconstruction
Parallelization & checkpointing of GPU applications through program transformation
Parallelization and characterization of GARCH option pricing on GPUs
Parallelization and Optimization of Feature Detection Algorithms on Embedded GPU
Parallelization and Performance of the NIM Weather Model for CPU, GPU and MIC Processors
Parallelization and Performance of the NIM Weather Model on CPU, GPU and MIC Processors
Parallelization Design of Irregular Algorithms of Video Processing on GPUs
Parallelization Methods of the Template Matching Method on Graphics Accelerators
Parallelization of a Block-Matching Algorithm
Parallelization of a Monte Carlo Ray Tracing Algorithm for Channel Modelling in Underwater Wireless Optical Communications
Parallelization of a novel frequent itemset hiding algorithm on a CPU-GPU platform
Parallelization of algorithms for solving the Boltzmann equation for GPU-based computations
Parallelization of an Ultrasound Reconstruction Algorithm for non Destructive Testing on Multicore CPU and GPU
Parallelization of an Unsteady ALE Solver with Deforming Mesh Using OpenACC
Parallelization of BFS Graph Algorithm using CUDA
Parallelization of Binary and Real-Coded Genetic Algorithms on CUDA
Parallelization of binary and real-coded genetic algorithms on GPU using CUDA
Parallelization of BVH and BSP on the GPU
Parallelization of calculations using GPU in optimization approach for macromodels construction
Parallelization of cellular neural networks on GPU
Parallelization of Coherent Point Drift for patient registration
Parallelization of Data Intensive Code Using Computer Unified Device Architecture (CUDA)
Parallelization of DIRA and CTmod using OpenMP and OpenCL
Parallelization of DNA alignment algorithms using GPUs
Parallelization of Encryption and Hashing Algorithm Using GPU
Parallelization of Hierarchical Text Clustering on Multi-core CUDA Architecture
Parallelization of KMP String Matching Algorithm on Different SIMD architectures: Multi-Core and GPGPU's
Parallelization of maximum likelihood fits with OpenMP and CUDA
Parallelization of Mesh Contraction and Fairing using OpenCL
Parallelization of Multipattern Matching on GPU
Parallelization of Myers Fast Bit-Vector Algorithm using GPGPU
Parallelization of PageRank on Multicore Processors
Parallelization of Particle Filter Algorithms
Parallelization of RSA Algorithm Based on Compute Unified Device Architecture
Parallelization of SAT Algorithms on GPUs
Parallelization of Shape Diameter Function Computation using OpenCL
Parallelization of Single Threaded Applications using OpenMP and CUDA/C
Parallelization of specialized fluid flow simulator based on lattice Boltzmann method on a multi GPU system
Parallelization of Synthetic Aperture Radar (SAR) Imaging Algorithms on GPU
Parallelization of tau-leap coarse-grained Monte Carlo simulations on GPUs
Parallelization of the Algorithm WHAM with NVIDIA CUDA
Parallelization of the Ant Colony Optimization for the Shortest Path Problem using OpenMP and CUDA
Parallelization of the Cuckoo Search using CUDA Architecture
Parallelization of the distinct lattice spring model
Parallelization of the Generalized Hough Transform on GPU
Parallelization of the Honeybee Search Algorithm for Object Tracking
Parallelization of the Local Threshold and Boolean Function Based Edge Detection Algorithm Using CUDA
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the Symmetric Indefinite Factorization
Parallelization of the x264 encoder using OpenCL
Parallelization of Weighted Sequence Comparison by using EBWT
Parallelization Research of Circle Detection Based on Hough Transform
Parallelization Strategies for Ant Colony Optimisation on GPUs
Parallelization Strategies for Local Search Algorithms on Graphics Processing Units
Parallelization Strategies of the Canny Edge Detector for Multi-core CPUs and Many-core GPUs
Parallelization techniques of the x264 video encoder
Parallelization the Job-shop Problem on Distributed and Shared Memory Architectures
Parallelization with Different API on Multicore Architecture
Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis
Parallelize L-BFGS-B on the GPU
Parallelized agent-based simulation on CPU and graphics hardware for spatial and stochastic models in biology
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU
Parallelized generation of photon texture and real-time rendering on GPU
Parallelized Hierarchical Expected Matching Probability for Multiple Sequence Alignment
Parallelized Incomplete Poisson Preconditioner in Cloth Simulation
Parallelized Kendall's Tau Coefficient Computation via SIMD Vectorized Sorting On Many-Integrated-Core Processors
Parallelized Local Volatility Estimation Using GP-GPU Hardware Acceleration
Parallelized Physical Optics computations for Scattering Center Models in radio channel simulations
Parallelized Seeded Region Growing using CUDA
Parallelized Segmentation of CT-Angiography datasets using CUDA
Parallelized Vlasov-Fokker-Planck solver for desktop personal computers
Parallelizing a high-order WENO scheme for complicated flow structures on GPU and MIC
Parallelizing AES on multicores and GPUs
Parallelizing Alternating Direction Implicit Solver on GPUs
Parallelizing compiler framework and API for power reduction and software productivity of real-time heterogeneous multicores
Parallelizing Exact and Approximate String Matching via Inclusive Scan on a GPU
Parallelizing flow-accumulation calculations on graphics processing units - From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm
Parallelizing FPGA Technology Mapping Using Graphics Processing Units (GPUs)
Parallelizing fuzzy rule generation using GPGPU
Parallelizing General Histogram Application for CUDA Architectures
Parallelizing Kernel Polynomial Method Applying Graphics Processing Units
Parallelizing LINQ Program for GPGPU
Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks
Parallelizing Motion JPEG 2000 with CUDA
Parallelizing Multicore Cache Simulations using Heterogeneous Computing on General Purpose and Graphics Processors
Parallelizing Multiple Flow Accumulation Algorithm using CUDA and OpenACC
Parallelizing of digital signal processing with using GPU
Parallelizing Peptide-Spectrum scoring using modern graphics processing units
Parallelizing Simulated Annealing-Based Placement Using GPGPU
Parallelizing the cellular potts model on GPU and multi-core CPU: An OpenCL cross-platform study
Parallelizing the Cellular Potts Model on graphics processing units
Parallelizing the Edge application for GPU-based systems using the SkePU skeleton programming library
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Parallelizing Word2Vec in Multi-Core and Many-Core Architectures
Parallelizing Word2Vec in Shared and Distributed Memory
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
Parameter Selection and Pre-Conditioning for a Graph Form Solver
Parameter Tuning of a Hybrid Treecode-FMM on GPUs
Parameterized Verification of GPU Kernel Programs
Parametric Flows: Automated Behavior Equivalencing for Symbolic Analysis of Races in CUDA Programs
Parametric GPU Code Generation for Affine Loop Programs
Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks
PARIS: A Parallel RSA-Prime Inspection Tool
Parle: parallelizing stochastic gradient descent
ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data
PARRAY: A Unifying Array Representation for Heterogeneous Parallelism
Parsing in Parallel on Multiple Cores and GPUs
Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network
ParTeCL: parallel testing using OpenCL
Partial Demosaicing for Stereo Matching of CFA Images on GPU and CPU
Partial Parallelization of the Successive Projections Algorithm using Compute Unified Device Architecture
Partial Volume Effect Correction using Anisotropic Backward Diffusion
Partial wave analysis at BES III harnessing the power of GPUs
Partial Wave Analysis using Graphics Cards
PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs
Particle and texture based spatiotemporal visualization of time-dependent vector fields
Particle filter on GPUs for real-time tracking
Particle filtering with rendered models: A two pass approach to multi-object 3D tracking with the GPU
Particle Filters on Multi-Core Processors
Particle Level Set Advection for the Interactive Visualization of Unsteady 3D Flow
Particle method on GPU
Particle Simulation on a GPU with PyCUDA
Particle Swarm Optimization of Model Parameters: Simulation of  Deep Reactive Ion Etching by the Continuous Cellular Automaton
Particle-Based Fluid Simulation on the GPU
Particle-Based Multiple Irregular Volume Rendering on CUDA 
Particle-based Visualization of Large Cosmological Datasets
Particle-based volume rendering
Particle-in-cell algorithms for plasma simulations on heterogeneous architectures
Particle-in-Cell Laser-Plasma Simulation on Xeon Phi Coprocessors
Particle-in-cell Simulations with Charge-Conserving Current Deposition on Graphic Processing Units
Partitioned Memory Parallel Programming Framework
Partitioning Large Scale Deep Belief Networks Using Dropout
Partitioning streaming parallelism for multi-cores: a machine learning based approach
Pass a Pointer: Exploring Shared Virtual Memory Abstractions in OpenCL Tools for FPGAs
PASSATA - Object oriented numerical simulation software for adaptive optics
Passive-Active Geometric Calibration for View-Dependent Projections onto Arbitrary Surfaces
Password Cracking in the Cloud
Password recovery for encrypted ZIP archives using GPUs
Password Recovery for RAR Files Using CUDA
Password Recovery Using MPI and CUDA
Patch-Based Image Vectorization with Automatic Curvilinear Feature Alignment
Path Integral Approaches and Graphics Processing Unit Tools for Quantum Molecular Dynamics Simulations
Pathological Image Analysis Using the GPU: Stroma Classification for Neuroblastoma
Pathological image segmentation for neuroblastoma using the GPU
Patient-Specific Non-Linear Finite Element Modelling for Predicting Soft Organ Deformation in Real-Time; Application to Non-Rigid Neuroimage Registration
Pattern Matching in OpenCL: GPU vs CPU Energy Consumption on Two Mobile Chipsets
Pattern Recognition with Embedded Systems Technology: A Survey
Pattern Recognition with OpenCL Heterogeneous Platform
Pattern-based Programming Abstractions for Heterogeneous Parallel Computing
Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)
Patterns of Inefficient Performance Behavior in GPU Applications
PATUS: A Code Generation and Auto-Tuning Framework For Parallel Stencil Computations
PATUS: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures
PCIeHLS: an OpenCL HLS framework
PConG: A novel platform available for pervasive computing based on GPU
PDAWL: Profile-based Iterative Dynamic Adaptive WorkLoad Balance on Heterogeneous Architectures
PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations
Pedestrian Detection at Warp Speed: Exceeding 500 Detections per Second
Pedestrian detection system based on stereo vision for mobile robot
Pegasus: coordinated scheduling for virtualized accelerator-based systems
PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming
People detection method using graphics processing units for a mobile robot with an omnidirectional camera
PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems
PEPSC: A Power-Efficient Processor for Scientific Computing
PErasure: a Parallel Cauchy Reed-Solomon Coding Library for GPUs
Perception of Acoustical Spatial Attributes and Impression in Virtually Rendered Sound Field
Perception-aware Depth Cueing for Illustrative Vascular Visualization
Perceptual enhancement of two-level volume rendering
Perceptually Optimized Real-Time Computer Graphics
PERCH 2.0: Fast and Accurate GPU-based Perception via Search for Object Pose Estimation
Percolation study of samples on 2D lattices using GPUs
perf4sight: A toolflow to model CNN training performance on Edge GPUs
Perfect Hashing Structures for Parallel Similarity Searches
Perfect spatial hashing
PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions
Performance Acceleration of Kernel Polynomial Method Applying Graphics Processing Units
Performance Analysis and Automatic Tuning of Hash Aggregation on GPUs
Performance Analysis and Benchmarking of the Intel SCC
Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs
Performance Analysis and Improvement of Parallel Differential Evolution
Performance Analysis and Optimisation of the OP2 Framework on Many-core Architectures
Performance analysis and optimization of a CFD application
Performance Analysis and Optimization of a Distributed Processing Framework for Data Mining Accelerated with Graphics Processing Units
Performance Analysis and Optimization of Hermite Methods on NVIDIA GPUs Using CUDA
Performance analysis and optimization of highly diverging algorithms on GPUs
Performance analysis and optimization of the OP2 framework on many-core architectures
Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model
Performance Analysis and Optimization Opportunities for NVIDIA Automotive GPUs
Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA
Performance Analysis and Tuning For: General-Purpose Graphics Processing Units (GPGPU)
Performance Analysis Cluster and GPU Computing Environment on Molecular Dynamic Simulation of BRV-1 and REM2 with GROMACS
Performance Analysis for GPU-based Ray-triangle Algorithms
Performance analysis of a 240 thread tournament level MCTS Go program on the Intel Xeon Phi
Performance Analysis of a High-level Abstractions-based Hydrocode on Future Computing Systems
Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
Performance Analysis of a Large Memory Application on Multiple Architectures
Performance Analysis of a New Real-Time Elastographic Time Constant Estimator
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Performance Analysis of a Particle-in-Cell Plasma Physics Code on Homogeneous and Heterogeneous HPC Systems
Performance Analysis of a Stereo Matching Implementation in OpenCL
Performance Analysis of a Symmetric Cryptographic Algorithm on Multicore Architectures
Performance Analysis of a Symmetric Cryptography Algorithm on GPU and GPU Cluster
Performance analysis of accelerated image registration using GPGPU
Performance Analysis of an Astrophysical Simulation Code on the Intel Xeon Phi Architecture
Performance Analysis of an Ultrasound Reconstruction Algorithm for Non Destructive Testing
Performance Analysis of CUDA and OpenCL By Implementation of Cryptographic Algorithms
Performance Analysis of Deep Learning Workloads on Leading-edge Systems
Performance Analysis of General-Purpose Computation on Commodity Graphics Hardware: A Case Study Using Bioinformatics
Performance analysis of GPGPU and CPU On AES Encryption
Performance Analysis of GPU Accelerators with Realizable Utilization of Computational Density
Performance Analysis of GPU compared to Single-core and Multi-core CPU for Natural Language Applications
Performance Analysis of GPU-Accelerated Filter-Based Source Finding for HI Spectral Line Image Data
Performance Analysis of GPU-based SAR and Interferometric SAR image processing
Performance Analysis of IBM Cell Broadband Engine on Sequence Alignment
Performance Analysis of Join Algorithms on GPUs
Performance Analysis of kNN on large datasets using CUDA & Pthreads
Performance analysis of matrix-free conjugate gradient kernels using SYCL
Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster
Performance analysis of multi-core CPUs and GPU computing on SF-FDTD scheme for third order nonlinear materials and periodic media
Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes
Performance analysis of parallel gravitational N-body codes on large GPU cluster
Performance Analysis of Parallel Sorting Algorithms using GPU Computing
Performance Analysis of Roberts Edge Detection Using CUDA and OpenGL
Performance analysis of single-phase, multiphase, and multicomponent lattice-Boltzmann fluid flow simulations on GPU clusters
Performance Analysis of Sobel Edge Detection Filter on GPU using CUDA & OpenGL
Performance Analysis of Sobel Edge Filter on Heterogeneous System Using OpenCL
Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)
Performance analysis of SSE instructions in multi-core CPUs and GPU computing on FDTD scheme for solid and fluid vibration problems
Performance Analysis of the OP2 Framework on Many-core Architectures
Performance Analysis on Energy Efficient High-Performance Architectures
Performance Analysis on Several GPU Architectures of an Algorithm for Noise Removal
Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations 
Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs) 
Performance and accuracy of Lattice-Boltzmann kernels on multi- and manycore architectures
Performance and Efficiency Analysis of Modern Accelerators: Fine-Grained Parallelism on the Intel Xeon Phi
Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application
Performance and energy optimization of the iterative solution of sparse linear systems on multicore processors
Performance and numerical accuracy evaluation of heterogeneous multicore systems for Krylov orthogonal basis computation
Performance and Numerical Aspects of Decompositional Factorizations with FP64 Floating-Point Emulation in INT8
Performance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC
Performance and Power Analysis of ATI GPU: A Statistical Approach
Performance and Power Comparisons Between Fermi and Cypress GPUs
Performance and Power Comparisons Between Nvidia and ATI GPUs
Performance and power consumption investigation for execution of integer operations on CPU and GPU processors for multimedia applications
Performance and Power Efficiency Analysis of the Symmetric Cryptograph on Two Stream Processor Architectures
Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models
Performance and Power Optimization of GPU Architectures for General-purpose Computing
Performance and Productivity of Parallel Python Programming: A study with a CFD Test Case
Performance and Quality of Random Number Generators
Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units
Performance and Scalability of GPU-Based Convolutional Neural Networks
Performance Assessment of A Multi-block Incompressible Navier-Stokes Solver using Directive-based GPU Programming in a Cluster Environment
Performance assessment of CUDA and OpenACC in large scale combustion simulations
Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs
Performance Assessment of using OpenCL on FPGA Systems for ODE Solvers
Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs
Performance benchmarking of deep learning framework on Intel Xeon Phi
Performance Characterization and Optimization of Atomic Operations on AMD GPUs
Performance characterization of data-intensive kernels on AMD Fusion architectures
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Performance Comparison Between Cg-based and CUDA-based Matrix Multiplications
Performance Comparison for Neuroscience Application Benchmarks
Performance comparison of CFD-DEM solver MFiX-Exa, on GPUs and CPUs
Performance Comparison of Cholesky Decomposition on GPUs and FPGAs
Performance Comparison of Different OpenCL Implementations of LBM Simulation on Commodity Computer Hardware
Performance comparison of FPGA, GPU and CPU in image processing
Performance comparison of gauss-Jordan elimination method using OpenMP and CUDA
Performance comparison of GPU and FPGA architectures for the SVM training problem
Performance Comparison of GPU, DSP and FPGA implementations of image processing and computer vision algorithms in embedded systems
Performance Comparison of GPUs with a Genetic Algorithm based on CUDA
Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study
Performance comparison of Lattice Boltzmann fluid flow simulation using OpenCL and CUDA frameworks
Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors
Performance Comparison with OpenMP Parallelization for Multi-core Systems
Performance Considerations When Using a Dedicated Ray Traversal Engine
Performance Counters based Power Modeling of Mobile GPUs using Deep Learning
Performance Debugging Frameworks for FPGA High-Level Synthesis
Performance Debugging of GPGPU Applications with the Divergence Map
Performance Degradation Analysis of GPU Kernels
Performance Drawbacks for Matrix Multiplication using Set Associative Cache in GPU devices
Performance Efficient DNA Sequence Detection on GPU Using Parallel Pattern Matching Approach
Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator
Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs
Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results
Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
Performance enhancement of MAGIC FDTD-PIC plasma-wave simulations using GPU processing
Performance Evaluation and Analysis of Sparse Matrix and Graph Kernels on Heterogeneous Processors
Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs
Performance Evaluation and Optimization of HPCG benchmark on CPU + MIC platform
Performance evaluation and optimization of random memory access on multicores with high productivity
Performance Evaluation and Tuning of An OpenCL based Matrix Multiplier
Performance Evaluation of Advanced Features in CUDA Unified Memory
Performance Evaluation of Blocking and NonBlocking Concurrent Queues on GPUs
Performance Evaluation of Concurrent Lock-free Data Structures on GPUs
Performance Evaluation of Container-based Virtualization for High Performance Computing Environments
Performance Evaluation of CPU-GPU communication Depending on the Characteristic of Co-Located Workloads
Performance evaluation of CUDA programming for machining simulation
Performance evaluation of deep learning on smartphones
Performance Evaluation of Deep Learning Tools in Docker Containers
Performance Evaluation of Discrete Wavelet Transform Based on Image Compression Technique on Both CPU and GPU
Performance Evaluation of Edge Detection Techniques on GPU Using OpenCL
Performance Evaluation of Feature Extraction Algorithm on GPGPU
Performance evaluation of GPU memory hierarchy using the FFT
Performance evaluation of H.264/AVC decoding and visualization using the GPU
Performance Evaluation of Heterogeneous GPU Programming Frameworks for Hemodynamic Simulations
Performance evaluation of image processing algorithms on the GPU
Performance Evaluation of Intel Xeon Phi Coprocessor using XKaapi
Performance Evaluation of Mixed Precision Algorithms for Solving Sparse Linear Systems
Performance Evaluation of OpenMP's Target Construct on GPUs - Exploring Compiler Optimizations
Performance Evaluation of OpenMP's Target Construct on GPUs: Exploring Compiler Optimizations
Performance Evaluation of Optimized Implementations of Finite Difference Method for Wave Propagation Problems on GPU Architecture
Performance Evaluation of Parallel AES Implementations over CUDA GPU Framework
Performance Evaluation of Parallel Count Sort using GPU Computing with CUDA
Performance Evaluation of Particle Swarm Optimization Algorithms on GPU Using CUDA
Performance Evaluation of Python ParallelProgramming Models: Charm4Py and mpi4py
Performance Evaluation of Query Processing Algorithms on GPGPUs
Performance Evaluation of Quicksort with GPU Dynamic Parallelism for Gene-Expression Quantile Normalization
Performance Evaluation of R with Intel Xeon Phi Coprocessor
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography
Performance evaluation of the multi-device OpenCL FDTD solver
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for Machine Learning
Performance Evaluation of the Ocean-Land-Atmosphere Model Using Graphics Processing Units
Performance Evaluations of Document-Oriented Databases using GPU and Cache Structure
Performance Evaluations of Graph Database using CUDA and OpenMP-Compatible Libraries
Performance Exploration of Selected Manually and Automatically Parallelized Codes on GPUs
Performance Gains in Conjugate Gradient Computation with Linearly Connected GPU Multiprocessors
Performance Impact of Data Layout on the GPU-accelerated IDW Interpolation
Performance impact of dynamic parallelism on different clustering algorithms
Performance Impact of Memory Channels on Sparse and Irregular Algorithms
Performance Improvement of Data Mining in Weka through GPU Acceleration
Performance Improvement of Multichannel Audio by Graphics Processing Units
Performance Improvement of Optical Algorithms on Multicore Platforms
Performance Improvement of TOUGH2 Simulation with Graphics Processing Unit
Performance improvements for iterative electron tomography reconstruction using graphics processing units (GPUs)
Performance improvements of real-time crowd simulations
Performance in GPU Architectures: Potentials and Distances
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
Performance modeling of atomic additions on GPU scratchpad memory
Performance Modeling, Optimization, and Characterization on Heterogeneous Architectures
Performance Modelling and Traffic Characterisation of Optical Networks
Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures
Performance models for CPU-GPU data transfers
Performance models for CUDA streams on NVIDIA GeForce series
Performance Models for Heterogeneous Iterative Programs
Performance Monitoring of Multi-FPGA Systems
Performance of a code migration for the simulation of supersonic ejector flow to SMP, MIC and GPU using OpenMP, OpenMP+LEO, and OpenACC directives
Performance of a GPU-based Direct Summation Algorithm for Computation of Small Angle Scattering Profile
Performance of Confidential Computing GPUs
Performance of CPU and GPU HPC Architectures for off-design aircraft simulation
Performance of FORTRAN and C GPU Extensions for a Benchmark Suite of Fourier Pseudospectral Algorithms
Performance of GPU for Pricing Financial Derivatives: Convertible Bonds
Performance of GTX Titan X GPUs and Code Optimization
Performance of Implicit Solver Strategies on GPUs
Performance of inverse atomistic scale fracture modeling on GPGPU architectures
Performance of Kepler GTX Titan GPUs and Xeon Phi System
Performance of OpenCL
Performance of Optical Flow Techniques on Graphics Hardware
Performance of PETSc GPU Implementation with Sparse Matrix Storage Schemes
Performance Optimisation of Smoothed Particle Hydrodynamics Algorithms for Multi/Many-Core Architectures
Performance Optimisations for Heterogeneous Managed Runtime Systems
Performance Optimization of 3-D Lattice Boltzmann Flow Solver on a GPU
Performance Optimization of Clustering On GPU
Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU
Performance Optimization of GPU ELF-Codes
Performance Optimization of Memory Intensive Applications on FPGA Accelerator
Performance Optimization of Vision Apps on Mobile Application Processor
Performance Optimization using Multimodal Modeling and Heterogeneous GNN
Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs
Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores
Performance portability analysis of SYCL with a classical CG on CPU, GPU, and FPGA
Performance Portability and Evaluation of Heterogeneous Components of SeisSol Targeted to Upcoming Intel HPC GPUs
Performance Portability Challenges for Fortran Applications
Performance Portability Evaluation for OpenACC on Intel Knights Corner and Nvidia Kepler
Performance portability evaluation of blocked stencil computations on GPUs
Performance Portability in Accelerated Parallel Kernels
Performance Portability of a GPU Enabled Factorization with the DAGuE Framework
Performance Portability of the Aeras Atmosphere Model to Next Generation Architectures using Kokkos
Performance Portability Strategies for Computational Fluid Dynamics (CFD) Applications on HPC Systems
Performance portability study of epistasis detection using SYCL on NVIDIA GPU
Performance Portability Study of Linear Algebra Kernels in OpenCL
Performance portability through machine learning guided kernel selection in SYCL libraries
Performance portability via C++ PSTL, SYCL, OpenMP, and HIP: the Gaia AVU-GSR case study
Performance Portability with the Chapel Language
Performance Portable GPU Code Generation for Matrix Multiplication
Performance Portable Gradient Computations Using Source Transformation
Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs
Performance potential for simulating spin models on GPU
Performance prediction of deep learning applications training in GPU as a service systems
Performance Predictions for General-Purpose Computation on GPUs
Performance study of filtered back-projection algorithms implemented on GPUs
Performance study of interference on GPU and CPU resources with multiple applications
Performance Study of LU Decomposition on the Programmable GPU
Performance study of mapping irregular computations on GPUs
Performance Study of Satellite Image Processing on Graphics Processors Unit Using CUDA
Performance study of using the Direct Compute API for implementing Support vector machines on GPUs
Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm
Performance Testing of GPU-Based Approximate Matching Algorithm on Network Traffic
Performance Tradeoff Spectrum of Integer and Floating Point Applications
Performance Tradeoff Spectrum of Integer and Floating Point Applications Kernels on Various GPUs
Performance Traps in OpenCL for CPUs
Performance Tuning for CUDA-Accelerated Neighborhood Denoising Filters
Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies
Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs
Performance-Analysis-Based Acceleration of Image Quality Assessment
Performance-aware component composition for GPU-based systems
Performance-Correctness Challenges in Emerging Heterogeneous Multicore Processors
Performance-efficient mechanisms for managing irregularity in throughput processors
Performance-Oriented Neural Architecture Search
Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
Performance/power assessment of CNN packages on embedded automotive platforms
Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement
Performant low-order matrix-free finite element kernels on GPU architectures
Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision
Performing DCT8x8 Computation on GPU Using NVIDIA CUDA Technology
Performing efficient NURBS modeling operations on the GPU
Performing with CUDA
PeriPy - A High Performance OpenCL Peridynamics Package
permGPU: Using graphics processing units in RNA microarray association studies
Permutation Index and GPU to Solve efficiently Many Queries
Persistent Kernels for Iterative Memory-bound GPU Applications
Persistent RNNs: Stashing Recurrent Weights On-Chip
Perturbation Functions in Computer Graphics
Petaflop biofluidics simulations on a two million-core system
Petascale Application of a Coupled CPU-GPU Algorithm for Simulation and Analysis of Multiphase Flow Solutions in Porous Medium Systems
Petascale computations for Large-scale Atomic and Molecular collisions
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures
Petascale elliptic solvers for anisotropic PDEs on GPU clusters
Petascale turbulence simulation using a highly parallel fast multipole method
Petascale visualization: Approaches and initial results
PFAC Library: GPU-based string matching algorithm
PFunc: modern task parallelism for modern high performance computing
PG-PuReMD: A Parallel-GPU Reactive Molecular Dynamics Package
PGEM: Preemptive GPGPU Execution Model for Runtime Engines
Pgx: Hardware-accelerated parallel game simulation for reinforcement learning
Phase Aware Memory Scheduling
Phase Based Volume Registration on the GPU with Application to Quantitative MRI
Phase Based Volume Registration Using CUDA
Phase diagram and critical behavior of the square-lattice Ising model with competing nearest- and next-nearest-neighbor interactions
Phase Transition in 3d Heisenberg Spin Glasses with Strong Random Anisotropies, through a Multi-GPU Parallelization
phiGEMM: a CPU-GPU library for porting Quantum ESPRESSO on hybrid systems
Phoenix: A Runtime Environment for High Performance Computing on Chip Multiprocessors
Photon mapping on programmable graphics hardware
Physical and graphical effects in OpenCL by example
Physical modeling and high-performance GPU computing for characterization, interception, and disruption of hazardous near-Earth objects
Physically Based Rendering: Implementation of Path Tracer
Physically-Based Interactive Flow Visualization Based on Schlieren and Interferometry Experimental Techniques
Physically-based interactive schlieren flow visualization
Physically-based painting style 3D image synthesis using GPU
Physically-Based Sound Synthesis on GPUs
Physically-based visual simulation on graphics hardware
Physics and Computing Performance of the Exa.TrkX TrackML Pipeline
Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers
PhysProver: Advancing Automatic Theorem Proving for Physics
Piccolo: building fast, distributed programs with partitioned tables
PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster
PIConGPU: Predictive Simulations of Laser-Particle Accelerators with Manycore Hardware
Piecewise Tri-linear Contouring for Multi-material Volumes
PIGEON: Optimizing CUDA Code Generator for End-to-End Training and Inference of Relational Graph Neural Networks
Piko: A Design Framework for Programmable Graphics Pipelines
PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework
PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks
Pipeline strategies to accelerate range query processing on a multi-GPU environment
Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units
Pipelined MapReduce: A Decoupled MapReduce RunTime for Shared Memory Multi-Processors
Pipelined Training with Stale Weights of Deep Convolutional Neural Networks
Pipelining the Fast Multipole Method over a Runtime System
PIPS Is not (just) Polyhedral Software
PIR: PMaC's Idiom Recognizer
PISTON: A Portable Cross-Platform Framework for Data-Parallel Visualization Operators
Pixel-Exact Rendering of Spacetime Finite Element Solutions
PixelPie: Maximal Poisson-disk Sampling with Rasterization
Places205-VGGNet Models for Scene Recognition
Planetary-Scale Terrain Composition
Plant Leaf Modeling and Rendering Based-On GPU
Plasma Visualization in Parallel using Particle Systems on Graphical Processing Units
Platform 2012, a Many-Core Computing Accelerator for Embedded SoCs: Performance Evaluation of Visual Analytics Applications
Platform Characterization for Domain-Specific Computing
Platform-independent parallelization of the Lattice Boltzmann method with OpenCL
Platform-Specific Optimization and Mapping of Stencil Codes through Refinement
Playdoh: A lightweight Python library for distributed computing and optimisation
PLB-HeC: A Profile-based Load-Balancing algorithm for Heterogeneous CPU-GPU Clusters
Plenoptic Rendering With Interactive Performance Using GPUs
PlinkGPU: A Framework for GPU Acceleration of Whole Genome Data Analysis
PM4Py-GPU: a High-Performance General-Purpose Library for Process Mining
PMT: Power Measurement Toolkit
PNG1 triangles for tangent plane continuous surfaces on the GPU
PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime
PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability
pocl: A Performance-Portable OpenCL Implementation
Point Based Approximate Color Bleeding With Cuda
Point Based Color Bleeding with CUDA and Caching
Point Rendering in CUDA Path Tracer
Point Spread Function Estimation of Solar Surface Images with a Cooperative Particle Swarm Optimization on GPUs
Point to Line Mappings and Other Line Parameterizations not only for Hough Transform
Point to point processing of digital images using parallel computing
Point-wise Adaptive Filtering for Fast Monte Carlo Noise Reduction
Pointer Analysis for Semi-Automatic Code Parallelizers
Poisson-Boltzmann model for protein-surface electrostatic interactions and grid-convergence study using the PyGBe code
Policy-based Tuning for Performance Portability and Library Co-optimization
Polly - Polyhedral optimization in LLVM
Polly-ACC: Transparent compilation to heterogeneous hardware
Polyconvexification of the multi-label optical flow problem
Polymer Field-Theory Simulations on Graphics Processing Units
POMPEI: Programming with OpenMP4 for Exascale Investigations
PONDER - A Real time software backend for pulsar and IPS observations at the Ooty Radio Telescope
PopSparse: Accelerated block sparse matrix multiplication on IPU
Population Parallel GP on the G80 GPU
Porous Rock Simulations and Lattice Boltzmann on GPUs
Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators
Portability of Fortran's 'do concurrent' on GPUs
Portable and Performant GPU/Heterogeneous Asynchronous Many-Task Runtime System
Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing
Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL)
Portable GPU-Based Artificial Neural Networks for Accelerated Data-Driven Modeling
Portable high-order finite element kernels I: Streaming Operations
Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
Portable Mapping of Data Parallel Programs to OpenCL for Heterogeneous Systems
Portable OpenCL Out-of-Order Execution Framework for Heterogeneous Platforms
Portable Parallel Kernels for High-Speed Beamforming in Synthetic Aperture Ultrasound Imaging
Portable parallelized blowfish via RenderScript
Portable Performance on Heterogeneous Architectures
Portable Programming Models for Heterogeneous Platforms
Portable Real-Time DCT Based Steganography Using OpenCL
Portable, high-performance containers for HPC
Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use
Portage: Bringing Hackers' Wisdom to Science
Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA
Porting a sparse linear algebra math library to Intel GPUs
Porting and optimizing MAGFLOW on CUDA
Porting Batched Iterative Solvers onto Intel GPUs with SYCL
Porting estimation of distribution algorithms to the cell broadband engine
Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned
Porting HPC Applications to AMD Instinct MI300A Using Unified Memory and OpenMP
Porting Large HPC Applications to GPU Clusters: The Codes GENE and VERTEX
Porting marine ecosystem model spin-up using transport matrices to GPUs
Porting NAHUJ to CUDA
Porting numerical integration codes from CUDA to oneAPI: a case study
Porting of an Edge-Based CFD Solver to GPUs
Porting OpenACC to OpenMP on heterogeneous systems
Porting to the Intel Xeon Phi: Opportunities and Challenges
Porting tree-based hash table compression to GPGPU model checking
Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines
Position-Dependent Arrays and Their Application for High Performance Code Generation
Possible planet-forming regions on submillimetre images
Poster: CUDA-Accelerated Continuous 2D Scatterplots
Poster: GPU-accelerated artificial neural network for QSAR modeling
Poster: GPU-accelerated rigid body fitting of atomic structures into electron density maps
Potential contribution of CNN-based solving of stiff ODEs and PDEs to enabling real-time Computational Engineering
Potential Energy Landscapes for the 2D XY Model: Minima, Transition States and Pathways
Potential of General Purpose Graphic Processing Unit for Energy Management System
Power analysis and optimizations for GPU architecture using a power simulator
Power analysis of sorting algorithms on FPGA using OpenCL
Power and Performance Analysis of GPU-Accelerated Systems
Power and Performance Characterization of Computational Kernels on the GPU
Power and Performance Studies of the Explicit Multi-Threading (XMT) Architecture
Power Consumption Modeling and Prediction in a Hybrid CPU-GPU-MIC Supercomputer
Power Consumption of GPUs from a Software Perspective
Power consumption of mixed precision in the iterative solution of sparse linear systems
Power Control for GPU Clusters in processing large-scale streams
Power Efficient Large Matrices Multiplication by Load Scheduling on Multi-core and GPU Platform with CUDA
Power Flow Analysis on CUDA-based GPU
Power Management and Optimization
Power Management for GPU-CPU Heterogeneous Systems
Power Management Techniques for Data Centers: A Survey
Power Modeling and Optimization for GPGPUs
Power performance analysis of 3-D finite element mesh refinement with tetrahedra by CUDA/MPI on multi-core and GPU architecture
Power Profiling and Optimization for Heterogeneous Multi-Core Systems
Power Profiling of GeMTC Many Task Computing
Power-aware Performance of Mixed Precision Linear Solvers for FPGAs and GPGPUs
Power-Efficient Accelerators for High-Performance Applications
Power-efficient medical image processing using PUMA
Power-Efficient Time-Sensitive Mapping in Heterogeneous Systems
Power-Efficient Work Distribution Method for CPU-GPU Heterogeneous System
Power-performance comparison of single-task driven many-cores
Power, Energy and Speed of Embedded and Server Multi-Cores applied to Distributed Simulation of Spiking Neural Networks: ARM in NVIDIA Tegra vs Intel Xeon quad-cores
PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion
Practical Algorithms for Finding Extremal Sets
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
Practical and Theoretical Aspects of a Parallel Twig Join Algorithm for XML Processing using a GPGPU
Practical CFD Simulations on Programmable Graphics Hardware using SMAC
Practical considerations for GPU-accelerated CT
Practical craniofacial surgery simulator based on GPU accelerated lattice shape matching
Practical examples of GPU computing optimization principles
Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs
Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing
Practical logarithmic rasterization for low-error shadow maps
Practical parallel imaging compressed sensing MRI: Summary of two years of experience in accelerating body MRI of pediatric patients
Practical Patient-Specific Cardiac Blood Flow Simulations Using SPH
Practical Pre-stack Kirchhoff Time Migration of Seismic Processing on General Purpose GPU
Practical Random Linear Network Coding on GPUs
Practical Symbolic Execution Analysis and Methodology for GPU Programs
Practical Symbolic Race Checking of GPU Programs
Practical Symmetric Key Cryptography on Modern Graphics Hardware
Practically efficient methods for performing bit-reversed permutation in C++11 on the x86-64 architecture
Pragma Directed Shared Memory Centric Optimizations on GPUs
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs
Pre-Training LLMs on a budget: A comparison of three optimizers
Precise dynamic analysis for slack elasticity: adding buffering without adding bugs
Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads
Precision and Performance Analysis of C Standard Math Library Functions on GPUs
Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs
Precision-Aware Soft Error Protection for GPUs
Precomputed Atmospheric Scattering
Precomputed compressive sensing for light transport acquisition
Precomputed Visibility Cuts for Interactive Relighting with Dynamic BRDFs
Preconditioned conjugate gradient solver for structural problems
Predictable GPGPU Computing in DNN-Driven Autonomous Systems
Predicting GPUDirect Benefits for HPC Workloads
Predicting NVIDIA's Next-Day Stock Price: A Comparative Analysis of LSTM, MLP, ARIMA, and ARIMA-GARCH Models
Predicting the Execution Time of a kernel on a specific GPU using PTX code
Prediction of Performance and Power Consumption of GPGPU Applications
Predictive Data Race Detection for GPUs
Predictive Lazy Amplification: Synthesis and Rendering of Massive Procedural Scenes in Real Time
Predictive Modeling and Analysis of OP2 on Distributed Memory GPU Clusters
Predictive Runtime Code Scheduling for Heterogeneous Architectures
Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels
Prefiltered Single Scattering
Preliminary Experiences with the Uintah Framework on Intel Xeon Phi and Stampede
Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor
Preliminary implementation of two parallel programs for fractal image coding on GPUs
Preliminary implementation of VQ image coding using GPGPU
Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC
Preliminary results of autotuning GEMM kernels for the NVIDIA Kepler architecture-GeForce GTX 680
Preparing Ginkgo for AMD GPUs - A Testimonial on Porting CUDA Code to HIP
Pretraining large language models with MXFP4 on Native FP4 Hardware
Pretty Good Accuracy in Matrix Multiplication with GPUs
Pricing composable contracts on the GP-GPU
Pricing of cross-currency interest rate derivatives on Graphics Processing Units
Pricing the American Option Using Reconfigurable Hardware
Primal Dual Affine Scaling on GPUs
Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads
Principles for Automated and Reproducible Benchmarking
Principles towards Real-Time Simulation of Material Point Method on Modern GPUs
Principles, Techniques, and Tools for Explicit and Automatic Parallelization
Priority-Based Task Management in a GPGPU Megakernel
PRISM-PSY: Precise GPU-Accelerated Parameter Synthesis for Stochastic Systems
Prius: A Runtime for Hybrid Computing
Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs
PRNG Random Numbers on GPU
Probabilistic View-based 3D Curve Skeleton Computation on the GPU
Probe-and-Refine Tuning of Repository Guidance for Coding Agents
Probing biomolecular machines with graphics processors
Probing the Statistical Validity of the Ductile-to-Brittle Transition in Metallic Nanowires Using GPU Computing
Process Time Comparison between GPU and CPU
Processing Big Data in Main Memory and on GPU
Processing data streams with hard real-time constraints on heterogeneous systems
Processing Hard Sphere Collisions on a GPU Using OpenCL
Processing Large-scale XML Files on GPGPU Cluster
Processing Markov Logic Networks with GPUs
Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data
Processing Neocognitron of Face Recognition on High Performance Environment Based on GPU with CUDA Architecture
Processing of synthetic Aperture Radar data with GPGPU
Processing OLTP Workloads on Hybrid CPU/GPU Systems
Processing Posting Lists Using OpenCL
Processing XPath Structural Constraints on GPU
Production Floating Point Applications on FPGAs
Production Level CFD Code Acceleration for Hybrid Many-Core Architectures
Productive and Efficient Computational Science Through Domain-specific Abstractions
Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages
Productive Performance Engineering for Weather and Climate Modeling with Python
Productivity, Portability, Performance: Data-Centric Python
Professional CUDA C Programming
Profile Util library: A quick and easy way to get MPI, OpenMP and GPU runtime information
Profile-guided optimization of critical medical imaging algorithms
Profiling Apple Silicon Performance for ML Training
Profiling based Out-of-core Hybrid Method for Large Neural Networks
Profiling Concurrent Vision Inference Workloads on NVIDIA Jetson - Extended
Profiling General Purpose GPU Applications
Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms
Profiling High Level Heterogeneous Programs: Using the SPOC GPGPU framework for OCaml
Profiling of Data-Parallel Processors
ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler
Program Acceleration in a Heterogeneous Computing Environment Using OpenCL, FPGA, and CPU
Program Analysis and Machine Learning based Approach to Predict Power Consumption of CUDA Kernel
Program optimization carving for GPU computing?
Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+
Program Optimization of Stencil Based Application on the GPU-Accelerated System
Program optimization space pruning for a multithreaded gpu
Program Optimization Strategies for Data-Parallel Many-Core Processors
Program Optimization Study on a 128-Core GPU
PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations
ProGraML: Graph-based Deep Learning for Program Optimization and Analysis
Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems
Programmability: Design Costs and Payoffs using AMD GPU Streaming Languages and Traditional Multi-Core Libraries
Programmable and Scalable Architecture for Graphics Processing Units
Programmable shaders for deformation rendering
Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems
Programming and Performance of Graphics Processors in Shock Waves Simulation by Finite Volume Method
Programming and Scheduling Model for Supporting Heterogeneous Accelerators in Linux
Programming Challenges for the Implementation of Numerical Quadrature in Atomic Physics on FPGA and GPU Accelerators
Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries
Programming Dense Linear Algebra Kernels on Vectorized Architectures
Programming Embedded Manycore: Refinement and Optimizing Compilation of a Parallel Action Language for Hierarchical State Machines
Programming finite-difference time-domain for graphics processor units using compute unified device architecture
Programming for scientific computing on peta-scale heterogeneous parallel systems
Programming framework for clusters with heterogeneous accelerators
Programming Frameworks for Distributed Smartphone Computing
Programming Future Parallel Architectures with Haskell and Intel ArBB
Programming GPUs with C++14 and Just-In-Time Compilation
Programming Heterogeneous Systems from an Image Processing DSL
Programming Heterogeneous Systems with General and Domain-Specific Frameworks
Programming hybrid systems with implicit memory based synchronization
Programming in CUDA for Kepler and Maxwell Architecture
Programming issues for video analysis on Graphics Processing Units
Programming Many-Core Chips
Programming Massively Parallel Architectures using MARTE: a Case Study
Programming massively parallel processors : A Hands - on approach
Programming Massively Parallel Processors with CUDA (audio course)
Programming model for a heterogeneous x86 platform
Programming Models and Runtimes for Heterogeneous Systems
Programming Models and Scheduling Techniques for Heterogeneous Architectures
Programming Models and Tools for Many-Core Platforms
Programming NVIDIA cards by means of transitive closure based parallelization algorithms
Programming of shared memory GPUs shared memory systems
Programming on Parallel Machines: GPU, Multicore, Clusters and More
Programming video cards for computational electromagnetics applications
Programming with Explicit Dependencies. A Framework for Portable Parallel Programming
Programming-Model Centric Debugging for Multicore Embedded Systems
Progressive Clustering of Big Data with GPU Acceleration and Visualization
Progressive High-Quality Response Surfaces for Visually Guided Sensitivity Analysis
Progressive Photon Mapping on GPUs
Progressive Semantic Segmentation
Projected tetrahedra revisited: a barycentric formulation applied to digital radiograph reconstruction using higher-order attenuation functions
Projectile Monte-Carlo Trajectory Analysis Using a Graphics Processing Unit
Projecting Tetrahedra with a Simplified Basis Graph
PROJECTION Algorithm for Motif Finding on GPUs
Promise of embedded system with GPU in artificial leg control: Enabling time-frequency feature extraction from electromyography
ProofWright: Towards Agentic Formal Verification of CUDA
Proposition for propagated occupation grids for non-rigid moving objects tracking
Prospects for scalable 3D FFTs on heterogeneous exascale systems
Prospects of GPGPU in the Auger Offline Software Framework
pROST : A Smoothed Lp-norm Robust Online Subspace Tracking Method for Realtime Background Subtraction in Video
PROST: Parallel robust online simple tracking
Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms
Protein alignment algorithms with an efficient backtracking routine on multiple GPUs
Proteus: Efficient Resource Use in Heterogeneous Architectures
Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks
Prototyping flexible touch screen devices using collocated haptic-graphic elastic-object deformation on the GPU
Prototyping methodology of image processing applications on heterogeneous parallel systems
ProtoX: A First Look
Provably Efficient GPU Algorithms
Providing performance portable numerics for Intel GPUs
Providing Source Code Level Portability Between CPU and GPU with MapCG
PSCToolkit: solving sparse linear systems with a large number of GPUs
Pseudo Random Number Generators on Graphics Processing Units, with Applications in Finance
Pseudo-random number generation for Brownian Dynamics and Dissipative Particle Dynamics simulations on GPU devices
Pseudo-Random Number Generation on GP-GPU
Pseudo-random number generators for Monte Carlo simulations on ATI Graphics Processing Units
Pseudo-random number generators for Monte Carlo simulations on Graphics Processing Units
Pseudorandom number generation on the GPU
Pseudorandom Numbers Generation for Monte Carlo Simulations on GPUs: OpenCL Approach
Pseudoscalar Meson in Two Flavors QCD with the Optimal Domain-Wall Fermion
pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations
PTask: Operating System Abstractions To Manage GPUs as Compute Devices
PTX2Kernel: Converting PTX Code into Compilable Kernels
PUGACE, a cellular Evolutionary Algorithm framework on GPUs
Pulsar Acceleration Searches on the GPU for the Square Kilometre Array
Pulsar search acceleration using FPGAs and OpenCL templates
Pulse-coupled neural network performance for real-time identification of vegetation during forced landing
Purine: A bi-graph based deep learning framework
Pushing the Envelope: Extreme Network Coding on the GPU
Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning
Pushing the limits for medical image reconstruction on recent standard multicore processors
Putting Automatic Polyhedral Compilation for GPGPU to Work
pVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments
PVR: Patch-to-Volume Reconstruction for Large Area Motion Correction of Fetal MRI
pyATF: Constraint-Based Auto-Tuning in Python
PyCOOL - a Cosmological Object-Oriented Lattice code written in Python
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
PyCUDA: GPU Run-Time Code Generation for High-Performance Computing
PyFAI, a versatile library for azimuthal regrouping
PyFAI: a Python library for high performance azimuthal integration on GPU
PyFR: An Open Source Framework for Solving Advection-Diffusion Type Problems on Streaming Architectures using the Flux Reconstruction Approach
PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch
pyGSL: A Graph Structure Learning Toolkit
pyJac: analytical Jacobian generator for chemical kinetics
PyMatting: A Python Library for Alpha Matting
pyMIC: A Python Offload Module for the Intel Xeon Phi Coprocessor
PyOMP: Parallel programming for CPUs and GPUs with OpenMP and Python
pyPaSWAS: Python-based multi-core CPU and GPU sequence alignment
PyPs, a programmable pass manager
Pyramid Methods in GPU-Based Image Processing
Pyramidal Image Blending Using CUDA Framework
PySAGES: flexible, advanced sampling methods accelerated with GPUs
PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems
PySPH: A Python framework for SPH
PySPH: a Python-based framework for smoothed particle hydrodynamics
PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage
Python for Development of OpenMP and CUDA Kernels for Multidimensional Data
Python Non-Uniform Fast Fourier Transform (PyNUFFT): An Accelerated Non-Cartesian MRI Package on a Heterogeneous Platform (CPU/GPU)
Python Workflows on HPC Systems
Python-Based Quantum Chemistry Calculations with GPU Acceleration
PyTorch Hyperparameter Tuning - A Tutorial for spotPython
PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorchPipe: a framework for rapid prototyping of pipelines combining language and vision
PyTransit: Fast and Easy Exoplanet Transit Modelling in Python
q-state Potts model metastability study using optimized GPU-based Monte Carlo algorithms
QArray: a GPU-accelerated constant capacitance model simulator for large quantum dot arrays
QCD on GPUs: cost effective supercomputing
QCD simulations with staggered fermions on GPUs
QCDGPU: open-source package for Monte Carlo lattice simulations on OpenCL-compatible multi-GPU systems
qecGPT: decoding Quantum Error-correcting Codes with Generative Pre-trained Transformers
QGTC: Accelerating Quantized GNN via GPU Tensor Core
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
QMCPACK: An open source ab initio Quantum Monte Carlo package for the electronic structure of atoms, molecules, and solids
QP: A Heterogeneous Multi-Accelerator Cluster
QPACE 2 and Domain Decomposition on the Intel Xeon Phi
QR decomposition on GPUs
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
QSL Squasher: A Fast Quasi-Separatrix Layer Map Calculator
Quadratic Pseudo-Boolean Optimization for Scene Analysis using CUDA
Qualcomm Snapdragon Mobile Platform OpenCL General Programming and Optimization
Quality comparison and acceleration for digital hologram generation method based on segmentation
Quality-score guided error correction for short-read sequencing data using CUDA
Quantifying NUMA and contention effects in multi-GPU systems
Quantifying OpenMP: Statistical Insights into Usage and Adoption
Quantifying the Energy Efficiency of FFT on Heterogeneous Platforms
Quantifying the Energy Efficiency of Object Recognition and Optical Flow
Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters
Quantile Mechanics II: Changes of Variables in Monte Carlo methods and a GPU-Optimized Normal Quantile
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
Quantum Boolean Image Denoising
Quantum chemical many-body theory on heterogeneous nodes
Quantum Chemistry for Solvated Molecules on Graphical Processing Units (GPUs) using Polarizable Continuum Models
Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation
Quantum Chemistry on Graphical Processing Units. 2. Direct Self-Consistent-Field Implementation
Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics 
Quantum computer simulation using the CUDA programming model
Quantum Monte Carlo on graphical processing units
Quantum.Ligand.Dock: protein-ligand docking with quantum entanglement refinement on a GPU system
Quartile and Outlier Detection on Heterogeneous Clusters Using Distributed Radix Sort
Quasars spectra classification with the help of GPU computing
Quasi-maximum Accuracy Floating-point Computations with GPGPU for Applications in Digital Signal Processing
Quasi-real-time analysis of dynamic near field scattering data using a graphics processing unit
QUDA programming for staggered quarks
Query Optimization in Heterogeneous CPU/GPU Environment for Time Series Databases
Query Processing on Tensor Computation Runtimes
Query-Driven Visualization of Time-Varying Adaptive Mesh Refinement Data
Quick-CULLIDE: fast inter- and intra-object collision culling using graphics hardware
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
QuickProbs - A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors
Quine-McCluskey algorithm on GPGPU
QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters
R2GUESS: A Graphics Processing Unit-Based R Package for Bayesian Variable Selection Regression of Multivariate Responses
Radeon PRO Solid State Graphics (SSG) API User Manual
Radial Basis Function Networks GPU-Based Implementation
Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System
Radiative Heat Transfer Simulation Using Programmable Graphics Hardware 
Radio astronomy beam forming on GPUs
Radio Astronomy Beam Forming on Many-Core Architectures
Radiometric Compensation through Inverse Light Transport
Radionuclides migration modelling using artificial neural networks and parallel computing
RadixBoost: A Hardware Acceleration Structure for Scalable Radix Sort on Graphic Processors
Rain Scene Animation through Particle Systems and Surface Flow Simulation by SPH
Raising the Bar for Using GPUs in Software Packet Processing
Raising the level of many-core programming with compiler technology: meeting a grand challenge
Raising the Performance of the Tinker-HP Molecular Modeling Package on Intel's HPC Architectures: a Living Review [Article v1.0]
Random Address Permute-Shift Technique for the Shared Memory on GPUs
Random Fields Generation on the GPU with the Spectral Turning Bands Method
Random Finite Set Based Bayesian Filtering with OpenCL in a Heterogeneous Platform
Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams
Random number generators for massively parallel simulations on GPU
Random Walks based Multi-Image Segmentation: Quasiconvexity Results and GPU-based Solutions
Random Walks for Image Cosegmentation
Random Walks for Interactive Organ Segmentation in Two and Three Dimensions: Implementation and Validation
Random-access rendering of general vector graphics
Randomized selection on the GPU
Range Cell Migration Correction using texture mapping on GPU
Range query processing in a multi-GPU environment
Rank k Cholesky Up/Down-dating on the GPU: gpucholmodV0.2
RankBoost Acceleration on both NVIDIA CUDA and ATI Stream Platforms
Rapid Computation of Sodium Bioscales Using GPU-Accelerated Image Reconstruction
Rapid evaluation and evolution of neural models using graphics card hardware
Rapid Modelling of Interactive Geological Illustrations with Faults and Compaction
Rapid motion compensation for prostate biopsy using GPU
Rapid Multipole Graph Drawing on the GPU
Rapid Performance of a Generalized Distance Calculation
Rapid Rabbit: Highly Optimized GPU Accelerated Cone-Beam CT Reconstruction
Rapid RNA Folding: Analysis and Acceleration of the Zuker Recurrence
Rapid star map simulation based on GPU
Rapid Texture-based Volume Rendering
RapidMind: Portability across Architectures and its Limitations
RAPIDNN: In-Memory Deep Neural Network Acceleration Framework
RAR password decryption by utilizing GPU
Raspberry Pi based System for Visual Object Detection and Tracking
RASR/NN: The RWTH Neural Network Toolkit for Speech Recognition
Raster Time Series: Learning and Processing
Raster2Mesh: Rasterization based CVT meshing
RaVioli: a GPU Supported High-Level Pseudo Real-time Video Processing Library
Ray Casting Deformable Models on the GPU
Ray Casting of Trimmed NURBS Surfaces on the GPU
Ray Reordering Techniques for GPU Ray-Cast Ambient Occlusion
Ray Traced Rendering Using GPGPU Devices
Ray Tracing in Real-Time Games
Ray Tracing in the Cloud using MapReduce
Ray Tracing of Volumetric Data in Real Time
Ray Tracing on GPUs
Ray Tracing on Graphics Hardware
Ray Tracing using HIP
Ray Tracing Visualization Toolkit
Ray-Casted BlockMaps for Large Urban Models Visualization
Ray-Traced Collision Detection: Interpenetration Control and Multi-GPU Performance
Ray-traced Radiative Transfer on Massively Threaded Architectures
Ray-Tracing Based Interactive Camera Simulation
Raytracing Dynamic Scenes on GPU
Raytracing Dynamic Scenes on the GPU using Grids
RBMD: A molecular dynamics package enabling to simulate 10 million all-atom particles in a single graphics processing unit
rCUDA: Reducing the number of GPU-based accelerators in high performance clusters
RDMA Point-to-Point Communication for LLM Systems
RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs
RDMA-Based Job Migration Framework for MPI over InfiniBand
Re-Introduction of Communication-Avoiding FMM-Accelerated FFTs with GPU Acceleration
Reaction-diffusion model Monte Carlo simulations on the GPU
Real FP4 Tensor-Core Code in Pure Rust on a Gaming GPU - with NVIDIA's Own Compiler
Real root isolation for univariate polynomials on GPUs and multicores
Real Time Background Subtraction On GPU Using CUDA
Real Time Capture of Audio Images and their Use with Video
Real time data analysis using GPU for High energy physics experiments
Real Time Face Detection on GPU Using OpenCL
Real Time Feature-Based Parallel Morphing in GPU Applied to Texture-Based Animation
Real time image reconstruction using GPUs for a surgical PET imaging probe system
Real Time KAP Systems for Image Enhancement/Reconstruction of Remote Sensing Imagery
Real time mitigation of atmospheric turbulence in long distance imaging using the lucky region fusion algorithm with FPGA and GPU hardware acceleration
Real time Multi-GPU-based Event Detection in High Definition Videos
Real Time Pixel Art Remasterization on GPUs
Real Time Simulation of Tissue Cutting Based on GPU and CUDA for Surgical Training
Real Time Stereo Vision Using Exponential Step Cost Aggregation On GPU
Real time ultrasound image denoising
Real time ultrasound image denoising
Real world applications of Artificial Intelligence on constrained hardware
Real-space density functional theory on graphical processing units: computational approach and comparison to Gaussian basis set methods
Real-Time 2-D Temperature Imaging Using Ultrasound
Real-time 3-D object recognition using scale invariant feature transform and stereo vision
Real-Time 3D Face Identification from a Depth Camera
Real-time 3D fluid simulation on GPU with complex obstacles
Real-time 3D reconstruction and pose estimation for human motion analysis
Real-time 3D Reconstruction for FPGAs: A Case Study for Evaluating the Performance, Area, and Programmability Trade-offs of the Altera OpenCL SDK
Real-time 3D reconstruction for mobile robot using catadioptric cameras
Real-time 3D registration of stereo-vision based range images using GPU
Real-time 3D registration using GPU
Real-time 3D semi-local surface patch extraction using GPGPU
Real-time 3D surface modeling for image based relighting
Real-time 3D video synthesis from binocular capture system based on commodity graphic hardware
Real-time adaptive algorithms using a Graphics Processing Unit
Real-time adaptive fluid simulation with complex boundaries
Real-Time Adaptive Image Compression
Real-Time Adaptive Radiometric Compensation
Real-time Adaptive Tone Mapping for Monitoring High Contrast Hemispherical Image Capture with the GPU
Real-Time All-in-Focus Video-Based Rendering Using A Network Camera Array
Real-time ambient occlusion and halos with summed area tables
Real-Time and Realistic Simulation for Cardiac Intervention with GPU
Real-time and Realistic Simulation of Large-scale Deep Ocean Wave Foams Based on GPU
Real-Time Animating and Rendering of Large Scale Grass Scenery on GPU
Real-time Animation of Sand-Water Interaction
Real-time anomaly detection in hyperspectral images using multivariate normal mixture models and GPU processing
Real-Time Approaches to Computer Vision
Real-time arbitrary view rendering on GPU from stereo video and time-of-flight camera
Real-Time Automatic Object Classification and Tracking using Genetic Programming and NVIDIA CUDA
Real-time blood flow visualization using the graphics processing unit
Real-time Building Airflow Simulation Aided by GPU and FFD
Real-time color holographic video display system
Real-time colouring and filtering with graphics shaders
Real-time Compressive Sensing MRI Reconstruction using GPU Computing and Split Bregman Methods
Real-time computation of interactive waves using the GPU
Real-Time Computation of Parameter Fitting and Image Reconstruction Using Graphical Processing Units
Real-time computation of photic extremum lines (PELs)
Real-time Computer Simulation of Three Dimensional Elastostatics using the Finite Point Method
Real-Time Computer Vision with openCV
Real-Time Concurrent Linked List Construction on the GPU
Real-time continuum grass
Real-Time Creased Approximate Subdivision Surfaces with Displacements
Real-Time Crowd Rendering and Interactions on GPU
Real-Time Dedispersion for Fast Radio Transient Surveys, using Auto Tuning on Many-Core Accelerators
Real-Time Deformation of Subdivision Surfaces from Object Collisions
Real-time depth estimation for immersive 3D videoconferencing
Real-Time Depth-of-Field Rendering Using Anisotropically Filtered Mipmap Interpolation
Real-Time Depth-of-Field Rendering Using Point Splatting on Per-Pixel Layers
Real-time digital holographic microscopy observable in multi-view and multi-resolution
Real-time digital holographic microscopy using the graphic processing unit
Real-Time Discriminative Background Subtraction
Real-time dual-mode standard/complex Fourier-domain OCT system using graphics processing unit accelerated 4D signal processing and visualization
Real-time DVB-S2 LDPC decoding on many-core GPU accelerators
Real-time dynamic tone-mapping operator on GPU
Real-Time Electroholography Using a Multi-GPU Environmental PC
Real-Time Exact Graph Matching with Application in Human Action Recognition
Real-time execution of image change detection
Real-time eye blink detection with GPU-based SIFT tracking
Real-Time Face Pose Estimation from Single Range Images
Real-time Flame Rendering with GPU and CUDA
Real-time foreground segmentation on GPUs using local online learning and global graph cut optimization
Real-time Forest Simulation for a Flight Simulator using a GPU; Graphics Card Acceleration
Real-time free viewpoint video from uncalibrated cameras using plane-sweep algorithm
Real-time Geometric Calibration on graphics processing unit with CUDA
Real-Time Geometry Decompression on Graphics Hardware
Real-Time Global Illumination for VR Applications
Real-time GPU color-based segmentation of football players
Real-Time GPU Implementation of Transverse Oscillation Vector Velocity Flow Imaging
Real-Time GPU Path Tracing
Real-time GPU rendering of piecewise algebraic surfaces
Real-Time GPU Silhouette Refinement using Adaptively Blended Bezier Patches
Real-time GPU-based Simulation of Dynamic Terrain in Virtual Battlefield
Real-Time GPU-Based Visualization of Tile Tracks in Dynamic Terrain
Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling
Real-time gradient-domain painting
Real-Time Grasp Detection Using Convolutional Neural Networks
Real-Time Hair Rendering
Real-Time Hair Simulation and Rendering with OpenCL and OpenGL
Real-time hair simulation on GPU with a dynamic wisp model
Real-Time Handling of GPU Interrupts in LITMUS RT
Real-Time Handling of GPU Interrupts in LITMUSRT
Real-time High Resolution Fusion of Depth Maps on GPU
Real-Time High-Performance Computing for Embedded Control Systems
Real-time human detection using contour cues
Real-time human detection using histograms of oriented gradients on a GPU
Real-Time Illustration of Vascular Structures
Real-time Image Processing on Low Cost Embedded Computers
Real-Time Image Segmentation on a GPU
Real-time image-based rendering system for virtual city based on image compression technique and eigen texture method
Real-Time Implementation of a Full Hyperspectral Unmixing Chain on Graphics Processing Units
Real-Time Implementation of Remotely Sensed Hyperspectral Image Unmixing on GPUs
Real-Time Implementation of the Pixel Purity Index Algorithm for Endmember Identification on GPUs
Real-Time Implementation of the Vertex Component Analysis Algorithm on GPUs
Real-Time Incompressible Fluid Simulation on the GPU
Real-time interactive object extraction system for high resolution remote sensing images based on parallel computing architecture
Real-time intraoperative full-range complex FD-OCT guided cerebral blood vessel identification and brain tumor resection in neurosurgery
Real-time Kd-tree Based Importance Sampling of Environment Maps
Real-time KD-tree construction on graphics hardware
Real-Time Marker Level Set on GPU
Real-time massive convolution for audio applications on GPU
Real-time massively parallel processing of spectral optical coherence tomography data on graphics processing units
Real-time Medical Image Volume Rendering Based on GPU Accelerated Method
Real-time medical video processing, enabled by hardware accelerated correlations
Real-time mesh simplification using the GPU
Real-time Minute Change Detection on GPU for Cellular and Remote Sensor Imaging
Real-time Model-based Articulated Object Pose Detection and Tracking with Variable Rigidity Constraints
Real-Time Motion Artifact Compensation for PMD-ToF Images
Real-time Motion Estimation for 1080p videos on graphics processing units with shared memory optimization
Real-time multi-agent path planning on arbitrary surfaces 
Real-time multi-band synthesis of ocean water with new iterative up-sampling technique
Real-time multi-stereo depth estimation on GPU with approximative discontinuity handling
Real-time multi-view deconvolution
Real-time Multi-view Depth Generation Using CUDA Multi-GPU
Real-Time Multiprocessor Systems with GPUs
Real-Time Non-rigid Registration of Medical Images on a Cooperative Parallel Architecture
Real-time nonlinear finite element computations on GPU - Application to neurosurgical simulation 
Real-time numerical dispersion compensation using graphics processing unit for Fourier-domain optical coherence tomography
Real-time object detection on CUDA
Real-Time Object Tracking by CUDA-accelerated Neural Network
Real-Time Object-Space Edge Detection using OpenCL
Real-time ocean wave motion simulation based on statistic model and GPU programming
Real-Time Online Video Object Silhouette Extraction Using Graph Cuts on the GPU
Real-Time Optical Flow Calculations on FPGA and GPU Architectures: A Comparison Study
Real-time optical manipulation of micron sized structures using GPU generated holograms
Real-time optical micro-manipulation using optimized holograms generated on the GPU
Real-Time Painterly Rendering of Terrains
Real-time parallel remote rendering for mobile devices using graphics processing units
Real-time particle filtering with heuristics for 3D motion capture by monocular vision
Real-time particle simulation of fluids
Real-time particle systems on the GPU in dynamic environments
Real-time path-based surface detail
Real-time PCA calculation for spectral imaging (using SIMD and GP-GPU)
Real-Time Pedestrian Detection With Deep Networks Cascades
Real-Time Phase Masks for Interactive Stimulation of Optogenetic Neurons
Real-time photo style transfer
Real-Time Photon Mapping on GPU
Real-time physically cloth simulation with CUDA
Real-time planar flow velocity measurements using an optical flow algorithm implemented on GPU
Real-Time Plane-Sweeping Stereo with Multiple Sweeping Directions
Real-Time Prediction of Brain Shift Using Nonlinear Finite Element Algorithms
Real-Time Radio Wave Propagation for Mobile Ad-Hoc Network Emulation using GPGPUs
Real-time rain simulation in cartoon style
Real-time ray casting of algebraic B-spline surfaces
Real-time Ray tracing and Editing of Large Voxel Scenes
Real-time ray tracing of implicit surfaces on the GPU
Real-Time Reconstruction of Sensitivity Encoded Radial Magnetic Resonance Imaging Using a Graphics Processing Unit
Real-time relief mapping on arbitrary polygonal surfaces
Real-Time Rendering Algorithm for Virtual Endoscopy Based on GPU
Real-time rendering and dynamic updating of 3-d volumetric data
Real-Time Rendering and Editing of Vector-based Terrains
Real-Time Rendering and Manipulation of Large Terrains
Real-Time Rendering for 3D Game Terrain with GPU Optimization
Real-time Rendering of Heterogeneous Translucent Objects with Arbitrary Shapes
Real-time rendering of large surface-scanned range data natively on a GPU
Real-time rendering of large-scale tree scene
Real-time Rendering of Melting Objects in Video Games
Real-Time Rendering of Molecular Dynamics Simulation Data: A Tutorial
Real-Time Rendering of Point Based Water Surfaces
Real-Time Rendering of Temporal Volumetric Data on a GPU
Real-time restoration algorithm based on one-dimensional Wiener filters for different rates of image motion blur
Real-Time Rigid Body Interactions
Real-time S-MRTD simulation of electrically large indoor wireless channels with commodity GPUs
Real-Time SAH BVH Construction for Ray Tracing Dynamic Scenes
Real-time saliency-aware video abstraction
Real-Time Scheduling for GPUs with Applications in Advanced Automotive Systems
Real-Time Scheduling Using GPUs - Advanced and More Accurate Proof of Feasibility
Real-time screen image scaling and its GPU acceleration
Real-Time Screen Space Rendering of Cartoon Water
Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml
Real-time Semi-Global Matching on the CPU
Real-Time Shadow Volume Algorithm for Subdivision Surface Based Models
Real-Time Simulation and Rendering of 3D Smoke on GPU Programme
Real-Time Simulation and Visualization of Subject-Specific 3D Lung Dynamics
Real-time simulation of a spiking neural network model of the basal ganglia circuitry using general purpose computing on graphics processing units
Real-time simulation of a spiking neural network model of the basal ganglia circuitry using general-purpose computing on graphics processing units
Real-time Simulation of Foam and Sprays Based on the Weber Number
Real-Time Simulation of Granular Materials Using Graphics Hardware
Real-time simulation of large-scale dynamic forest with GPU
Real-Time Simulation of Medical Ultrasound from CT Images
Real-time Simultaneous Pose and Shape Estimation for Articulated Objects Using a Single Depth Camera
Real-time Sliding Phase Vocoder using a Commodity GPU
Real-Time Soft-Finger Grasping of Physically Based Quasi-rigid Objects
Real-time Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid
Real-Time Spherical Panorama Image Stitching Using OpenCL
Real-Time Stereo Matching using Adaptive Window based Disparity Refinement
Real-time stereo matching using orthogonal reliability-based dynamic programming
Real-time stereo matching: A cross-based local approach
Real-Time Stereo on GPGPU using Progressive Multi-Resolution Adaptive Windows
Real-time Stereo Vision: Optimizing Semi-Global Matching
Real-time stereographic rendering and display of medical images with programmable GPUs
Real-Time Stochastic Kinodynamic Motion Planning via Multiobjective Search on GPUs
Real-time Stochastic Optimization of Complex Energy Systems on High Performance Computers
Real-time Stochastic Rasterization on Conventional GPU Architectures
Real-time Subsurface Scattering for Particle-based Fluids using Finite Volume Method
Real-Time Surface Extraction and Visualization of Medical Images using OpenCL and GPUs
Real-Time Systems with Radiation-Hardened Processors: A GPU-based Framework to Explore Tradeoffs
Real-time task reconfiguration support applied to an UAV-based surveillance system
Real-time Terrain Modeling using CPU-GPU Coupled Computation
Real-Time Tone Mapping for High-Resolution HDR Images
Real-Time Tracking of Visually Attended Objects in Virtual Environments and Its Application to LOD
Real-Time Tracking with Non-Rigid Geometric Templates Using the GPU
Real-time Traffic Sign Recognition with Map Fusion on Multicore/Many-core Architectures
Real-Time Translucent Rendering Using GPU-based Texture Space Importance Sampling
Real-Time Ultrasound Biomicroscopy with Optoacoustic Arrays
Real-Time Use of GPUs in NA62 Experiment
Real-time video breakup detection for multiple HD video streams on a single GPU
Real-time video denoising for 2D ultrasound streaming video on GPUs
Real-time video watermarking on programmable graphics hardware
Real-time view synthesis system with multi-texture structure of GPU
Real-time virtual environment signal extraction and denoising using programmable graphics hardware
Real-Time Virtual Viewpoint Generation on the GPU for Scene Navigation
Real-Time Visibility-Based Fusion of Depth Maps
Real-time Visual Tracker by Stream Processing
Real-time visualization of large volume datasets on standard PC hardware
Real-time Visualization of Streaming Text with Force-Based Dynamic System
Real-time Volumetric Haptic and Visual Burrhole Simulation
Real-time volumetric image reconstruction and 3D tumor localization based on a single x-ray projection image for lung cancer radiotherapy
Real-Time Volumetric Shadows using 1D Min-Max Mipmaps
Real-time voxelization for complex polygonal models
Real-Time Weighted Pose-Space Deformation on the GPU
Real-time, accurate depth of field using anisotropic diffusion and programmable graphics cards
Real-time, fast radio transient searches with GPU de-dispersion
Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS
Real-World Constraints of GPUs in Real-Time Systems
Realisation of a holographic microlaser scalpel using a digital micromirror device
Realistic Lighting Simulation for Interactive VR Applications
Realistic real-time rendering for large-scale forest scenes
Realistic real-time rendering for ocean waves on GPU
Realistic real-time sound re-synthesis and processing for interactive virtual worlds
Realistic rendering of surface appearance using GPU
Realizing Accelerated Cost-Effective Distributed RAID
Realtime affine-photometric KLT feature tracker on GPU in CUDA framework
Realtime background subtraction from dynamic scenes
Realtime Computation of a VST Audio Effect Plugin on the Graphics Processor
Realtime Deformation of Constrained Meshes Using GPU
RealTime GPU-Based Motion Planning for Task Executions
Realtime Loop Subdivision on the GPU
Realtime phase-based optical flow on the GPU
Realtime Ray Tracing on a Hibrid Parallel Architecture
Realtime Ray Tracing on GPU with BVH-based Packet Traversal
Realtime scheduling using GPUs - proof of feasibility
Realtime Simulation of Burning Solids on GPU with CUDA
Realtime Two-Way Coupling of Meshless Fluids and Nonlinear FEM
Recent Advances on GPU Computing in Operations Research
Recent algorithm and machine developments for lattice QCD
Recent progress and challenges in exploiting graphics processors in computational fluid dynamics
Recent trends in software and hardware for GPGPU computing: A comprehensive survey
Reconfigurable Control Variate Monte-Carlo Designs for Pricing Exotic Options
Reconfigurable real-time MIMO detector on GPU
Reconstructing hash reversal based proof of work schemes
Reconstruction and visualization of planetary nebulae
Record Setting Software Implementation of DES Using CUDA
Recovering Historical Climate Records using Artificial Neural Networks in GPU
Recurrence quantification analysis in images with CUDA
Recurrent Neural Networks for anomaly detection in the Post-Mortem time series of LHC superconducting magnets
Recurrent neural networks for language modeling
Recurrent Neural Networks Hardware Implementation on FPGA
Recursive MIS Computation for Streaming BDPT on the GPU
Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
Redefining the Role of the CPU in the Era of CPU-GPU Integration
Redesigning combustion modeling algorithms for the Graphics Processing Unit (GPU): Chemical kinetic rate evaluation and ordinary differential equation integration
Redução de Complexidade de Tempo em GPUs
Reduce, Reuse, Recycle (R^3): a Design Methodology for Sparse Matrix Vector Multiplication on Reconfigurable Platforms
Reduced Vlasov-Maxwell simulations
Reducing Beamforming Calculation Time with GPU Accelerated Algorithms
Reducing branch divergence in GPU programs
Reducing branch divergence to speed up parallel execution of unit testing on GPUs
Reducing data access latency in SDSM systems using runtime optimizations
Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization
Reducing IO bandwidth for GPU based moment invariant classifier systems
Reducing overheads of dynamic scheduling on heterogeneous chips
Reducing shading on GPUs using quad-fragment merging
Reducing Synchronous GPU Memory Transfers: Design and implementation of a Futhark compiler optimisation
Reducing the Code Degree Of Parallelism to Increase GPUs Reliability
Reducing the Cost of Heuristic Generation with Machine Learning
Reducing the Disk IO Bandwidth Bottleneck through Fast Floating Point Compression using Accelerators
Reducing the Size of Nurbs Controls Nets Using Genetic Algorithms and CUDA
Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm
Reducing Thread Divergence in GPU-based B and B Applied to the Flow-shop problem
Reducing Thread Divergence in GPU-based B&B Applied to the Flow-shop problem
Reduction of a Symmetrical Matrix to Tridiagonal Form on GPUs
Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads
Refinements in Syntactic Parsing
Refining HPCToolkit for application performance analysis at exascale
Reflective Shadow Map Clustering for Real-Time Global Illumination
Reflector Antenna Analysis using Physical Optics on Graphics Processing Units
Refresh Rate Modulation for Perceptually Optimized Computer Graphics
ReGen: Optimizing Genetic Selection Algorithms for Heterogeneous Computing
Region Templates: Data Representation and Management for Large-Scale Image Analysis
Regional Heritability Advanced Complex Trait Analysis for GPU and Traditional Parallel Architectures
Register packing for cyclic reduction: a case study
Register-leaning kernels in CUDA
Regression Modelling of Power Consumption for Heterogeneous Processors
Regular Expression Matching and Operational Semantics
Regular Expression Matching on Graphics Hardware for Intrusion Detection
Regular Lattice and Small-World Spin Model Simulations Using CUDA and GPUs
Regularity versus Load-Balancing on GPU for treefix computations
Regularization and nonlinearities for neural language models: when are they needed?
Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis
Reionization simulations powered by GPUs I: the structure of the Ultraviolet radiation field
Reionization Simulations Powered by Graphics Processing Units. I. On the Structure of the Ultraviolet Radiation Field
Relational Algorithms for Multi-Bulk-Synchronous Processors
Relational joins on graphics processors
Relational query coprocessing on graphics processors
Relativistic Hydrodynamics on Graphic Cards
Relativistic hydrodynamics on graphics processing units
Relax-Miracle: GPU Parallelization of Semi-Analytic Fourier-Domain solvers for Earthquake Modeling
Reliability modeling of MEMS devices on CUDA based HPC setup
Reliable Initialization of GPU-enabled Parallel Stochastic Simulations Using Mersenne Twister for Graphics Processors
REMODE: Probabilistic, Monocular Dense Reconstruction in Real Time
Remote GPU-Accelerated Online Pre-processing of Raster Maps for Terrain Rendering
Remote Sensing Processing: From Multicore to GPU
Remotely Keyed Cryptographics Secure Remote Display Access Using (Mostly) Untrusted Hardware
Removing the Barrier for FPGA-Based OpenCL Data Center Servers
RenderAnts: Interactive REYES Rendering on GPUs
Rendering Forest Scenes in Real-Time
Rendering of 3D Dynamic Virtual Environments
Rendering Volumetric Haptic Shapes in Mid-Air using Ultrasound
RenderKernel: High-level programming for real-time rendering systems
REOH: Runtime Energy Optimization for Heterogeneous Systems
Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution
Reordering strategy for blocking optimization in sparse linear solvers
RepoLaunch: Automating Build & Test Pipeline of Code Repositories on ANY Language and ANY Platform
RepoLaunch: Automating Build & Test Pipeline of Code Repositories on ANY Language and ANY Platform
Report on the Feasibility of Implementing PIC Codes on a GPU
Report: Performance comparison between C2075 and P100 GPU cards using cosmological correlation functions
Representing Higher-Order Singularities in Vector Fields on Piecewise Linear Surfaces
Reproducible and Accurate Matrix Multiplication for GPU Accelerators
Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations
Reproducible Triangular Solvers for High-Performance Computing
Research and Application of Parallel Computing Technologies based on CUDA and OpenCL
Research and Development of Porting SYCL on QNX Operating System for High Parallelism
Research for Chinese Spam Filtering Based on GPU
Research on a Parallel BD-tree Index Structure
Research on ATI-CAL for accelerating FBP reconstruction
Research on CUDA-based Kriging Interpolation Algorithm
Research on double negative materials by using FDTD method based on GPUs
Research on DSP-GPU Heterogeneous Computing System
Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method
Research on OpenCL optimization for FPGA deep learning application
Research on Parallel DVH Statistic Based on CUDA
Research on Real-Time LLL Imaging Generation Method Based on GPU
Research on the fast Fourier transform of image based on GPU
Research on the simulation of PF-LBM model based on MPI+CUDA mixed granularity parallel
Research on Three-Dimensional Playing Video Technology in Virtual Education Environment
Reservoir Simulation on NVIDIA Tesla GPUs
Resolution of Linear Algebra for the Discrete Logarithm Problem using GPU and Multi-core Architectures
Resolution of the Vlasov-Maxwell system by PIC Discontinuous Galerkin method on GPU with OpenCL
Resolving the conflict between generality and plausibility in verified computation
Resource Centered Computing delivering high parallel performance
Resource Elastic Virtualization for FPGAs using OpenCL
Resource Sharing in GPU-Accelerated Windowing Systems
Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores
Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays
ReSYCLator: Transforming CUDA C++ source code into SYCL
Retargeting and Respecializing GPU Workloads for Performance Portability
Rethinking resampling in the particle filter on graphics processing units
Rethinking Runtime Verification on Hundreds of Cores: Challenges and Opportunities
Rethinking the Union of Computed Tomography Reconstruction and GPGPU Computing
Returning control to the programmer: SIMD intrinsics for virtual machines
RETURNN: The RWTH Extensible Training framework for Universal Recurrent Neural Networks
Reusable OpenCL FPGA Infrastructure
Reusable software components for accelerator-based clusters
Reuse and Refactoring of GPU Kernels to Design Complex Applications
Reusing Auto-Schedules for Efficient DNN Compilation
Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment
Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight
Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature
Reverse Computation for Rollback-based Fault Tolerance in Large Parallel Systems: Evaluating the Potential Gains and Systems Effects
Reverse-Mode AD of Reduce-by-Index and Scan in Futhark
Review and Comparative Study of Ray Traversal Algorithms on a Modern GPU Architecture
Review of Memory/Cache Management Technologies used on Heterogeneous Computing Systems
Review: Kd-tree Traversal Algorithms for Ray Tracing
Reviewing GPU architectures to build efficient back projection for parallel geometries
Revision of Relational Joins for Multi-Core and Many-Core Architectures
Revisit Long Short-Term Memory: An Optimization Perspective
Revisiting Actor Programming in C++
Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture
Revisiting Edge and Node Parallelism for Dynamic GPU Graph Analytics
Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators
Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures
Revisiting Query Performance in GPU Database Systems
Revisiting sorting for GPGPU stream architectures
Revisiting the Case of ARM SoCs in High-Performance Computing Clusters
Revolutionary technologies for acceleration of emerging petascale applications
RGEM: A Responsive GPGPU Execution Model for Runtime Engines
Rgtsvm: Support Vector Machines on a GPU in R
Ringing: Frugal Subdivision of Curves and Surfaces
Rinnegan: Efficient Resource Use in Heterogeneous Architectures
Ripple: Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout
Rise of the Graphics Processor
Risk Estimation Without Using Stein's Lemma -- Application to Image Denoising
Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks
RNA secondary structure prediction using dynamic programming algorithm - A review and proposed work
RNS-Based Elliptic Curve Point Multiplication for Massive Parallel Architectures
RoadRunner: a fast and flexible exoplanet transit model
Roberts edge detection algorithm based on GPU
Robotic approach to multi-beam optical tweezers with Computer Generated Hologram
Robust Adaptive 3-D Segmentation of Vessel Laminae From Fluorescence Confocal Microscope Images and Parallel GPU Implementation
Robust Computational Tools for Multiple Testing With Genetic Association Studies
Robust Edge Detection and GPU-Based Smoothing for Extracting Surface Primitives from Range Images
Robust foreground segmentation for GPU architecture in an immersive 3D videoconferencing system
Robust GPGPU plugin development for RapidMiner
Robust GPU-assisted camera tracking using free-form surface models
Robust LLM Training Infrastructure at ByteDance
Robust Low Complexity Feature Tracking using CUDA
Robust mesh reconstruction from unoriented noisy points
Robust modified L2 local optical flow estimation and feature tracking
Robust non-local denoising of colored depth data
Robust real time face recognition and tracking on gpu using fusion of rgb and depth image
Robust Real-Time Multiprocessor Interrupt Handling Motivated by GPUs
Rodinia: A benchmark suite for heterogeneous computing
Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs
Room acoustics modelling using GPU-accelerated finite difference and finite volume methods on a face-centered cubic grid
Rootbeer: Seamlessly using GPUs from Java
Rotationally invariant sparse patch matching on GPU and FPGA 
Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born
RSVDPACK: Subroutines for computing partial singular value decompositions via randomized sampling on single core, multi core, and GPU architectures
RTCUDB: Building Databases with RT Processors
RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing
RTSL: a Ray Tracing Shading Language
RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location
RubiCL, a Library Providing Automatic Parallelisation on CPU and GPU devices
Rubus: A compiler for seamless and extensible parallelism
RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles
Run-time Image and Video Resizing Using CUDA-enabled GPUs
Run-time Reconfigurable Multiprocessors
Run-time support for multi-level disjoint memory address spaces
Run, Stencil, Run! - A Comparison of Modern Parallel Programming Paradigms
Running Financial Risk Management Applications on FPGA in the Amazon Cloud
Running the NIM Next-Generation Weather Model on GPUs
Running unstructured grid-based CFD solvers on modern graphics hardware
Running unstructured grid-based CFD solvers on modern graphics hardware
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Comparison of CPU and GPU Using Portable Programming Models
Runtime Compilation of Array-Oriented Python Programs
Runtime Configurable Deep Neural Networks for Energy-Accuracy Trade-off
Runtime Performances Benchmark for Knowledge Graph Embedding Methods
Runtime Specialization for Heterogeneous CPU-GPU Platforms
Runtime Support for Adaptive Power Capping on Heterogeneous SoCs
Runtime Support for Performance Portability on Heterogeneous Distributed Platforms
Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems
Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures
Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment
S-buffer: Sparsity-aware Multi-fragment Rendering
SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures
SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs
Saddle Vertex Graph (SVG): A Novel Solution to the Discrete Geodesic Problem
Safe and Practical GPU Acceleration in TrustZone
Safe Asynchronous Multicore Memory Operations
Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc
SafeGPU: Contract- and Library-Based GPGPU for Object-Oriented Languages
SAGA: SystemC Acceleration on GPU Architectures
SAGE: Self-Tuning Approximation for Graphics Engines
SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems
Sailfish: a flexible multi-GPU implementation of the lattice Boltzmann method
SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs
Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
Sample distribution shadow maps
SAPPORO: A way to turn your graphics cards into a GRAPE-6
Sapporo2: A versatile direct N-body library
SAR focusing of P-band ice sounding data using back-projection
SAR raw signal simulation based on GPU parallel computation
Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10
SBArt4 - Breeding abstract animations in realtime
SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-core Platforms
SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing
Scalability Analysis of Parallel Algorithms on GPU Clusters
Scalability Analysis of Synchronous Data-Parallel Artificial Neural Network (ANN) Learners
Scalability and Optimization Strategies for GPU Enhanced Neural Networks (GeNN)
Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs
Scalability of Higher-Order Discontinuous Galerkin FEM Computations for Solving Electromagnetic Wave Propagation Problems on GPU Clusters
Scalability of Incompressible Flow Computations on Multi-GPU Clusters Using Dual-Level and Tri-Level Parallelism
Scalability of Self-organizing Maps on a GPU cluster using OpenCL and CUDA
Scalability Study of Deep Learning Algorithms in High Performance Computer Infrastructures
Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads
Scalable and deterministic timing-driven parallel placement for FPGAs
Scalable and High Performance Betweenness Centrality on the GPU
Scalable and highly parallel implementation of Smith-Waterman on graphics processing unit using CUDA
Scalable and Interactive Segmentation and Visualization of Neural Processes in EM Datasets 
Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms
Scalable and Parallel Implementation of a Financial Application on a GPU: With Focus on Out-of-Core Case 
Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework
Scalable approximate k-NN in multidimensional big data
Scalable Breadth-First Search on a GPU Cluster
Scalable Clustering for Vision using GPUs
Scalable Clustering Using Graphics Processors
Scalable communication for high-order stencil computations using CUDA-aware MPI
Scalable Data Clustering using GPU Clusters
Scalable Dense Linear Algebra on Heterogeneous Hardware
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
Scalable Distributed Fast Multipole Methods
Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture
Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures
Scalable Fast Multipole Methods on Heterogeneous Architecture
Scalable framework for mapping streaming applications onto multi-GPU systems
Scalable GPU Acceleration of B-Spline Signal Processing Operations
Scalable GPU rendering of CSG models
Scalable GPU-Based Integrity Verification for Large Machine Learning Models
Scalable heterogeneous parallelism for atmospheric modeling and simulation
Scalable instruction set simulator for thousand-core architectures running on GPGPUs
Scalable Kernel Fusion for Memory-Bound GPU Applications
Scalable Lattice Boltzmann Solvers for CUDA GPU Clusters
Scalable learning for object detection with GPU hardware
Scalable Metropolis Monte Carlo for simulation of hard shapes
Scalable Molecular Dynamics Simulation Using FPGAs and Multicore Processors
Scalable Multi Agent Simulation on the GPU
Scalable Multi-Cache Simulation Using GPUs
Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer
Scalable multi-GPU implementation of the MAGFLOW simulator
Scalable Multi-GPU Simulation of Long-Range Molecular Dynamics
Scalable packet classification via GPU metaprogramming
Scalable Parallel Minimum Spanning Forest Computation
Scalable parallel programming with CUDA
Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-Core Architectures
Scalable Programming Models for Massively Multicore Processors
Scalable Query Evaluation in Relational Databases
Scalable Simulation of 3D Wave Propagation in Semi-Infinite Domains Using the Finite Difference Method on a GPU Based Cluster
Scalable Simulation of Tsunamis Generated by Submarine Landslides on GPU clusters
Scalable SMT-based verification of GPU kernel functions
Scalable Software Defined FM-radio receiver running on desktop computers
Scalable software defined receivers running on desktop computers using General Purpose Graphics Processing Units
Scalable Solution of Radiative Heat Transfer Problems by the Photon Monte Carlo Algorithm on Hybrid Computing Architectures
Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass
Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth
Scalable Techniques for Scheduling and Mapping DSP Applications onto Embedded Multiprocessor Platforms
Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay
Scalable Verification Techniques for Data-Parallel Programs
Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs
Scalar collapse in AdS with an OpenCL open source code
SCALE-Ahead-Of-Time Compilation of CUDA for AMD GPUs
Scale-dependent and example-based grayscale stippling
Scale-space ridge detection with GPU acceleration
Scaleable Sparse Matrix-Vector Multiplication with Functional Memory and GPUs
ScaleHLS: Scalable High-Level Synthesis through MLIR
Scaling behavior of topologically constrained polymer rings in a melt
Scaling Coupled Climate Models to Exascale: OpenACC-enabled ECEarth3 Earth System Model
Scaling CUDA for Distributed Heterogeneous Processors
Scaling Deep Learning on GPU and Knights Landing clusters
Scaling Deep Learning on Multiple In-Memory Processors
Scaling Fast Multipole Methods up to 4000 GPUs
Scaling GPU-Accelerated Databases beyond GPU Memory Size
Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters
Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer
Scaling Hierarchical N-body Simulations on GPU Clusters
Scaling High Performance Domain-Specific Language Implementation with Delite
Scaling IDS construction based on Non-negative Matrix factorization using GPU computing
Scaling LAPACK panel operations using parallel cache assignment
Scaling Lattice QCD beyond 100 GPUs
Scaling Monte Carlo Tree Search on Intel Xeon Phi
Scaling Multifluid Compressible Fluid Dynamics to 700,000 cores, 1.5 Pflop/s, and a Trillion Grid Cells
Scaling On-Device GPU Inference for Large Generative Models
Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation
Scaling Radio Astronomy Signal Correlation on Heterogeneous Supercomputers Using Various Data Distribution Methodologies
Scaling Recurrent Neural Network Language Models
Scaling Results for a Discontinuous Galerkin Finite-Element Wave Solver on Multi-GPU Systems
Scaling Soft Matter Physics to Thousands of GPUs in Parallel
Scaling SU(2) to 1000 GPUs using HiRep
Scaling up scientific computations by using map-reduce-like control flow on NUMA architectures
Scaling-up spatially-explicit ecological models using graphics processors
SCALSALE: Scalable SALE Benchmark Framework for Supercomputers
Scan primitives for GPU computing
Scan Test Power Simulation on GPGPUs
Scandalously Parallelizable Mesh Generation
ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU
Scattering Parameters and Surface Normals from Homogeneous Translucent Materials using Photometric Stereo
Scattering Points in Parallel Coordinates
SCELib3.0: The new revision of SCELib, the parallel computational library of molecular properties in the Single Center Approach
Scene Boundary Detection Technique Based on Bottom-Up Attention System and OpenCL Parallel Implementation
Scene image classfying via the Partially Connected Neural Network
Scene independent real-time indirect illumination
Scene Recognition Acceleration Using CUDA and OpenMP
SCF: a device- and language-independent task coordination framework for reconfigurable, heterogeneous systems
SCGPSim: A fast SystemC simulator on GPUs
Scheduling (ir)regular applications on heterogeneous platforms
Scheduling a Parallel Sparse Direct Solver to Multiple GPUs
Scheduling by Work-Stealing in Hybrid Parallel Architectures
Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs
Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures
Scheduling Dataflow Execution Across Multiple Accelerators
Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
Scheduling for new computing platforms with GPUs
Scheduling Languages: A Past, Present, and Future Taxonomy
Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources
Scheduling on Manycore and Heterogeneous Graphics Processors
Scheduling Parallel Tasks under Multiple Resources: List Scheduling vs. Pack Scheduling
Scheduling processing of real-time data streams on heterogeneous multi-GPU systems
Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System's Perspective
SciAI4Industry - Solving PDEs for industry-scale problems with deep learning
SciDef: Automating Definition Extraction from Academic Literature with Large Language Models
Scientific and Engineering Computing Using ATI Stream Technology
Scientific computation for simulations on programmable graphics hardware
Scientific Computation on Graphics Processing Unit using CUDA
Scientific Computation Through a GPU
Scientific Computing on Heterogeneous Architectures
Scientific Computing on Hybrid Architectures
Scientific Computing Using Consumer Video-Gaming Hardware Devices
Scientific Computing with Python on GPUs
Scientific GPU Programming with Data-Flow Languages
Scientific Programming for Heterogeneous Systems - Bridging the Gap between Algorithms and Applications
Scientific Visualization in Astronomy: Towards the Petascale Astronomy Era
Scope for performance enhancement of CMU Sphinx by parallelising with OpenCL
Scope is all you need: Transforming LLMs for HPC Code
Scout: a data-parallel programming language for graphics processors
Seamless acceleration of Fortran intrinsics via AMD AI engines
Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio
Seamless GPU acceleration for C++ based physics with the Metal Shading Language on Apple's M series unified chips
Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures
Searching for a counterexample of Kurepa's Conjecture
Searching for Concurrent Design Patterns in Video Games
Searching for sinks of Henon map using a multiple-precision GPU arithmetic library
Second Order Pre-Integrated Volume Rendering
Secret Key Cryptography Using Graphics Cards
Secrets from the GPU
Secure 3D graphics for virtual machines
Secure Distributed Computing on a Manycore Cloud
SecureMed: Secure Medical Computation using GPU-Accelerated Homomorphic Encryption Scheme
Securing GPU via Region-based Bounds Checking
Seeded ND medical image segmentation by cellular automaton on GPU
SeedFold: Scaling Biomolecular Structure Prediction
Seeing through the fog: an algorithm for fast and accurate touch detection in optical tabletop surfaces
Seer: Predictive Runtime Kernel Selection for Irregular Problems
Seismic Attributes Extraction Based on GPU
Seismic damage simulation for urban buildings based on high-performance GPU computing
Seismic imaging based on spectral differentiation matrix and GPU implementation
Seismic volume visualization for horizon extraction
Seismic Wave Propagation Simulation Using Accelerated Support Operator Rupture Dynamics on Multi-GPU
Seismic Wave Propagation Simulation Using Support Operator Method on multi-GPU system
Selecting the Best Tridiagonal System Solver Projected on Multi-Core CPU and GPU Platforms
Selection algorithm of graphic accelerators in heterogeneous cluster for optimization computing
Selection of Task Implementations in the Nanos++ Runtime
Self-Adapting Parallel Framework for Long-Term Object Tracking
Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures
Self-calibration of geometric and radiometric parameters for cone-beam computed tomography
self-CD: Interactive Self-collision Detection for Deformable Body Simulation Using GPUs
Self-Configuring Applications for Heterogeneous Systems: Program Composition and Optimization Using Cognitive Techniques 
Self-Supervised Clustering for Codebook Construction: An Application to Object Localization
Self-Tuning Distribution of DB-Operations on Hybrid CPU/GPU Platforms
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
Semantic Pose using Deep Networks Trained on Synthetic RGB-D
Semantic Product Search
Semantic Segmentation of Colon Glands with Deep Convolutional Neural Networks and Total Variation Segmentation
SemCache: Semantics-aware Caching for Efficient GPU Offloading
Semi-Analytic Solutions to the Radiative Transfer Equations via Hetergeneous Computing
Semi-Global Filtering of Airborne LiDAR Data for Fast Extraction of Digital Terrain Models
Semi-Global Matching-Motivation, Developments and Applications
Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100
Separate Compilation in a Language-Integrated Heterogeneous Environment
Sequence alignment with GPU: Performance and design challenges
Sequence Data Indexing Method Exploiting the Parallel Processing Resources of GPGPU
Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs
Sequence Parallelism: Making 4D Parallelism Possible
Sequential Code Parallelization for Multi-core Embedded Systems: A Survey of Models, Algorithms and Tools
Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms
Sequential Monte Carlo Optimisation for Air Traffic Management
Serial and Parallel Bayesian Spam Filtering using Aho-Corasick and PFAC
Serpent encryption algorithm implementation on Compute Unified Device Architecture (CUDA)
Serve Programs, Not Prompts
Serverless Computing Strategies on Cloud Platforms
Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs
SESH framework: A Space Exploration Framework for GPU Application and Hardware Codesign
Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU
SGO: An ultrafast engine for atomic structure global optimization by differential evolution
SGPU 2: a runtime system for using large applications on clusters of hybrid nodes
Shader algebra
Shader Performance Analysis on a Modern GPU Architecture
Shader-based tessellation to save memory bandwidth in a mobile multimedia processor
Shader-based visual simulation of ocean wave
SHADOW3 API: The Application Programming Interface for the ray tracing code SHADOW
Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies
Shallow Water Simulation on GPUs for Sparse Domains
Shallow water simulations on multiple GPUs
Shape Manipulation on GPU
Shape Modeling and GPU Based Image Warping
Shape Transformation of Multidimensional Density Functions using Distribution Interpolation of the Radon Transforms
Shape-merging and interpolation using class estimation for unseen voxels with a GPU-based efficient implementation
SHARC: A streaming model for FPGA accelerators and its application to Saliency
Shared Memory Multiplexing: A Novel Way to Improve GPGPU Throughput
Shared Sampling for Real-Time Alpha Matting
ShearLab 3D: Faithful Digital Shearlet Transforms based on Compactly Supported Shearlets
Shell: A Spatial Decomposition Data Structure for 3D Curve Traversal on Many-core Architectures
SHEsisEpi, a GPU-enhanced genome-wide SNP-SNP interaction scanning algorithm, efficiently reveals the risk of genetic epistasis in bipolar disorder
Ship Detection from SAR Imagery Using CUDA and Performance Analysis of the System
Short-time Fourier transform laser Doppler holography
Shortening design time through multiplatform simulations with a portable OpenCL golden-model: the LDPC decoder case
Shortest-Path Queries in Planar Graphs on GPU-Accelerated Architectures
Should I use TensorFlow?
ShoveRand: a model-driven framework to easily generate random numbers on GP-GPU
Shredder: GPU-Accelerated Incremental Storage and Computation
Shuffle Reduction Based Sparse Matrix-Vector Multiplication on Kepler GPU
Sieve: Stratified GPU-Compute Workload Sampling
SiftCU: An Accelerated Cuda Based Implementation of SIFT
Sigma*: Symbolic Learning of Input-Output Specifications
Sigma*: Symbolic Learning of Stream Filters
SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching
Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems
Signal Processing and General Purpose Computing on GPU
SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters
Signed distance transform using graphics hardware
Significantly Improved Performances Of The Cryptographically Generated Addresses Thanks To ECC And GPGPU
Silhouette Extraction using Graphics Processing Units
Silhouette Smoothing for Real-Time Rendering of Mesh Surfaces
Simbuca, using a graphics card to simulate Coulomb interactions in a penning trap
SimCommSys: taking the errors out of error-correcting code simulations
SIMD Divergence Optimization through Intra-Warp Compaction
SIMD Floating Point Extension for Ray Tracing
SIMD Implementation of a Multiplicative Schwarz Smoother for a Multigrid Poisson Solver on an Intel Xeon Phi Coprocessor
SIMD Optimization of Linear Expressions for Programmable Graphics Hardware
SIMD Parallel Gibbs Sampling of Probabilistic Directed Acyclic Graphs
SIMD Re-Convergence At Thread Frontiers
SIMD-Based Large-Scale Transient Stability Simulation on the Graphics Processing Unit
SIMD-X: Programming and Processing of Graph Algorithms on GPUs
Similarity Search in Metric Spaces on Parallel multi-core and multi-GPU Platforms
SIML: A Fast SIMD Algorithm for Calculating LINGO Chemical Similarities on GPUs and CPUs
Simple and efficient GPU accelerated topology optimisation: Codes and applications
Simple dynamic LOD for geometry images
Simple Geometry Compression for Ray Tracing on GPU
Simple Iterative Incompressible Smoothed Particle Hydrodynamics
Simple optimizations for an applicative array language for graphics processors
Simple sorting algorithm test based on CUDA
Simpler and faster HLBVH with work queues
SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance
SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations
Simulating 3-D Lung Dynamics Using a Programmable Graphics Processing Unit
Simulating a Family of Tissue P Systems Solving SAT on the GPU
Simulating a P system based efficient solution to SAT by using GPUs
Simulating Active Membrane Systems Using GPUs
Simulating and Benchmarking the Shallow-Water Fluid Dynamical Equations on Multiple Graphical Processing Units
Simulating and Visualizing Real-Time Crowds on GPU Clusters
Simulating anomalous diffusion on graphics processing units
Simulating Biological-Inspired Spiking Neural Networks with OpenCL
Simulating Dam-Break Flooding with Floating Objects through Intricate City Layouts Using GPU-based SPH Method
Simulating flows of incompressible and weakly compressible fluids on multicore hybrid computer systems
Simulating Lattice Spin Models on Graphics Processing Units
Simulating of query processing on multiprocessor database systems with modern coprocessors
Simulating Photon Mapping for Real-time Applications
Simulating Quantum Computers Using OpenCL
Simulating soft tissues using a GPU approach of the mass-spring model
Simulating spiking neural networks on GPU
Simulating spiking neural networks on massively parallel graphical processing units using a code generation approach with GeNN
Simulating Spiking Neural P systems without delays using GPUs
Simulating spin models on GPU
Simulating the Cardinal Movements of Childbirth Using Finite Element Analysis on the Graphics Processing Unit
Simulating the Spread of Epidemics in Real-world Trading Networks using OpenCL
Simulating the universe with GPU-accelerated supercomputers: n-body methods, tests, and examples
Simulating vehicle kinematics with SimVis3D and Newton
Simulation and interaction of fluid dynamics
Simulation and modeling of physical broadcasts
Simulation and modelling of gravitational microlensing events using graphical processing units
Simulation and visualization of the Saint-Venant system using GPUs
Simulation Methodologies for Mobile GPUs
Simulation Modelling and Visualisation: Toolkits for Building Artificial Worlds
Simulation of 1+1 dimensional surface growth and lattices gases using GPUs
Simulation of a flowing snow avalanche using molecular dynamics
Simulation of atmospheric binary mixtures based on two-fluid model
Simulation of bevel gear cutting with GPGPUs-performance and productivity
Simulation of Biological Tissue using Mass-Spring-Damper Models
Simulation of cloud dynamics on graphics hardware
Simulation of Coarse-Grained Protein-Protein Interactions with Graphics Processing Units
Simulation of deformable environment with haptic feedback on GPU
Simulation of earthquake sloshing loads in a nuclear reactor
Simulation of one-layer shallow water systems on multicore and CUDA architectures
Simulation of P systems with active membranes on CUDA
Simulation of pollutant transport in shallow water on a CUDA architecture
Simulation of reaction-diffusion processes in three dimensions using CUDA
Simulation of real-time explosion smoke based on Simplex-Noise
Simulation of shallow water based on shader
Simulation of Shallow-Water systems using Graphics Processing Units
Simulation of stochastic processes using graphics hardware
Simulation of the hydrogen ground state in Stochastic Electrodynamics
Simulation Studies of Viral Advertisement Diffusion on Multi-GPU
Simulation Valuation of Multiple Exercise Options
Simulations of Complex and Microscopic Models of Cardiac Electrophysiology Powered by Multi-GPU Platforms
Simulations of Large Membrane Regions using GPU-enabled Computations - Preliminary Results
Simulations of Large Particle Systems in Real Time
Simultaneous and fast 3D tracking of multiple faces in video by GPU-based stream processing
Simultaneous Branch and Warp Interweaving for Sustained GPU Performance
Simultaneous estimation of super-resolved depth and all-in-focus images from a plenoptic camera
Simultaneous floating-point sine and cosine for VLIW integer processors
Simultaneous Use of CPU and GPU to Real Time Inverted Index Updating in Microblogs
SINGA: Putting Deep Learning in the Hands of Multimedia Users
Single Chain Slip-Spring Model for Fast Rheology Simulations of Entangled Polymers on GPU
Single molecule detection of tuberculosis nucleic acid using dark field Tethered Particle Motion
Single Scattering of Aspherical Particles in DDA Calculations on GPUs Using OpenCL
Single Server Multi-GPU Training of ConvNets
Single stream parallelization of generalized LSTM-like RNNs on a GPU
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
Single-particle 3D reconstruction from cryo-electron microscopy images on GPU
Single-pass GPU solid voxelization for real-time applications
Single-Pass GPU-Raycasting for Structured Adaptive Mesh Refinement Data
Singular value decomposition for collaborative filtering on a GPU
Singular value decomposition on GPU using CUDA
Sinus Endoscopy - Application of Advanced GPU Volume Rendering for Virtual Endoscopy
Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
Size-based Transfer Functions: A New Volume Exploration Technique
Skeletal rigid skinning with blending patches on the GPU
Skeleton and Shape Adjustment and Tracking in Multicamera Environments
Skeleton Programming for Heterogeneous GPU-based Systems
Skeleton-based Automatic Parallelization of Image Processing Algorithms for GPUs
Skeleton-based edge bundling for graph visualization
SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems
SkePU 2: Language Embedding and Compiler Support for Flexible and Type-Safe Skeleton Programming
SkePU: a multi-backend skeleton programming library for multi-GPU systems
Sketch Based Facial Expression Recognition Using Graphics Hardware
Sketching MLS Image Deformations On the GPU
Skew Handling in Aggregate Streaming Queries on GPUs
Skinning with dual quaternions
SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration
SkyFlow: Heterogeneous streaming for skyline computation using FlowGraph and SYCL
SLATE port to AMD and Intel platforms
Sliding-Tris: A Sliding Window Level-of-Detail Scheme
Sliding-Windows for Rapid Object Class Localization: A Parallel Technique
SMAA: Enhanced Subpixel Morphological Antialiasing
Small Discrete Fourier Transforms on GPUs
Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing
Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization
Smart Multi-Task Scheduling for OpenCL Programs on CPU/GPU Heterogeneous Platforms
SMCGen: Generating Reconfigurable Design for Sequential Monte Carlo Applications
Smith-Waterman Acceleration in Multi-GPUs: A Performance per Watt Analysis
Smooth Mixed-Resolution GPU Volume Rendering
Smoothed Particle Hydrodynamics Simulation for Continuous Casting
Smoothed-Particle Hydrodynamics Models: Implementation Features on GPUs
SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs
Snowflake: A Lightweight Portable Stencil DSL
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
SnuHPL: high performance LINPACK for heterogeneous GPUs
SoaAlloc: Accelerating Single-Method Multiple-Objects Applications on GPUs
SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner
SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes
SOCL: An OpenCL Implementation with Automatic Multi-Device Adaptation Support
SODECL: An Open Source Library for Calculating Multiple Orbits of a System of Stochastic Differential Equations in Parallel
SOFF: An OpenCL High-Level Synthesis Framework for FPGAs
Soft Error Resilient QR Factorization for Hybrid System
Soft Error Resilient QR Factorization for Hybrid System with GPGPU
Soft GPGPUs for Embedded FPGAs: An Architectural Evaluation
Softassign and EM-ICP on GPU
Softshell: Dynamic Scheduling on GPUs
Software architecture and system validation of an open, unified model for accelerated multicore computing
Software Challenges for Extreme Scale Computing: Going From Petascale to Exascale Systems
Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems
Software Defined Radio over CUDA
Software Development Tools Using GPGPU Potentialities
Software Model Checking for GPGPU Programs, Towards a Verification Tool
Software Optimization and Orchestration for Heterogeneous and Distributed Architectures
Software parallel CAVLC encoder based on stream processing
Software Performance Analysis with Parallel Programming Approaches
Software Pipelined Execution of Stream Programs on GPUs
Software Platform for Hybrid Resource Management of Many-core Accelerators
Software Polarization Spectrometer "PolariS"
Software Prefetching for Indirect Memory Accesses
Software Reliability Enhancements for GPU Applications
Software Testing - Test Suite Compilation and Execution Optimizations
Software-Based Algorithm for Modeling and Correction of Gradient Nonlinearity Distortions in Magnetic Resonance Imaging
Software-based branch predication for AMD GPUs
Software-Based ECC for GPUs
Software-Based Hardening Strategies for Neutron Sensitive FFT Algorithms on GPUs
Software-Defined FPGA Accelerator Design for Mobile Deep Learning Applications
SoK: A Systems Perspective on Compound AI Threats and Countermeasures
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
SOL: Effortless Device Support for AI Frameworks without Source Code Changes
SOL: Reducing the Maintenance Overhead for Integrating Hardware Support into AI Frameworks
Solution Level Parallelization of Local Search Metaheuristic Algorithm on GPU
Solutions for Optimizing the Monte Carlo Option Pricing Method's Implementation Using the Compute Unified Device Architecture
Solutions For Optimizing The Radix Sort Algorithmic Function Using The Compute Unified Device Architecture
Solver for Systems of Linear Equations with Infinite Precision on a GPU Cluster
Solving $k$-Nearest Vector Problem on Multiple Graphics Processors
Solving 2D Nonlinear Unsteady Convection-Diffusion Equations on Heterogenous Platforms with Multiple GPUs
Solving 3D Anisotropic Elastic Wave Equations on Parallel GPU Devices
Solving 3D incompressible Navier-Stokes equations on hybrid CPU/GPU systems
Solving 3D viscous incompressible Navier-Stokes equations using CUDA
Solving a kind of BVP for ODEs on heterogeneous CPU + CUDA-enabled GPU systems
Solving Batched Linear Programs on GPU and Multicore CPU
Solving Bivariate Polynomial Systems on a GPU
Solving convex optimization problems on FPGA using OpenCL
Solving Dense Generalized Eigenproblems on Multi-threaded Architectures
Solving Dense Linear Systems on Graphics Processors
Solving dense linear systems on platforms with multiple hardware accelerators
Solving diffractive optics problems using graphics processing units
Solving Discrete Logarithms in Smooth-Order Groups with CUDA
Solving incompressible Navier-Stokes equations on heterogeneous parallel architectures
Solving Incompressible Two-Phase Flows on Massively Parallel Multi-GPU Clusters
Solving incompressible two-phase flows on multi-GPU clusters
Solving Kinetic Equations on GPUs I: Model Kinetic Equations
Solving knapsack problems on GPU
Solving large permutation flow-shop scheduling problems on GPU-accelerated supercomputers
Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs
Solving lattice QCD systems of equations using mixed precision solvers on GPUs
Solving Linear Equations with Conjugate Gradient Method on OpenCL Platforms
Solving Linear Recurrences on Hybrid GPU Accelerated Manycore Systems
Solving MaxSAT with Matrix Multiplication
Solving Mixed Integer Programs Using Neural Networks
Solving Molecular Distance Geometry Problems in OpenCL
Solving Multiple Queries through a Permutation Index in GPU
Solving Parabolic Problems Using Multithread and GPU
Solving Path Problems on the GPU
Solving prime-field ECDLPs on GPUs with OpenCL
Solving quadratic assignment problems by genetic algorithms with GPU computation: a case study
Solving Quadratic Programming Problems on Graphics Processing Unit
Solving RFIC Simulation Tasks Using GPU Computations
Solving Rigid Multibody Physics Dynamics Using Proximal Point Functions on the GPU
Solving Sparse Linear Systems on NVIDIA Tesla GPUs
Solving Stochastic Differential Equations Using General Purpose Graphics Processing Unit
Solving Systems of Polynomial Equations on a GPU
Solving the Boltzmann Equation on GPU
Solving the Boltzmann equation on GPUs
Solving the Caputo Fractional Reaction-Diffusion Equation on GPU
Solving the Coalition Structure Generation Problem on a GPU
Solving the Euler Equations on Graphics Processing Units
Solving the Examination Timetabling Problem in GPUs
Solving the Flexible Job Shop Problem on Multi-GPU
Solving the Ghost-Gluon System of Yang-Mills Theory on GPUs
Solving the Quadratic Assignment Problem on heterogeneous environment (CPUs and GPUs) with the application of Level 2 Reformulation and Linearization Technique
Solving the Vlasov equation for one-dimensional models with long range interactions on a GPU
Solving very large instances of the scheduling of independent tasks problem on the GPU
Solving Wave Equations on Unstructured Geometries
Some examples of instant computations of fluid dynamics on GPU
Some Graph Algorithms And Related Primitives For The GPU
Some of the What?, Why?, How?, Who? and Where? of Graphics Processing Unit Computing for Bayesian Analysis
SOMGPU: An unsupervised pattern classifier on Graphical Processing Unit
Somoclu: An Efficient Distributed Library for Self-Organizing Maps
Sop-GPU: Accelerating biomolecular simulations in the centisecond timescale using graphics processors
Soren: Adaptive MapReduce for Programmable GPUs
Sort-First Parallel Volume Rendering
Sorting and Permuting without Bank Conflicts on GPUs
Sorting On A Graphics Processing Unit (GPU)
Sorting on FPGAs using Merge Trees
Sorting on GPUs for large scale datasets: A thorough comparison
Sorting with GPUs: A Survey
Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs
Sound Speed Optimization Using Image Texture on CUDA
Sound Synthesis Using Physical Modeling on Heterogeneous Computing Platforms
Source-to-Source Automatic Differentiation of OpenMP Parallel Loops
Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators
Source-to-Source Optimization of CUDA C for GPU Accelerated Cardiac Cell Modeling
Source-to-Source Transformations for GPU Code Generation
Source-to-source transformations for irregular and multithreaded code optimization
Space and the Synchronic A-Ram
Space Charge Dominated Envelope Dynamics Using GPUs
Space-Time Finite Element Analysis on Graphics Processing Unit Computing Platform
Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters
Spark: modular, composable shaders for graphics hardware
SparkCL: A Unified Programming Framework for Accelerators on Heterogeneous Clusters
SparkJNI: A Reference Design for a Heterogeneous Apache Spark Framework
Sparse Approximate Inverse Preconditioners for Iterative Solvers on GPUs
Sparse array representations and some selected array operations on GPUs
Sparse Convex Optimization on GPUs
Sparse direct solvers with accelerators over DAG runtimes
Sparse GPU Kernels for Deep Learning
Sparse LU Factorization for Parallel Circuit Simulation on GPU
Sparse Matrix Algorithms Using GPGPU
Sparse matrix computations on manycore GPU's
Sparse Matrix Formats Evaluation and Optimization on a GPU
Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms
Sparse Matrix Multiplication using CUDA and Mex Interface
Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments
Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation
Sparse Matrix-Vector Multiplication on GPGPUs
Sparse Matrix-Vector Multiplication on GPU
Sparse Matrix-Vector Multiplication on NVIDIA GPU
Sparse Recovery on GPUs: Accelerating the Iterative Soft-Thresholding Algorithm
Sparse regularization in MRI iterative reconstruction using GPUs
Sparse systems solving on GPUs with GMRES
Sparse Winograd Convolutional neural networks on small-scale systolic arrays
Sparse-Matrix support for the SkePU library for portable CPU/GPU programming
Sparse-Matrix-CG-Solver in CUDA
Sparselet Models for Efficient Multiclass Object Detection
Sparser, Better, Faster GPU Parsing
Spatial Data Structures, Sorting and GPU Parallelism for Situated-agent Simulation and Visualisation
Spatial Indexing of Large-Scale Geo-Referenced Point Data on GPGPUs Using Parallel Primitives
Spatial interpolation in massively parallel computing environments
Spatial interpolation of scattered geoscientific data
Spatial Join with R-Tree on Graphics Processing Units
Spatial Sorting Algorithms for Parallel Computing in Networks
Spatial splits in bounding volume hierarchies
Spatial: A Language and Compiler for Application Accelerators
Spatio-temporal upsampling on the GPU
Spatter: A Benchmark Suite for Evaluating Sparse Access Patterns
SpecGen: Accelerating Agentic Kernel Optimization with Speculative Generation
Special Relativistic Visualization by Local Ray Tracing
Specification and verification of GPGPU programs
Specification and Verification of GPGPU Programs using Permission-Based Separation Logic
Speckle Reduction with Trained Nonlinear Diffusion Filtering
Spectral classification using convolutional neural networks
Spectral Ewald Acceleration of Stokesian Dynamics for polydisperse suspensions
Spectral Method Characterization on FPGA and GPU Accelerators
Spectral volume rendering using GPU-based raycasting
Specular Effects on the GPU: State of the Art
Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs
Speculative Execution on GPU: An Exploratory Study
Speculative Execution on Multi-GPU Systems
Speculative Parallel Evaluation Of Classification Trees On GPGPU Compute Engines
Speculative Parallelization on GPGPUs
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Specx: a C++ task-based runtime system for heterogeneous distributed architectures
Speech Recognition on Modern Graphic Processing Units
Speech Recognition on Multi-Core Processors and GPUs
Speed and Portability issues for Random Number Generation on Graphical Processing Units with CUDA and other Processing Accelerators
Speed Records for NTRU
Speed sign detection and recognition by convolutional neural networks
Speed up Large Integer Multiplication Using Fourier Transforms and CUDA Technology
Speed-Up Improvement Using Parallel Approach in Image Steganography
Speed, power and cost implications for GPU acceleration of Computational Fluid Dynamics on HPC systems
Speeding up a few orders of magnitude the Jacobi method: high order Chebyshev-Jacobi over GPUs
Speeding up a Video Summarization Approach Using GPUs and Multicore CPUs
Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves
Speeding Up Computer Vision Applications on Mobile Computing Platforms
Speeding Up Cycle Based Logic Simulation Using Graphics Processing Units
Speeding Up Geospatial Polygon Rasterization on GPGPUs
Speeding Up Homomorpic Hashing Using GPUs
Speeding up K-Means Algorithm by GPUs
Speeding up Large-Scale Point-in-Polygon Test Based Spatial Join on GPUs
Speeding up lattice sieve with Xeon Phi coprocessor
Speeding up LIP-Canny with CUDA programming
Speeding Up Model Building for ECGA on CUDA Platform
Speeding up Mutual Information Computation Using NVIDIA CUDA Hardware
Speeding Up Object Detection: Fast Resizing in the Integral Image Domain
Speeding Up Particle Trajectory Simulations under Moving Force Fields using GPUs
Speeding Up Reinforcement Learning with Graphics Processing Units
Speeding up Scoring Module of Mass Spectrometry Based Protein Identification by GPU
Speeding up subset seed algorithm for intensive protein sequence comparison
Speeding up the evaluation of evolutionary learning systems using GPGPUs
Speeding up the evaluation phase of GP classification algorithms on GPUs
Speeding up the MATLAB complex networks package using graphic processors
Speeding up the MATLAB Hyperspectral Image Analysis Toolbox using GPUs and the Jacket Toolbox
Speeding up the small progress measures algorithm for parity games using the GPU
Speeding-up Pearson Correlation Coefficient calculation on graphical processing units
Speeding-up the Verification Phase of Set Similarity Joins in the GPGPU paradigm
Speedup and Parallelization Models for Energy-Efficient Many-Core Systems Using Performance Counters
Speedup for quantum optimal control from GPU-based automatic differentiation
Speedup of Fuzzy Clustering Through Stream Processing on Graphics Processing Units
Speedup of Micromagnetic Simulations with C++ AMP On Graphics Processing Units
Speedup of Type-1 Fuzzy Logic Systems on Graphics Processing Units Using CUDA
Speedups between x70 and x120 for a generic local search (memetic) algorithm on a single GPGPU chip
sPEGG: high throughput eco-evolutionary simulations on commodity graphics processors
SPH Based Fluid Animation Using CUDA Enabled GPU
SPH Fluids for Viscous Jet Buckling
SPH on GPU with CUDA
Spherical harmonic transform on heterogeneous architectures using hybrid programming
Spherical harmonic transform with GPUs
Spiking Neural Networks for Real-Time Infrared Images Processing in Thermo Vision Systems
SPIRE, a Sequential to Parallel Intermediate Representation Extension
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Splotch: porting and optimizing for the Xeon Phi
SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place
SPOC: GPGPU Programming Through Stream Processing With OCaml
Sponge: portable stream programming on graphics engines
Spotting Radio Transients with the help of GPUs
SPRAT: Runtime processor selection for energy-aware computing
Spring-Bead Animation of Viscoelastic Materials
Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams
Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks
SqueezCL: Squeezing OpenCL Kernels for Approximate Computing on Contemporary GPUs
SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading
SRP Based Natural Interaction between Real and Virtual Worlds in Augmented Reality
SSE Vectorized and GPU Implementations of Arakawa's Formula for Numerical Integration of Equations of Fluid Motion
SSLPV: subsurface light propagation volumes
SSLShader: Cheap SSL Acceleration with Commodity Processors
Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU
Stabilized Backward Diffusion for Partial Volume Correction
Stable fluids
Stable large-scale solver for Ginzburg-Landau equations for superconductors
Stack-less SIMT reconvergence at low cost
Stackless KD-Tree Traversal for High Performance GPU Ray Tracing
Stadium Hashing: Scalable and Flexible Hashing on GPUs
Staggered fermions simulations on GPUs
STAR-RT: Visual attention for real-time video game playing
Starchart: Hardware and Software Optimization Using Recursive Partitioning Regression Trees
Stargazer: Automated Regression-Based GPU Design Space Exploration
STARK: Strategic Team of Agents for Refining Kernels
StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators
StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
State Lattice-based Motion Planning for Autonomous On-Road Driving
State of The Art Report on GPU
State of the Art Report on Real-time Rendering with Hardware Tessellation
State-Based Gauss-Seidel Framework for Real-time 2D Ultrasound Image Sequence Denoising on GPUs
State-of-the-art in heterogeneous computing
Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
Static Analysis and Dynamic Adaptation of Parallelism
Static and Dynamic Analyses for Efficient GPU Execution
Static Compilation Analysis for Host-Accelerator Communication Optimization
Static GPU threads and an improved scan algorithm
Static Memory Access Pattern Analysis on a Massively Parallel GPU
Statistical Computing With Graphics Processing Units
Statistical constraints on binary black hole inspiral dynamics
Statistical Power Consumption Analysis and Modeling for GPU-based Computing
Statistical power modeling of GPU kernels using performance counters
Statistical testing of random number sequences using CUDA
stdgpu: Efficient STL-like Data Structures on the GPU
Stealing Webpages Rendered on Your Browser by Exploiting GPU Vulnerabilities
Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL
Stellar-mass black holes in star clusters: implications for gravitational wave radiation
Stencil and Lattice Structures for Field Equation Model Simulations on GPUs
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies
Stencil shadow volumes for complex and deformable objects
Stencil-Aware GPU Optimization of Iterative Solvers
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
StePS: A Multi-GPU Cosmological N-body Code for Compactified Simulations
Stereo depth with a Unified Architecture GPU
Stereo Matching Algorithm Using Population-Based Incremental Learning on GPU
Stereo Matching using Multi-Resolution Images on CUDA
Stereoscopic Ray Tracing on Graphics Processors
Stereoscopic Scene Flow Computation for 3D Motion Understanding
Stereovision On GPU
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Stochastic Analysis of a Queue Length Model Using a Graphics Processing Unit
Stochastic Differential Equations simulation using GPU
Stochastic DT-MRI Connectivity Mapping on the GPU
Stochastic Gradient Descent on GPUs
Stochastic Progressive Photon Mapping for Dynamic Scenes
Stochastic transparency
STOCHSIMGPU: Parallel stochastic simulation for the Systems Biology Toolbox 2 for MATLAB
Stock trading strategy creation using GP on GPU
StoreGPU: exploiting graphics processing units to accelerate distributed storage systems
Strain Visualization of Ultra Sound Signals Processed by General Purpose Graphic Process Unit
Strassen's Matrix Multiplication on GPUs
Strategies for Maximizing Utilization in multi-CPU & multi-GPU Heterogeneous Architectures
Strategies for Optimization of Parallel Programs
Strategies for preparing computer science students for the multicore world
Strategies for Protecting Intellectual Property when Using CUDA Applications on Graphics Processing Units
Strategies for the Heterogeneous Execution of Large-Scale Simulations on Hybrid Supercomputers
Strategies to minimise the total run time of cyclic graph based genetic programming with GPUs
Strategy Preserving Compilation for Parallel Functional Code
Stream computing on graphics hardware
Stream Join Processing on Heterogeneous Processors
Stream processing for fast and efficient rotated Haar-like features using rotated integral images
Stream Processing of Integral Images for Real-Time Object Detection
Stream processing of moment invariants for real-time classifiers
Stream-Centric Stereo Matching and View Synthesis: A High-Speed Approach on GPUs
StreamBlocks: A compiler for heterogeneous dataflow computing
StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs, GPUs and FPGAs
Streamed Watershed Transform on GPU for Processing of Large Volume Data
Streaming Algorithms for Biological Sequence Alignment on GPUs
Streaming Applications on Heterogeneous Platforms
Streaming architectures and technology trends
Streaming Data from HDD to GPUs for Sustained Peak Performance
Streaming Dynamic Coarse-Grained CPU/GPU Workloads with Heterogeneous Pipelines in FastFlow
Streaming GPU Singular Value and Dynamic Mode Decompositions
Streaming Parallel GPU Acceleration of Large-Scale filter-based Spiking Neural Networks
Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels
STREAMIT: Dynamic visualization and interactive exploration of text streams
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping
StreamMR: An Optimized MapReduce Framework for AMD GPUs
StreamWorks: An Energy-efficient Embedded Co-processor for Stream Computing
Strega: An HTTP Server for FPGAs
Stress Tensor Field Visualization for Implant Planning in Orthopedics
Stressing the BER simulation of LDPC codes in the error floor region using GPU clusters
String Algorithm on GPGPU
String Matching on a Multicore GPU Using CUDA
Striped Smith-Waterman speeds database searches six times over other SIMD implementations
Strong scaling of general-purpose molecular dynamics simulations on GPUs
Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices
Structural, dynamic, and electrostatic properties of fully hydrated DMPC bilayers from molecular dynamics simulations accelerated with graphical processing units (GPUs)
Structured Orthogonal Inversion of Block p-Cyclic Matrices on Multicore with GPU Accelerators
STT-RAM for Shared Memory in GPUs
Studies Concerning the ATLAS IBL Calibration Architecture
Studies of quantum dots: Ab initio coupled-cluster analysis using OpenCL and GPU programming
Studies on CUDA Offloading for Real-Time Simulation and Visualization
Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures
Study and evaluation of improved automatic GPU offloading method
Study for measurement method for coal volume on base of GPU
Study of Bandwidth Partitioning for Co-executing GPU Kernels
Study of basic vector operations on Intel Xeon Phi and NVIDIA Tesla using OpenCL
Study of Convolution Algorithms using CPU and Graphics Hardware
Study of low density nuclear matter with quantum molecular dynamics: the role of the symmetry energy
Study of OpenCL Processing Models for FPGA Devices
Study of Sparse-Matrix Vector Multiplication (SpMV) on Different Architectures and Libraries
Study on acceleration technique for calculating near field of horn antenna based on GPU
Study on acceleration technique for two-dimensional FDTD algorithm based on GPU
Study on GPU-accelerated extraction of interconnects parasitic using CUDA and MPI
Study on semi-global matching algorithm extended for multi baseline matching and parallel processing method based on GPU
Study on Transient Temperature Field Parallel Computing in Cooling Control Based on a GPU Fourier Method
Study on volume rendering of CT slices based on ray casting
Study, Modelling and Implementation of the Level Set Method Used in Micromachining Processes
Studying the core-cusp problem in cold dark matter halos using N-body simulations on GPU clusters
Studying the Potential of Automatic Optimizations in the Intel FPGA SDK for OpenCL
Studying Thermal Management for Graphics-Processor Architectures
STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning
SU(2) Lattice Gauge Theory Simulations on Fermi GPUs
SU(2) Lattice QCD Simulations on Fermi GPUs
Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models
Subdivision Surface Evaluation as Sparse Matrix-Vector Multiplication
Subpixel reconstruction antialiasing for deferred shading
Suitability of NVIDIA GPUs for SKA1-Low
Super Earths and Dynamical Stability of Planetary Systems: First Parallel GPU Simulations Using GENGA
Supercharging Federated Learning with Flower and NVIDIA FLARE
Supercomputing and stellar dynamics
Supercomputing with toys: harnessing the power of NVIDIA 8800GTX and playstation 3 for bioinformatics problem
Superconducting proximity effect in graphene under inhomogeneous strain
SUPERGLUE: A Shared Memory Framework Using Data Versioning for Dependency-Aware Task-Based Parallelization
SUperman: Efficient Permanent Computation on GPUs
SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
SuperNeurons: FFT-based Gradient Sparsification in the Distributed Training of Deep Neural Networks
Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models
Supervised Hashing with Deep Neural Networks
Support for Parallel Scan in OpenMP
Support Operator Rupture Dynamics on GPU
Support Vector Machines on GPU with Sparse Matrix Format
Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms
Supporting CUDA for an extended RISC-V GPU architecture
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework
Supporting Heterogenous Computing Environments in SaC
Supporting input dependent access pattern algorithms on GPUs using GPUfs
Supporting Iteration in a Heterogeneous Data Flow Engine
Supporting mixed-datatype matrix multiplication within the BLIS framework
Supporting Preemptive Task Executions and Memory Copies in GPGPUs
Supporting x86-64 Address Translation for 100s of GPU Lanes
Surface Compression Using Dynamic Color Palettes
Surface Normal Integration for Convex Space-time Multi-view Reconstruction
Surface quality assessment of subdivision surfaces on programmable graphics hardware
Surface Reconstruction from Scattered Point via RBF Interpolation on GPU
Survey and Benchmarking of Machine Learning Accelerators
Survey of Domain-Specific Languages for FPGA Computing
Survey of GPU water simulation in game engine
Survey of HPC in US Research Institutions
Survey on Benchmarks for a GPU Based Multi Camera Stereo Matching Algorithm
Survey on Efficient Linear Solvers for Porous Media Flow Models on Recent Hardware Architectures
Survey On The Off-Chip Scheduling of Memory Accesses in the Memory Interface Of GPUs
Survey paper on Deep Learning on GPUs
Sustainable GPU Computing at Scale
Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale
SW# - GPU enabled exact alignments on genome scale
SW#db: GPU-accelerated exact sequence similarity database search
Swan: A tool for porting CUDA programs to OpenCL
SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors
Swarm-NG: a CUDA Library for Parallel n-body Integrations with focus on Simulations of Planetary Systems
Swarm's flight: Accelerating the particles using C-CUDA
swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight
swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer
Swendsen-Wang Multi-Cluster Algorithm for the 2D/3D Ising Model on Xeon Phi and GPU
Swept Volume approximation of polygon soups
SWIFOLD: Smith-Waterman implementation on FPGA with OpenCL for long DNA sequences
Switching to High Gear: Opportunities for Grand-Scale Real-Time Parallel Simulations
Swizzle Inventor: Data Movement Synthesis for GPU Kernels
SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection
SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2
SYCL Code Generation for Multigrid Methods
SYCL compute kernels for ExaHyPE
SYCL in the edge: performance and energy evaluation for heterogeneous acceleration
SYCL in the Edge: Performance Evaluation for Heterogeneous Acceleration
SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs
SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing
SYCL-Bench: A Versatile Single-Source Benchmark Suite for Heterogeneous Computing
SYCLops: A SYCL Specific LLVM to MLIR Converter
Sylkan: Towards a Vulkan Compute Target Platform for SYCL
Symbolic Crosschecking of Data-Parallel Floating Point Code
Symbolic crosschecking of floating-point and SIMD code
Symbolic Differentiation in GPU Shaders
Symbolic Testing of OpenCL Code
Symphony: A Scheduler for Client-Server Applications on Coprocessor-based Heterogeneous Clusters
Synchronization and Coordination in Heterogeneous Processors
Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming
Synergia CUDA: GPU-accelerated accelerator modeling package
Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra
Synergistic execution of stream programs on multicores with accelerators
SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving
Synkhronos: a Multi-GPU Theano Extension for Data Parallelism
SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction
Synthesis and rendering of bidirectional texture functions on arbitrary surfaces
Synthesis of Custom Networks of Heterogeneous Processing Elements for Complex Physical System Emulation
Synthesis of Embedded Software using Dataflow Schedule Graphs
Synthesis of GPU Programs from High-Level Models
Synthesis of Platform Architectures from OpenCL Programs
Synthesizing Benchmarks for Predictive Modeling
Synthesizing Software from a ForSyDe Model Targeting GPGPUs
Synthesizing Structured Traversals from Attribute Grammars
Synthesizing Subdivision Meshes Using Real Time Tessellation
Synthetic Aperture Beamformation using the GPU
Synthetic Aperture Radar imaging on a CUDA-enabled mobile platform
Synthetic Aperture Radar Processing with GPGPU
Syntix: A Profiling Based Resource Estimator for CUDA Kernels
System Design Principles for Heterogeneous Resource Management and Scheduling in Accelerator-Based Systems
System integration of FastSPECT III, a dedicated SPECT rodent-brain imager based on BazookaSPECT detector technology
System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language
Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU
Systematic construction, verification and implementation methodology for LDPC codes
Systematic Performance Optimization of Cone-Beam Back-Projection on the Kepler Architecture
Systematic Physics Constrained Parameter Estimation of Stochastic Differential Equations
SystemC simulation on GP-GPUs: CUDA vs. OpenCL
Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing
SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets
TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning
Tabu Search on GPU
Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
Tactics to Directly Map CNN graphs on Embedded FPGAs
Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures
Takagi Factorization on GPU using CUDA
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
Taking the graphics processor beyond graphics
Taming irregular EDA applications on GPUs
Taming the complexities of the C11 and OpenCL memory models
Tamp: A Library for Compact Deep Neural Networks with Structured Matrices
Tangible video teleconference system using real-time image-based relighting
Tango: A Deep Neural Network Benchmark Suite for Various Accelerators
Tangram: a High-level Language for Performance Portable Code Synthesis
Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization
TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture
Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics processors
Tapping the supercomputer under your desk: solving dynamic equilibrium models with graphics processors?
Target Marker: A Visual Marker for Long Distances and Detection in Realtime on Mobile Devices
targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance
Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles
Targeting GPUs with OpenMP Directives on Summit: A Simple and Effective Fortran Experience
Targeting heterogeneous architectures via macro data flow
Task and Data Distribution in Hybrid Parallel Systems
Task management for irregular-parallel workloads on the GPU
Task migration of DSP application specified with a DFG and implemented with the BSP computing model on a CPU-GPU cluster
Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
Task Parallelism and Data Distribution: An Overview of Explicit Parallel Programming Languages
Task Parallelism and Synchronization: An Overview of Explicit Parallel Programming Languages
Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge
Task Partition Comparison between Multi-core System and GPU
Task Performance with List-Mode Data
Task Scheduling for Heterogeneous Multicore Systems
Task scheduling in hybrid CPU-GPU systems
Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment
Task Superscalar: An Out-of-Order Task Pipeline
Task superscalar: using processors as functional units
Task-based Conjugate-Gradient for multi-GPUs platforms
Task-based FMM for heterogeneous architectures
Task-Based Parallel Strategies for CFD Application in Heterogeneous CPU/GPU Resources
Task-based, GPU-accelerated and Robust Library for Solving Dense Nonsymmetric Eigenvalue Problems
Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System
Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA
TBD: Benchmarking and Analyzing Deep Neural Network Training
TC-CIM: Empowering Tensor Comprehensions for Computing-In-Memory
tcFFT: Accelerating Half-Precision FFT through Tensor Cores
TCUDB: Accelerating Database with Tensor Processors
TDDFT in massively parallel computer architectures: the OCTOPUS project
Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code Translation
Teaching cardiac electrophysiology modeling to undergraduate students: laboratory exercises and GPU programming for the study of arrhythmias and spiral wave dynamics
Teaching graphics processing and architecture using a hardware prototyping approach
Teaching Parallel Programming in Containers: Virtualization of a Heterogeneous Local Infrastructure
Teaching Parallel Programming Models on a Shallow-Water Code
Teaching Parallel Programming Using Java
Technical aspects of the GPU accelerated surgical simulator
Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers
Techniques for designing GPGPU games
Techniques for efficient DCT/IDCT implementation on generic GPU
Techniques for efficient, real-time, 3D visualization of multi-modality cardiac data using consumer graphics hardware
Techniques for Mapping Synthetic Aperture Radar Processing Algorithms to Multi-GPU Clusters
Techniques to maximize memory bandwidth on the Rigel compute accelerator
TEDI: efficient shortest path query answering on graphs
TEG: GPU Performance Estimation Using a Timing Model
Telekine: Secure Computing with Cloud GPUs
Template Library for Multi-GPU Pseudorandom Number Recursion-based Generators
Temporal Blending for Adaptive SPH
Temporally Consistent Disparity and Optical Flow via Efficient Spatio-temporal Filtering
Temporospatial Epidemic Simulations Using Heterogeneous Computing
TENSILE: A Tensor granularity dynamic GPU memory scheduler method towards multiple dynamic workloads system
Tensor Computation Based on Heterogeneous Memory
Tensor Contractions with Extended BLAS Kernels on CPU and GPU
Tensor Processing Units for Financial Monte Carlo
Tensor Voting Accelerated by Graphics Processing Units (GPU)
TensorFlow Doing HPC
TensorFlow: A system for large-scale machine learning
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
TensorFlow.js: Machine Learning for the Web and Beyond
TensorNetwork for Machine Learning
TensorNetwork: A Library for Physics and Machine Learning
Tera-scale Astronomical Data Analysis and Visualization
TeraFLOP computing on a desktop PC with GPUs for 3D CFD
Teraflop per second gravitational lensing ray-shooting using graphics processing units
Termination Analysis for GPU Kernels
TESLA GPUs versus MPI with OpenMP for the Forward Modeling of Gravity and Gravity Gradient of Large Prisms Ensemble
Tesla vs. Xeon Phi vs. Radeon A Compiler Writer's Perspective
Test-driving Intel Xeon Phi
Testing and Exposing Weak Graphics Processing Unit Memory Models
Testing and Mutation Testing for GPU Kernels
Testing fine-grained parallelism for the ADMM on a factor-graph
Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs
Testing Tesla architecture for scientific computing: The performance of matrix-vector product
Tetrahedral Interpolation for Deformable Image Registration on GPUs
Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents
Texture Cache Approximation on GPUs
Texture compression of light maps using smooth profile functions
Texture-based visualization of uncertainty in flow fields
Texture-Based Visualization of Unsteady 3D Flow by Real-Time Advection and Volumetric Illumination
Texturing and Modeling, Third Edition: A Procedural Approach (The Morgan Kaufmann Series in Computer Graphics)
TH-1: China's first petaflop supercomputer
The 'Chimera': an off-the-shelf CPU/GPGPU/FPGA hybrid computing platform
The 3D Flow Field Around an Embedded Planet
The Accelerated Universe
The accelerating implementation of BLAST with stream processor
The Accelerator Wall: Limits of Chip Specialization
The AES Implantation Based on OpenCL for Multi/many Core Architecture
The AGILE library for image reconstruction in biomedical sciences using graphics card hardware acceleration
The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition
The AlexNet Moment for Homomorphic Encryption: HCNN, the First Homomorphic CNN on Encrypted Data with GPUs
The Anatomy of a Triton Attention Kernel
The Anatomy of High-Performance 2D Similarity Calculations
The ANTAREX Approach to Autotuning and Adaptivity for Energy Efficient HPC Systems
The ANTAREX Domain Specific Language for High Performance Computing
The Application of AI Technology in GPU Scheduling Algorithm Optimization
The Application of CUDA Architecture in Facial Expression Recognition
The application of GPU particle tracing to diffusion tensor field visualization
The Application Perspective: Seeking Productivity and Performance
The Arcane development framework
The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing
The architecture of the DecentVM: towards a decentralized virtual machine for many-core computing
The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product
The Astrophysical Multipurpose Software Environment
The battle of the giants: a case study of GPU vs FPGA optimisation for real-time image processing
The BiConjugate gradient method on GPUs
The Boat Hull Model: Adapting the Roofline Model to Enable Performance Prediction for Parallel Computing
The BondMachine toolkit: Enabling Machine Learning on FPGA
The Bones Source-to-Source Compiler Manual
The Case for Higher Computational Density in the Memory-Bound FDTD Method within Multicore Environments
The case for VOS: the vector operating system
The Celerity High-level API: C++20 for Accelerator Clusters
The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units
The Comparisons of OpenCL and OpenMP Computing Paradigm
The Complete Rank Transform: A Tool for Accurate and Morphologically Invariant Matching of Structures
The computer graphics wars heat up
The conjugate gradient solver accelerated by GPU for solving wave-propagation problems
The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding
The Correctness Illusion in LLM-Generated GPU Kernels
The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography
The CUDA Handbook: A Comprehensive Guide to GPU Programming
The CUDA implementation of the method of lines for the curvature dependent flows
The CUDA LATCH Binary Descriptor: Because Sometimes Faster Means Better
The DabR - A multitouch system for intuitive 3D scene navigation
The Deep Learning Compiler: A Comprehensive Survey
The density matrix renormalization group algorithm on kilo-processor architectures: implementation and trade-offs
The Design and Implementation of a GPU-enabled Multi-objective Tabu-search Intended for Real World and High-dimensional Applications
The Design and Implementation of a Verification Technique for GPU Kernels
The design and verification of Mumax3
The development and expansion of HOOMD-blue through six years of GPU proliferation
The development of Mellanox/NVIDIA GPUDirect over InfiniBand-a new model for GPU to GPU communications
The discrete dipole approximation code DDscat.C++: features, limitations and plans
The distributed diagonal force decomposition method for parallelizing molecular dynamics simulations
The Distribution of OpenCL Kernel Execution Across Multiple Devices
The Dual-Path Execution Model for Efficient GPU Control Flow
The Dynamical Kernel Scheduler - Part 1
The Ecological Footprint of Neural Machine Translation Systems
The effects of nutrient chemotaxis on bacterial aggregation patterns with non-linear degenerate cross diffusion
The Fast and Wideband MoM Based on GPU and Two-Path AFS Acceleration
The fast evaluation of hidden Markov models on GPU
The fast multipole method on parallel clusters, multicore processors, and graphics processing units
The Fast Multipole Method on the Cell processor
The Fat-Link Computation On Large GPU Clusters for Lattice QCD
The Feasibility of Using OpenCL Instead of OpenMP for Parallel CPU Programming
The FFT on a GPU
The Flocking Based and GPU Accelerated Internet Traffic Classification
The Framework and Compilation Techniques for Directive-based GPU Cluster Programming
The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries
The Future in Mobile Multicore Computing
The Future of Accelerator Programming: Abstraction, Performance or Can We Have Both?
The future of microprocessors
The GASPI API specification and its implementation GPI 2.0
The Geant4 Visualisation System - a multi-driver graphics system
The GeForce 6 series GPU architecture
The GeForce 6800
The Genetic Convolutional Neural Network Model Based on Random Sample
The GENGA Code: Gravitational Encounters in N-body simulations with GPU Acceleration
The GPU as a high performance computational resource
The GPU as numerical simulation engine
The GPU Computing Era
The GPU Computing Revolution: From Multi-Core CPUs To Many-Core Graphics Processors
The GPU Enhanced Parallel Computing for Large Scale Data Clustering
The GPU enters computing's mainstream
The GPU on biomedical image processing for color and phenotype analysis
The GPU on irregular computing: performance issues and contributions
The GPU on the simulation of cellular computing models
The GPU vs Phi Debate: Risk Analytics Using Many-Core Computing
The GPU-based High-performance Pattern-matching Algorithm for Intrusion Detection
The GPU-based Parallel Ant Colony System
The GPU-based String Matching System in Advanced AC Algorithm
The gputools package enables GPU computing in R
The GPUVerify Method: a Tutorial Overview
The Graphics Card as a Streaming Computer
The Graphics Processor as a Mathematical Coprocessor in MATLAB
The Heisenberg spin glass model on GPU: myths and actual facts
The Hierarchical Memory Machine Model for GPUs
The Hitchhiker's Guide to Cross-Platform OpenCL Application Development
The impact of accelerator processors for high-throughput molecular modeling and simulation
The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain
The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study
The impact of GPU/Multicore in Signal Processing: a quantitative approach
The Impact of Modern Consumer GPUs on Commonly Used Secure Password Standards
The Implement of Common Beam Forming Using GPU
The implementation and optimization of Bitonic sort algorithm based on CUDA
The Implementation of a Real-Time Polyphase Filter
The implementation of Multi-Scale Retinex image enhancement algorithm based on GPU via CUDA
The Infrared behavior of SU(3) Nf=12 gauge theory -about the existence of conformal fixed point-
The integrated implementation of surgical simulations through modeling by means of imaging, comprehension, visualization, deformation, and collision detection in virtual environments
The International Exascale Software Project roadmap
The K-Anonymity Approach in Preserving the Privacy of E-Services that Implement Data Mining
The Landscape of GPU-Centric Communication
The Lattice Boltzmann Equation Method for Complex Flows
The Lattice Boltzmann Simulation on Multi-GPU Systems
The lattice-Boltzmann method for simulating gaseous phenomena
The Linear Direct Sparse Solver on GPU for Bundle Adjustment Method
The Living Application: a Self-Organising System for Complex Grid Tasks
The magic volume lens: an interactive focus+context technique for volume rendering
The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface
The method of improving performace of the GPU-accelerated 2D FDTD simulator
The Model of Computation of CUDA and its Formal Semantics
The MOPED framework: Object recognition and pose estimation for manipulation
The More We Share, The More We Have: Improving GPU performance through Register Sharing
The MOSIX Cluster Operating System for High-Performance Computing on Linux Clusters, Multi-Clusters, GPU Clusters and Clouds
The MOSIX Virtual OpenCL (VCL) Cluster Platform
The multi-GPU System with ExpEther
The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing
The multikernel: a new OS architecture for scalable multicore systems
The New Compiler Stack: A Survey on the Synergy of LLMs and Compilers
The nonequispaced FFT on graphics processing units
The Ocean Tensor Package
The OoO VLIW JIT Compiler for GPU Inference
The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science
The openip open source image processing library
The OpenMP Cluster Programming Model
The Optimization of Algorithms in the Process of Temporal Data Mining Using the Compute Unified Device Architecture
The optimization of parallel Smith-Waterman sequence alignment using on-chip memory of GPGPU
The orthorectified technology for UAV aerial remote sensing image based on the Programmable GPU
The Parallel Bayesian Toolbox for High-performance Bayesian Filtering in Metrology
The Parallel Processing Based on CUDA for Convolution Filter FDK Reconstruction of CT
The PEPPHER Approach to Programmability and Performance Portability for Heterogeneous many-core Architectures
The PEPPHER Composition Tool: Performance-Aware Dynamic Composition of Applications for GPU-based Systems
The Performance Analysis Based on Heterogeneous Parallel Processors for Anisotropic Diffusion Filters
The performances of R GPU implementations of the GMRES method
The Physics of Singular Dislocation Structures in Continuum Dislocation Dynamics
The Plasma Simulation Code: A modern particle-in-cell code with load-balancing and GPU support
The Possibility of Fast Large-Scale Numerical Simulation Implemented with Graphics Processing Units
The Potential for a GPU-Like Overlay Architecture for FPGAs
The Potential of the Intel Xeon Phi for Supervised Deep Learning
The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications
The Promises of Hybrid Hexagonal/Classical Tiling for GPU
The Q Continuum Simulation: Harnessing the Power of GPU Accelerated Supercomputers
The Reconstruction Toolkit (RTK), an open-source cone-beam CT reconstruction toolkit based on the Insight Toolkit (ITK)
The Reduction Problem in CUDA and Its Simulation with P Systems
The Research of Large-Scale 3D Scenes Rendering Optimization
The Research of Real-Time Shadow Rendering Algorithm of Virtual Scenes
The Rewriting of DataRaceBench Benchmark for OpenCL Program Validations
The Rhombic Dodecahedron Map: An Efficient Scheme for Encoding Panoramic Video
The Risks of WebGL: Analysis, Evaluation and Detection
The Rodinia Benchmark Suite in SYCL
The role of GPU computing in medical image analysis and visualization
The role of multigrid algorithms for LQCD
The Saga of Landau-Gauge Propagators: Gathering New Ammo
The Scalable Heterogeneous Computing (SHOC) benchmark suite
The scoring sequences on profile Hidden Markov Models with delete states elimination by GPUs
The Security of Key Derivation Functions in WINRAR
The Shamrock code: I- Smoothed Particle Hydrodynamics on GPUs
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches
The sparse matrix vector product on GPUs
The State of the Art in Interactive Global Illumination
The Stencil Processing Unit: GPGPU Done Right
The Study of the OpenCL Processing Models for the FPGA Devices
The system for visualization of synoptic objects
The Test and Evaluation Uses of Heterogeneous Computing: GPGPUs and Other Approaches
The TheLMA project: Multi-GPU implementation of the lattice Boltzmann method
The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Use of Automated Search in Deriving Software Testing Strategies
The Use of GPUs for Solving the Computed Tomography Problem
The use of overlapping subgrids to accelerate the FDTD on GPU devices
The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability
The VerCors Verifier: A Progress Report
The Virtual Marathon: Parallel Computing Supports Crowd Simulations 
The Virtual OpenCL (VCL) Cluster Platform
The visible ear surgery simulator
The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware
The VOLNA-OP2 Tsunami Code (Version 1.0)
The VRE volume rendering engine
The Yin and Yang of Processing Data Warehousing Queries on GPU Devices
Theano-based Large-Scale Visual Recognition with Multiple GPUs
Theano-MPI: a Theano-based Distributed Training Framework
Theano: A CPU and GPU Math Compiler in Python
Theano: A Python framework for fast computation of mathematical expressions
Theano: Deep Learning on GPUs with Python
TheanoLM - An Extensible Toolkit for Neural Network Language Modeling
Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads
Theoretical and Numerical Analysis of Three Approaches to the GPGPU Application of the Explicit FDTD Method
Theory of square, rectangular, and microband electrodes through explicit GPU simulation
Thermal and Athermal Swarms of Self-Propelled Particles
Thermal Safety and Real-Time Predictability on Heterogeneous Embedded SoC Platforms
Theseus: A Library for Differentiable Nonlinear Optimization
Thickness computation of trimmed B-Rep model using GPU ray tracing
THOR: A New and Flexible Global Circulation Model to Explore Planetary Atmospheres
THOR: A Transparent Heterogeneous Open Resource framework
Thorough Evaluation of GPU Shared Memory Load and Store Instructions
Thousand core chips: a technology perspective
Thread Block Compaction for Efficient SIMT Control Flow
Thread-safe lattice Boltzmann for high-performance computing on GPUs
Thread-Scalable Evaluation of Multi-Jet Observables
Three Contributions to the Theory and Practice of Optimizing Compilers
Three Dimensional Fast Fourier Transform CUDA Implementation
Three dimensional tracking of gold nanoparticles using digital holographic microscopy
Three storage formats for sparse matrices on GPGPUs
Three-Dimension Fountain Simulation Based on GPU and Particle System
Three-Dimensional Image Warping on Programmable Graphics Hardware
Three-dimensional LBM simulations of buoyancy-driven flow using Graphics processing units
Three-Dimensional Modeling of Long-Wave Runup: Simulation of Tsunami Inundation with GPU-SPHysics
Throughput-Effective On-Chip Networks for Manycore Accelerators
Throughput-Oriented Analytical Models for Performance Estimation on Programmable Hardware Accelerators
ThunderGBM: Fast GBDTs and Random Forests on GPUs
ThunderSVM: A Fast SVM Library on GPUs and CPUs
Thwarting Piracy: Anti-debugging Using GPU-assisted Self-healing Codes
Tight Binding Molecular Dynamics on CPU and GPU clusters
Tile Based Procedural Terrain Generation in Real-Time
Tile-based Lightweight Integer Compression in GPU
Tileable BTF
Tiled QR Decomposition and Its Optimization on CPU and GPU Computing System
Tiled Shading
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
Tiling for Performance Tuning on Different Models of GPUs
Tiling optimizations for stencil computations
Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation
Time dependent simulation of the Driven Lid Cavity at High Reynolds Number
Time Predictability of GPU Kernel on an HSA Compliant Platform
Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project
Time-stepping methods for the simulation of the self-assembly of nano-crystals in Matlab on a GPU
Time-varying clustering for local lighting and material design
TimeGraph: GPU scheduling for real-time multi-tasking environments
Tinker-HP: Accelerating Molecular Dynamics Simulations of Large Complex Systems with Advanced Point Dipole Polarizable Force Fields using GPUs and Multi-GPUs systems
TinyDL: Just-In-Time Deep Learning Solution For Constrained Embedded Systems
Tiramisu: A Code Optimization Framework for High Performance Systems
Titan: A Parallel Asynchronous Library for Multi-Agent and Soft-Body Robotics using NVIDIA CUDA
TLP: A Deep Learning-based Cost Model for Tensor Program Tuning
tntorch: Tensor Network Learning with PyTorch
To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures
To GPU Synchronize or Not GPU Synchronize?
To Use or Not to Use: Graphics Processing Units for Pattern Matching Algorithms
Togpu: Automatic Source Transformation from C++ to CUDA using Clang/LLVM
TonY: An Orchestrator for Distributed Machine Learning Jobs
Toolchain for programming, simulating and studying the XMT many-core architecture
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
Tools for GPU Computing – Debugging and Performance Analysis of Heterogenous HPC Applications
Tools for GPU Computing–Debugging and Performance Analysis of Heterogenous HPC Applications
Tools for Reduced Precision Computation: A Survey
Top ten ways to make formal methods for HPC practical
Top-k Queries Processing With Uncertain Data on Graphics Processing Units
Top-Performance Tokenization and Small-Ruleset Regular Expression Matching: A Quantitative Performance Analysis and Optimization Study on the Cell/B.E. Processor
Topical perspective on massive threading and parallelism
TopicBERT for Energy Efficient Document Classification
Topology optimization design of 3D electrothermomechanical actuators by using GPU as a co-processor
Topology Optimization with Unstructured Meshes on Graphics Processing Units (GPUs)
Torch7: A Matlab-like Environment for Machine Learning
TorchAudio: Building Blocks for Audio and Speech Processing
TorchBench: Benchmarking PyTorch with High API Surface Coverage
Torchnet: An Open-Source Platform for (Deep) Learning Research
torchode: A Parallel ODE Solver for PyTorch
TorchOpt: An Efficient Library for Differentiable Optimization
TorchQC - A framework for efficiently integrating machine and deep learning methods in quantum dynamics and control
Toward a Generic Hybrid CPU-GPU Parallelization of Divide-and-Conquer Algorithms
Toward a GPU-Accelerated Immersed Boundary Method for Wind Forecasting Over Complex Terrain
Toward a Multi-level Parallel Framework on GPU Cluster with PetSC-CUDA for PDE-based Optical Flow Computation
Toward a multicore architecture for real-time ray-tracing
Toward a Practical Implementation of Exemplar-Based Noise Robust ASR
Toward Accelerating the Matrix Inversion Computation of Symmetric Positive-Definite Matrices on Heterogeneous GPU-Based Systems
Toward Acceleration of RSA Using 3D Graphics Hardware
Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks
Toward Auto-tuned Krylov Basis Computations with minimized Communication on Clusters of Accelerators
Toward Automatic Translation: From OpenACC to OpenMP 4
Toward Better Computation Models for Modern Machines
Toward efficient GPU-accelerated N-body simulations
Toward GPU Accelerated Data Stream Processing
Toward GPU-accelerated Traffic Simulation and Its Real-Time Challenge
Toward GPUs being mainstream in analytic processing: An initial argument using simple scan-aggregate queries
Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs
Toward improved aeromechanics simulations using recent advancements in scientific computing
Toward large-scale Hybrid Monte Carlo simulations of the Hubbard model on graphics processing units
Toward OpenCL Automatic Multi-Device Support
Toward optimised skeletons for heterogeneous parallel architecture with performance cost model
Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions
Toward Practical Real-Time Photon Mapping: Efficient GPU Density Estimation
Toward Real-Time Dense 3d Reconstruction using Stereo Vision
Toward real-time kernel density estimate display for instrumentation
Towards a Benchmarking Suite for Kernel Tuners
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a Distributed GPU-Accelerated Matrix Inversion
Towards a functional run-time for dense NLA domain
Towards a GPU-based Implementation of Interaction Nets
Towards a GPU-Based Simulation Framework for Deformable Surface Meshes
Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core
Towards a More Efficient Use of GPUs
Towards a Performance-Portable FFT Library for Heterogeneous Computing
Towards a Portable and Future-proof Particle-in-Cell Plasma Physics Code
Towards a robust, real-time face processing system using CUDA-enabled GPUs
Towards a Software Transactional Memory for Graphics Processors
Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems
Towards a Unified CPU-GPU code hybridization: A GPU Based Optimization Strategy Efficient on Other Modern Architectures
Towards a unified framework for rapid 3D computed tomography on commodity GPUs
Towards a Unified Sentiment Lexicon (USL) based on Graphics Processing Units (GPUs)
Towards a Unified Sentiment Lexicon Based on Graphics Processing Units
Towards Accelerated Computation of Atmospheric Equations Using CUDA
Towards accelerating molecular modeling via multi-scale approximation on a GPU
Towards accelerating Smoothed Particle Hydrodynamics simulations for free-surface flows on multi-GPU clusters
Towards acceleration of fault simulation using graphics processing units
Towards ad-hoc GPU acceleration of parallel eigensystem computations
Towards Adaptive GPU Resource Management for Embedded Real-Time Systems
Towards Alignment of Parallelism in SYCL and ISO C++
Towards an automatic generation of dense linear algebra solvers on parallel architectures
Towards an Effective Unified Programming Model for Many-Cores
Towards an embedded biologically-inspired machine vision processor
Towards an interactive and automated script feature analysis of 3D scanned cuneiform tablets
Towards Automated Kernel Generation in the Era of LLMs
Towards automated kernel selection in machine learning systems: A SYCL case study
Towards Automated Learning of Object Detectors
Towards Automatic C Programs Optimization and Parallelization using the PIPS-PoCC Integration
Towards automatic Digital Surface Model generation using a Graphics Processing Unit
Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural Code
Towards Automatic Transformation of Legacy Scientific Code into OpenCL for Optimal Performance on FPGAs
Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System
Towards autonomous resource management: Deep learning prediction of CPU-GPU load balancing
Towards Building Error Resilient GPGPU Applications
Towards Calculating HPC CUDA Kernel Performance on Nvidia GPUs
Towards Chip-on-Chip Neuroscience: Fast Mining of Frequent Episodes Using Graphics Processors
Towards chip-on-chip neuroscience: fast mining of neuronal spike streams using graphics hardware
Towards Co-execution on Commodity Heterogeneous Systems: Optimizations for Time-Constrained Scenarios
Towards Code Generation from Design Models for Embedded Systems on Heterogeneous CPU-GPU Platforms
Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems
Towards Distortion-Predictable Embedding of Neural Networks
Towards Distributed Heterogenous High-Performance Computing with ViennaCL
Towards Domain-specific Computing for Stencil Codes in HPC
Towards dynamic reconfigurable load-balancing for hybrid desktop platforms
Towards Efficient and Practical GPU Multitasking in the Era of LLM
Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA
Towards Efficient GPU Sharing on Multicore Processors
Towards Efficient Indexing of Spatiotemporal Trajectories on the GPU for Distance Threshold Similarity Searches
Towards Efficient Large-Scale Graph Neural Network Computing
Towards Efficient Risk Quantification-Using GPUs and Variance Reduction Technique
Towards energy efficiency and productivity for decision making in mobile robot navigation
Towards Enhancing Performance, Programmability, and Portability in Heterogeneous Computing
Towards fast and certified multiple-precision libraries
Towards Faster Cloth Simulation: Examining the Preconditioned Conjugate Gradient
Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation
Towards fully user transparent task and data parallel image processing
Towards global composition of performance-aware components for GPU-based systems
Towards Good Practices for Very Deep Two-Stream ConvNets
Towards GPGPU Assisted Computing in Virtualized Environments
Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines
Towards GPU-Accelerated Large-Scale Graph Processing in the Cloud
Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL
Towards High Performance Java-based Deep Learning Frameworks
Towards High Speed Aerial Tracking of Agile Targets
Towards High-Performance and Cost-Effective Distributed Storage Systems with Information Dispersal Algorithms
Towards Improving Programmability of Heterogeneous Parallel Architectures
Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems
Towards Interactive Visual Exploration of Parallel Programs using a Domain-specific Language
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors
Towards large-scale network analytics
Towards Lattice Quantum Chromodynamics on FPGA devices
Towards making the most of NLP-based device mapping optimization for OpenCL kernels
Towards Memory-Efficient Answering of Tree-Shaped SPARQL Queries using GPUs
Towards metaprogramming for parallel systems on a chip
Towards microsecond biological molecular dynamics simulations on hybrid processors
Towards Modeling Energy Consumption of Xeon Phi
Towards multi-GPU support for visualization
Towards Multi-GPU Support in the Marrow Skeleton Framework
Towards On-Chip Optical FFTs for Convolutional Neural Networks
Towards On-Line Digital Doubles
Towards paradisEO-MO-GPU: a framework for GPU-based local search metaheuristics
Towards Parallel Programming Models for Predictability
Towards Path Tracing in Games
Towards Performance Portable Programming for Distributed Heterogeneous Systems
Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems
Towards Performance-Portable, Scalable, and Convenient Linear Algebra
Towards Portable Performance for Explicit Hydrodynamics Codes
Towards Porting a Real-World Seismological Application to the Intel MIC Architecture
Towards Predictable Real-Time Performance on Multi-Core Platforms
Towards Rapid Prototyping of Parallel and HPC Applications (GPU Focus)
Towards real time 2D to 3D registration for ultrasound-guided endoscopic and laparoscopic procedures
Towards real time 3D tracking and reconstruction on a GPU using Monte Carlo simulations
Towards real time vision based UUV navigation using GPU technology
Towards real-time radiation therapy: GPU accelerated superposition/convolution
Towards real-time tomography: Fast reconstruction algorithms and GPU implementation
Towards reverse engineering the brain: Modeling abstractions and simulation frameworks
Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization
Towards robust automatic detection of vulnerable road users: monocular pedestrian tracking from a moving vehicle
Towards scalar synchronization in SIMT architectures
Towards shared memory consistency models for GPUs
Towards smart-pixel-based implementation of wideband active sonar echolocation system for multi-target detection
Towards solving the Table Maker's Dilemma on GPU
Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability
Towards systematic exploration of tradeoffs for medical image registration on heterogeneous platforms
Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems
Towards Unified Analysis of GPU Consistency
Towards Unified INT8 Training for Convolutional Neural Network
Towards user transparent parallel multimedia computing on GPU-clusters
Towards Utilizing GPUs in Information Visualization: A Model and Implementation of Image-Space Operations
Towards Utilizing Remote GPUs for CUDA Program Execution
TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s
Track finding in ATLAS using GPUs
Tracking 3d Pose of Rigid Object by Sparse Template Matching
Tracking and Clustering Salient Features in Image Sequences
Tracking humans interacting with the environment using efficient hierarchical sampling and layered observation models
Tracking Many Solution Paths of a Polynomial Homotopy on a Graphics Processing Unit
Tradeoff analysis and optimization of power delivery networks with on-chip voltage regulation
Tradeoffs in designing accelerator architectures for visual computing
Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration
Training a Feedback Loop for Hand Pose Estimation
Training a Vision Transformer from scratch in less than 24 hours with 1 GPU
Training DNN Models over Heterogeneous Clusters with Optimal Performance
Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)
Training Neural Networks Without Gradients: A Scalable ADMM Approach
Tranformation of CPU-based Applications To Leverage on Graphics Processors using CUDA
TransAxx: Efficient Transformers with Approximate Computing
TransCAIP: A Live 3D TV System Using a Camera Array and an Integral Photography Display with Interactive Control of Viewing Parameters
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework
Transfer Time Reduction of Data Transfers between CPU and GPU
Transform Coding for Hardware-accelerated Volume Rendering
Transformation of Scientific Algorithms to Parallel Computing Code: Single GPU and MPI multi GPU Backends with Subdomain Support
Transformations of High-Level Synthesis Codes for High-Performance Computing
Transforming and Optimizing Irregular Applications for Parallel Architectures
Transforming C OpenMP Programs for Verification in CIVL
Translating GPU binaries to tiered SIMD architectures with Ocelot
Translating OpenMP Device Constructs to OpenCL using Unnecessary Data Transfer Elimination
Translation-invariant two-dimensional discrete wavelet transform on graphics processing units
Transparent Acceleration for Heterogeneous Platforms With Compilation to OpenCL
Transparent Acceleration of Java-based Deep Learning Engines
Transparent Accelerator Migration in a Virtualized GPU Environment
Transparent Checkpoint-Restart for Hardware-Accelerated 3D Graphics
Transparent Checkpointing for OpenGL Applications on GPUs
Transparent Compiler and Runtime Specializations for Accelerating Managed Languages on FPGAs
Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems
Transparent FPGA Acceleration with TensorFlow
Transparent use of Java objects on the GPU in the JaMP/OpenMP framework
Trapping of giant-planet cores - I. vortex aided trapping at the outer dead zone edge
Tree Structured Analysis on GPU Power Study
Treecode and fast multipole method for N-body simulation with CUDA
TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization
Trellis: Portability Across Architectures with a High-level Framework
Tri-Hybrid Computational Fluid Dynamics on DOE's Cray XK7, Titan
Triangular matrix inversion on Graphics Processing Unit
Triangular mesh simplification on the GPU
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
Trie Compression for GPU Accelerated Multi-Pattern Matching
TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing
triSYCL for Xilinx FPGA
Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection
TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization
True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity
True 4D Image Denoising on the GPU
TRUST: the HPC open-source CFD platform – from CPU to GPU
TTC: A Tensor Transposition Compiler for Multiple Architectures
TuCCompi: A Multi-Layer Programing Model for Heterogeneous Systems with Auto-Tuning Capabilities
Tuned and asynchronous stencil kernels for CPU/GPU systems (thesis)
Tuned and GPU-accelerated parallel data mining from comparable corpora
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Tuning a Finite Difference Computation for Parallel Vector Processors
Tuning A Hybrid GPU-CPU V-cycle Multilevel Preconditioner for Solving Large Real and Complex Systems of FEM Equations
Tuning Manifold Harmonics Filters
Tuning Stencil Codes in OpenCL for FPGAs
Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach
Turbo Bayesian Compressed Sensing
Tutorial 3: Methodologies and Performance Impacts of General Purpose Computing on GPUs
Tutoring LLM into a Better CUDA Optimizer
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM: End-to-End Optimization Stack for Deep Learning
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors
Two Algorithms for Sorting On Heterogeneous Clusters
Two Approaches to Particle Simulation: OpenMPI and CUDA
Two improved GPU acceleration strategies for force-directed graph layout
Two Level Approach to Efficient Visualization of Protein Dynamics
Two Simple Single-pass GPU methods for Multi-channel Surface Voxelization of Dynamic Scenes
Two Stage Data Mining Technique for Fast Monsoon Onset Prediction
Two-electron integral evaluation on the graphics processor unit
Two-fluid compressible simulations on GPU cluster
Two-Level Approach to Efficient Visualization of Protein Dynamics
Two-stage compression for fast volume rendering of time-varying scalar data
Two-way partitioning of a recursive Gaussian filter in CUDA
Two-Way Real Time Fluid Simulation Using a Heterogeneous Multicore CPU and GPU Architecture
TWQCD's dynamical DWF project
Type-safe Runtime Code Generation: Accelerate to LLVM
U-Net: Convolutional Networks for Biomedical Image Segmentation
UAV Path Planning with Parallel Genetic Algorithms on CUDA Architecture
uBench: Performance Impact of CUDA Block Geometry
UberFlow: a GPU-based particle engine
Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford
UCHPC - UnConventional High Performance Computing for Finite Element Simulations
Ultra-Fast Detection of Higher-Order Epistatic Interactions on GPUs
Ultra-Fast Displaying Spectral Domain Optical Doppler Tomography System Using a Graphics Processing Unit
Ultra-fast FFT protein docking on graphics processors
Ultra-Fast Hybrid CPU-GPU Multiple Scatter Simulation for 3D PET
Ultra-fast treatment plan optimization for volumetric modulated arc therapy (VMAT)
Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml
Ultrasound goes GPU: real-time simulation using CUDA
Ultrasound Image Simulation with GPU-based Ray Tracing
Uncertainty-Aware Guided Volume Segmentation
Uncluttering Graph Layouts Using Anisotropic Diffusion and Mass Transport
Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units
Under the Hood of SYCL - An Initial Performance Analysis With an Unstructured-mesh CFD Application
Understanding and Modeling the Synchronization Cost in the GPU Architecture
Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric
Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach
Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures
Understanding GPU Triggering APIs for MPI+X Communication
Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations
Understanding Latency Hiding on GPUs
Understanding Performance Portability of Bioinformatics Applications in SYCL on an NVIDIA GPU
Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models
Understanding software approaches for GPGPU reliability
Understanding the Costs of Many-Task Computing Workloads on Intel Xeon Phi Coprocessors
Understanding the design trade-offs among current multicore systems for numerical computations
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Understanding the efficiency of parallel incomplete Cholesky preconditioners on the performance of ICCG solvers for multi-core and GPU systems
Understanding the efficiency of ray traversal on GPUs
Understanding the impact of CUDA tuning techniques for Fermi
Understanding the Impact of Hybrid Programming on Software Energy Efficiency
Understanding the Impact of Input Entropy on FPU, CPU, and GPU Power
Understanding the ISA impact on GPU Architecture
Understanding the Landscape of Ampere GPU Memory Errors
Understanding the Performance of HPC Applications
Understanding the Power of Evolutionary Computation for GPU Code Optimization
Understanding the SIMD Efficiency of Graph Traversal on GPU
Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts
Unfolding and Shrinking Neural Machine Translation Ensembles
UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization
UNICORN: A Bulk Synchronous Programming Model, Framework and Runtime for Hybrid CPU-GPU Clusters
Unified - A Sharp Turn in the Latest Era of Graphic Processors
Unified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment
Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation
Unified Particle Physics for Real-Time Applications
Unified schemes for directive-based GPU offloading
Unified Shader Programming in C++
Unified Shared Memory: Friend or Foe?
Unified system of code transformation and execution for heterogeneous multi-core architectures
Unified Tables for Exponential and Logarithm Families
UniFL: Accelerating Federated Learning Using Heterogeneous Hardware Under a Unified Framework
Uniform partitioning of Monte Carlo radiosity on GPUs
Unifying stream based and reconfigurable computing to design application accelerators
Unleashing the Power of Distributed CPU/GPU Architectures: Massive Astronomical Data Analysis and Visualization case study
Unlocking Bandwidth for GPUs in CC-NUMA Systems
Unsafe Floating-point to Unsigned Integer Casting Check for GPU Programs
Unstructured grid applications on GPU: performance analysis and improvement
Unsupervised Asset Cluster Analysis Implemented with Parallel Genetic Algorithms on the NVIDIA CUDA Platform
Unsupervised Deep Learning of Incompressible Fluid Dynamics
Unsupervised Markovian Segmentation on Graphics Hardware
Up to 700k GPU cores, Kepler, and the Exascale future for simulations of star clusters around black holes
UPC on MIC: Early Experiences with Native and Symmetric Modes
Urban Regional Seismic Damage Prediction Based On GPU-CPU Hybrid Computing
Usable assembly language for GPUs: a success story
Usage of GPU in LS-DYNA
Use NVIDIA CUDA technology to create genetic algorithms with extensive population
Use of Checkpoint-Restart for Complex HEP Software on Traditional Architectures and Intel MIC
Use of CUDA for the Continuous Space Language Model
Use of CUDA Parallel Computing Technology in Modeling of Solid Mineral Deposits
Use of FPGA or GPU-based architectures for remotely sensed hyperspectral image processing
Use of modern GPUs in Design Optimization
Use of Multi-GPU Systems for Larger Than Device FFTs: With Applications in Ultrasound Simulations
Use of Multiple GPUs on Shared Memory Multiprocessors for Ultrasound Propagation Simulations
Use of Multiple GPUs to Speedup the Execution of a Three-Dimensional Computational Model of the Innate Immune System
User-Driven Online Kernel Fusion for SYCL
User's needs influencing HPC technologies
Uses of GPU Powered Interval Optimization for Parameter Identification in the Context of SO Fuel Cells
Using a GPU to accelerate die and mold fabrication
Using a GPU-CPU architecture to speed up a GA-based real-time system for trading the stock market
Using a GPU, Online Diarization - Offline Diarization
Using AI libraries for Incompressible Computational Fluid Dynamics
Using an OpenCL Framework to Evaluate Interconnect Implementations on FPGAs
Using Artificial Intelligence in Computational Games
Using Butterfly-Patterned Partial Sums to Optimize GPU Memory Accesses for Drawing from Discrete Distributions
Using Commodity Coprocessors for Host Intrusion Detection
Using Commodity Graphics Hardware for Real-Time Digital Hologram View-Reconstruction
Using common graphics hardware for multi-agent traffic simulation with CUDA
Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation
Using Compiler Snippets to Exploit Parallelism on Heterogeneous Hardware: A Java Reduction Case Study
Using Compute Unified Device Architecture (CUDA) in Parallelizing Different Digital Image Processing Techniques
Using CUDA architecture for computer simulations of thermomechanical phenomena
Using CUDA Architecture for the Computer Simulation of the Casting Solidification Process
Using CUDA for Exhaustive Password Recovery
Using CUDA GPU to Accelerate the Ant Colony Optimization Algorithm
Using Data Compression for Increasing Efficiency of Data Transfer Between Main Memory and Intel Xeon Phi Coprocessor or NVidia GPU in Parallel DBMS
Using Deep Convolutional Neural Networks in Monte Carlo Tree Search
Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler
Using DRBL to Deploy MPICH2 and CUDA on Green Computing
Using efficient parallelization in Graphic Processing Units to parameterize stochastic fire propagation models
Using Fermi architecture knowledge to speed up CUDA and OpenCL programs
Using generalized ensemble simulations and Markov state models to identify conformational states
Using GPU for query of email spam detection systems and IDS
Using GPU shaders for visualization
Using GPU Shaders for Visualization, Part 2
Using GPU Simulation to Accurately Fit to the Power-Law Distribution
Using GPU to Accelerate Cache Simulation
Using GPU to exploit parallelism on cryptography
Using GPU VSIPL & CUDA to Accelerate RF Clutter Simulation
Using GPU-based Computing To Accelerate Finite Element Problems
Using GPUs for beamforming acceleration on SAFT imaging
Using GPUs for Machine Learning Algorithms
Using GPUs for Realtime Prediction of Optical Forces on Microsphere Ensembles
Using GPUs to Accelerate Installed Antenna Performance Simulations
Using GPUs to Crack Android Pattern-based Passwords
Using GPUs to Improve Multigrid Solver Performance on a Cluster
Using Graph Properties to Speed-up GPU-based Graph Traversal: A Model-driven Approach
Using Graphic Processing Unit in Block Cipher Calculations (thesis)
Using Graphic Processor Units for the Study of Electric Propagation in Realistic Heart Models
Using Graphical Processing Units for Deterministic Single Machine Scheduling Problems
Using Graphical Processing Units in Scheduling Problems
Using graphics devices in reverse: GPU-based Image Processing and Computer Vision
Using Graphics Hardware for Enhancing Edge and Circle Detection
Using Graphics Processing Unit to Accelerate Database Query Execution
Using Graphics Processing Units for Logic Simulation of Electronic Designs
Using graphics processing units to generate random numbers
Using Graphics Processing Units to Parallelize the FDK Algorithm for Tomographic Image Reconstruction
Using Graphics Processing Units to solve the classical N-body problem in physics and astrophysics
Using Graphics Processor Units (GPUs) for Automatic Video Structuring
Using Graphics Processors for a High Performance Normalization of Gene Expressions
Using graphics processors for high performance IR query processing
Using Graphics Processors for High-Performance Computation and Visualization of Plasma Turbulence
Using Graphics Processors for Parallelizing Hash-based Data Carving 
Using Graphics Processors to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation
Using graphics processors to accelerate the computation of the matrix inverse
Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems
Using Graphics Processors to Facilitate Explicit Digital Electrochemical Simulation: Theory of Elliptical Disc Electrodes
Using hardware performance counters to speed up autotuning convergence on GPUs
Using high performance computing and Monte Carlo simulation for pricing american options
Using High Performance Computing for Optimizing Credit Risk Calculation
Using High Performance Computing to Improve Image Guided Cancer Treatment
Using Hybrid CPU-GPU Platforms to Accelerate the Computation of the Matrix Sign Function
Using hybrid GPU/CPU kernel splitting to accelerate spherical convolutions
Using Hybrid Shared and Distributed Caching for Mixed-Coherency GPU Workloads
Using Image Morphing for Memory-Efficient Impostor Rendering on GPU
Using Intel oneAPI for Multi-hybrid Acceleration Programming with GPU and FPGA Coupling
Using JavaScript and WebCL for Numerical Computations: A Comparative Study of Native and Web Technologies
Using Machine Learning to Estimate Utilization and Throughput for OpenCL-Based SpMV Implementation on an FPGA
Using many-core hardware to correlate radio astronomy signals
Using Meta-heuristics and Machine Learning for Software Optimization of Parallel Computing Systems: A Systematic Literature Review
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy
Using mobile GPU for general-purpose computing - a case study of face recognition on smartphones
Using modern C++ to improve CUDA programs
Using modern graphics architectures for general-purpose computing: a framework and analysis
Using Modularity Metrics to assist Move Method Refactoring of Large System
Using multiple GPUs to accelerate string searching for digital forensic analysis
Using of GPUs for cluster analysis of large data by K-means method
Using of New Possibilities of Fermi Architecture by Development of GPGPU Programs
Using OpenCL for image analysis
Using OpenCL for Implementing Simple Parallel Graph Algorithms
Using OpenCL to Calculate a Pressure Field
Using OpenCL to Implement Median Filtering and RSA Algorithms: Two GPGPU Application Case Studies
Using OpenCL: Programming Massively Parallel Computers
Using OpenGL State History for Graphics Debugging
Using P System with GPU Model to Design and Implement a Public Key Cryptography
Using Parallel Computing for the Display and Simulation of the Space Debris Environment
Using parallel GPU architecture for simulation of planar I/F networks
Using Parallel Programming Models for Automotive Workloads on Heterogeneous Systems - a Case Study
Using reconfigurable computing technology to accelerate matrix decomposition and applications
Using Reconfigurable Logic to Optimise GPU Memory Accesses
Using RenderScript and RCUDA for Compute Intensive tasks on Mobile Devices: a Case Study
Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues
Using Shared Memory as a Cache in Cellular Automata Water Flow Simulations on GPUs
Using SIMD and SIMT vectorization to evaluate sparse chemical kinetic Jacobian matrices and thermochemical source terms
Using sparse optical flow for multiple Kinect applications
Using the CPU and GPU for real-time video enhancement on a mobile computer
Using the CPU to Improve Performance in 3D Applications
Using the GPGPU for Scaling Up Mining Software Repositories
Using the GPU for Fast Symmetry-Based Dense Stereo Matching in High Resolution Images
Using the High Productivity Language Chapel to Target GPGPU Architectures
Using the physics-based rendering toolkit for medical reconstruction
Using the PhysX engine for physics-based virtual surgery with force feedback
Using the pyMIC Offload Module in PyFR
Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications
Using visualization to reveal weak cryptosystems
Using Workload Characterization to Guide High Performance Graph Processing
UT-OCL: An OpenCL Framework for Embedded Systems Using Xilinx FPGAs
Utilising OpenCL Framework for Ray-Tracing Acceleration
Utilization of GPU for real-time vision in robotics
Utilizing GPGPU in Computer Emulation
Utilizing GPUs to Accelerate Turbomachinery CFD Codes
Utilizing Graphics Processing Units for Network Anomaly Detection
Utilizing Graphics Processing Units for Rapid Facial Recognition Using Video Input
Utilizing Hierarchical Multiprocessing for Medical Image Registration
Utilizing jump flooding in image-based soft shadows 
Utilizing massive parallelism in decoding of modern error-correcting codes for accelerating communication systems simulations
Utilizing state-of-art NeuroES and GPGPU to optimize Mario AI
Utilizing Tensor Cores in Futhark
UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs
Valar: A Benchmark Suite to Study the Dynamic Behavior of Heterogeneous Systems
Validation of GPU Computation in Decentralized, Trustless Networks
Validation of the PyGBe code for Poisson-Boltzmann equation with boundary element methods
Value Prediction and Speculative Execution on GPU
ValuePack: value-based scheduling framework for CPU-GPU clusters
Variable Bit Rate GPU Texture Decompression
Variable selection in a GPU cluster using delta test
Variants of Jump Flooding Algorithm for Computing Discrete Voronoi Diagrams
Variants of Mersenne Twister Suitable for Graphic Processors
Variational Bayesian Image Super-Resolution with GPU Acceleration
Various String Matching Algorithms for DNA Sequences to Detect Breast Cancer using CUDA Processors
VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron
vCUDA Framework Development for GPU Virtualization
vCUDA: GPU accelerated high performance computing in virtual machines
VDBSCAN+: Performance Optimization Based on GPU Parallelism
Vector and Line Quantization for Billion-scale Similarity Search on GPUs
Vector graphics depicting marbling flow 
Vector Quantization: A Many-Core Approach
Vectorization of Hybrid Breadth First Search on the Intel Xeon Phi
Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures
Vectorized Higher Order Finite Difference Kernels
Vectorized OpenCL implementation of numerical integration for higher order finite elements
Vendors Draw up a New Graphics-Hardware Approach
Vergence Using GPU Cepstral Filtering
Verifiable Computation with Massively Parallel Interactive Proofs
Verification of GPU Program Optimizations in Lean
Verification of Producer-Consumer Synchronization in GPU Programs
Verification of Program Parallelization
Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs
Verifying CUDA Programs using SMT-Based Context-Bounded Model Checking
Verifying GPU Kernels by Test Amplification
VertexAPI2 - A Vertex-Program API for Large Graph Computations on the GPU
Very fast ellipse detection using GPU-based RHT
Very Fast Non-Dominated Sorting
VHF SAR image formation implemented on a GPU
Viability of Feature Detection on Sony Xperia Z3 using OpenCL
VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs
Video architecture and real-time lighting technology for tangible teleconference
Video Coding on Multicore Graphics Processors
Video coding on multicore graphics processors (GPUs)
Videogame Graphics, BigData & Analytics
View-dependent exploration of massive volumetric models on large-scale light field displays
View-Dependent Real-Time Rendering of Large Outdoor Scenes
View-Dependent Streamlines for 3D Vector Fields
Viewpoints: A high-performance high-dimensional exploratory data analysis tool
VirtCL: a framework for OpenCL device abstraction and management
Virtual open heart surgery: obtaining models suitable for surgical simulation.
Virtual Rheoscopic Fluids
Virtual Texturing with WebGL
Virtual Viewpoint Disparity Estimation and Convergence Check for Real-Time View Synthesis
Virtualization and Migration with GPGPUs
Virtualizing Data Parallel Systems for Portability, Productivity, and Performance
Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design
Visibility Cuts: A System for Rendering Dynamic Virtual Environments
Visibility Sampling on GPU and Applications
Vision based Navigation (VBN) of Unmanned Aerial Vehicles (UAV)
Vispark: GPU-Accelerated Distributed Visual Computing Using Spark
VisPy: Harnessing The GPU For Fast, High-Level Visualization
Visual Analysis Algorithms for Embedded Systems
Visual Computing in Biology and Medicine: Interactive visual analysis of contrast-enhanced ultrasound data based on small neighborhood statistics
Visual cortex on the GPU: Biologically inspired classifier and feature descriptor for rapid recognition
Visual Data Mining Using the Point Distribution Tensor
Visual Human - Machine Learning
Visual Performance Analysis of Memory Behavior in a Task-Based Runtime on Hybrid Platforms
Visual Signatures in Video Visualization
Visual Simulation of Breaking Waves in Shallow Water
Visual Simulation of Flow 
Visual Simulation of Heat Shimmering and Mirage
Visual simulation of shockwaves
Visual simulation of thermal fluid dynamics in a pressurized water reactor
Visual system design for excavator simulator with deformable terrain
Visual-model-based, real-time 3D pose tracking for autonomous navigation: methodology and experiments
Visual, Spatial and Temporal Quality in Video-Based Reconstruction of People: Achieving, Prototyping and Evaluating
Visualisation of Physical Lung Simulation: an Interactive Application to Assist Physicians
Visualising Interfaces in Scalar and Vector Field-Model Simulations
Visualising spins and clusters in regular and small-world Ising models with GPUs
Visualization and Analysis of GPU Summer School Applicants and Participants
Visualization and Correction of Automated Segmentation, Tracking and Lineaging from 5-D Stem Cell Image Sequences
Visualization and GPU-accelerated simulation of medical ultrasound from CT images
Visualization assisted by parallel processing
Visualization in the Einstein Year 2005: a case study on explanatory and illustrative visualization of relativity and astrophysics
Visualization of Astronomical Nebulae via Distributed Multi-GPU Compressed Sensing Tomography
Visualization of Fibrous and Thread-like Data
Visualization of large multidimensional data sets by using multi-core CPU, GPU and MPI cluster
Visualization of Large Volumetric Multi-Channel Microscopy Data Streams on Standard PCs
Visualization of level-of-detail meshes on the GPU
Visualization of LIDAR datasets using point-based rendering technique
Visualization of OpenCL Application Execution on CPU-GPU Systems
Visualization of Pareto Solutions by Spherical Self-Organizing Map and It's acceleration on a GPU
Visualization of structured nonuniform grids
Visualization Tool for GPGPU Programming
Visualization with stylized line primitives
Visualizing and Analyzing the Mona Lisa
Visualizing complex dynamics in many-core accelerator architectures
Visualizing Complex Functions Using GPUs
Visualizing Multiwavelength Astrophysical Data
Visualizing the Radiation of the Kelvin-Helmholtz Instability
Visualizing Trends on Twitter
VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing
Vivaldi: A Domain-Specific Language for Volume Processing and Visualization on Distributed Heterogeneous Systems
Vlasov on GPU (VOG Project)
VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units
Voice Command Recognition with Dynamic Time Warping (DTW) using Graphics Processing Units (GPU) with Compute Unified Device Architecture (CUDA)
VolQD: Direct Volume Rendering of Multi-million Atom Quantum Dot Simulations
Volume and Isosurface Rendering with GPU-Accelerated Cell Projection
Volume exploration using ellipsoidal Gaussian transfer functions
Volume Raycasting Performance Using DirectCompute
Volume rendering visualization of 3D spherical mantle convection with an unstructured mesh 
Volume Visualization: A Technical Overview with a Focus on Medical Applications
Volume-preserving FFD for programmable graphics hardware
Volumetric Ambient Occlusion
Volumetric Ambient Occlusion for Real-Time Rendering and Games
Volumetric Rendering Techniques for Scientific Visualization
Voreen: A Rapid-Prototyping Environment for Ray-Casting-Based Volume Visualizations
Voronoi Toolpaths for PCB Mechanical Etch: Simple and Intuitive Algorithms with the 3D GPU
Vortex Methods for Fluid Simulation in Computer Graphics
Vortex methods for incompressible flow simulations on the GPU
Vortex particle method and parallel computing
Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics
Voxelized Minkowski sum computation on the GPU with robust culling
VoxelPipe: a programmable pipeline for 3D voxelization
Voxels on fire
VSIPL++ Acceleration Using Commodity Graphics Processors
vSMC: Parallel Sequential Monte Carlo in C++
Vulkan 1.1.97 - A Specification (with all registered Vulkan extensions)
Vulnerability Analysis and Attacks on Intel Xeon Phi Coprocessor
Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU
Wait-free programming for general purpose computations on graphics processors
waLBerla: A block-structured high-performance framework for multiphysics simulations
Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning
Wanted: Floating-Point Add Round-off Error instruction
Warp Size Impact in GPUs: Large or Small?
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation
Warp-Level Parallelism: Enabling Multiple Replications In Parallel on GPU
WarpCore: A Library for fast Hash Tables on GPUs
WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU
Warped Register File: A Power Efficient Register File for GPGPUs
Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels
Wasserstein-Fisher-Rao Document Distance
Waste Not, Want Not! Managing relational data in asymmetric memories
Waste Not... Efficient Co-Processing of Relational Data
Water simulation based on HLSL
Water simulation for cell based sandbox games
Water Surface Animation using Damped Wave Equation and CUDA Acceleration
wav2letter++: The Fastest Open-source Speech Recognition System
Wave field synthesis for 3D audio: architectural prospectives
Wavefront raycasting using larger filter kernels for on-the-fly GPU gradient reconstruction
Wavelet Encoding and Multi-GPU Programming
Wavelet Model-based Stereo for Fast, Robust Face Reconstruction
WAYPOINT: scaling coherence to thousand-core architectures
WCCV: Improving the Vectorization of IF-statements with Warp-Coherent Conditions
Weak execution ordering - exploiting iterative methods on many-core GPUs
WebCL for Hardware-Accelerated Web Applications
Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems
Weighted Residuals for Very Deep Networks
WgPy: GPU-accelerated NumPy-like array library for web browsers
What you see is what you snap: snapping to geometry deformed on the GPU
When HLS Meets FPGA HBM: Benchmarking and Bandwidth Optimization
When Machine Learning Meets Quantum Computers: A Case Study
Where is the data? Why you cannot debate CPU vs. GPU performance without the answer
Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
Whole-function vectorization
Why does PHM matter? - Nvidia's GPU problems reviewed
Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?
Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS
Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors
WiLLM: An Open Wireless LLM Communication System
Wilson and Domainwall Kernels on Oakforest-PACS
Winograd Algorithm for AdderNet
Wire Speed Name Lookup: A GPU-based Approach
Wireless Interference Identification with Convolutional Neural Networks
word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement
Work Efficient Parallel Algorithms for Large Graph Exploration
Work in Progress: Vortex Detection and Visualization for Design of Micro Air Vehicles and Turbomachinery
Work Stealing Inside GPUs
Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths
Working With Incremental Spatial Data During Parallel (GPU) Computation
Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone
Workload and network-optimized computing systems
Workload Aware Algorithms for Heterogeneous Platforms
Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation
Workload Characterization of 3D Games
Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB
Workload Scheduling on Heterogeneous Devices
Workload-aware Automatic Parallelization for Multi-GPU DNN Training
Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures
WPA/WPA2 Password Security Testing using Graphics Processing Units
Wrinkling Coarse Meshes on the GPU
Writing a modular GPGPU program in Java
Writing a performance-portable matrix multiplication
Writing self-adaptive codes for heterogeneous systems
X-Device Query Processing by Bitwise Distribution
X-ray CT on the GPU
X-toon: an extended toon shader
XBOOLE-CUDA: Fast Boolean Operations on the GPU
Xbox 360 System Architecture
Xbox360 Front Side Bus - A 21.6 GB/s End-to-End Interface Design
Xeon Phi: A comparison between the newly introduced MIC architecture and a standard CPU through three types of problems
XeonPhi Meets Astrophysical Fluid Dynamics
XGBoost: Scalable GPU Accelerated Learning
XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines
XML3D: interactive 3D graphics for the web
XMT-GPU: A PRAM Architecture for Graphics Computation
XSD: Accelerating MapReduce by Harnessing the GPU inside an SSD
YaDiV-an open platform for 3D visualization and 3D segmentation of medical data
Yang-Mills lattice on CUDA
YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights
You Can Type, but You Can't Hide: A Stealthy GPU-based Keylogger
Ypnos: declarative, parallel structured grid programming
ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales
ZAME: Interactive Large-Scale Graph Visualization
Zero-copy I/O processing for low-latency GPU computing
Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training
Zippy: A Framework for Computation and Visualization on a GPU Cluster
ZNN - A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines
Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management
ZUCL: A ZYNQ UltraScale+ Framework for OpenCL HLS Applications