Papers on hgpu.org (.txt-file)
SW#db: GPU-accelerated exact sequence similarity database search

Swan: A tool for porting CUDA programs to OpenCL

SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors

Swarm-NG: a CUDA Library for Parallel n-body Integrations with focus on Simulations of Planetary Systems

Swarm’s flight: Accelerating the particles using C-CUDA
swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer

Swendsen-Wang Multi-Cluster Algorithm for the 2D/3D Ising Model on Xeon Phi and GPU

Swept Volume approximation of polygon soups

SWIFOLD: Smith-Waterman implementation on FPGA with OpenCL for long DNA sequences

Switching to High Gear: Opportunities for Grand-Scale Real-Time Parallel Simulations

Swizzle Inventor: Data Movement Synthesis for GPU Kernels

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection

SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2

SYCL Code Generation for Multigrid Methods

SYCL compute kernels for ExaHyPE

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

SYCL in the Edge: Performance Evaluation for Heterogeneous Acceleration

SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs

SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing

SYCL-Bench: A Versatile Single-Source Benchmark Suite for Heterogeneous Computing

SYCLops: A SYCL Specific LLVM to MLIR Converter

Sylkan: Towards a Vulkan Compute Target Platform for SYCL

Symbolic Crosschecking of Data-Parallel Floating Point Code

Symbolic crosschecking of floating-point and SIMD code

Symbolic Differentiation in GPU Shaders

Symbolic Testing of OpenCL Code

Symphony: A Scheduler for Client-Server Applications on Coprocessor-based Heterogeneous Clusters

Synchronization and Coordination in Heterogeneous Processors

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming

Synergia CUDA: GPU-accelerated accelerator modeling package

Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra

Synergistic execution of stream programs on multicores with accelerators

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction

Synthesis and rendering of bidirectional texture functions on arbitrary surfaces

Synthesis of Custom Networks of Heterogeneous Processing Elements for Complex Physical System Emulation

Synthesis of Embedded Software using Dataflow Schedule Graphs

Synthesis of GPU Programs from High-Level Models

Synthesis of Platform Architectures from OpenCL Programs

Synthesizing Benchmarks for Predictive Modeling

Synthesizing Software from a ForSyDe Model Targeting GPGPUs

Synthesizing Structured Traversals from Attribute Grammars

Synthesizing Subdivision Meshes Using Real Time Tessellation

Synthetic Aperture Beamformation using the GPU

Synthetic Aperture Radar imaging on a CUDA-enabled mobile platform

Synthetic Aperture Radar Processing with GPGPU

Syntix: A Profiling Based Resource Estimator for CUDA Kernels

System Design Principles for Heterogeneous Resource Management and Scheduling in Accelerator-Based Systems

System integration of FastSPECT III, a dedicated SPECT rodent-brain imager based on BazookaSPECT detector technology

System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Systematic construction, verification and implementation methodology for LDPC codes

Systematic Performance Optimization of Cone-Beam Back-Projection on the Kepler Architecture

Systematic Physics Constrained Parameter Estimation of Stochastic Differential Equations

SystemC simulation on GP-GPUs: CUDA vs. OpenCL

Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning

Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

Tactics to Directly Map CNN graphs on Embedded FPGAs

Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures

Takagi Factorization on GPU using CUDA

Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

Taking the graphics processor beyond graphics

Taming irregular EDA applications on GPUs
Taming the complexities of the C11 and OpenCL memory models

Tamp: A Library for Compact Deep Neural Networks with Structured Matrices

Tangible video teleconference system using real-time image-based relighting

Tango: A Deep Neural Network Benchmark Suite for Various Accelerators

Tangram: a High-level Language for Performance Portable Code Synthesis

TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture

Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics processors

Tapping the supercomputer under your desk: solving dynamic equilibrium models with graphics processors?

Target Marker: A Visual Marker for Long Distances and Detection in Realtime on Mobile Devices

targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance

Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles

Targeting GPUs with OpenMP Directives on Summit: A Simple and Effective Fortran Experience

Targeting heterogeneous architectures via macro data flow

Task and Data Distribution in Hybrid Parallel Systems

Task management for irregular-parallel workloads on the GPU

Task migration of DSP application specified with a DFG and implemented with the BSP computing model on a CPU-GPU cluster

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Task Parallelism and Data Distribution: An Overview of Explicit Parallel Programming Languages

Task Parallelism and Synchronization: An Overview of Explicit Parallel Programming Languages

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

Task Partition Comparison between Multi-core System and GPU
Task Performance with List-Mode Data

Task Scheduling for Heterogeneous Multicore Systems

Task scheduling in hybrid CPU-GPU systems

Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment
Task Superscalar: An Out-of-Order Task Pipeline

Task superscalar: using processors as functional units

Task-based Conjugate-Gradient for multi-GPUs platforms

Task-based FMM for heterogeneous architectures

Task-Based Parallel Strategies for CFD Application in Heterogeneous CPU/GPU Resources

Task-based, GPU-accelerated and Robust Library for Solving Dense Nonsymmetric Eigenvalue Problems

Titles: 100
open PDFs: 95
packages: 34
