Papers on hgpu.org (.txt-file)
WarpCore: A Library for fast Hash Tables on GPUs

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

Warped Register File: A Power Efficient Register File for GPGPUs

Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels

Wasserstein-Fisher-Rao Document Distance

Waste Not, Want Not! Managing relational data in asymmetric memories

Waste Not… Efficient Co-Processing of Relational Data

Water simulation based on HLSL

Water simulation for cell based sandbox games

Water Surface Animation using Damped Wave Equation and CUDA Acceleration

wav2letter++: The Fastest Open-source Speech Recognition System

Wave field synthesis for 3D audio: architectural prospectives

Wavefront raycasting using larger filter kernels for on-the-fly GPU gradient reconstruction
Wavelet Encoding and Multi-GPU Programming

Wavelet Model-based Stereo for Fast, Robust Face Reconstruction

WAYPOINT: scaling coherence to thousand-core architectures

WCCV: Improving the Vectorization of IF-statements with Warp-Coherent Conditions

Weak execution ordering – exploiting iterative methods on many-core GPUs

WebCL for Hardware-Accelerated Web Applications

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems

Weighted Residuals for Very Deep Networks

WgPy: GPU-accelerated NumPy-like array library for web browsers

What you see is what you snap: snapping to geometry deformed on the GPU

When HLS Meets FPGA HBM: Benchmarking and Bandwidth Optimization

When Machine Learning Meets Quantum Computers: A Case Study

Where is the data? Why you cannot debate CPU vs. GPU performance without the answer

Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU

Why does PHM matter? – Nvidia’s GPU problems reviewed
Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?

Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS

Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors

WiLLM: An Open Wireless LLM Communication System

Wilson and Domainwall Kernels on Oakforest-PACS

Winograd Algorithm for AdderNet

Wire Speed Name Lookup: A GPU-based Approach

Wireless Interference Identification with Convolutional Neural Networks

word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement

Work Efficient Parallel Algorithms for Large Graph Exploration

Work in Progress: Vortex Detection and Visualization for Design of Micro Air Vehicles and Turbomachinery

Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths

Working With Incremental Spatial Data During Parallel (GPU) Computation

Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone

Workload and network-optimized computing systems
Workload Aware Algorithms for Heterogeneous Platforms

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation

Workload Characterization of 3D Games

Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

Workload Scheduling on Heterogeneous Devices

Workload-aware Automatic Parallelization for Multi-GPU DNN Training

Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures

WPA/WPA2 Password Security Testing using Graphics Processing Units

Wrinkling Coarse Meshes on the GPU

Writing a modular GPGPU program in Java

Writing a performance-portable matrix multiplication

Writing self-adaptive codes for heterogeneous systems

X-Device Query Processing by Bitwise Distribution

X-toon: an extended toon shader

XBOOLE-CUDA: Fast Boolean Operations on the GPU

Xbox360 Front Side Bus – A 21.6 GB/s End-to-End Interface Design

Xeon Phi: A comparison between the newly introduced MIC architecture and a standard CPU through three types of problems

XeonPhi Meets Astrophysical Fluid Dynamics

XGBoost: Scalable GPU Accelerated Learning

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines

XML3D: interactive 3D graphics for the web

XMT-GPU: A PRAM Architecture for Graphics Computation

XSD: Accelerating MapReduce by Harnessing the GPU inside an SSD

YaDiV-an open platform for 3D visualization and 3D segmentation of medical data

YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights

You Can Type, but You Can’t Hide: A Stealthy GPU-based Keylogger

Ypnos: declarative, parallel structured grid programming

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

ZAME: Interactive Large-Scale Graph Visualization

Zero-copy I/O processing for low-latency GPU computing

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Zippy: A Framework for Computation and Visualization on a GPU Cluster
ZNN – A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines

Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management

ZUCL: A ZYNQ UltraScale+ Framework for OpenCL HLS Applications

Titles: 84
open PDFs: 80
packages: 22
