high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Optimizing Deep Learning Models For Raspberry Pi

Optimizing exact computation of Betweenness Centrality for CUDA

Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming

Optimizing Full Correlation Matrix Analysis of fMRI Data on Intel Xeon Phi Coprocessors

Optimizing GPU to GPU Communication on Cray XK7

Optimizing GPU Volume Rendering

Optimizing GPU-accelerated Group-By and Aggregation

Optimizing High-Performance Linpack for Exascale Accelerated Architectures

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Optimizing Krylov Subspace Solvers on Graphics Processing Units

Optimizing Lempel-Ziv Factorization for the GPU Architecture

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Optimizing LZSS Compression on GPGPUs

Optimizing MapReduce for GPUs with effective shared memory usage

Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

Optimizing memory management on heterogeneous systems using polyhedral, compile-time techniques

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Monte Carlo radiosity on graphics hardware

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

Optimizing OpenCL Local Work Group Size With Machine Learning

Optimizing Performance and Energy Efficiency in Massively Parallel Systems

Optimizing Performance of Recurrent Neural Networks on GPUs

Optimizing Performance of Stencil Code with SPL Conqueror

Optimizing performance per watt on GPUs in High Performance Computing: temperature, frequency and voltage effects

Optimizing RDF stores by coupling General-purpose Graphics Processing Units and Central Processing Units

Optimizing Real Time GPU Kernels Using Fuzzy Inference System

Optimizing Similarity Computations for Ontology Matching – Experiences from GOMMA

Optimizing simulated annealing on GPU: A case study with IC floorplanning

Optimizing Smith-Waterman algorithm on Graphics Processing Unit

Optimizing Sparse Matrix-Matrix Multiplication for the GPU

Optimizing Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures

Optimizing Stencil Computations for NVIDIA Kepler GPUs

Optimizing strassen matrix multiply on GPUs

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Optimizing Sweep3D for Graphic Processor Unit

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Optimizing the Computation of Eigenvalues Using Graphics Processing Units

Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

Optimizing the SUSAN corner detection algorithm for a high speed FPGA implementation

Optimizing Urban Environmental Simulations using Boinc

Optimizing Web Virtual Reality

Optimizing Xeon Phi for Interactive Data Analysis

OptiML: An implicitly parallel domain-specific language for machine learning

Optimum Application Deployment Technology for Heterogeneous IaaS Cloud

Option Pricing on the GPU

Option pricing with COS method on graphics processing units

Option pricing with multi-dimensional quadrature architectures

OptiX: a general purpose ray tracing engine

Orca: FSS-based Secure Training with GPUs

Orchestrated Scheduling and Prefetching for GPGPUs

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices

Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors

Orchestration by approximation: mapping stream programs onto multicore architectures

Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station

Origami: A Convolutional Network Accelerator

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

Orthogononalization on a general purpose graphics processing unit with double double and quad double arithmetic

Orthorectification by Using GPGPU Method

Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs

Out-of-core cone beam reconstruction using multiple GPUs

Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds

Out-of-core singular value decomposition

Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Out-of-the-box library support for DBMS operations on GPUs

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Overcoming the GPU memory limitation on FDTD through the use of overlapping subgrids

Overcomplete Dictionary Learning with Jacobi Atom Updates

Overdetermined Shooting Methods for Computing Standing Water Waves with Spectral Accuracy

Overhauling SC atomics in C11 and OpenCL

Overlap fermions on GPUs

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping computation and communication of three-dimensional FDTD on a GPU cluster

Overtaking CPU DBMSes with a GPU in Whole-Query Analytic Processing with Parallelism-Friendly Execution Plan Optimization

Overview of approaches for accelerating scale invariant feature detection algorithm

Overview of implementation of DARPA GPU program in SAIC

OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance

P-HGRMS: A Parallel Hypergraph Based Root Mean Square Algorithm for Image Denoising

PacketShader: a GPU-accelerated software router

Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm

Pairwise Sequence Alignment for Very Long Sequences on GPUs

Pairwise Sequence Alignment with Gaps with GPU

PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs

Panda: A Compiler Framework for Concurrent CPU-GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Pangolin: An Efficient and Flexible Graph Mining System on CPU and GPU

PanJoin: A Partition-based Adaptive Stream Join

PANNA: Properties from Artificial Neural Network Architectures

Pannotia: Understanding Irregular GPGPU Graph Applications

PantaRay: fast ray-traced occlusion caching of massive scenes

PAPER – Accelerating parallel evaluations of ROCS

ParadisEO-MO-GPU: a Framework for Parallel GPU-based Local Search Metaheuristics

Paragon: Collaborative Speculative Loop Execution on GPU and CPU

Brief statistics for this page

Titles: 100

Download open PDFs: 89

Package packages: 21

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Polygeist: C/C++ frontend for MLIR

Retargeting and Respecializing GPU Workloads for Performance Portability

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

OpenMP5-Offload-OpenMC-Intel-PVC

Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)