high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

__host__ __device__ — Generic programming in Cuda

“Local Rank Differences” Image Feature Implemented on GPU

[Serbian] The Methods and Procedures for Accelerating Operations and Queries in Large Database Systems and Data Warehouse (Big Data Systems)

10×10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency

190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

2-D Impulse Noise Suppression by Recursive Gaussian Maximum Likelihood Estimation

24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way Galaxy with 18600 GPUs

2D and 3D level-set algorithms on GPU

2D Image Convolution using Three Parallel Programming Models on the Xeon Phi

2D Triangulation of Polygons on CUDA

2D/3D image registration on the GPU

2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation

2PARMA: Parallel Paradigms and Run-time Management Techniques for Many-Core Architectures

3-SAT on CUDA: Towards a massively parallel SAT solver

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

3D data denoising via Non-Local means filter by using parallel GPU strategies

3D Edge Bundling for Geographical Data Visualization

3D FFT on a Single FPGA

3D finite difference computation on GPUs using CUDA

3D finite element numerical integration on GPUs

3D GPU Architecture using Cache Stacking: Performance, Cost, Power and Thermal analysis

3D Haar-Like Elliptical Features for Object Classification in Microscopy

3D Hydrodynamic Simulation of Classical Nova Explosions

3D Information Extraction Based on GPU

3D Modeling, Distance and Gradient Computation for Motion Planning: A Direct GPGPU Approach

3D Non-Local Means denoising via multi-GPU

3D nonrigid registration via optimal mass transport on the GPU

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels

3D Object Recognition with Convolutional Neural Networks

3D Objects Tracking by GPGPU-Enhanced Particle Filter Algorithms

3D Recursive Gaussian IIR on GPU and FPGAs: A Case Study for Accelerating Bandwidth-Bounded Applications

3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation

3D simulation of complex shading affecting PV systems taking benefit from the power of graphics cards developed for the video game industry

3D Skeleton Extraction Method using Potential Field on OpenCL

3D tumor localization through real-time volumetric x-ray imaging for lung cancer radiotherapy

3D vision of electromagnetic fields in antenna and microwave technique

3D visualization of astronomy data cubes using immersive displays

3D-color video camera

3DES ECB Optimized for Massively Parallel CUDA GPU Architecture

3I: A tool for visualizing and processing in parallel 2D & 3D images

42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

4kUHD H264 wireless live video streaming using CUDA

5.6: GPU enhancement of FDTD-PIC plasma-wave simulations

8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

86 PFLOPS Deep Potential Molecular Dynamics simulation of 100 million atoms with ab initio accuracy

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

A (ir)regularity-aware task scheduler for heterogeneous platforms

A (Somewhat Dated) Comparative Study of Betweenness Centrality Algorithms on GPU

A 3D Convex Hull Algorithm for Graphics Hardware

A 3D radiative transfer framework: XIII. OpenCL implementation

A 3D radiative transfer framework. VIII. OpenCL implementation

A 57mW embedded mixed-mode neuro-fuzzy accelerator for intelligent multi-core processor

A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks

A balanced programming model for emerging heterogeneous multicore systems

A Batched GPU Algorithm for Set Intersection

A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

A Bi-objective Optimization Framework for Query Plans

A biomolecular electrostatics solver using Python, GPUs and boundary elements that can handle solvent-filled cavities and Stern layers

A block-asynchronous relaxation method for graphics processing units

A Braille Conversion Service Using GPU and Human Interaction by Computer Vision

A breadth-first course in multicore and manycore programming

A capabilities-aware framework for using computational accelerators in data-intensive computing

A Case Against Small Data Types on GPGPUs

A case for neuromorphic ISAs

A Case for Work-stealing on FPGAs with OpenCL Atomics

A Case Study for Petascale Applications in Astrophysics: Simulating Gamma-Ray Bursts

A Case Study in Using OpenCL on FPGAs: Creating an Open-Source Accelerator of the AutoDock Molecular Docking Software

A Case Study of OpenCL on an Android Mobile GPU

A Case Study of SWIM: Optimization of Memory Intensive Application on GPGPU

A case study on porting scientific applications to GPU/CUDA

A Case Study: Exploiting Neural Machine Translation to Translate CUDA to OpenCL

A CG-based Poisson solver on a GPU-cluster

A characterization and analysis of PTX kernels

A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads

A Chunking Method for Euclidean Distance Matrix Calculation on Large Dataset Using Multi-GPU

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

A closer look at GPUs

A Cloud Computing Service Architecture of a Parallel Algorithm Oriented to Scientific Computing with CUDA and Monte Carlo

A cluster for CS education in the manycore era

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

A Co-Prime Blur Scheme for Data Security in Video Surveillance

A Coarse Grain Reconfigurable Architecture for sequence alignment problems in bio-informatics

A code motion technique for accelerating general-purpose computation on the GPU

A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators

A Code Transformation Framework for Scientific Applications on Structured Grids

A code-based analytical approach for using separate device coprocessors in computing systems

A Collective Knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniques

A collision detection algorithm using adaptive particle sensor

A combined MPI-CUDA parallel solution of linear and nonlinear Poisson-Boltzmann equation

A Common GPU n-Dimensional Array for Python and C

A Comparative Analysis of GPU Implementations of Spectral Unmixing Algorithms

A comparative analysis of the performance and deployment overhead of parallelized Finite Difference Time Domain (FDTD) algorithms on a selection of high performance multiprocessor computing systems

A comparative benchmarking of the FFT on Fermi and Evergreen GPUs

A Comparative Measurement Study of Deep Learning as a Service Framework

A Comparative Study of 2D Numerical Methods with GPU Computing

A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++

A Comparative Study of Game Tree Searching Methods

A comparative study of GPU programming models and architectures using neural networks

Brief statistics for this page

Titles: 100

Download open PDFs: 88

Package packages: 9

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

ParEval: A Parallel Code Evaluation Benchmark

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

No More Shading Languages: Compiling C++ to Vulkan Shaders

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)