high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

MyCaffe: A Complete C# Re-Write of Caffe with Reinforcement Learning

MYRIAD: A new N-body code for simulations of Star Clusters

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Myths and Legends in High-Performance Computing

N-body Simulation for Astronomical Collisional Systems with a New SIMD Instruction Set Extension to the x86 Architecture, Advanced Vector Extensions

N-Body Simulation Using GP-GPU: Evaluating Host/Device Memory Transference Overhead

N-Body Simulations on GPUs

N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks

NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

NaNet:a low-latency NIC enabling GPU-based, real-time low level trigger systems

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Native Offload of Haskell Repa Programs to GPGPU

Natural HPC substrate: Exploitation of mixed multicore CPU and GPUs

NaturalCC: A Toolkit to Naturalize the Source Code Corpus

Navier-Stokes on programmable graphics hardware using SMAC

Navigating An Evolutionary Fast Path to Exascale – Expanded Version

NBODY6++GPU: Ready for the gravitational million-body problem

NBSymple, a double parallel, symplectic N-body code running on Graphic Processing Units

NCAM: Near-Data Processing for Nearest Neighbor Search

NCRF++: An Open-source Neural Sequence Labeling Toolkit

ndzip-gpu: Efficient Lossless Compression of Scientific Floating-Point Data on GPUs

Near Memory Similarity Search on Automata Processors

Near real-time Fast Bilateral Stereo on the GPU

Near-LSPA Performance at MSA Complexity

Near-real-time simulations of biolelectric activity in small mammalian hearts using graphical processing units

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Nemo: A parallelized Lagrangian particle-tracking model

NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs

Neneta: Heterogeneous Computing Complex-Valued Neural Network Framework

Nengo: a Python tool for building large-scale functional brain models

NengoDL: Combining deep learning and neuromorphic modelling methods

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Neon: A Domain-Specific Programming Language for Image Processing

neoSYCL: a SYCL implementation for SX-Aurora TSUBASA

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Neptune: An astrophysical smooth particle hydrodynamics code for massively parallel computer architectures

NEPTUNE: Network- and GPU-aware Management of Serverless Functions at the Edge

Nested Data-Parallelism on the GPU

Nested Intervals Tree Encoding with System of Residual Classes

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems

Network Simulator Tools and GPU Parallel Systems

Network-on-Chip Hardware Accelerators for Biological Sequence Alignment

Neural Architecture Search for Lightweight Non-Local Networks

Neural Architecture Search without Training

Neural Code Comprehension: A Learnable Representation of Code Semantics

Neural Decoding using a Parallel Sequential Monte Carlo method on Point Processes with Ensemble Effect

Neural GPUs Learn Algorithms

Neural Multi-scale Image Compression

Neural Network Computing Using On-Chip Accelerators

Neural Network Implementation Using CUDA and OpenMP

Neural Network Inference on Mobile SoCs

Neural Network Libraries: A Deep Learning Framework Designed from Engineers’ Perspectives

Neural network modeling on evolution of hydration reaction for Portland cement

Neural Network Simulation: The recognition application

Neural Networks for Beginners. A fast implementation in Matlab, Torch, TensorFlow

Neural Networks through Shared Maps in Mobile Devices

Neural Query Language: A Knowledge Base Query Language for Tensorflow

Neural scene representation and rendering

Neurokernel: An Open Scalable Software Framework for Emulation and Validation of Drosophila Brain Models on Multiple GPUs

Neurokernel: An Open Source Platform for Emulating the Fruit Fly Brain

Neuromorphic models on a GPGPU cluster

Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA

New Basic Linear Algebra Methods for Simulation on GPUs

New efficient integral algorithms for quantum chemistry

New Efficient Method To Solve Longest Overlap Region Problem For Noncoding DNA Sequence

New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code

New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA

New Sparse Matrix Storage Format to Improve The Performance of Total SPMV Time

New Techniques for Spectral Image Acquisition and Analysis

Next-generation acceleration and code optimization for light transport in turbid media using GPUs

nGFSIM: A GPU-based fault simulator for 1-to-n detection and its applications

Nikola: embedding compiled GPU functions in Haskell

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

NLSEmagic: Nonlinear Schrodinger Equation Multidimensional Matlab-based GPU-accelerated Integrators using Compact High-order Schemes

NMF-mGPU: non-negative matrix factorization on multi-GPU systems

nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware

NNP/MM: Fast molecular dynamics simulations with machine learning potentials and molecular mechanics

NNS: The Case For Neural Network-based Sorting

No More Shading Languages: Compiling C++ to Vulkan Shaders

Nodal Discontinuous Galerkin Methods on Graphics Processors

Noise Removal from Remote Sensed Images by NonLocal Means with OpenCL Algorithm

Noise-resistant fitting for spherical harmonics

Non-blocking programming on multi-core graphics processors: (extended asbtract)

Non-Determinism in TensorFlow ResNets

Non-deterministic parallelism considered useful

Non-Hydrostatic Pressure Shallow Flows: GPU Implementation Using Finite-Volume and Finite-Difference Scheme

Non-intrusive Performance Analysis of Parallel Hardware Accelerated Applications on Hybrid Architectures

Non-local means denoising algorithm accelerated by GPU

Non-Local Total Generalized Variation for Optical Flow Estimation

Non-Parametric Adaptive Network Pruning

Non-recursive beam search on GPU for formal concept analysis

Non-rigid multi-modal registration on the GPU

Non-separable 2D, 3D and 4D filtering with CUDA

Non-steady relaxation and critical exponents at the depinning transition

Non-symmetric magnetohydrostatic equilibria: a multigrid approach

Non-Uniform Domain Decomposition for Heterogeneous Accelerated Processing Units

Non-Uniformly Partitioned Block Convolution on Graphics Processing Units

Nondissipative Marbling

Brief statistics for this page

Titles: 100

Download open PDFs: 91

Package packages: 37

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)