high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Low viscosity flow simulations for animation

Low-complexity Distributed Tomographic Backprojection for large datasets

Low-cost edge computing using upcycled smartphones

Low-cost, high-speed computer vision using NVIDIA’s CUDA architecture

Low-Frequency MLFMA on Graphics Processors

Low-Impact Profiling of Streaming, Heterogeneous Applications

Low-Latency Elliptic Curve Scalar Multiplication

Low-latency Image Recognition with GPU-accelerated Convolutional Networks for Web-based Services

Low-overhead diskless checkpoint for hybrid computing systems

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

Low-power Task Scheduling for GPU Energy Reduction

Lowering IrGL to CUDA

LS-CAT: A Large-Scale CUDA AutoTuning Dataset

LTE Physical Layer Implementation Using GPU Based High Performance Computing

LTTng CLUST: A system-wide unified CPU and GPU tracing tool for OpenCL applications

LU Factorization for Accelerator-based Systems

LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

LU Factorization with Partial Pivoting for a Multicore System with Accelerators

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

LU, QR, and Cholesky factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi

LUDA: Boost LSM Key Value Store Compactions with GPUs

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures

Lyra2: Password Hashing Scheme with improved security against time-memory trade-offs

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Machine Learning at the Limit

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Machine Learning Based Intrusion Detection in Controller Area Networks

Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)

Machine Learning for CUDA+MPI Design Rules

Machine Learning for Predictive Auto-Tuning with Boosted Regression Trees

Machine learning for ultrafast X-ray diffraction patterns on large-scale GPU clusters

Machine Learning from Streaming Data in Heterogeneous Computing Environments

Machine Learning in Compilers: Past, Present and Future

Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

Machines and Algorithms

MacroSS: macro-SIMDization of streaming applications

Maestro: Data Orchestration and Tuning for OpenCL Devices

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing

MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing

Magneto-hydrodynamics simulation in astrophysics

Magnetohydrodynamics on Heterogeneous architectures: a performance comparison

Magnetohydrodynamics simulations on graphics processing units

Maintaining constant frame rates in 3D texture-based volume rendering

Makespan computation for GPU threads running on a single streaming multiprocessor

Making Human Connectome Faster: GPU Acceleration of Brain Network Analysis

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Making the case of GPUs in courses on computational physics

MALBEC: a new CUDA-C ray-tracer in General Relativity

MambaCPU: Enhanced Correlation Mining with State Space Models for CPU Performance Prediction

Managing coherent groups

Managing Extreme Heterogeneity in Next Generation HPC Systems

Managing heterogeneous device memory using C++17 memory resources

Managing Multi Instance GPUs for High Throughput and Energy Savings

Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality (hwloc)

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

Manas: Mining Software Repositories to Assist AutoML

ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview

Many-body quantum chemistry on graphics processing units

Many-Core Algorithms for Combinatorial Optimization

Many-core algorithms for statistical phylogenetics

Many-core applications to online track reconstruction in HEP experiments

Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques

Many-Core Compiler Fuzzing

Many-core GPU computing with NVIDIA CUDA

Many-core parallel computing – Can compilers and tools do the heavy lifting?

Many-Core vs. Many-Thread Machines: Stay Away From the Valley

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Many-threaded Differential Evolution on the GPU

Many-threaded implementation of differential evolution for the CUDA platform

Manycore high-performance computing in bioinformatics

Manycore processing of repeated k-NN queries over massive moving objects observations

Manycore processing of repeated range queries over massive moving objects observations

MAP-based Brain Tissue Segmentation using Manifold Learning and Hierarchical Max-Flow regularization

Map-reduce as a Programming Model for Custom Computing Machines

MapCG: writing parallel program portable between CPU and GPU

MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs

Mapping a Data-Flow Programming Model onto Heterogeneous Platforms

Mapping a Dataflow Programming Model onto Heterogeneous Architectures

Mapping a Guided Image Filter on the HARP Reconfigurable Architecture Using OpenCL

Mapping computational concepts to GPUs

Mapping dynamic programming algorithms on graphics processing units

Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures

Mapping Iterative Medical Imaging Algorithm on Cell Accelerator

Mapping of a film grain removal algorithm to a heterogeneous reconfigurable architecture

Mapping parallel programs to heterogeneous multi-core systems

Mapping Streaming Applications to OpenCL

Mapping the Arnold web with a GPU-supercomputer

Mapping the Arnold web with a graphic processing unit

Mapping the SBR and TW-ILDCs to Heterogeneous CPU-GPU Architecture for Fast Computation of Electromagnetic Scattering

MapReduce for Counting Word Frequencies with MPI and GPUs

MapSQ: A MapReduce-based Framework for SPARQL Queries on GPU

MARC: A Many-Core Approach to Reconfigurable Computing

March of the Froblins: simulation and rendering massive crowds of intelligent and detailed creatures on GPU

Marian: Cost-effective High-Quality Neural Machine Translation in C++

Brief statistics for this page

Titles: 100

Download open PDFs: 98

Package packages: 23

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)