Papers on hgpu.org (.txt-file)
Locality optimization on a NUMA architecture for hybrid LU factorization

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures

Location-based Matching in Publish/Subscribe Revisited

LOD Terrain Rendering by Local Parallel Processing on GPU

Log File Regular Expression Pattern Matching And Capture With GPUs

LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment

LoGV: Low-overhead GPGPU Virtualization

Long time-scale simulations of in vivo diffusion using GPU hardware

Long Timestep Molecular Dynamics on the Graphical Processing Unit

Long-time Simulations with Complex Code Using Multiple Nodes of Intel Xeon Phi Knights Landing

Loo.py: From Fortran to performance via transformation and substitution rules

Loo.py: transformation-based code generation for GPUs and CPUs

Looking at the surprise: Bottom-up attentional control of an active camera system

LookNN: Neural Network with No Multiplication

Loop Transformation Recipes for Code Generation and Auto-Tuning

LoopBench: An Evaluation of Loop Acceleration in Heterogeneous Systems

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

Loose capacity-constrained representatives for the qualitative visual analysis in molecular dynamics

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding

Lossless Compression of Variable-Precision Floating-Point Buffers on GPUs

Lossless data compression on GPGPU architectures

Lossless LZW Data Compression Algorithm on CUDA

Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level

Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation

Low Complexity Corner Detector Using CUDA for Multimedia Applications

Low cost approach to real-time vehicle to vehicle communication using parallel CPU and GPU processing

Low Latency Complex Event Processing on Parallel Hardware

Low latency photon mapping using block hashing

Low viscosity flow simulations for animation

Low-complexity Distributed Tomographic Backprojection for large datasets

Low-cost edge computing using upcycled smartphones

Low-cost, high-speed computer vision using NVIDIA’s CUDA architecture

Low-Frequency MLFMA on Graphics Processors

Low-Impact Profiling of Streaming, Heterogeneous Applications

Low-Latency Elliptic Curve Scalar Multiplication

Low-latency Image Recognition with GPU-accelerated Convolutional Networks for Web-based Services

Low-overhead diskless checkpoint for hybrid computing systems

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

Low-power Task Scheduling for GPU Energy Reduction

LS-CAT: A Large-Scale CUDA AutoTuning Dataset

LTE Physical Layer Implementation Using GPU Based High Performance Computing

LTTng CLUST: A system-wide unified CPU and GPU tracing tool for OpenCL applications

LU Factorization for Accelerator-based Systems

LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

LU Factorization with Partial Pivoting for a Multicore System with Accelerators

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

LU, QR, and Cholesky factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi

LUDA: Boost LSM Key Value Store Compactions with GPUs

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures

Lyra2: Password Hashing Scheme with improved security against time-memory trade-offs

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Machine Learning Based Intrusion Detection in Controller Area Networks

Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)

Machine Learning for CUDA+MPI Design Rules

Machine Learning for Predictive Auto-Tuning with Boosted Regression Trees

Machine learning for ultrafast X-ray diffraction patterns on large-scale GPU clusters

Machine Learning from Streaming Data in Heterogeneous Computing Environments

Machine Learning in Compilers: Past, Present and Future

Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

MacroSS: macro-SIMDization of streaming applications

Maestro: Data Orchestration and Tuning for OpenCL Devices

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing

MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing

Magneto-hydrodynamics simulation in astrophysics

Magnetohydrodynamics on Heterogeneous architectures: a performance comparison

Magnetohydrodynamics simulations on graphics processing units

Maintaining constant frame rates in 3D texture-based volume rendering

Makespan computation for GPU threads running on a single streaming multiprocessor

Making Human Connectome Faster: GPU Acceleration of Brain Network Analysis

Making the case of GPUs in courses on computational physics

MALBEC: a new CUDA-C ray-tracer in General Relativity

MambaCPU: Enhanced Correlation Mining with State Space Models for CPU Performance Prediction

Managing Extreme Heterogeneity in Next Generation HPC Systems

Managing heterogeneous device memory using C++17 memory resources

Managing Multi Instance GPUs for High Throughput and Energy Savings

Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality (hwloc)

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

Manas: Mining Software Repositories to Assist AutoML

ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview

Many-body quantum chemistry on graphics processing units

Many-Core Algorithms for Combinatorial Optimization

Titles: 100
open PDFs: 99
packages: 27
