Papers on hgpu.org (.txt-file)
Literature Review: Parallel Computing on linear equations of linear elastic FEM stimulation with CUDA

LithOS: An Operating System for Efficient Machine Learning on GPUs

Live Migration for OpenCL FPGA Accelerators

Live Migration of FPGA Applications

Living Flows: Enhanced Exploration of Edge-Bundled Graphs Based on GPU-Intensive Edge Rendering

LLload: An Easy-to-Use HPC Utilization Tool

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

LLMPerf: GPU Performance Modeling meets Large Language Models

LLOR: Automated Repair of OpenMP Programs

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

LN-Annote: An Alternative Approach to Information Extraction from Emails using Locally-Customized Named-Entity Recognition

LNA: Fast Protein Classification Using A Laplacian Characterization of Tertiary Structure

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model Visualization

Load Balancing for Constraint Solving with GPUs

Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability

Load Balancing in Data Warehouse – Evolution and Perspectives

Load Balancing Utilizing Data Redundancy in Distributed Volume Rendering

Load-Balanced Multi-GPU Ambient Occlusion for Direct Volume Rendering

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

Local Histogram Modification Based Contrast Enhancement with GPU Acceleration

Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid

Local Search Algorithms on Graphics Processing Units. A Case Study: The Permutation Perceptron Problem

Local Volatility FX Basket Option on CPU and GPU

Local vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments

Locality Analysis for Characterizing Applications Based on Sparse Matrices

Locality and parallelism optimization for dynamic programming algorithm in bioinformatics

Locality Aware Work-Stealing Based Scheduling in Hybrid CPU-GPU

Locality optimization on a NUMA architecture for hybrid LU factorization

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures

Location-based Matching in Publish/Subscribe Revisited

LOD Terrain Rendering by Local Parallel Processing on GPU

Log File Regular Expression Pattern Matching And Capture With GPUs

LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment

LoGV: Low-overhead GPGPU Virtualization

Long time-scale simulations of in vivo diffusion using GPU hardware

Long Timestep Molecular Dynamics on the Graphical Processing Unit

Long-time Simulations with Complex Code Using Multiple Nodes of Intel Xeon Phi Knights Landing

Loo.py: From Fortran to performance via transformation and substitution rules

Loo.py: transformation-based code generation for GPUs and CPUs

Looking at the surprise: Bottom-up attentional control of an active camera system

LookNN: Neural Network with No Multiplication

Loop Transformation Recipes for Code Generation and Auto-Tuning

LoopBench: An Evaluation of Loop Acceleration in Heterogeneous Systems

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

Loose capacity-constrained representatives for the qualitative visual analysis in molecular dynamics

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding

Lossless Compression of Variable-Precision Floating-Point Buffers on GPUs

Lossless data compression on GPGPU architectures

Lossless LZW Data Compression Algorithm on CUDA

Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level

Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation

Low Complexity Corner Detector Using CUDA for Multimedia Applications

Low cost approach to real-time vehicle to vehicle communication using parallel CPU and GPU processing

Low Latency Complex Event Processing on Parallel Hardware

Low latency photon mapping using block hashing

Low viscosity flow simulations for animation

Low-complexity Distributed Tomographic Backprojection for large datasets

Low-cost edge computing using upcycled smartphones

Low-cost, high-speed computer vision using NVIDIA’s CUDA architecture

Low-Frequency MLFMA on Graphics Processors

Low-Impact Profiling of Streaming, Heterogeneous Applications

Low-Latency Elliptic Curve Scalar Multiplication

Low-latency Image Recognition with GPU-accelerated Convolutional Networks for Web-based Services

Low-overhead diskless checkpoint for hybrid computing systems

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

Low-power Task Scheduling for GPU Energy Reduction

LS-CAT: A Large-Scale CUDA AutoTuning Dataset

LTE Physical Layer Implementation Using GPU Based High Performance Computing

LTTng CLUST: A system-wide unified CPU and GPU tracing tool for OpenCL applications

LU Factorization for Accelerator-based Systems

LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

LU Factorization with Partial Pivoting for a Multicore System with Accelerators

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

LU, QR, and Cholesky factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi

LUDA: Boost LSM Key Value Store Compactions with GPUs

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures

Lyra2: Password Hashing Scheme with improved security against time-memory trade-offs

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Machine Learning Based Intrusion Detection in Controller Area Networks

Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)

Machine Learning for CUDA+MPI Design Rules

Titles: 100
open PDFs: 97
packages: 24
