high performance computing on graphics processing units: hgpu.org

Applications

hgpu.org » Applications

SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes

Holger Homann, Francois Laenen

View

Download (PDF)

Source codes

Tags: Benchmarking, Computational Physics, Computer science, CUDA, Heterogeneous systems, nVidia, nVidia GeForce GT 755 M, Package, Performance, Physics, Tesla M2050

October 15, 2017 by hgpu

FPGA implementation of a Convolutional Neural Network for "Wake up word" detection

Ole Martin Skafsa

View

Download (PDF)

Tags: CNN, Computer science, Deep learning, FPGA, Machine learning, Neural networks, NLP, nVidia, nVidia Tegra TX1, OpenCL, Thesis

October 3, 2017 by hgpu

Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors

Satya P. Jammy, Christian T. Jacobs, David J. Lusher, Neil D. Sandham

View

Download (PDF)

Tags: Algorithms, Computer science, CUDA, Energy-efficient computing, Finite difference, Intel Xeon Phi, nVidia, Tesla K40

October 3, 2017 by hgpu

An Efficient Load Balancing Method for Tree Algorithms

Osama Talaat Ibrahim, Ahmed El-Mahdy

View

Download (PDF)

Tags: Algorithms, Computer science, Intel Xeon Phi, load balancing, Performance

October 3, 2017 by hgpu

Computing Treewidth on the GPU

Tom C. van der Zanden, Hans L. Bodlaender

View

Download (PDF)

Source codes

Tags: Algorithms, Bloom filter, Computer science, Graph theory, nVidia, nVidia GeForce GTX 1060, OpenCL, Package

October 3, 2017 by hgpu

Performance Evaluation of Container-based Virtualization for High Performance Computing Environments

Carlos Arango, Remy Dernat, John Sanabria

View

Download (PDF)

Source codes

Tags: Benchmarking, Computer science, CUDA, MPI, nVidia, Operating systems, Package, Performance, Tesla K20, Virtualization

October 3, 2017 by hgpu

OpenCL Actors – Adding Data Parallelism to Actor-based Programming with CAF

Raphael Hiesgen, Dominik Charousset, Thomas C. Schmidt

View

Download (PDF)

Source codes

Tags: Computer science, Data parallelism, Heterogeneous systems, Intel Xeon Phi, nVidia, nVidia GeForce GTX 780 M, OpenCL, Package, Tesla C2075

September 28, 2017 by hgpu

Mixed Precision Solver Scalable to 16000 MPI Processes for Lattice Quantum Chromodynamics Simulations on the Oakforest-PACS System

Taisuke Boku, Ishikawa Ken-Ichi, Yoshinobu Kuramashi, Lawrence Meadows

View

Download (PDF)

Source codes

Tags: Computational Physics, High Energy Physics - Lattice, Intel Xeon Phi, Mixed precision, MPI, OpenMP, Package, Physics, QCD

September 28, 2017 by hgpu

Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices

Zongqing Lu, Swati Rallapalli, Kevin Chan, Thomas La Porta

View

Download (PDF)

Tags: Computer science, Computer vision, CUDA, Deep learning, Neural networks, nVidia, nVidia Jetson TK1, nVidia Tegra TX1

September 28, 2017 by hgpu

GALARIO: a GPU Accelerated Library for Analysing Radio Interferometer Observations

Marco Tazzari, Frederik Beaujean, Leonardo Testi

View

Download (PDF)

Source codes

Tags: Astrophysics, CUDA, Instrumentation and Methods for Astrophysics, nVidia, nVidia GeForce GTX 1060, Package, Python, Tesla P100

September 28, 2017 by hgpu

Accelerating Electron Tomography Reconstruction Algorithm ICON Using the Intel Xeon Phi Coprocessor on Tianhe-2 Supercomputer

Zihao Wang, Yu Chen, Jingrong Zhang, Lun Li, Xiaohua Wan, Zhiyong Liu, Fei Sun, Fa Zhang

View

Download (PDF)

Tags: intel mic, Intel Xeon Phi, NUFFT, supercomputer, Tianhe-2

September 28, 2017 by holy

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Dalal Sukkari, Hatem Ltaief, Mathieu Faverge, David Keyes

View

Download (PDF)

Tags: Algorithms, Benchmarking, Computer science, Factorization, Intel Xeon Phi, nVidia, StarPU, Task scheduling, Tesla K80, Tesla P100

September 21, 2017 by hgpu

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

DVM: Real-Time Kernel Generation for Dynamic AI Models

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

* * *

high performance computing on graphics processing units: hgpu.org

Applications

SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes

FPGA implementation of a Convolutional Neural Network for "Wake up word" detection

Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors

An Efficient Load Balancing Method for Tree Algorithms

Computing Treewidth on the GPU

Performance Evaluation of Container-based Virtualization for High Performance Computing Environments

OpenCL Actors – Adding Data Parallelism to Actor-based Programming with CAF

Mixed Precision Solver Scalable to 16000 MPI Processes for Lattice Quantum Chromodynamics Simulations on the Oakforest-PACS System

Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices

GALARIO: a GPU Accelerated Library for Analysing Radio Interferometer Observations

Accelerating Electron Tomography Reconstruction Algorithm ICON Using the Intel Xeon Phi Coprocessor on Tianhe-2 Supercomputer

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)