Neural networks | hgpu.org

hgpu.org » Neural networks

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Kimsong Lor

View

Download (PDF)

Tags: Bayesian, Computer science, Fortran, Neural networks, OpenMP, Thesis

August 3, 2025 by hgpu

Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler

M. Ameur Nassim, M. Tirichine Mohammed

View

Download (PDF)

Tags: Computer science, High Energy Physics - Lattice, Neural networks, Physics, QCD, Thesis

July 20, 2025 by hgpu

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

Weiyi Xia, Maxim Moraru, Ying Wai Li, Timothy Liao, James R. Chelikowsky, Cai-Zhuang Wang

View

Download (PDF)

Source codes

Tags: Anisotropy, Computational Physics, Condensed matter, CUDA, Machine learning, Materials Science, Neural networks, nVidia, nVidia A100, Package, Physics

July 6, 2025 by hgpu

Omniwise: Predicting GPU Kernels Performance with LLMs

Zixian Wang, Cole Ramos, Muhammad A. Awad, Keith Lowery

View

Download (PDF)

Tags: AMD, AMD Radeon Instinct MI250, AMD Radeon Instinct MI300X, Artificial intelligence, Benchmarking, Computer science, LLM, Neural networks, Performance, ROCm

June 29, 2025 by hgpu

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Hossein Albakri, Kazem Cheshmi

View

Download (PDF)

Tags: Compilers, Computer science, CUBLAS, CUDA, Machine learning, Matrix multiplication, Neural networks, nVidia, nVidia A100, Programming Languages, Sparse matrix

June 22, 2025 by hgpu

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

Paul Fuchs, Weilong Chen, Stephan Thaler, Julija Zavadlav

View

Download (PDF)

Source codes

Tags: Chemistry, Computational Physics, Computer science, CUDA, Machine learning, Molecular dynamics, Neural networks, nVidia, nVidia A100, nVidia GH200, nVidia H100, Package, Physics

June 15, 2025 by hgpu

Exploring SYCL for batched kernels with memory allocations

Aymeric Millan, Thomas Padioleau, Julien Bigot

View

Download (PDF)

Source codes

Tags: AMD Radeon Instinct MI250X, ATI, Computer science, CUDA, FFT, Neural networks, nVidia, nVidia A100, Package, performance portability, SYCL

May 25, 2025 by hgpu

LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning

Neha Prakriya, Zijian Ding, Yizhou Sun, Jason Cong

View

Download (PDF)

Tags: AMD Radeon Instinct MI250, ATI, Computer science, FPGA, HLS, LLM, Machine learning, Neural networks, PyTorch

May 4, 2025 by hgpu

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization

Ka Wai Wu

View

Download (PDF)

Tags: Computer science, CUDA, Graph, Heterogeneous systems, Neural networks, nVidia, PyTorch, Tesla V100

December 24, 2024 by hgpu

Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach

Yijia Zhang, Zhihong Gou, Shijie Cao, Weigang Feng, Sicheng Zhang, Guohao Dai, Ningyi Xu

View

Download (PDF)

Tags: Computer science, CUDA, Energy-efficient computing, Machine learning, Neural networks, nVidia, nVidia A100, nVidia GeForce RTX 4090, Performance

December 8, 2024 by hgpu

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

Jaebeom Jeon, Minseong Gil, Junsu Kim, Jaeyong Park, Gunjae Koo, Myung Kuk Yoon, Yunho Oh

View

Download (PDF)

Tags: AI, Artificial intelligence, Computer science, CUDA, Deep learning, Neural networks, nVidia, nVidia Jetson AGX Orin, Performance

September 1, 2024 by hgpu

Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL

Zheming Jin

View

Download (PDF)

Source codes

Tags: Benchmarking, Computer science, CUDA, HIP, Machine learning, Neural networks, nVidia, nVidia GeForce RTX 2080, nVidia GeForce RTX 3090, nVidia H100, oneAPI, Performance, performance portability, SYCL, Tesla A100, Tesla V100

August 14, 2024 by hgpu

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Fused Kernel Library (FKL)

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Block: Balance Loader of LLM Serving with Context, Knowledge and Predictive Scheduling

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

SIGMo: Scalable Isomorphism Graph Matching on GPUs

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

Specx: Speculative task-based runtime system

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

Omniwise: Predicting GPU Kernels Performance with LLMs

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

Exploring SYCL for batched kernels with memory allocations

LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization

Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL

Recent source codes

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Fused Kernel Library (FKL)

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Block: Balance Loader of LLM Serving with Context, Knowledge and Predictive Scheduling

SIGMo: Scalable Isomorphism Graph Matching on GPUs

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Most viewed papers (last 30 days)