high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Physics » Optimization of the Brillouin operator on the KNL architecture

Optimization of the Brillouin operator on the KNL architecture

Stephan Durr

University of Wuppertal, Gaussstrasse 20, D-42119 Wuppertal, Germany

arXiv:1709.01828 [hep-lat], (6 Sep 2017)

@article{durr2017optimization,

title={Optimization of the Brillouin operator on the KNL architecture},

author={Durr, Stephan},

year={2017},

month={sep},

archivePrefix={"arXiv"},

primaryClass={hep-lat}

}

View

Source

6430

views

Experiences with optimizing the matrix-times-vector application of the Brillouin operator on the Intel KNL processor are reported. Without adjustments to the memory layout, performance figures of 360 Gflop/s in single and 270 Gflop/s in double precision are observed. This is with N_c=3 colors, N_v=12 right-hand-sides, N_{thr}=256 threads, on lattices of size 32^3*64, using exclusively OMP pragmas. Interestingly, the same routine performs quite well on Intel Core i7 architectures, too. Some observations on the much harder Wilson fermion matrix-times-vector optimization problem are added.

Tags: High Energy Physics – Lattice, Intel Xeon Phi, OpenMP, Physics, QCD

September 12, 2017 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

Agentic Code Optimization via Compiler-LLM Cooperation

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

DVM: Real-Time Kernel Generation for Dynamic AI Models

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

True 4-Bit Quantized CNN Training on CPU

True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

Hunting CUDA Bugs at Scale with cuFuzz

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

See all packages

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us: