high performance computing on graphics processing units: hgpu.org

Programming

hgpu.org » Programming » OpenCL

Program Acceleration in a Heterogeneous Computing Environment Using OpenCL, FPGA, and CPU

Herman Noel Hoffman

View

Download (PDF)

Tags: Computer science, FPGA, Heterogeneous systems, OpenCL, SoC, Thesis

June 5, 2017 by hgpu

UT-OCL: An OpenCL Framework for Embedded Systems Using Xilinx FPGAs

Vincent Mirian

View

Download (PDF)

Tags: Computer science, FPGA, Heterogeneous systems, OpenCL, SoC, Thesis

June 5, 2017 by hgpu

SparkJNI: A Reference Design for a Heterogeneous Apache Spark Framework

Tudor Alexandru Voicu

View

Download (PDF)

Tags: big data, Computer science, FPGA, Heterogeneous systems, Hybrid computing, Java, OpenCL, Spark, Thesis

June 1, 2017 by hgpu

Accelerating Discrete Wavelet Transforms on GPUs

David Barina, Michal Kula, Michal Matysek, Pavel Zemcik

View

Download (PDF)

Tags: ATI, ATI Radeon HD 6970, Computer science, Discrete Wavelet Transform, HLSL, Image processing, nVidia, nVidia GeForce GTX Titan X, OpenCL, Pixel shaders

May 24, 2017 by hgpu

Implementing Efficient, Portable Computations for Machine Learning

Matthew Walter Moskewicz

View

Download (PDF)

Source codes

Tags: Algorithms, AMD R9 Nano, ATI, Computer science, Computer vision, CUDA, Deep learning, Machine learning, Neural networks, nVidia, nVidia GeForce GTX Titan X, OpenCL, Package, Thesis

May 24, 2017 by hgpu

CLBlast: A Tuned OpenCL BLAS Library

Cedric Nugteren

View

Download (PDF)

Source codes

Tags: AMD Radeon R9 M370X, ARM, ATI, BLAS, Computer science, Intel HD 5100, Linear Algebra, Machine learning, nVidia, nVidia GeForce GTX 750 Ti, nVidia GeForce GTX Titan X, OpenCL, Package

May 18, 2017 by hgpu

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays

Abhishek Kumar Jain, Douglas L. Maskell, Suhaib A. Fahmy

View

Download (PDF)

Tags: ARM, Computer science, DSP, FPGA, Heterogeneous systems, OpenCL

May 11, 2017 by hgpu

Towards Enhancing Performance, Programmability, and Portability in Heterogeneous Computing

Konstantinos Krommydas

View

Download (PDF)

Source codes

Tags: ATI, ATI Radeon HD 6550, ATI Radeon HD 7660, ATI Radeon HD 7970, Code generation, Compilers, Computer science, FPGA, Heterogeneous systems, Intel Xeon Phi, nVidia, OpenCL, Package, Performance, Tesla K20, Thesis

May 9, 2017 by hgpu

Acceleration of Deep Learning on FPGA

Huyuan Li

View

Download (PDF)

Tags: CNN, Computer science, Deep learning, FPGA, Neural networks, nVidia, OpenCL, Tesla K40, Thesis

May 9, 2017 by hgpu

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems

Ben Taylor, Vicent Sanz Marco, Zheng Wang

View

Download (PDF)

Tags: ARM, Computer science, Energy-efficient computing, Heterogeneous systems, Machine learning, OpenCL

April 30, 2017 by hgpu

Accelerating Discrete Wavelet Transforms on Parallel Architectures

David Barina, Michal Kula, Michal Matysek, Pavel Zemcik

View

Download (PDF)

Tags: Algorithms, ATI, ATI Radeon HD 6970, Discrete Wavelet Transform, Image processing, nVidia, nVidia GeForce GTX Titan X, OpenCL, OpenGL, Performance, Pixel shaders

April 30, 2017 by hgpu

Developing a massive real-time crowd simulation framework on the GPU

Guillaume Payet

View

Download (PDF)

Tags: Algorithms, Computer science, Crowd simulation, nVidia, nVidia GeForce GTX 850 M, OpenCL, Thesis

April 26, 2017 by hgpu

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

DVM: Real-Time Kernel Generation for Dynamic AI Models

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

* * *

high performance computing on graphics processing units: hgpu.org

Programming

Program Acceleration in a Heterogeneous Computing Environment Using OpenCL, FPGA, and CPU

UT-OCL: An OpenCL Framework for Embedded Systems Using Xilinx FPGAs

SparkJNI: A Reference Design for a Heterogeneous Apache Spark Framework

Accelerating Discrete Wavelet Transforms on GPUs

Implementing Efficient, Portable Computations for Machine Learning

CLBlast: A Tuned OpenCL BLAS Library

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays

Towards Enhancing Performance, Programmability, and Portability in Heterogeneous Computing

Acceleration of Deep Learning on FPGA

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems

Accelerating Discrete Wavelet Transforms on Parallel Architectures

Developing a massive real-time crowd simulation framework on the GPU

Recent source codes

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)