high performance computing on graphics processing units: hgpu.org

hgpu.org » Matrix multiplication

Design Principles for Sparse Matrix Multiplication on the GPU

Carl Yang, Aydin Buluc, John D. Owens

View

Download (PDF)

Tags: Computer science, CUDA, Matrix multiplication, nVidia, Sparse matrix, Tesla K40

March 31, 2018 by hgpu

Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Patricia Suriana, Shoaib Kamil, Saman Amarasinghe

View

Download (PDF)

Tags: Compilers, Computer science, DSL, FPGA, Matrix multiplication, nVidia, OpenMPI, Performance, Programming Languages, PTX, Tesla K40

March 3, 2018 by hgpu

Rubus: A compiler for seamless and extensible parallelism

Muhammad Adnan, Faisal Aslam, Zubair Nawaz, Syed Mansoor Sarwar

View

Download (PDF)

Source codes

Tags: Aparapi, Benchmarking, Computer science, CUDA, Java, Matrix multiplication, nVidia, nVidia GeForce GT 630 M, OpenCL, OpenGL, Package

January 6, 2018 by hgpu

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

Gaurav Mitra

View

Download (PDF)

Tags: ARM, Benchmarking, Computer science, DSP, Heterogeneous systems, Matrix multiplication, nVidia, nVidia Jetson TK1, nVidia Tegra TX1, OpenCL, SoC, Thesis

November 12, 2017 by hgpu

How to distribute most efficiently a computation intensive calculation on an Android device to external compute units with an Android API

Lander Beckers, Henning Lakiere

View

Download (PDF)

Tags: Android, Computer science, FPGA, Java, Matrix multiplication, OpenCL, SoC, Thesis

October 21, 2017 by hgpu

Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds

Hamidreza Khaleghzadeh, Ziming Zhong, Ravi Reddy, Alexey Lastovetsky

View

Download (PDF)

Source codes

Tags: BLAS, Cloud, Computer science, CUBLAS, CUDA, FPGA, Heterogeneous systems, Intel Xeon Phi, Matrix multiplication, nVidia, OpenCL, Package, Virtualization

September 16, 2017 by hgpu

A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs

Siddharth Samsi, Brian Helfer, Jeremy Kepner, Albert Reuther, Darrell O. Ricke

View

Download (PDF)

Tags: Algorithms, Computer science, CUDA, Genomics, Linear Algebra, Matrix multiplication, nVidia, Performance, Tesla K80

July 5, 2017 by hgpu

Investigation of heterogeneous computing through novel parallel programming platforms

Andrei-Alexandru Dafinoiu

View

Download (PDF)

Tags: ARM, ATI, ATI Radeon HD 6450, ATI Radeon HD 6570, Computer science, FPGA, Heterogeneous systems, Matrix multiplication, nVidia, nVidia GeForce GTX 750 Ti, OpenCL, SoC, Thesis

April 17, 2017 by hgpu

Parallel Multi Channel Convolution using General Matrix Multiplication

Aravind Vasudevan, Andrew Anderson, David Gregg

View

Download (PDF)

Tags: Algorithms, ARM, Computer science, CUDA, Deep learning, Machine learning, Matrix multiplication, Neural networks, nVidia, nVidia Tegra TX1, Performance

April 17, 2017 by hgpu

Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose

Shaohuai Shi, Pengfei Xu, Xiaowen Chu

View

Download (PDF)

Source codes

Tags: Algorithms, BLAS, Caffe, Computer science, CUBLAS, CUDA, Deep learning, Linear Algebra, Matrix multiplication, Neural networks, nVidia, nVidia GeForce GTX 1080, Package, Performance

February 14, 2017 by hgpu

GPU-Accelerated SVM Training Algorithm Based on PC and Mobile Device

Yi-Yan Nan, Quan-Zhe Li, Jin-Chun Piao, Shin-Dug Kim

View

Download (PDF)

Tags: Algorithms, AMD Radeon R9 M200X, Android, ATI, Computer science, Image processing, Matrix multiplication, OpenCL, Pattern recognition

February 10, 2017 by hgpu

A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures

Ali Charara, David Keyes, Hatem Ltaief

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, CUDA, Intel Xeon Phi, Linear Algebra, Matrix multiplication, nVidia, Package, Tesla K40

January 8, 2017 by hgpu

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Design Principles for Sparse Matrix Multiplication on the GPU

Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers

Rubus: A compiler for seamless and extensible parallelism

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

How to distribute most efficiently a computation intensive calculation on an Android device to external compute units with an Android API

A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs

Investigation of heterogeneous computing through novel parallel programming platforms

Parallel Multi Channel Convolution using General Matrix Multiplication

Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose

GPU-Accelerated SVM Training Algorithm Based on PC and Mobile Device

A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)