high performance computing on graphics processing units: hgpu.org

Posts

Jul, 28

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we […]

CUDA

Jul, 26

MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing

In this paper, we present work towards the development of a new data analytics and machine learning (ML) framework, called MagmaDNN. Our main goal is to provide scalable, high-performance data analytics and ML solutions for scientific applications running on current and upcoming heterogeneous many-core GPU-accelerated architectures. To this end, since many of the functionalities needed […]

CUDA

Jul, 24

Assessing the feasibility of OpenCL CPU implementations for agent-based simulations

Agent-based modeling (ABM) is a bottom-up modeling approach, where each entity of the system being modeled is uniquely represented as a self-determining agent. Large scale emergent behavior in ABMs is population sensitive. As such, it is advisable that the number of agents in a simulation is able to reflect the reality of the system being […]

OpenCL

Jul, 21

Sorting on FPGAs using Merge Trees

Hardware Mergers can be used to implement sorting algorithms on Field-Programmable Gate Arrays (FPGAs) by inductively merging elements as in the Merge Sort algorithm.[1][2] These Hardware Mergers have also been laid out onto the FPGA in a complete binary tree pattern (called a Hardware Merge Tree) which further enhances performance of the sorting procedure by […]

Jul, 21

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate […]

CUDA

Jul, 21

GRN: Gated Relation Network to Enhance Convolutional Neural Network for Named Entity Recognition

The dominant approaches for named entity recognition (NER) mostly adopt complex recurrent neural networks (RNN), e.g., long-short-term-memory (LSTM). However, RNNs are limited by their recurrent nature in terms of computational efficiency. In contrast, convolutional neural networks (CNN) can fully exploit the GPU parallelism with their feedforward architectures. However, little attention has been paid to performing […]

Jul, 21

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size […]

CUDA

Jul, 21

Block based Singular Value Decomposition approach to matrix factorization for recommender systems

With the abundance of data in recent years, interesting challenges are posed in the area of recommender systems. Producing high quality recommendations with scalability and performance is the need of the hour. Singular Value Decomposition(SVD) based recommendation algorithms have been leveraged to produce better results. In this paper, we extend the SVD technique further for […]

CUDA

Jul, 16

Out-of-core singular value decomposition

Singular value decomposition (SVD) is a standard matrix factorization technique that produces optimal low-rank approximations of matrices. It has diverse applications, including machine learning, data science and signal processing. However, many common problems involve very large matrices that cannot fit in the main memory of commodity computers, making it impractical to use standard SVD algorithms […]

Jul, 14

On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation

Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPUaccelerated code on such systems should be "write once, run anywhere," regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due […]

CUDA

•

OpenCL

Jul, 14

HashGraph – Scalable Hash Tables Using A Sparse Graph Data Structure

Hash tables are ubiquitous and used in a wide range of applications for efficient probing of large and unsorted data. If designed properly, hash-tables can enable efficients look ups in a constant number of operations or commonly referred to as O(1) operations. As data sizes continue to grow and data becomes less structured (as is […]

CUDA

Jul, 14

A Translation Framework from RVC-CAL Dataflow Programs to OpenCL/SYCL based Implementations

Conventional programming languages nowadays still rely on sequential Models of Computation (MoC). However, the hardware makes more and more use of parallelism to increase the performance, e.g. an increasing number of cores. Nevertheless, programming languages, that still rely on sequential MoCs are not well suited to completely utilise this hardware. Dataflow programming languages like RVC-CAL […]

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing

Assessing the feasibility of OpenCL CPU implementations for agent-based simulations

Sorting on FPGAs using Merge Trees

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

GRN: Gated Relation Network to Enhance Convolutional Neural Network for Named Entity Recognition

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Block based Singular Value Decomposition approach to matrix factorization for recommender systems

Out-of-core singular value decomposition

On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation

HashGraph – Scalable Hash Tables Using A Sparse Graph Data Structure

A Translation Framework from RVC-CAL Dataflow Programs to OpenCL/SYCL based Implementations

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)