high performance computing on graphics processing units: hgpu.org

Posts

May, 23

Instructions’ Latencies Characterization for NVIDIA GPGPUs

The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Nowadays, Graphics Processing Units (GPUs) are in a variety of systems from supercomputers to mobile phones and tablets. They are not only used for graphics operations but rather as general-purpose special hardware (GPGPUs) to boost the performance […]

CUDA

May, 19

Neural Query Language: A Knowledge Base Query Language for Tensorflow

Large knowledge bases (KBs) are useful for many AI tasks, but are difficult to integrate into modern gradient-based learning systems. Here we describe a framework for accessing soft symbolic database using only differentiable operators. For example, this framework makes it easy to conveniently write neural models that adjust confidences associated with facts in a soft […]

May, 19

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Sparse matrix-vector multiplication (SpMV) operations are commonly used in various scientific applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular […]

CUDA

May, 19

Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL

Recent technological advances have proliferated the available computing power, memory, and speed of modern Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). Consequently, the performance and complexity of Artificial Neural Networks (ANNs) is burgeoning. While GPU accelerated Deep Neural Networks (DNNs) currently offer state-of-the-art performance, they consume large amounts […]

OpenCL

May, 19

Automatic Virtualization of Accelerators

Applications are migrating en masse to the cloud, while accelerators such as GPUs, TPUs, and FPGAs proliferate in the wake of Moore’s Law. These technological trends are incompatible. Cloud applications run on virtual platforms, but traditional I/O virtualization techniques have not provided production-ready solutions for accelerators. As a result, cloud providers expose accelerators by using […]

OpenCL

May, 19

OpenDNN: An Open-source, cuDNN-like Deep Learning Primitive Library

Deep neural networks (DNNs) are a key enabler of today’s intelligent applications and services. cuDNN is the de-facto standard library of deep learning primitives, which makes it easy to develop sophisticated DNN models. However, cuDNN is a propriatary software from NVIDIA, and thus does not allow the user to customize it based on her needs. […]

OpenCL

May, 15

CUDA au Coq: A Framework for Machine-validating GPU Assembly Programs

A prototype framework for formal, machinechecked validation of GPU pseudo-assembly code algorithms using the Coq proof assistant is presented and discussed. The framework is the first to afford GPU programmers a reliable means of formally machine-validating high-assurance GPU computations without trusting any specific source-to-assembly compilation toolchain. A formal operational semantics for the PTX pseudo-assembly language […]

CUDA

May, 15

A Unified Approach to Variable Renaming for Enhanced Vectorization

Despite the fact that compiler technologies for automatic vectorization have been under development for over four decades, there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization for SIMD units. One such gap can be found in the handling of loops with dependence cycles that involve memory-based anti (write-after-read) and […]

CUDA

May, 15

An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language

Domain-specific languages (DSLs) play an increasingly important role in the generation of high performing software. They allow the user to exploit specific knowledge encoded in the constructs for the generation of code adapted to a particular hardware architecture; at the same time, they make it easier to generate optimized code for a multitude of platforms […]

CUDA

May, 12

Improving Resource Efficiency in Virtualized Datacenters

In recent years there has been an extraordinary growth of the Internet of Things (IoT) and its protocols. The increasing diffusion of electronic devices with identification, computing and communication capabilities is laying ground for the emergence of a highly distributed service and networking environment. The above mentioned situation implies that there is an increasing demand […]

CUDA

May, 12

FPGA Implementation of Reduced Precision Convolutional Neural Networks

With the improvement in processing systems, machine learning applications are finding widespread use in almost all sectors of technology. Image recognition is one application of machine learning which has become widely popular with various architectures and systems aimed at improving recognition performance. With classification accuracy now approaching saturation point, many researchers are now focusing on […]

CUDA

May, 12

Arbitrarily large iterative tomographic reconstruction on multiple GPUs using the TIGRE toolbox

Tomographic image sizes keep increasing over time and while the GPUs that compute the tomographic reconstruction are also increasing in memory size, they are not doing so fast enough to reconstruct the largest datasets. This problem is often solved by reconstructing data in large clusters of GPUs with enough devices to fit the measured X-ray […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Instructions’ Latencies Characterization for NVIDIA GPGPUs

Neural Query Language: A Knowledge Base Query Language for Tensorflow

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL

Automatic Virtualization of Accelerators

OpenDNN: An Open-source, cuDNN-like Deep Learning Primitive Library

CUDA au Coq: A Framework for Machine-validating GPU Assembly Programs

A Unified Approach to Variable Renaming for Enhanced Vectorization

An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language

Improving Resource Efficiency in Virtualized Datacenters

FPGA Implementation of Reduced Precision Convolutional Neural Networks

Arbitrarily large iterative tomographic reconstruction on multiple GPUs using the TIGRE toolbox

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)