Posts
Jun, 30
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
The validation and deployment of novel research ideas in the field of Deep Learning is often limited by the availability of efficient compute kernels for certain basic primitives. In particular, operations that cannot leverage existing vendor libraries (e.g., cuBLAS, cuDNN) are at risk of facing poor device utilization unless custom implementations are written by experts […]
Jun, 30
HEATS: Heterogeneity- and Energy-Aware Task-based Scheduling
Cloud providers usually offer diverse types of hardware for their users. Customers exploit this option to deploy cloud instances featuring GPUs, FPGAs, architectures other than x86 (e.g., ARM, IBM Power8), or featuring certain specific extensions (e.g, Intel SGX). We consider in this work the instances used by customers to deploy containers, nowadays the de facto […]
Jun, 27
ReSYCLator: Transforming CUDA C++ source code into SYCL
CUDA while very popular, is not as flexible with respect to target devices as OpenCL. While parallel algorithm research might address problems first with a CUDA C++ solution, those results are not easily portable to a target not directly supported by CUDA. In contrast, a SYCL C++ solution can operate on the larger variety of […]
Jun, 27
Heterogeneous Active Messages (HAM) – Implementing Lightweight Remote Procedure Calls in C++
We present HAM (Heterogeneous Active Messages), a C++-only active messaging solution for heterogeneous distributed systems.Combined with a communication protocol, HAM can be used as a generic Remote Procedure Call (RPC) mechanism. It has been used in HAM-Offload to implement a low-overhead offloading framework for inter- and intra-node offloading between different architectures including accelerators like the […]
Jun, 27
Mirovia: A Benchmarking Suite for Modern Heterogeneous Computing
This paper presents Mirovia, a benchmark suite developed for modern day heterogeneous computing. Previous benchmark suites such as Rodinia [1] and SHOC [2] are well written and have many desirable features. However, these tools were developed years ago when hardware was less powerful and software had fewer features. For example, unified memory was introduced in […]
Jun, 23
A Static Analysis-based Cross-Architecture Performance Prediction Using Machine Learning
Porting code from CPU to GPU is costly and time-consuming; Unless much time is invested in development and optimization, it is not obvious, a priori, how much speed-up is achievable or how much room is left for improvement. Knowing the potential speed-up a priori can be very useful: It can save hundreds of engineering hours, […]
Jun, 23
Data-Parallel Flattening by Expansion
We present a higher-order programmer-level technique for compiling particular kinds of irregular data-parallel problems to parallel hardware. The technique, which we have named "flattening-by-expansion" builds on a number of segmented data-parallel operations but is itself implemented as a higher-order generic function, which makes it useful for many irregular problems. Concretely, the implementation is given in […]
Jun, 23
IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication
Sparse matrix-matrix multiplication (SpGEMM) is a sparse kernel that is used in a number of scientific applications. Although several SpGEMM algorithms have been proposed, almost all of them are restricted to the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. The particular format and […]
Jun, 23
GPU Volume Voxelization: Exploration of the performance characteristics of different GPU-based implementations
In recent years, voxel-based modelling has seen a reintroduction to computer game development through massive graphics hardware improvements. Nevertheless, polygons continue to be the default building block of 3D objects, introducing a need for the transformation of polygon meshes into voxel-based models; this process is known as voxelization. Efficient voxelization algorithms take advantage of the […]
Jun, 23
MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization
The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU […]
Jun, 20
TensorNetwork for Machine Learning
We demonstrate the use of tensor networks for image classification with the TensorNetwork open source library. We explain in detail the encoding of image data into a matrix product state form, and describe how to contract the network in a way that is parallelizable and well-suited to automatic gradients for optimization. Applying the technique to […]
Jun, 20
Accelerating Concurrent Heap on GPUs
Priority queue, often implemented as a heap, is an abstract data type that has been used in many well-known applications like Dijkstra’s shortest path algorithm, Prim’s minimum spanning tree, Huffman encoding, and the branch-and-bound algorithm. However, it is challenging to exploit the parallelism of the heap on GPUs since the control divergence and memory irregularity […]