## Posts

Jun, 23

### A Static Analysis-based Cross-Architecture Performance Prediction Using Machine Learning

Porting code from CPU to GPU is costly and time-consuming; Unless much time is invested in development and optimization, it is not obvious, a priori, how much speed-up is achievable or how much room is left for improvement. Knowing the potential speed-up a priori can be very useful: It can save hundreds of engineering hours, […]

Jun, 23

### Data-Parallel Flattening by Expansion

We present a higher-order programmer-level technique for compiling particular kinds of irregular data-parallel problems to parallel hardware. The technique, which we have named "flattening-by-expansion" builds on a number of segmented data-parallel operations but is itself implemented as a higher-order generic function, which makes it useful for many irregular problems. Concretely, the implementation is given in […]

Jun, 23

### IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication

Sparse matrix-matrix multiplication (SpGEMM) is a sparse kernel that is used in a number of scientific applications. Although several SpGEMM algorithms have been proposed, almost all of them are restricted to the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. The particular format and […]

Jun, 23

### GPU Volume Voxelization: Exploration of the performance characteristics of different GPU-based implementations

In recent years, voxel-based modelling has seen a reintroduction to computer game development through massive graphics hardware improvements. Nevertheless, polygons continue to be the default building block of 3D objects, introducing a need for the transformation of polygon meshes into voxel-based models; this process is known as voxelization. Efficient voxelization algorithms take advantage of the […]

Jun, 23

### MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization

The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU […]

Jun, 20

### TensorNetwork for Machine Learning

We demonstrate the use of tensor networks for image classification with the TensorNetwork open source library. We explain in detail the encoding of image data into a matrix product state form, and describe how to contract the network in a way that is parallelizable and well-suited to automatic gradients for optimization. Applying the technique to […]

Jun, 20

### Accelerating Concurrent Heap on GPUs

Priority queue, often implemented as a heap, is an abstract data type that has been used in many well-known applications like Dijkstra’s shortest path algorithm, Prim’s minimum spanning tree, Huffman encoding, and the branch-and-bound algorithm. However, it is challenging to exploit the parallelism of the heap on GPUs since the control divergence and memory irregularity […]

Jun, 20

### High-Performance Deep Learning via a Single Building Block

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In […]

Jun, 16

### Performance Evaluation and Analysis of Sparse Matrix and Graph Kernels on Heterogeneous Processors

Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip, thus can exploit the advantages and avoid disadvantages of those compute units. We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU-GPU heterogeneous processor by using 956 sparse matrices. Five characteristics, i.e., […]

Jun, 16

### RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location

We explore a first proof-of-concept example of creatively using the Turing generation’s hardware ray tracing cores to solve a problem other than classical ray tracing, specifically, point location in unstructured tetrahedral meshes. Starting with a CUDA reference method, we describe and evaluate three different approaches to reformulate this problem in a manner that allows it […]

Jun, 16

### SYCL Code Generation for Multigrid Methods

Multigrid methods are fast and scalable numerical solvers for partial differential equations (PDEs) that possess a large design space for implementing their algorithmic components. Code generation approaches allow formulating multigrid methods on a higher level of abstraction that can then be used to define a problem- and hardwarespecific solution. Since these problems have considerable implementation […]

Jun, 16

### Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems

The increasing demands of modern embedded systems, such as highperformance and energy-efficiency, have motivated the use of heterogeneous multicore platforms enabled by Multiprocessor System-on-Chips(MPSoCs). To fully exploit the power of these platforms, new tools are needed to address the increasing software complexity to achieve a high productivity. An MPSoC compiler is a toolchain to tackle […]