19793

Posts

Feb, 23

Let’s sort this out: GPGPU Verification of Radix Sort

This paper shows how the VerCors verification toolset can be used to prove data race freedom and functional correctness of a parallel radix sort algorithm for GPUs. This is a widely used standard sorting implementation for GPGPU programming frameworks and therefore its correctness is of utmost importance. Additionally, it presents the usefulness of VerCors as […]
Feb, 23

From English To Foreign Languages: Transferring Pre-trained Language Models

Pre-trained models have demonstrated their effectiveness in many downstream natural language processing (NLP) tasks. The availability of multilingual pre-trained models enables zero-shot transfer of NLP tasks from high resource languages to low resource ones. However, recent research in improving pre-trained models focuses heavily on English. While it is possible to train the latest neural architectures […]
Feb, 23

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, […]
Feb, 16

The Deep Learning Compiler: A Comprehensive Survey

The difficulty of deploying various deep learning (DL) models on diverse DL hardwares has boosted the research and development of DL compilers in the community. Several DL compilers have been proposed from both industry and academia such as Tensorflow XLA and TVM. Similarly, the DL compilers take the DL models described in different DL frameworks […]
Feb, 16

EASYPAP: a Framework for Learning Parallel Programming

This paper presents EASYPAP, an easy-to-use programming environment designed to help students to learn parallel programming. EASYPAP features a wide range of 2D computation kernels that the students are invited to parallelize using Pthreads, OpenMP, OpenCL or MPI. Execution of kernels can be interactively visualized, and powerful monitoring tools allow students to observe both the […]
Feb, 16

ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works are focusing on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations lack of considering fully utilizing the memory bandwidth […]
Feb, 16

LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment

Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of […]
Feb, 16

Task-based, GPU-accelerated and Robust Library for Solving Dense Nonsymmetric Eigenvalue Problems

In this paper, we present the StarNEig library for solving dense nonsymmetric standard and generalized eigenvalue problems. The library is built on top of the StarPU runtime system and targets both shared and distributed memory machines. Some components of the library have support for GPU acceleration. The library is currently in an early beta state […]
Feb, 9

Working With Incremental Spatial Data During Parallel (GPU) Computation

Central to many complex systems, spatial actors require an awareness of their local environment to enable behaviours such as communication and navigation. Complex system simulations represent this behaviour with Fixed Radius Near Neighbours (FRNN) search. This algorithm allows actors to store data at spatial locations and then query the data structure to find all data […]
Feb, 9

Automated Runtime Analysis and Adaptation for Scalable Heterogeneous Computing

In the last decade, there have been tectonic shifts in computer hardware because of reaching the physical limits of the sequential CPU performance. As a consequence, current high-performance computing (HPC) systems integrate a wide variety of compute resources with different capabilities and execution models, ranging from multi-core CPUs to many-core accelerators. While such heterogeneous systems […]
Feb, 9

TC-CIM: Empowering Tensor Comprehensions for Computing-In-Memory

Memristor-based, non-von-Neumann architectures performing tensor operations directly in memory are a promising approach to address the ever-increasing demand for energy-efficient, high-throughput hardware accelerators for Machine Learning (ML) inference. A major challenge for the programmability and exploitation of such Computing-InMemory (CIM) architectures consists in the efficient mapping of tensor operations from high-level ML frameworks to fixed-function […]
Feb, 9

MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA

OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, […]

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: