Posts
Mar, 1
Telekine: Secure Computing with Cloud GPUs
GPUs have become ubiquitous in the cloud due to the dramatic performance gains they enable in domains such as machine learning and computer vision. However, offloading GPU computation to the cloud requires placing enormous trust in providers and administrators. Recent proposals for GPU trusted execution environments (TEEs) are promising but fail to address very real […]
Mar, 1
Evaluating the Energy Efficiency of OpenCL-accelerated AutoDock Molecular Docking
AUTODOCK is a molecular docking application that consists of a genetic algorithm coupled with the Solis-Wets localsearch method. Despite its wide usage, its power consumption on heterogeneous systems has not been evaluated extensively. In this work, we evaluate the energy efficiency of an OpenCL-accelerated version of AUTODOCK that, along with the traditional SolisWets method, newly […]
Mar, 1
A Systematic Survey of General Sparse Matrix-Matrix Multiplication
SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much attention from researchers in fields of multigrid methods and graph analysis. Many optimization techniques have been developed for certain application fields and computing architecture over the decades. The objective of this paper is to provide a structured and comprehensive overview of the research on SpGEMM. Existing optimization […]
Feb, 23
Performance Counters based Power Modeling of Mobile GPUs using Deep Learning
GPUs have recently become important computational units on mobile devices, resulting in heterogeneous devices that can run a variety of parallel processing applications. While developing and optimizing such applications, estimating power consumption is of immense importance as energy efficiency has become the key design constraint to optimize for on these platforms. In this work, we […]
Feb, 23
Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs
Graphics processor units (GPUs) are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various instructions found in modern […]
Feb, 23
Let’s sort this out: GPGPU Verification of Radix Sort
This paper shows how the VerCors verification toolset can be used to prove data race freedom and functional correctness of a parallel radix sort algorithm for GPUs. This is a widely used standard sorting implementation for GPGPU programming frameworks and therefore its correctness is of utmost importance. Additionally, it presents the usefulness of VerCors as […]
Feb, 23
From English To Foreign Languages: Transferring Pre-trained Language Models
Pre-trained models have demonstrated their effectiveness in many downstream natural language processing (NLP) tasks. The availability of multilingual pre-trained models enables zero-shot transfer of NLP tasks from high resource languages to low resource ones. However, recent research in improving pre-trained models focuses heavily on English. While it is possible to train the latest neural architectures […]
Feb, 23
High-Performance High-Order Stencil Computation on FPGAs Using OpenCL
In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, […]
Feb, 16
EASYPAP: a Framework for Learning Parallel Programming
This paper presents EASYPAP, an easy-to-use programming environment designed to help students to learn parallel programming. EASYPAP features a wide range of 2D computation kernels that the students are invited to parallelize using Pthreads, OpenMP, OpenCL or MPI. Execution of kernels can be interactively visualized, and powerful monitoring tools allow students to observe both the […]
Feb, 16
The Deep Learning Compiler: A Comprehensive Survey
The difficulty of deploying various deep learning (DL) models on diverse DL hardwares has boosted the research and development of DL compilers in the community. Several DL compilers have been proposed from both industry and academia such as Tensorflow XLA and TVM. Similarly, the DL compilers take the DL models described in different DL frameworks […]
Feb, 16
ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs
Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works are focusing on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations lack of considering fully utilizing the memory bandwidth […]
Feb, 16
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of […]