Posts
Sep, 22
Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models
Deep neural networks (DNNs) have become widely used in many AI applications. Yet, training a DNN requires a huge amount of calculations and it takes a long time and energy to train a satisfying model. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) play a key role in training DNNs. However, different many-core processors from […]
Sep, 22
Model-Based Warp-Level Tiling for Image Processing Programs on GPUs
The efficient execution of image processing pipelines on GPUs is an area of active research. The state-of-art involves 1) dividing portions of an image into overlapped tiles, where each tile can be processed by a single thread block and 2) fusing loops together to improve memory locality. However, the state-of-the-art has two limitations: 1) synchronization […]
Sep, 22
ALPyNA: Acceleration of Loops in Python for Novel Architectures
We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data dependences within nested loops and dynamically generates CUDA kernels for GPU execution. The ALPyNA system applies classical dependence analysis techniques to discover and exploit potential parallelism. The skeletal structure of the dependence graph is determined statically (if possible) or at runtime; this […]
Sep, 22
Espresso: A Fast End-to-end Neural Speech Recognition Toolkit
We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which […]
Sep, 15
Code optimization based on source to source transformations using profile guided metrics
Modern high performance processor architectures tackle performance issues by heavily relying on increased vector lengths and advanced memory hierarchies to deliver high performance. Manual optimization is became a difficult task. Developers usually trust compilers to automatically address these performance issues, but they deploy static performance models and heuristics that force them to remain conservative. On […]
Sep, 15
Parallelizing Multiple Flow Accumulation Algorithm using CUDA and OpenACC
Watershed analysis, as a fundamental component of digital terrain analysis, is based on the Digital Elevation Model (DEM), which is a grid (raster) model of the Earth surface and topography. Watershed analysis consists of computationally and data intensive computing algorithms that need to be implemented by leveraging parallel and high-performance computing methods and techniques. In […]
Sep, 15
Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems
The next-generation of supercomputers will feature a diverse mix of accelerator devices. The increase in heterogeneity is explained by the nature of supercomputing workloads – certain devices offer acceleration, or a shorter time to completion, for particular application programs. Certain characteristics of these programs are fixed and impose fundamental limitations on the workloads regardless of […]
Sep, 15
PySPH: a Python-based framework for smoothed particle hydrodynamics
PySPH is a Python-based framework for particle methods in general and Smoothed Particle Hydrodynamics (SPH) in particular. PySPH allows a user to define a complete SPH simulation using pure Python. High-performance code is generated from this high-level Python code and executed on either multiple cores, or on GPUs, seamlessly. It also supports distributed execution using […]
Sep, 15
Efficient Interleaved Batch Matrix Solvers for CUDA
In this paper we present a new methodology for data accesses when solving batches of Tridiagonal and Pentadiagonal matrices that all share the same LHS matrix. By only storing one copy of this matrix there is a significant reduction in storage overheads and the authors show that there is also a performance increase in terms […]
Sep, 8
ArborX: A Performance Portable Search Library
Searching for geometric objects that are close in space is a fundamental component of many applications. The performance of search algorithms comes to the forefront as the size of a problem increases both in terms of total object count as well as in the total number of search queries performed. Scientific applications requiring modern leadership-class […]
Sep, 8
Fast Code Exploration for Pipeline Processing in FPGA Accelerators
The increasing demand for energy efficient computing has endorsed the usage of Field-Programmable Gate Arrays to create hardware accelerators for large and complex codes. However, implementing such accelerators involve two complex decisions. The first one lies in deciding which code snippet is the best to create an accelerator, and the second one lies in how […]
Sep, 8
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow
Most sequence-to-sequence (seq2seq) models are autoregressive; they generate each token by conditioning on previously generated tokens. In contrast, non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the joint distribution of all tokens simultaneously is challenging, and even with […]