Posts
Dec, 23
wav2letter++: The Fastest Open-source Speech Recognition System
This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than […]
Dec, 23
Deep Learning by Doing: The NVIDIA Deep Learning Institute and University Ambassador Program
Over the past two decades, High-Performance Computing (HPC) communities have developed many models for delivering education aiming to help students understand and harness the power of parallel and distributed computing. Most of these courses either lack a hands-on component or heavily focus on theoretical characterization behind complex algorithms. To bridge the gap between application and […]
Dec, 23
On Runtime Systems for Task-based Programming on Heterogeneous Platforms
Simulation has become pervasive in science. Real experimentation remains an essential step in scientific research, but simulation replaced a wide range of costly and lengthy or even dangerous experimentation. It however requires massive computation power, and scientists will always welcome bigger and faster computation platforms, to be able to keep simulating more and more accurately […]
Dec, 23
Targeting GPUs with OpenMP Directives on Summit: A Simple and Effective Fortran Experience
We use OpenMP directives to target hardware accelerators (GPUs) on Summit, a newly deployed supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), demonstrating simplified access to GPU devices for users of our astrophysics code GenASiS and useful speedup on a sample fluid dynamics problem. At a lower level, we use the capabilities of Fortran […]
Dec, 23
cuPC: CUDA-based Parallel PC Algorithm for Causal Structure Learning on GPU
The main goal in many fields in empirical sciences is to discover causal relationships among a set of variables from observational data. PC algorithm is one of the promising solutions to learn the underlying causal structure by performing a number of conditional independence tests. In this paper, we propose a novel GPU-based parallel algorithm, called […]
Dec, 16
Software Platform for Hybrid Resource Management of Many-core Accelerators
The ever-increasing computational demand from workload mix of concurrent applications characterizes modern embedded systems. In response to such a trend, many-core accelerators are becoming more popular in high-end embedded systems. However, embedded systems usually have many constraints compared to general purpose computers. Various constraints such as low computing powers, lack of operating system and restriction […]
Dec, 16
Performance Analysis of a Stereo Matching Implementation in OpenCL
Stereo matching is one of the first steps in the process of calculating 3D information from two 2D images. To triangulate a 3D point from two corresponding 2D features, the displacement in pixels, or the so-called disparity, must be estimated. From the estimated per-pixel disparity, using a projective camera model, 3D data for large portions […]
Dec, 16
Developing acquisition systems based on FPGA with OpenCL
Nuclear fusion is a phenomenon in which the nucleuses of hydrogen crash between them, causing helium atoms. The resulting nucleus is heavier than the hydrogen nucleuses, but is lighter than the addition of the masses of the nucleuses involved in the process. This phenomenon releases huge amounts of energy. The research group i2a2 develops the […]
Dec, 16
Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks
Achieving high performance on stencil computations poses a number of challenges on modern architectures. The optimization strategy varies significantly across architectures, types of stencils, and types of applications. The standard approach to adapting stencil computations to different architectures, used by both compilers and application programmers, is through the use of iteration space tiling, whereby the […]
Dec, 16
SIMD-X: Programming and Processing of Graph Algorithms on GPUs
With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However, graph algorithms such as breadth-first search and k-core, often fail to take full advantage of GPUs, due to irregularity in memory access and control flow. […]
Dec, 12
3rd International Workshop on Theoretical Approaches to Performance Evaluation, Modeling and Simulation (TAPEMS), 2019
The objective of the 3rd TAPEMS International Workshop on Theoretical Approaches to Performance Evaluation, Modeling and Simulation is to bring together researchers and practitioners from academia and industry to discuss current advances and trends in both theoretical and experimental approaches to the performance evaluation, modeling and analysis of parallel applications and algorithms on multicore […]
Dec, 9
Clacc: Translating OpenACC to OpenMP in Clang
OpenACC was launched in 2010 as a portable programming model for heterogeneous accelerators. Although various implementations already exist, no extensible, open-source, production-quality compiler support is available to the community. This deficiency poses a serious risk for HPC application developers targeting GPUs and other accelerators, and it limits experimentation and progress for the OpenACC specification. To […]