Posts
Sep, 27
Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives
It is demonstrated how the non-proprietary OpenACC standard of compiler directives may be used to compactly and efficiently accelerate the rate-determining steps of two of the most routinely applied many-body methods of electronic structure theory, namely the second-order M{o}ller-Plesset (MP2) model in its resolution-of-the-identity (RI) approximated form and the (T) triples correction to the coupled […]
Sep, 27
FastCollect: Offloading Generational Garbage Collection to Integrated GPUs
Generational Mark-Sweep Garbage Collection is a widely used garbage collection technique. However, the garbage collector has poor execution efficiency for large programs. Aggressive collection causes execution pauses in the program, while reducing the collection frequency leads to memory wastage. In this work, we develop FastCollect, a parallel version of the generational mark-sweep garbage collector running […]
Sep, 27
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
We introduce a method to train Quantized Neural Networks (QNNs) — neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-time the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with […]
Sep, 27
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered […]
Sep, 27
A Novel GPU-based Parallel Implementation Scheme and Performance Analysis of Robot Forward Dynamics Algorithms
We propose a novel unifying scheme for parallel implementation of articulated robot dynamics algorithms. It is based on a unified Lie group notation for deriving the equations of motion of articulated robots, where various well-known forward algorithms differ only by their joint inertia matrix inversion strategies. This new scheme leads to a unified abstraction of […]
Sep, 27
Solving Batched Linear Programs on GPU and Multicore CPU
Linear Programs (LPs) appear in a large number of applications and offloading them to the GPU is viable to gain performance. Existing work on offloading and solving an LP on GPU suggests that performance is gained from large sized LPs (typically 500 constraints, 500 variables and above). In order to gain performance from GPU for […]
Sep, 22
Bridging the Semantic Gaps of GPU Acceleration for Scaleout CNN-based Big Data Processing: Think Big, See Small
Convolutional Neural Networks (CNNs) have substantially advanced the state-of-the-art accuracies of object recognition, which is the core function of a myriad of modern multimedia processing techniques such as image/video processing, speech recognition, and natural language processing. GPU-based accelerators gained increasing attention because a large amount of highly parallel neurons in CNN naturally matches the GPU […]
Sep, 22
Tuning Stencil Codes in OpenCL for FPGAs
OpenCL is designed as a parallel programming framework to support heterogeneous computing platforms. The implicit or explicit parallelism in OpenCL kernel code enables efficient FPGA implementation from a high-level programming abstraction. However, FPGA architecture is completely different from GPU architecture, for which OpenCL is widely used. Tuning OpenCL codes to achieve high performance on FPGAs […]
Sep, 22
Characterization of Speech Recognition Systems on GPU Architectures
Automatic speech recognition is one of the most important applications in the area of cognitive computing. Mobile devices, such as smartphones, have incorporated speech recognition as one of the main interfaces for user interaction. This trend towards voice-based user interfaces is likely to continue in the next years. Effective speech recognition systems require real-time recognition, […]
Sep, 22
Efficient dictionary learning implementation on the GPU using OpenCL
The dictionary learning field offers a wide range of algorithms that are able to provide good sparse approximations and well trained dictionaries. These algorithms are very complex and this is reflected in the slow execution of their computationally intensive implementations. This article proposes efficient parallel implementations for the main algorithms in the field that significantly […]
Sep, 22
MCS 572: Introduction to Supercomputing
The goal of the course is to study parallel algorithms and their implementation on distributed and shared memory computers, using message passing, OpenMP, and threads. In the second half of the course we will consider general purpose graphics processing units. Prerequisites are a working knowledge of C (or willingness to acquire programming skills) and a […]
Sep, 20
Acceleration of Block-Aware Matrix Factorization on Heterogeneous Platforms
Block-structured matrices arise in several contexts in circuit simulation problems. These matrices typically inherit the pattern of sparsity from the circuit connectivity. However, they are also characterized by dense spots or blocks. Direct factorization of those matrices has emerged as an attractive approach if the host memory is sufficiently large to store the block-structured matrix. […]