## Posts

Sep, 22

### Characterization of Speech Recognition Systems on GPU Architectures

Automatic speech recognition is one of the most important applications in the area of cognitive computing. Mobile devices, such as smartphones, have incorporated speech recognition as one of the main interfaces for user interaction. This trend towards voice-based user interfaces is likely to continue in the next years. Effective speech recognition systems require real-time recognition, […]

Sep, 22

### Efficient dictionary learning implementation on the GPU using OpenCL

The dictionary learning field offers a wide range of algorithms that are able to provide good sparse approximations and well trained dictionaries. These algorithms are very complex and this is reflected in the slow execution of their computationally intensive implementations. This article proposes efficient parallel implementations for the main algorithms in the field that significantly […]

Sep, 22

### MCS 572: Introduction to Supercomputing

The goal of the course is to study parallel algorithms and their implementation on distributed and shared memory computers, using message passing, OpenMP, and threads. In the second half of the course we will consider general purpose graphics processing units. Prerequisites are a working knowledge of C (or willingness to acquire programming skills) and a […]

Sep, 20

### Acceleration of Block-Aware Matrix Factorization on Heterogeneous Platforms

Block-structured matrices arise in several contexts in circuit simulation problems. These matrices typically inherit the pattern of sparsity from the circuit connectivity. However, they are also characterized by dense spots or blocks. Direct factorization of those matrices has emerged as an attractive approach if the host memory is sufficiently large to store the block-structured matrix. […]

Sep, 20

### Parallel Computational Fluid Dynamics With the Intel Xeon Phi Coprocessor

The Intel Xeon Phi coprocessor is a PCI Express form factor card designed to work in tangent with Intel Xeon processors in order to allow faster execution of highly parallelizable code. Efficient execution of highly parallel applications is achieved through the use of many smaller, lower clock speed cores; allowing for many more simultaneous execution […]

Sep, 20

### A Compiler for Throughput Optimization of Graph Algorithms on GPUs

Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand. To address this problem, we have implemented these optimizations […]

Sep, 20

### Runtime Support for Adaptive Power Capping on Heterogeneous SoCs

Power capping is a fundamental method for reducing the energy consumption of a wide range of modern computing environments, ranging from mobile embedded systems to datacentres. Unfortunately, maximising performance and system efficiency under static power caps remains challenging, while maximising performance under dynamic power caps has been largely unexplored. We present an adaptive power capping […]

Sep, 20

### Feynman Machine: The Universal Dynamical Systems Computer

Efforts at understanding the computational processes in the brain have met with limited success, despite their importance and potential uses in building intelligent machines. We propose a simple new model which draws on recent findings in Neuroscience and the Applied Mathematics of interacting Dynamical Systems. The Feynman Machine is a Universal Computer for Dynamical Systems, […]

Sep, 17

### Devito: automated fast finite difference computation

Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance optimization on different computer architectures. Although a large number of stencil languages are available, finite difference domain specific languages have proved challenging to design because most practical use cases require […]

Sep, 17

### The CUDA LATCH Binary Descriptor: Because Sometimes Faster Means Better

Accuracy, descriptor size, and the time required for extraction and matching are all important factors when selecting local image descriptors. To optimize over all these requirements, this paper presents a CUDA port for the recent Learned Arrangement of Three Patches (LATCH) binary descriptors to the GPU platform. The design of LATCH makes it well suited […]

Sep, 17

### Parallel Dynamics Computation using Prefix Sum Operations

We propose a new parallel framework for fast computation of inverse and forward dynamics of articulated robots based on prefix sums (scans). We re-investigate the well-known recursive Newton-Euler formulation of robot dynamics and show that the forward-backward propagation process for robot inverse dynamics is equivalent to two scan operations on certain semigroups. We show that […]

Sep, 17

### A parallel pattern for iterative stencil + reduce

We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop in both data-parallel and streaming applications, or a combination of both. The pattern makes it possible to deploy […]