Posts
Nov, 13
Executing Dynamic Data Rate Actor Networks on OpenCL Platforms
Heterogeneous computing platforms consisting of general purpose processors (GPPs) and graphics processing units (GPUs) have become commonplace in personal mobile devices and embedded systems. For years, programming of these platforms was very tedious and simultaneous use of all available GPP and GPU resources required low-level programming to ensure efficient synchronization and data transfer between processors. […]
Nov, 13
Fractal Art Generation using GPUs
Fractal image generation algorithms exhibit extreme parallelizability. Using general purpose graphics processing unit (GPU) programming to implement escape-time algorithms for Julia sets of functions,parallel methods generate visually attractive fractal images much faster than traditional methods. Vastly improved speeds are achieved using this method of computation, which allow real-time generation and display of images. A comparison […]
Nov, 10
Shuffle Reduction Based Sparse Matrix-Vector Multiplication on Kepler GPU
GPU is the suitable equipment for accelerating computing-intensive applications in order to get the higher throughput for High Performance Computing (HPC). Sparse Matrix-Vector Multiplication (SpMV) is the core algorithm of HPC, so the SpMV’s throughput on GPU may affect the throughput on HPC platform. In the paper, we focus on the latency of reduction routine […]
Nov, 10
PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks
Convolutional neural networks (CNNs) have been widely employed in many applications such as image classification, video analysis and speech recognition. Being compute-intensive, CNN computations are mainly accelerated by GPUs with high power dissipations. Recently, studies were carried out exploiting FPGA as CNN accelerator because of its reconfigurability and energy efficiency advantage over GPU, especially when […]
Nov, 10
Using multiple GPUs to accelerate string searching for digital forensic analysis
String searching within a large corpus of data is an important component of digital forensic (DF) analysis techniques such as file carving. The continuing increase in capacity of consumer storage devices requires corresponding im-provements to the performance of string searching techniques. As string search-ing is a trivially-parallelisable problem, GPGPU approaches are a natural fit – […]
Nov, 10
Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors
B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present node-level optimizations of B-spline evaluations on […]
Nov, 10
Memory layout in GPU implementation of lattice Boltzmann method for sparse 3D geometries
We describe a high-performance implementation of the lattice Boltzmann method (LBM) for sparse 3D geometries on graphic processors (GPU). The main contribution of this work is a data layout that allows to minimise the number of redundant memory transactions during the propagation step of LBM. We show that by using a uniform mesh of small […]
Nov, 8
Balancing locality and concurrency: solving sparse triangular systems on GPUs
Many numerical optimisation problems rely on fast algorithms for solving sparse triangular systems of linear equations (STLs). To accelerate the solution of such equations, two types of approaches have been used: on GPUs, concurrency has been prioritised to the disadvantage of data locality, while on multi-core CPUs, data locality has been prioritised to the disadvantage […]
Nov, 8
Tamp: A Library for Compact Deep Neural Networks with Structured Matrices
We introduce Tamp, an open source C++ library for reducing the space and time costs of deep neural network models. In particular, Tamp implements several recent works which use structured matrices to replace unstructured matrices which are often bottlenecks in neural networks. Tamp is also designed to serve as a unified development platform with several […]
Nov, 8
Performance Portability of the Aeras Atmosphere Model to Next Generation Architectures using Kokkos
The subject of this report is the performance portability of the Aeras global atmosphere dynamical core (implemented within the Albany multi-physics code) to new and emerging architecture machines using the Kokkos library and programming model. We describe the process of refactoring the finite element assembly process for the 3D hydrostatic model in Aeras and highlight […]
Nov, 8
Accelerate Deep Learning Inference with MCTS in the game of Go on the Intel Xeon Phi
The performance of Deep Learning Inference is a serious issue when combining with speed delicate Monte Carlo Tree Search. Traditional hybrid CPU and Graphics processing unit solution is bounded because of frequently heavy data transferring. This paper proposes a method making Deep Convolution Neural Network prediction and MCTS execution simultaneously at Intel Xeon Phi. This […]
Nov, 8
Vispark: GPU-Accelerated Distributed Visual Computing Using Spark
With the growing need of big-data processing in diverse application domains, MapReduce (e.g., Hadoop) has become one of the standard computing paradigms for large-scale computing on a cluster system. Despite its popularity, the current MapReduce framework suffers from inflexibility and inefficiency inherent to its programming model and system architecture. In order to address these problems, […]