## Posts

Apr, 11

### Vectorization of Hybrid Breadth First Search on the Intel Xeon Phi

The Breadth-First Search (BFS) algorithm is an important building block for graph analysis of large datasets. The BFS parallelisation has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. We investigate the optimisation and vectorisation of the hybrid BFS (a […]

Apr, 7

### Design Exploration of AES Accelerators on FPGAs and GPUs

The embedded systems are increasingly becoming a key technological component of all kinds of complex technical systems and an exhaustive analysis of the state of the art of all current performance with respect to architectures, design methodologies, test and applications could be very interesting. The Advanced Encryption Standard (AES), based on the well-known algorithm Rijndael, […]

Apr, 7

### In-Datacenter Performance Analysis of a Tensor Processing Unit

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU) – deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers […]

Apr, 7

### Cascaded Segmentation-Detection Networks for Word-Level Text Spotting

We introduce an algorithm for word-level text spotting that is able to accurately and reliably determine the bounding regions of individual words of text "in the wild". Our system is formed by the cascade of two convolutional neural networks. The first network is fully convolutional and is in charge of detecting areas containing text. This […]

Apr, 7

### Load Balancing for Constraint Solving with GPUs

Solving a complex Constraint Satisfaction Problem (CSP) is a computationally hard task which may require a considerable amount of time. Parallelism has been applied successfully to the job and there are already many applications capable of harnessing the parallel power of modern CPUs to speed up the solving process. Current Graphics Processing Units (GPUs), containing […]

Apr, 7

### Optimizing Communication by Compression for Multi-GPU Scalable Breadth-First Searches

The Breadth First Search (BFS) algorithm is the foundation and building block of many higher graph-based operations such as spanning trees, shortest paths and betweenness centrality. The importance of this algorithm increases each day due to it is a key requirement for many data structures which are becoming popular nowadays. When the BFS algorithm is […]

Apr, 5

### The Second International Workshop on GPU Computing and Applications (GCA), 2017

Built for massive parallelism, General Purpose computing on Graphic Processing Unit (GPGPU) has superseded high-performance CPU in several important tasks, including computer graphics, physics calculations, encryption/decryption and scientific computations. The goal of this workshop is to provide a forum to discuss and evaluate emerging techniques, platforms and applications capable of harvesting the power of current […]

Apr, 3

### Merge or Separate? Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms

Computer systems are increasingly heterogeneous with nodes consisting of CPUs and GPU accelerators. As such systems become mainstream, they move away from specialized highperformance single application platforms to a more general setting with multiple, concurrent, application jobs. Determining how jobs should be dynamically best scheduled to heterogeneous devices is non-trivial. In certain cases, performance is […]

Apr, 3

### Evaluation of Libraries for Parallel Computing in Haskell – A Case Study with a Super-resolution Application

Haskell is a functional language featuring lazy evaluation and referential transparency. On one hand, Referential transparency is useful for parallel computing because the results do not depend on the evaluation order, but on the other hand, parallel computing requires an evaluation order that is different from that of lazy evaluation. There are some parallel programming […]

Apr, 3

### Sparse Matrix-Vector Multiplication on GPGPUs

The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrixvector multiplication is therefore crucial and has been the subject of an […]

Apr, 3

### Chai: Collaborative Heterogeneous Applications for Integrated-architectures

Heterogeneous system architectures are evolving towards tighter integration among devices, with emerging features such as shared virtual memory, memory coherence, and systemwide atomics. Languages, device architectures, system specifications, and applications are rapidly adapting to the challenges and opportunities of tightly integrated heterogeneous platforms. Programming languages such as OpenCL 2.0, CUDA 8.0, and C++ AMP allow […]

Apr, 3

### FPGA-based Tsunami Simulation: Performance Comparison with GPUs, and Roofline Model for Scalability Analysis

MOST (Method Of Splitting Tsunami) is widely used to solve shallow water equations (SWEs) for simulation of tsunami. This paper presents high-performance and power-efficient computation of MOST for practical tsunami simulation with FPGA. The custom hardware for simulation is based on a stream computing architecture for deeply pipelining to increase performance with a limited bandwidth. […]