## Posts

Oct, 15

### Flexible FPGA design for FDTD using OpenCL

Compared to classical HDL designs, generating FPGA with high-level synthesis from an OpenCL specification promises easier exploration of different design alternatives and, through ready-to-use infrastructure and common abstractions for host and memory interfaces, easier portability between different FPGA families. In this work, we evaluate the extent of this promise. To this end, we present a […]

Oct, 15

### Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions

The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software redevelopment cost for each type or generation of hardware. To mitigate this cost, a performance portable programming system is required. This work presents my solution to the performance portability problem. I argue […]

Oct, 15

### Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

We present Synkhronos, an extension to Theano for multi-GPU computations leveraging data parallelism. Our framework provides automated execution and synchronization across devices, allowing users to continue to write serial programs without risk of race conditions. The NVIDIA Collective Communication Library is used for high-bandwidth inter-GPU communication. Further enhancements to the Theano function interface include input […]

Oct, 15

### SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes

The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing […]

Oct, 3

### Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors

In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, […]

Oct, 3

### FPGA implementation of a Convolutional Neural Network for "Wake up word" detection

The popularity of machine learning has increased dramatically in the last years and the possible applications varies from web search, speech recognition, object detection, etc. A big part of this development is due to the use of Convolutional Neural Networks (CNNs), where high performance Graphics Processing Units (GPUs) has been the most popular device. This […]

Oct, 3

### An Efficient Load Balancing Method for Tree Algorithms

Nowadays, multiprocessing is mainstream with exponentially increasing number of processors. Load balancing is, therefore, a critical operation for the efficient execution of parallel algorithms. In this paper we consider the fundamental class of tree-based algorithms that are notoriously irregular, and hard to load-balance with existing static techniques. We propose a hybrid load balancing method using […]

Oct, 3

### Computing Treewidth on the GPU

We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an O*(2^n)-time algorithm that explores the elimination orderings of the graph using a Held-Karp like dynamic programming approach. We use Bloom filters to detect […]

Oct, 3

### Performance Evaluation of Container-based Virtualization for High Performance Computing Environments

Virtualization technologies have evolved along with the development of computational environments since virtualization offered needed features at that time such as isolation, accountability, resource allocation, resource fair sharing and so on. Novel processor technologies bring to commodity computers the possibility to emulate diverse environments where a wide range of computational scenarios can be run. Along […]

Sep, 28

### 6th International Workshop on OpenCL (IWOCL), 2018

The International Workshop on OpenCL (IWOCL) is the annual meeting of OpenCL users, researchers, developers and suppliers to share OpenCL best practise, and to promote the evolution and advancement of the OpenCL standard. The meeting is open to anyone who is interested in contributing to, and participating in the OpenCL community. Submissions related to any […]

Sep, 28

### OpenCL Actors – Adding Data Parallelism to Actor-based Programming with CAF

The actor model of computation has been designed for a seamless support of concurrency and distribution. However, it remains unspecific about data parallel program flows, while available processing power of modern many core hardware such as graphics processing units (GPUs) or coprocessors increases the relevance of data parallelism for general-purpose computation. In this work, we […]

Sep, 28

### Mixed Precision Solver Scalable to 16000 MPI Processes for Lattice Quantum Chromodynamics Simulations on the Oakforest-PACS System

Lattice Quantum Chromodynamics (Lattice QCD) is a quantum field theory on a finite discretized space-time box so as to numerically compute the dynamics of quarks and gluons to explore the nature of subatomic world. Solving the equation of motion of quarks (quark solver) is the most compute-intensive part of the lattice QCD simulations and is […]