Posts
Oct, 21
Wilson and Domainwall Kernels on Oakforest-PACS
We report the performance of Wilson and Domainwall Kernels on a new Intel Xeon Phi Knights Landing based machine named Oakforest-PACS, which is co-hosted by University of Tokyo and Tsukuba University and is currently fastest in Japan. This machine uses Intel Omni-Path for the internode network. We compare performance with several types of implementation including […]
Oct, 21
Revisiting the Case of ARM SoCs in High-Performance Computing Clusters
Over the course of the past decade, the explosive popularity of embedded devices such as smartphones and tablets have given rise to ARM SoCs, whose characteristically low power consumption have made them ideal for these types of embedded devices. Recent maturation in the ARM SoC market, which has seen the advent of more powerful 64-bit […]
Oct, 21
Parallel Matching and Clustering Algorithms on GPUs
The main focus of this thesis is on developing efficient algorithms on GPUs for certain matching and clustering problems. Through extensive experiments we show that sparse and unstructured problems can benefit greatly from using GPUs as long as the algorithms are carefully designed. Even though none of the presented algorithms are fundamentally new, they still […]
Oct, 21
How to distribute most efficiently a computation intensive calculation on an Android device to external compute units with an Android API
Is transferring computation intensive calculations to external compute-units the next trend? This master’s thesis researches if it is worth the effort to transfer a matrix multiplication from an Android phone to a System-on-Chip (SoC), using Bluetooth or WebSocket as communication protocols. The SoC solution used in this work is an Intel Altera Cyclone V based […]
Oct, 21
Computation of gray-level co-occurrence matrix based on CUDA and its optimization
As in various fields like scientific research and industrial application, the computation time optimization is becoming a task that is of increasing importance because of its highly parallel architecture. The graphics processing unit is regarded as a powerful engine for application programs that demand fairly high computation capabilities. Based on this, an algorithm was introduced […]
Oct, 15
Accelerating Genomics Research with OpenCL and FPGAs
With the rapid decrease in gene sequencing costs due to the emergence of second-generation sequencing equipment, the availability of genome sequence data is increasing dramatically. The ability to correlate the variations among genomes is enabling advances in a wide range of medical research and personalized care. Because each human genome comprises more than three billion […]
Oct, 15
Flexible FPGA design for FDTD using OpenCL
Compared to classical HDL designs, generating FPGA with high-level synthesis from an OpenCL specification promises easier exploration of different design alternatives and, through ready-to-use infrastructure and common abstractions for host and memory interfaces, easier portability between different FPGA families. In this work, we evaluate the extent of this promise. To this end, we present a […]
Oct, 15
Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions
The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software redevelopment cost for each type or generation of hardware. To mitigate this cost, a performance portable programming system is required. This work presents my solution to the performance portability problem. I argue […]
Oct, 15
Synkhronos: a Multi-GPU Theano Extension for Data Parallelism
We present Synkhronos, an extension to Theano for multi-GPU computations leveraging data parallelism. Our framework provides automated execution and synchronization across devices, allowing users to continue to write serial programs without risk of race conditions. The NVIDIA Collective Communication Library is used for high-bandwidth inter-GPU communication. Further enhancements to the Theano function interface include input […]
Oct, 15
SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes
The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing […]
Oct, 3
Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors
In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, […]
Oct, 3
FPGA implementation of a Convolutional Neural Network for "Wake up word" detection
The popularity of machine learning has increased dramatically in the last years and the possible applications varies from web search, speech recognition, object detection, etc. A big part of this development is due to the use of Convolutional Neural Networks (CNNs), where high performance Graphics Processing Units (GPUs) has been the most popular device. This […]