Posts
Oct, 3
An Efficient Load Balancing Method for Tree Algorithms
Nowadays, multiprocessing is mainstream with exponentially increasing number of processors. Load balancing is, therefore, a critical operation for the efficient execution of parallel algorithms. In this paper we consider the fundamental class of tree-based algorithms that are notoriously irregular, and hard to load-balance with existing static techniques. We propose a hybrid load balancing method using […]
Oct, 3
Computing Treewidth on the GPU
We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an O*(2^n)-time algorithm that explores the elimination orderings of the graph using a Held-Karp like dynamic programming approach. We use Bloom filters to detect […]
Oct, 3
Performance Evaluation of Container-based Virtualization for High Performance Computing Environments
Virtualization technologies have evolved along with the development of computational environments since virtualization offered needed features at that time such as isolation, accountability, resource allocation, resource fair sharing and so on. Novel processor technologies bring to commodity computers the possibility to emulate diverse environments where a wide range of computational scenarios can be run. Along […]
Sep, 28
6th International Workshop on OpenCL (IWOCL), 2018
The International Workshop on OpenCL (IWOCL) is the annual meeting of OpenCL users, researchers, developers and suppliers to share OpenCL best practise, and to promote the evolution and advancement of the OpenCL standard. The meeting is open to anyone who is interested in contributing to, and participating in the OpenCL community. Submissions related to any […]
Sep, 28
OpenCL Actors – Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of concurrency and distribution. However, it remains unspecific about data parallel program flows, while available processing power of modern many core hardware such as graphics processing units (GPUs) or coprocessors increases the relevance of data parallelism for general-purpose computation. In this work, we […]
Sep, 28
Mixed Precision Solver Scalable to 16000 MPI Processes for Lattice Quantum Chromodynamics Simulations on the Oakforest-PACS System
Lattice Quantum Chromodynamics (Lattice QCD) is a quantum field theory on a finite discretized space-time box so as to numerically compute the dynamics of quarks and gluons to explore the nature of subatomic world. Solving the equation of motion of quarks (quark solver) is the most compute-intensive part of the lattice QCD simulations and is […]
Sep, 28
GALARIO: a GPU Accelerated Library for Analysing Radio Interferometer Observations
We present GALARIO, a computational library that exploits the power of modern graphical processing units (GPUs) to accelerate the analysis of observations from radio interferometers like ALMA or Jansky VLA. GALARIO speeds up the computation of synthetic visibilities from a generic 2D model image or a radial brightness profile (for axisymmetric sources). On a GPU, […]
Sep, 28
Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices
Convolutional Neural Networks (CNNs) have revolutionized the research in computer vision, due to their ability to capture complex patterns, resulting in high inference accuracies. However, the increasingly complex nature of these neural networks means that they are particularly suited for server computers with powerful GPUs. We envision that deep learning applications will be eventually and […]
Sep, 28
Accelerating Electron Tomography Reconstruction Algorithm ICON Using the Intel Xeon Phi Coprocessor on Tianhe-2 Supercomputer
Electron tomography (ET) is an important method for studying three-dimensional cell ultrastructure. Combining with a sub-volume averaging approach, ET provides new possibilities for investigating in situ macromolecular complexes in sub-nanometer resolution. Because of the limited sampling angles, ET reconstruction usually suffers from the `missing wedge’ problem. With a validation procedure, Iterative Compressed-sensing Optimized NUFFT reconstruction […]
Sep, 21
Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures
This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by […]
Sep, 21
Accelerating Radio Astronomy with Auto-Tuning
The goal of this thesis is to show a way to improve the performance of different radio astronomy applications. To begin with, in this thesis we advocate the use of many-core accelerators, parallel processors with hundreds of computational cores, as execution platforms for widely used radio astronomy algorithms and platforms. However, we also show that […]
Sep, 21
IBM Deep Learning Service
Deep learning driven by large neural network models is overtaking traditional machine learning methods for understanding unstructured and perceptual data domains such as speech, text, and vision. At the same time, the "as-a-Service"-based business model on the cloud is fundamentally transforming the information technology industry. These two trends: deep learning, and "as-a-service" are colliding to […]