## Posts

Sep, 28

### Mixed Precision Solver Scalable to 16000 MPI Processes for Lattice Quantum Chromodynamics Simulations on the Oakforest-PACS System

Lattice Quantum Chromodynamics (Lattice QCD) is a quantum field theory on a finite discretized space-time box so as to numerically compute the dynamics of quarks and gluons to explore the nature of subatomic world. Solving the equation of motion of quarks (quark solver) is the most compute-intensive part of the lattice QCD simulations and is […]

Sep, 28

### GALARIO: a GPU Accelerated Library for Analysing Radio Interferometer Observations

We present GALARIO, a computational library that exploits the power of modern graphical processing units (GPUs) to accelerate the analysis of observations from radio interferometers like ALMA or Jansky VLA. GALARIO speeds up the computation of synthetic visibilities from a generic 2D model image or a radial brightness profile (for axisymmetric sources). On a GPU, […]

Sep, 28

### Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices

Convolutional Neural Networks (CNNs) have revolutionized the research in computer vision, due to their ability to capture complex patterns, resulting in high inference accuracies. However, the increasingly complex nature of these neural networks means that they are particularly suited for server computers with powerful GPUs. We envision that deep learning applications will be eventually and […]

Sep, 28

### Accelerating Electron Tomography Reconstruction Algorithm ICON Using the Intel Xeon Phi Coprocessor on Tianhe-2 Supercomputer

Electron tomography (ET) is an important method for studying three-dimensional cell ultrastructure. Combining with a sub-volume averaging approach, ET provides new possibilities for investigating in situ macromolecular complexes in sub-nanometer resolution. Because of the limited sampling angles, ET reconstruction usually suffers from the `missing wedge’ problem. With a validation procedure, Iterative Compressed-sensing Optimized NUFFT reconstruction […]

Sep, 21

### Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by […]

Sep, 21

### Accelerating Radio Astronomy with Auto-Tuning

The goal of this thesis is to show a way to improve the performance of different radio astronomy applications. To begin with, in this thesis we advocate the use of many-core accelerators, parallel processors with hundreds of computational cores, as execution platforms for widely used radio astronomy algorithms and platforms. However, we also show that […]

Sep, 21

### IBM Deep Learning Service

Deep learning driven by large neural network models is overtaking traditional machine learning methods for understanding unstructured and perceptual data domains such as speech, text, and vision. At the same time, the "as-a-Service"-based business model on the cloud is fundamentally transforming the information technology industry. These two trends: deep learning, and "as-a-service" are colliding to […]

Sep, 21

### Automated Testing of Graphics Shader Compilers

We present an automated technique for finding defects in compilers for graphics shading languages. A key challenge in compiler testing is the lack of an oracle that classifies an output as correct or incorrect; this is particularly pertinent in graphics shader compilers where the output is a rendered image that is typically under-specified. Our method […]

Sep, 21

### Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks […]

Sep, 16

### Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address […]

Sep, 16

### Monte Carlo methods for massively parallel computers

Applications that require substantial computational resources today cannot avoid the use of heavily parallel machines. Embracing the opportunities of parallel computing and especially the possibilities provided by a new generation of massively parallel accelerator devices such as GPUs, Intel’s Xeon Phi or even FPGAs enables applications and studies that are inaccessible to serial programs. Here […]

Sep, 16

### Meta Networks for Neural Style Transfer

In this paper we propose a new method to get the specified network parameters through one time feed-forward propagation of the meta networks and explore the application to neural style transfer. Recent works on style transfer typically need to train image transformation networks for every new style, and the style is encoded in the network […]