## Posts

Nov, 7

### Lattice QCD on new chips: a community summary

I review the most recent evolutions of the QCD codes on new architectures, with a focus on the performances obtained by the different coding strategies as presented during the Lattice-2017 conference.

Nov, 7

### Acceleration of tensor-product operations for high-order finite element methods

This paper is devoted to GPU kernel optimization and performance analysis of three tensor-product operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators’ properties: low arithmetic intensity, tiered structure, and the need to store […]

Nov, 5

### Dynamic Load Balancing Strategies for Graph Applications on GPUs

Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent irregularity in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, […]

Nov, 5

### A Dynamic Hash Table for the GPU

We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this […]

Nov, 5

### Data Coherence Analysis and Optimization for Heterogeneous Computing

Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance […]

Nov, 5

### ChainerMN: Scalable Distributed Deep Learning Framework

One of the keys for deep learning to have made a breakthrough in various fields was to utilize high computing powers centering around GPUs. Enabling the use of further computing abilities by distributed processing is essential not only to make the deep learning bigger and faster but also to tackle unsolved challenges. We present the […]

Nov, 5

### Deep and Shallow convections in Atmosphere Models on Intel Xeon Phi Coprocessor Systems

Deep and shallow convection calculations occupy significant times in atmosphere models. These calculations also present significant load imbalances due to varying cloud covers over different regions of the grid. In this work, we accelerate these calculations on Intel Xeon Phi Coprocessor Systems. By employing dynamic scheduling in OpenMP, we demonstrate large reductions in load imbalance […]

Oct, 31

### PCIeHLS: an OpenCL HLS framework

One of the goals of high level synthesis (HLS) is to make designing hardware accelerators running on FPGAs accessible to developers with a software background (usually implying developers with little foundations in hardware design). While high level synthesis generates accelerator kernels, it generally does not assist with integrating the generated kernels into a system. In […]

Oct, 31

### Automatic Scan Parallelization in OpenMP

Prefix Scan (or simply scan) is an operator that computes all the partial sums of a vector. A scan operation results in a vector where each element is the sum of the preceding elements in the original vector up to the corresponding position. Scan is a key operation in many relevant problems like sorting, lexical […]

Oct, 31

### An efficient GPU algorithm for tetrahedron-based Brillouin-zone integration

We report an efficient algorithm for calculating momentum-space integrals in solid state systems on modern graphics processing units (GPUs). We extend the tetrahedron method by Bl"ochl et al.~to the more general case of the integration of a momentum as well as energy dependent quantity and implement the algorithm based on the CUDA programming framework. We […]

Oct, 31

### On Pre-Trained Image Features and Synthetic Images for Deep Learning

Deep Learning methods usually require huge amounts of training data to perform at their full potential, and often require expensive manual labeling. Using synthetic images is therefore very attractive to train object detectors, as the labeling comes for free, and several approaches have been proposed to combine synthetic and real images for training. In this […]

Oct, 31

### Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

For computational fluid dynamics (CFD) applications with a large number of grid points/cells, parallel computing is a common efficient strategy to reduce the computational time. How to achieve the best performance in the modern supercomputer system, especially with heterogeneous computing resources such as hybrid CPU+GPU, or a CPU + Intel Xeon Phi (MIC) co-processors, is […]