Posts
Nov, 7
Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass
Cosmological N-body simulations play a vital role in studying how the Universe evolves. To compare to observations and make scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the datasets that are forbiddingly large in modern […]
Nov, 7
Lattice QCD on new chips: a community summary
I review the most recent evolutions of the QCD codes on new architectures, with a focus on the performances obtained by the different coding strategies as presented during the Lattice-2017 conference.
Nov, 7
Acceleration of tensor-product operations for high-order finite element methods
This paper is devoted to GPU kernel optimization and performance analysis of three tensor-product operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators’ properties: low arithmetic intensity, tiered structure, and the need to store […]
Nov, 5
Dynamic Load Balancing Strategies for Graph Applications on GPUs
Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent irregularity in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, […]
Nov, 5
A Dynamic Hash Table for the GPU
We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this […]
Nov, 5
Data Coherence Analysis and Optimization for Heterogeneous Computing
Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance […]
Nov, 5
ChainerMN: Scalable Distributed Deep Learning Framework
One of the keys for deep learning to have made a breakthrough in various fields was to utilize high computing powers centering around GPUs. Enabling the use of further computing abilities by distributed processing is essential not only to make the deep learning bigger and faster but also to tackle unsolved challenges. We present the […]
Nov, 5
Deep and Shallow convections in Atmosphere Models on Intel Xeon Phi Coprocessor Systems
Deep and shallow convection calculations occupy significant times in atmosphere models. These calculations also present significant load imbalances due to varying cloud covers over different regions of the grid. In this work, we accelerate these calculations on Intel Xeon Phi Coprocessor Systems. By employing dynamic scheduling in OpenMP, we demonstrate large reductions in load imbalance […]
Oct, 31
PCIeHLS: an OpenCL HLS framework
One of the goals of high level synthesis (HLS) is to make designing hardware accelerators running on FPGAs accessible to developers with a software background (usually implying developers with little foundations in hardware design). While high level synthesis generates accelerator kernels, it generally does not assist with integrating the generated kernels into a system. In […]
Oct, 31
Automatic Scan Parallelization in OpenMP
Prefix Scan (or simply scan) is an operator that computes all the partial sums of a vector. A scan operation results in a vector where each element is the sum of the preceding elements in the original vector up to the corresponding position. Scan is a key operation in many relevant problems like sorting, lexical […]
Oct, 31
An efficient GPU algorithm for tetrahedron-based Brillouin-zone integration
We report an efficient algorithm for calculating momentum-space integrals in solid state systems on modern graphics processing units (GPUs). We extend the tetrahedron method by Bl"ochl et al.~to the more general case of the integration of a momentum as well as energy dependent quantity and implement the algorithm based on the CUDA programming framework. We […]
Oct, 31
On Pre-Trained Image Features and Synthetic Images for Deep Learning
Deep Learning methods usually require huge amounts of training data to perform at their full potential, and often require expensive manual labeling. Using synthetic images is therefore very attractive to train object detectors, as the labeling comes for free, and several approaches have been proposed to combine synthetic and real images for training. In this […]