Jun, 16
Electric potential and field calculation of charged BEM triangles and rectangles by Gaussian cubature
It is a widely held view that analytical integration is more accurate than the numerical one. In some special cases, however, numerical integration can be more advantageous than analytical integration. In our paper we show this benefit for the case of electric potential and field computation of charged triangles and rectangles applied in the boundary […]
Jun, 16
NCAM: Near-Data Processing for Nearest Neighbor Search
Deep down in many applications like natural language processing (NLP), vision, and robotics is a form of the k-nearest neighbor search algorithm (kNN). The kNN algorithm is primarily bottlenecked by data movement, limiting throughput and incurring latency in these applications. While there do exist well bounded kNN approximations that improve the performance of kNN, these […]
Jun, 16
Splotch: porting and optimizing for the Xeon Phi
With the increasing size and complexity of data produced by large scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous High Performance Computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize […]
Jun, 16
Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs
We perform a study of the factors affecting training time in multi-device deep learning systems. Given a specification of a convolutional neural network, we study how to minimize the time to train this model on a cluster of commodity CPUs and GPUs. Our first contribution focuses on the single-node setting, in which we show that […]
Jun, 16
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Graphics Processing Units (GPUs) are becoming popular accelerators in modern High-Performance Computing (HPC) clusters. Installing GPUs on each node of the cluster is not efficient resulting in high costs and power consumption as well as underutilisation of the accelerator. The research reported in this paper is motivated towards the use of few physical GPUs by […]
Jun, 14
Jun, 14
Jun, 14
Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
With the appearance of the heterogeneous platform OpenPower,many-core accelerator devices have been coupled with Power host processors for the first time. Towards utilizing their full potential, it is worth investigating performance portable algorithms that allow to choose the best-fitting hardware for each domain-specific compute task. Suiting even the high level of parallelism on modern GPGPUs, […]
Jun, 14
First Application of Lattice QCD to Pezy-SC Processor
Pezy-SC processor is a novel new architecture developed by Pezy Computing K. K. that has achieved large computational power with low electric power consumption. It works as an accelerator device similarly to GPGPUs. A programming environment that resembles OpenCL is provided. Using a hybrid parallel system "Suiren" installed at KEK, we port and tune a […]
Jun, 14
OpenCL-Based Erasure Coding on Heterogeneous Architectures
Erasure coding, Reed-Solomon coding in particular, is a key technique to deal with failures in scale-out storage systems. However, due to the algorithmic complexity, the performance overhead of erasure coding can become a significant bottleneck in storage systems attempting to meet service level agreements (SLAs). Previous work has mainly leveraged SIMD (singleinstruction multiple-data) instruction extensions […]
Jun, 14
Processing Big Data in Main Memory and on GPU
Many large-scale systems were designed with the assumption that I/O is the bottleneck, but this assumption has been challenged in the past decade with new trends in hardware capabilities and workload demands. The computational power of CPU cores has not improved proportional to the performance of disks and network interfaces in the past decade, but […]
Jun, 14
Multi-GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL
Using modern Graphic Processing Units (GPUs) becomes very useful for computing complex and time consuming processes. GPUs provide high-performance computation capabilities with a good price. This paper deals with a multi-GPU OpenCL and CUDA implementations of k-Nearest Neighbor (k-NN) algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for […]