Posts
Sep, 7
From MPI to MPI+OpenACC: Conversion of a legacy FORTRAN PCG solver for the spherical Laplace equation
A real-world example of adding OpenACC to a legacy MPI FORTRAN Preconditioned Conjugate Gradient code is described, and timing results for multi-node multi-GPU runs are shown. The code is used to obtain three-dimensional spherical solutions to the Laplace equation. Its application is finding potential field solutions of the solar corona, a useful tool in space […]
Sep, 7
Multi-Tasking Scheduling for Heterogeneous Systems
Heterogeneous platforms play an increasingly important role in modern computer systems. They combine high performance with low power consumption. From mobiles to supercomputers, we see an increasing number of computer systems that are heterogeneous. The most well-known heterogeneous system, CPU+GPU platforms have been widely used in recent years. As they become more mainstream, serving multiple […]
Sep, 7
Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization
An efficient algorithm for recurrent neural network training is presented. The approach increases the training speed for tasks where a length of the input sequence may vary significantly. The proposed approach is based on the optimal batch bucketing by input sequence length and data parallelization on multiple graphical processing units. The baseline training performance without […]
Sep, 3
Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters
Data analytics is undergoing a revolution in many scientific domains, demanding cost-effective parallel data analysis techniques. Traditional Java-based Big Data processing tools like Hadoop MapReduce are designed for commodity CPUs. In contrast, emerging manycore processors like Xeon Phi has an order of magnitude of computation power and memory bandwidth. To harness the computing capabilities, we […]
Sep, 3
Real-Time Rendering of Molecular Dynamics Simulation Data: A Tutorial
Achieving real-time molecular dynamics rendering is a challenge, especially when the rendering requires intensive computation involving a large simulation data-set. The task becomes even more challenging when the size of the data is too large to fit into random access memory (RAM) and the final imagery depends on the input and output (I/O) performance. The […]
Sep, 3
Integer sorting on multicores: some (experiments and) observations
There have been many proposals for sorting integers on multicores/GPUs that include radix-sort and its variants or other approaches that exploit specialized hardware features of a particular multicore architecture. Comparison-based algorithms have also been used. Network-based algorithms have also been used with primary example Batcher’s bitonic sorting algorithm. Although such a latter approach is theoretically […]
Sep, 3
Towards On-Chip Optical FFTs for Convolutional Neural Networks
Convolutional neural networks have become an essential element of spatial deep learning systems. In the prevailing architecture, the convolution operation is performed with Fast Fourier Transforms (FFT) electronically in GPUs. The parallelism of GPUs provides an efficiency over CPUs, however both approaches being electronic are bound by the speed and power limits of the interconnect […]
Sep, 3
Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes
The basic features of some of the most versatile and popular open source frameworks for machine learning (TensorFlow, Deep Learning4j, and H2O) are considered and compared. Their comparative analysis was performed and conclusions were made as to the advantages and disadvantages of these platforms. The performance tests for the de facto standard MNIST data set […]
Aug, 26
Accelerate Local Tone Mapping for High Dynamic Range Images Using OpenCL with GPU
Tone mapping has been used to transfer HDR (high dynamic range) images to low dynamic range. This paper describes an algorithm to display high dynamic range images. Although local tone-mapping operator is better than global operator in reproducing images with better details and contrast, however, local tone mapping algorithm usually requires a huge amount of […]
Aug, 26
Vulnerability Analysis and Attacks on Intel Xeon Phi Coprocessor
The Intel Xeon Phi coprocessor is a PCIe based add-in card. Though it is prone to simple attacks, many high performance computing systems are constructed by combining CPUs and coprocessors. This paper describes two attacks that exploit vulnerabilities related to the boot process of coprocessor and ownership of offload user. Proof of concept codes are […]
Aug, 26
Large Integer Arithmetic in GPU for Cryptography
Most computer nowadays support 32 bits or 64 bits of data type on various type of programming languages and they are sufficient for most use cases. However, in cryptography, the required range and precision are more than 64 bits which are computationally expensive on CPUs. In this report, we present our design and implementation of […]
Aug, 26
Dynamic Parallelism in GPU Optimized Barnes Hut Trees for Molecular Dynamics Simulations
Since the beginning of the modern computing era, high performance computing has been pushing the boundaries of the types of problems that can be solved in many different disciplines. One of the leading fields is computational biophysics where molecular dynamics (MD) simulations provide microscopic resolution details of how biomolecules move, fold, and assemble into intricate […]