Sep, 9

Data Layout Oriented Compilation Techniques in Vectorization for Multi-/Many-cores

Single instruction, multiple data (SIMD) architectures are widely adopted in both general-purpose processors and graphic processing units for exploiting data-level parallelism. It is tedious and error-prone for programmers to write high performance code to utilize SIMD execution units on both platforms. Therefore, users often rely on automatic code generation techniques in compilers. However, it is […]
Sep, 7

Generating Custom Code for Efficient Query Execution on Heterogeneous Processors

Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall to deliver improved performance. Currently, database engines are manually optimized for each processor: A costly and error prone process. In this paper, we propose concepts to enable the database engine to perform per-processor optimization automatically. Our core idea is to create […]
Sep, 7

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

We present FLASH (Fast LSH Algorithm for Similarity search accelerated with HPC (High-Performance Computing)), a similarity search system for ultra-high dimensional datasets on a single machine, which does not require similarity computation. Our system is an auspicious illustration of the power of randomized algorithms carefully tailored for high-performance computing platforms. We leverage LSH style randomized […]
Sep, 7

From MPI to MPI+OpenACC: Conversion of a legacy FORTRAN PCG solver for the spherical Laplace equation

A real-world example of adding OpenACC to a legacy MPI FORTRAN Preconditioned Conjugate Gradient code is described, and timing results for multi-node multi-GPU runs are shown. The code is used to obtain three-dimensional spherical solutions to the Laplace equation. Its application is finding potential field solutions of the solar corona, a useful tool in space […]
Sep, 7

Multi-Tasking Scheduling for Heterogeneous Systems

Heterogeneous platforms play an increasingly important role in modern computer systems. They combine high performance with low power consumption. From mobiles to supercomputers, we see an increasing number of computer systems that are heterogeneous. The most well-known heterogeneous system, CPU+GPU platforms have been widely used in recent years. As they become more mainstream, serving multiple […]
Sep, 7

Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization

An efficient algorithm for recurrent neural network training is presented. The approach increases the training speed for tasks where a length of the input sequence may vary significantly. The proposed approach is based on the optimal batch bucketing by input sequence length and data parallelization on multiple graphical processing units. The baseline training performance without […]
Sep, 3

Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters

Data analytics is undergoing a revolution in many scientific domains, demanding cost-effective parallel data analysis techniques. Traditional Java-based Big Data processing tools like Hadoop MapReduce are designed for commodity CPUs. In contrast, emerging manycore processors like Xeon Phi has an order of magnitude of computation power and memory bandwidth. To harness the computing capabilities, we […]
Sep, 3

Integer sorting on multicores: some (experiments and) observations

There have been many proposals for sorting integers on multicores/GPUs that include radix-sort and its variants or other approaches that exploit specialized hardware features of a particular multicore architecture. Comparison-based algorithms have also been used. Network-based algorithms have also been used with primary example Batcher’s bitonic sorting algorithm. Although such a latter approach is theoretically […]
Sep, 3

Real-Time Rendering of Molecular Dynamics Simulation Data: A Tutorial

Achieving real-time molecular dynamics rendering is a challenge, especially when the rendering requires intensive computation involving a large simulation data-set. The task becomes even more challenging when the size of the data is too large to fit into random access memory (RAM) and the final imagery depends on the input and output (I/O) performance. The […]
Sep, 3

Towards On-Chip Optical FFTs for Convolutional Neural Networks

Convolutional neural networks have become an essential element of spatial deep learning systems. In the prevailing architecture, the convolution operation is performed with Fast Fourier Transforms (FFT) electronically in GPUs. The parallelism of GPUs provides an efficiency over CPUs, however both approaches being electronic are bound by the speed and power limits of the interconnect […]
Sep, 3

Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes

The basic features of some of the most versatile and popular open source frameworks for machine learning (TensorFlow, Deep Learning4j, and H2O) are considered and compared. Their comparative analysis was performed and conclusions were made as to the advantages and disadvantages of these platforms. The performance tests for the de facto standard MNIST data set […]
Aug, 26

Accelerate Local Tone Mapping for High Dynamic Range Images Using OpenCL with GPU

Tone mapping has been used to transfer HDR (high dynamic range) images to low dynamic range. This paper describes an algorithm to display high dynamic range images. Although local tone-mapping operator is better than global operator in reproducing images with better details and contrast, however, local tone mapping algorithm usually requires a huge amount of […]
Page 3 of 92912345...102030...Last »

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: