Posts
Dec, 6
Using Data Compression for Increasing Efficiency of Data Transfer Between Main Memory and Intel Xeon Phi Coprocessor or NVidia GPU in Parallel DBMS
The need to transfer data through PCI Express bus is considered as one of main bottlenecks in programming for manycore coprocessors and GPUs. This paper focuses on using data compression methods, such as RLE, Null Suppression, LZSS and combination of RLE and Null Suppression to increase efficiency of data transfer between main memory and coprocessor. […]
Dec, 6
A Study of Parallel Sorting Algorithms Using CUDA and OpenMP
This thesis reviews the parallel languages according to their computational complexities, in terms of time, while using sorting algorithms coded in CUDA and OpenMP. The thesis evaluates the solution for parallelism at a maintainable cost of money and other efforts, for achieving acceptable results of timing when compared to parallel languages together, as well as […]
Dec, 6
Parallel Implementation of Vortex Element Method on CPUs and GPUs
The implementations of 2D vortex element method adapted to different types of parallel computers are considered. The developed MPI-implementation provides close to linear acceleration for small number of computational cores and approximately 40-times acceleration for 80-cores cluster when solving model problem. OpenMP-based modification allows to obtain 5% additional acceleration due to shared memory usage. Approximate […]
Dec, 6
CuMF: scale matrix factorization using just ONE machine with GPUs
Matrix factorization (MF) is widely used in recommendation systems. We present cuMF, a highly-optimized matrix factorization tool with supreme performance on graphics processing units (GPUs) by fully utilizing the GPU compute power and minimizing the overhead of data movement. Firstly, we introduce a memoryoptimized alternating least square (ALS) method by reducing discontiguous memory access and […]
Dec, 6
Parallelization Methods of the Template Matching Method on Graphics Accelerators
Template matching is a classic technique used in image processing for object detection. It is based on multiple matrix-based calculations, where there are no dependencies on partial results, so parallel solutions could be created. In this article two GPU implemented methods are presented and compared to the CPU-based sequential solution.
Dec, 4
The Genetic Convolutional Neural Network Model Based on Random Sample
Convolutional neural network (CNN) – the result of the training is affected by of initial value of the weights. It is concluded that the model is not necessarily the best features of expression. The use of genetic algorithm can help choosing the better characteristics. But there almost was not literature study of the combining genetic […]
Dec, 4
An Accelerator based on the rho-VEX Processor: an Exploration using OpenCL
In recent years the use of co-processors to accelerate specific tasks is becoming more common. To simplify the use of these accelerators in software, the OpenCL framework has been developed. This framework provides programs a cross-platform interface for using accelerators. The rho-VEX processor is a run-time reconfigurable VLIW processor. It allows run-time switching of configurations, […]
Dec, 4
Optimizing CUDA Shared Memory Usage
CUDA shared memory is fast, on-chip storage. However, the bank conflict issue could cause a performance bottleneck. Current NVIDIA Tesla GPUs support memory bank accesses with configurable bit-widths. While this feature provides an efficient bank mapping scheme for 32-bit and 64-bit data types, it becomes trickier to solve the bank conflict problem through manual code […]
Dec, 4
Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In […]
Dec, 4
An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA
Modern Graphics Processing Units (GPUs) have high computation power and low cost. Recently, many applications in various fields have been computed powerfully on the GPU using CUDA. In this paper, we propose an efficient parallel algorithm for graph isomorphism which runs on the GPU using CUDA for matching large graphs. Parallelization of a sequential graph […]
Dec, 1
A General Framework for Constrained Bayesian Optimization using Information-based Search
We present an information-theoretic framework for solving global black-box optimization problems that also have black-box constraints. Of particular interest to us is to efficiently solve problems with decoupled constraints, in which subsets of the objective and constraint functions may be evaluated independently. For example, when the objective is evaluated on a CPU and the constraints […]
Dec, 1
Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems
There are four trends in modern high-performance computing (HPC) that have led to an increased need for efficient memory management techniques for heterogeneous systems (such as one fitted with GPUs). First, the average size of datasets for HPC applications is rapidly increasing. Read-only input matrices that used to be on the order of megabytes or […]