Posts
Dec, 6
A Survey Of Architectural Techniques for Managing Process Variation
Process variation –deviation in parameters from their nominal specifications– threatens to slow down and even pause technological scaling and mitigation of it is the way to continue the benefits of chip miniaturization. In this paper, we present a survey of architectural techniques for managing process variation (PV) in modern processors. We also classify these techniques […]
Dec, 6
Using Data Compression for Increasing Efficiency of Data Transfer Between Main Memory and Intel Xeon Phi Coprocessor or NVidia GPU in Parallel DBMS
The need to transfer data through PCI Express bus is considered as one of main bottlenecks in programming for manycore coprocessors and GPUs. This paper focuses on using data compression methods, such as RLE, Null Suppression, LZSS and combination of RLE and Null Suppression to increase efficiency of data transfer between main memory and coprocessor. […]
Dec, 6
CuMF: scale matrix factorization using just ONE machine with GPUs
Matrix factorization (MF) is widely used in recommendation systems. We present cuMF, a highly-optimized matrix factorization tool with supreme performance on graphics processing units (GPUs) by fully utilizing the GPU compute power and minimizing the overhead of data movement. Firstly, we introduce a memoryoptimized alternating least square (ALS) method by reducing discontiguous memory access and […]
Dec, 6
Parallelization Methods of the Template Matching Method on Graphics Accelerators
Template matching is a classic technique used in image processing for object detection. It is based on multiple matrix-based calculations, where there are no dependencies on partial results, so parallel solutions could be created. In this article two GPU implemented methods are presented and compared to the CPU-based sequential solution.
Dec, 6
A Study of Parallel Sorting Algorithms Using CUDA and OpenMP
This thesis reviews the parallel languages according to their computational complexities, in terms of time, while using sorting algorithms coded in CUDA and OpenMP. The thesis evaluates the solution for parallelism at a maintainable cost of money and other efforts, for achieving acceptable results of timing when compared to parallel languages together, as well as […]
Dec, 6
Parallel Implementation of Vortex Element Method on CPUs and GPUs
The implementations of 2D vortex element method adapted to different types of parallel computers are considered. The developed MPI-implementation provides close to linear acceleration for small number of computational cores and approximately 40-times acceleration for 80-cores cluster when solving model problem. OpenMP-based modification allows to obtain 5% additional acceleration due to shared memory usage. Approximate […]
Dec, 4
The Genetic Convolutional Neural Network Model Based on Random Sample
Convolutional neural network (CNN) – the result of the training is affected by of initial value of the weights. It is concluded that the model is not necessarily the best features of expression. The use of genetic algorithm can help choosing the better characteristics. But there almost was not literature study of the combining genetic […]
Dec, 4
An Accelerator based on the rho-VEX Processor: an Exploration using OpenCL
In recent years the use of co-processors to accelerate specific tasks is becoming more common. To simplify the use of these accelerators in software, the OpenCL framework has been developed. This framework provides programs a cross-platform interface for using accelerators. The rho-VEX processor is a run-time reconfigurable VLIW processor. It allows run-time switching of configurations, […]
Dec, 4
Optimizing CUDA Shared Memory Usage
CUDA shared memory is fast, on-chip storage. However, the bank conflict issue could cause a performance bottleneck. Current NVIDIA Tesla GPUs support memory bank accesses with configurable bit-widths. While this feature provides an efficient bank mapping scheme for 32-bit and 64-bit data types, it becomes trickier to solve the bank conflict problem through manual code […]
Dec, 4
Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In […]
Dec, 4
An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA
Modern Graphics Processing Units (GPUs) have high computation power and low cost. Recently, many applications in various fields have been computed powerfully on the GPU using CUDA. In this paper, we propose an efficient parallel algorithm for graph isomorphism which runs on the GPU using CUDA for matching large graphs. Parallelization of a sequential graph […]
Dec, 1
Programming in CUDA for Kepler and Maxwell Architecture
Since the first version of CUDA was launch, many improvements were made in GPU computing. Every new CUDA version included important novel features, turning this architecture more and more closely related to a typical parallel High Performance Language. This tutorial will present the GPU architecture and CUDA principles, trying to conceptualize novel features included by […]