Posts
Dec, 6
Parallelization Methods of the Template Matching Method on Graphics Accelerators
Template matching is a classic technique used in image processing for object detection. It is based on multiple matrix-based calculations, where there are no dependencies on partial results, so parallel solutions could be created. In this article two GPU implemented methods are presented and compared to the CPU-based sequential solution.
Dec, 6
A Study of Parallel Sorting Algorithms Using CUDA and OpenMP
This thesis reviews the parallel languages according to their computational complexities, in terms of time, while using sorting algorithms coded in CUDA and OpenMP. The thesis evaluates the solution for parallelism at a maintainable cost of money and other efforts, for achieving acceptable results of timing when compared to parallel languages together, as well as […]
Dec, 4
The Genetic Convolutional Neural Network Model Based on Random Sample
Convolutional neural network (CNN) – the result of the training is affected by of initial value of the weights. It is concluded that the model is not necessarily the best features of expression. The use of genetic algorithm can help choosing the better characteristics. But there almost was not literature study of the combining genetic […]
Dec, 4
An Accelerator based on the rho-VEX Processor: an Exploration using OpenCL
In recent years the use of co-processors to accelerate specific tasks is becoming more common. To simplify the use of these accelerators in software, the OpenCL framework has been developed. This framework provides programs a cross-platform interface for using accelerators. The rho-VEX processor is a run-time reconfigurable VLIW processor. It allows run-time switching of configurations, […]
Dec, 4
Optimizing CUDA Shared Memory Usage
CUDA shared memory is fast, on-chip storage. However, the bank conflict issue could cause a performance bottleneck. Current NVIDIA Tesla GPUs support memory bank accesses with configurable bit-widths. While this feature provides an efficient bank mapping scheme for 32-bit and 64-bit data types, it becomes trickier to solve the bank conflict problem through manual code […]
Dec, 4
Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In […]
Dec, 4
An Efficient Parallel Algorithm for Graph Isomorphism on GPU using CUDA
Modern Graphics Processing Units (GPUs) have high computation power and low cost. Recently, many applications in various fields have been computed powerfully on the GPU using CUDA. In this paper, we propose an efficient parallel algorithm for graph isomorphism which runs on the GPU using CUDA for matching large graphs. Parallelization of a sequential graph […]
Dec, 1
Programming in CUDA for Kepler and Maxwell Architecture
Since the first version of CUDA was launch, many improvements were made in GPU computing. Every new CUDA version included important novel features, turning this architecture more and more closely related to a typical parallel High Performance Language. This tutorial will present the GPU architecture and CUDA principles, trying to conceptualize novel features included by […]
Dec, 1
Auxiliary Image Regularization for Deep CNNs with Noisy Labels
Precisely-labeled data sets with sufficient amount of samples are notably important for training deep convolutional neural networks (CNNs). However, many of the available real-world data sets contain erroneously labeled samples and the error in labels of training sample makes it a daunting task to learn a well-performing deep CNN model. In this work, we consider […]
Dec, 1
A General Framework for Constrained Bayesian Optimization using Information-based Search
We present an information-theoretic framework for solving global black-box optimization problems that also have black-box constraints. Of particular interest to us is to efficiently solve problems with decoupled constraints, in which subsets of the objective and constraint functions may be evaluated independently. For example, when the objective is evaluated on a CPU and the constraints […]
Dec, 1
Efficient Static and Dynamic Memory Management Techniques for Multi-GPU Systems
There are four trends in modern high-performance computing (HPC) that have led to an increased need for efficient memory management techniques for heterogeneous systems (such as one fitted with GPUs). First, the average size of datasets for HPC applications is rapidly increasing. Read-only input matrices that used to be on the order of megabytes or […]
Dec, 1
Bridging OpenCL and CUDA: A Comparative Analysis and Translation
Heterogeneous systems are widening their user-base, and heterogeneous computing is becoming popular in supercomputing. Among others, OpenCL and CUDA are the most popular programming models for heterogeneous systems. Although OpenCL inherited many features from CUDA and they have almost the same platform model, they are not compatible with each other. In this paper, we present […]