Posts
Feb, 2
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model
We present a library for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. The library is based on the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. Acting as matrix library developers, using this model we do not have to explicitly deal with distribution of work and data or communication between computational nodes […]
Feb, 2
Montblanc: GPU accelerated Radio Interferometer Measurement Equations in support of Bayesian Inference for Radio Observations
We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters […]
Feb, 1
Optimized Data Transfers Based on the OpenCL Event Management Mechanism
In standard OpenCL programming, hosts such as CPUs are supposed to control their compute devices such as GPUs. Since compute devices are dedicated to kernel computation, only hosts can execute several kinds of data transfers such as inter-node communication and file access. These data transfers require one host to simultaneously play two or more roles […]
Feb, 1
In-Memory Data Analytics on Coupled CPU-GPU Architectures
In the big data era, in-memory data analytics is an effective means of achieving high performance data processing and realizing the value of data in a timely manner. Efforts in this direction have been spent on various aspects, including in-memory algorithmic designs and system optimizations. In this paper, we propose to develop the next-generation in-memory […]
Feb, 1
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly. Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs. While GPUs provide large numbers of compute resources, the resources needed for memory intensive workloads are more scarce. […]
Feb, 1
Productive and Efficient Computational Science Through Domain-specific Abstractions
In an ideal world, scientific applications are computationally efficient, maintainable and composable and allow scientists to work very productively. We argue that these goals are achievable for a specific application field by choosing suitable domain-specific abstractions that encapsulate domain knowledge with a high degree of expressiveness. This thesis demonstrates the design and composition of domain-specific […]
Feb, 1
Performance Analysis and Optimization of a Distributed Processing Framework for Data Mining Accelerated with Graphics Processing Units
In this age, a huge amount of data is generated every day by human interactions with services. Discovering the patterns of these data are very important to take business decisions. Due to the size of this data, it requires very high intensive computation power. Thus, many frameworks have been developed using Central Processing Units (CPU) […]
Jan, 30
On Vectorization of Deep Convolutional Neural Networks for Vision Tasks
We recently have witnessed many ground-breaking results in machine learning and computer vision, generated by using deep convolutional neural networks (CNN). While the success mainly stems from the large volume of training data and the deep network architectures, the vector processing hardware (e.g. GPU) undisputedly plays a vital role in modern CNN implementations to support […]
Jan, 30
OpenCL Implementation of LiDAR Data Processing
When designing a safety system, the faster the response time, the greater the reflexes of the system to hazards. As more commercial interest in autonomous and assisted vehicles grows, the number one concern is safety. If the system cannot react as fast as or faster than an average human, then the public will deem it […]
Jan, 30
Different Optimization Strategies and Performance Evaluation of Reduction on Multicore CUDA Architecture
The objective of this paper is to use different optimization strategies on multicore GPU architecture. Here for performance evaluation we have used parallel reduction algorithm. GPU on-chip shared memory is very fast than local and global memory. Shared memory latency is roughly 100x lower than non-cached global memory (make sure that there are no bank […]
Jan, 30
Accelerate micromagnetic simulations with GPU programming in MATLAB
A finite-difference Micromagnetic simulation code written in MATLAB is presented with Graphics Processing Unit (GPU) acceleration. The high performance of Graphics Processing Unit (GPU) is demonstrated compared to a typical Central Processing Unit (CPU) based code. The speed-up of GPU to CPU is shown to be greater than 30 for problems with larger sizes on […]
Jan, 30
Design Space Exploration of OpenCL Applications on Heterogeneous Parallel Platforms
Parallel programming is a skill which software engineers no longer can do without, since multi- and many-core architectures have been widely adopted for general-purpose computing platforms. In 2006 Intel introduced the first multi-core processor on the consumer market and, at the same time, NVIDIA unveiled CUDA, a programming paradigm to exploit Graphics Processing Units (GPUs) […]