Posts
May, 23
ImageCL: An Image Processing Language for Performance Portability on Heterogeneous Systems
Modern computer systems typically conbine multicore CPUs with accelerators like GPUs for inproved performance and energy efficiency. However, these systems suffer from poor performance portability, code tuned for one device must be retuned to achieve high performance on another. Image processing is increasing in importance, with applications ranging from seismology and medicine to Photoshop. Based […]
May, 23
Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups
We propose a new method for training computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. Our sparse connection structure facilitates a significant reduction in computational cost and number of parameters of state-of-the-art deep CNNs without compromising accuracy. We validate our approach by using it […]
May, 23
Graphics Supercomputing Applied to Brain Image Analysis with NiftyReg
Medical image processing in general and brain image processing in particular are computationally intensive tasks. Luckily, their use can be liberalized by means of techniques such as GPU programming. In this article we study NiftyReg, a brain image processing library with a GPU implementation using CUDA, and analyse different possible ways of further optimising the […]
May, 23
A Practical Performance Model for Compute and Memory Bound GPU Kernels
Performance prediction of GPU kernels is generally a tedious procedure with unpredictable results. In this paper, we provide a practical model for estimating performance of CUDA kernels on GPU hardware in an automated manner. First, we propose the quadrant-split model, an alternative of the roofline visual performance model, which provides insight on the performance limiting […]
May, 21
The Hitchhiker’s Guide to Cross-Platform OpenCL Application Development
One of the benefits to programming of OpenCL is platform portability. That is, an OpenCL program that follows the OpenCL specification should, in principle, execute reliably on any platform that supports OpenCL. To assess the current state of OpenCL portability, we provide an experience report examining two sets of open source benchmarks that we attempted […]
May, 21
Architecture-Adaptive Code Variant Tuning
Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and input dataset). Automating code variant selection typically relies on machine learning to construct a model during an offline learning phase that can be quickly […]
May, 21
GPU-based Pedestrian Detection for Autonomous Driving
Pedestrian detection has gained a lot of prominence during the last few years. Besides the fact that it is one of the hardest tasks within computer vision, it involves huge computational costs. Obtaining acceptable real-time performance, measured in frames per second (fps), for the most advanced algorithms is nowadays a hard challenge. In this work, […]
May, 21
Performance Evaluation of Parallel Count Sort using GPU Computing with CUDA
OBJECTIVE: Sorting is considered a very important application in many areas of computer science. Nowadays parallelization of sorting algorithms using GPU computing, on CUDA hardware is increasing rapidly. The objective behind using GPU computing is that the users can get, the more speedup of the algorithms. METHODS: In this paper, we have focused on count […]
May, 21
Employing Directive Based Compression Solutions on Accelerators Global Memory under OpenACC
Programmers invest extensive development effort to optimize a GPU program to achieve peak performance. Achieving this requires an efficient usage of global memory, and avoiding memory bandwidth underutilization. The OpenACC programming model has been introduced to tackle the accelerators programming complexity. However, this models coarse-grained control on a program can make the memory bandwidth utilization […]
May, 17
GPU-Accelerated Feature Tracking
The motivation of this research is to prove that GPUs can provide significant speedup of long-executing image processing algorithms by way of parallelization and massive data throughput. This thesis accelerates the well-known KLT feature tracking algorithm using OpenCL and an NVidia GeForce GTX 780 GPU. KLT is a fast, efficient and accurate feature tracker but […]
May, 17
DeepLearningKit – an GPU Optimized Deep Learning Framework for Apple’s iOS, OS X and tvOS developed in Metal and Swift
In this paper we present DeepLearningKit – an open source framework that supports using pretrained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOS-based mobile apps on iPhone/iPad, tvOS-based apps for the […]
May, 17
A Foray into Efficient Mapping of Algorithms to Hardware Platforms on Heterogeneous Systems
Heterogeneous computing can potentially offer significant performance and performance per watt improvements over homogeneous computing, but the question "what is the ideal mapping of algorithms to architectures?" remains an open one. In the past couple of years new types of computing devices such as FPGAs have come into general computing use. In this work we […]

