Posts
May, 28
Theano-MPI: a Theano-based Distributed Training Framework
We develop a scalable and extendable training framework that can utilize GPUs across nodes in a cluster and accelerate the training of deep learning models based on data parallelism. Both synchronous and asynchronous training are implemented in our framework, where parameter exchange among GPUs is based on CUDA-aware MPI. In this report, we analyze the […]
May, 26
Faster GPU-based convolutional gridding via thread coarsening
Convolutional gridding is a processor-intensive step in interferometric imaging. While it is possible to use graphics processing units (GPUs) to accelerate this operation, existing methods use only a fraction of the available flops. We apply thread coarsening to improve the efficiency of an existing algorithm, and observe performance gains of up to 3.2x for single-polarization […]
May, 26
Learning a Metric Embedding for Face Recognition using the Multibatch Method
This work is motivated by the engineering task of achieving a near state-of-the-art face recognition on a minimal computing budget running on an embedded system. Our main technical contribution centers around a novel training method, called Multibatch, for similarity learning, i.e., for the task of generating an invariant "face signature" through training pairs of "same" […]
May, 26
Implementing Deep Neural Networks for Financial Market Prediction on the Intel Xeon Phi
Deep neural networks (DNNs) are powerful types of artificial neural networks (ANNs) that use several hidden layers. They have recently gained considerable attention in the speech transcription and image recognition community (Krizhevsky et al., 2012) for their superior predictive properties including robustness to overfitting. However their application to financial market prediction has not been previously […]
May, 26
PROJECTION Algorithm for Motif Finding on GPUs
Motif finding is one of the NP-complete problems in Computational Biology. Existing nondeterministic algorithms for motif finding do not guarantee the global optimality of results and are sensitive to initial parameters. To address this problem, the PROJECTION algorithm provides a good initial estimate that can be further refined using local optimization algorithms such as EM, […]
May, 26
Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU
In this paper, we present that security threats coming with existing GPU memory management strategy are overlooked, which opens a back door for adversaries to freely break the memory isolation: they enable adversaries without any privilege in a computer to recover the raw memory data left by previous processes directly. More importantly, such attacks can […]
May, 23
Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks
Convolutional neural networks (CNN) have achieved major breakthroughs in recent years. Their performance in computer vision have matched and in some areas even surpassed human capabilities. Deep neural networks can capture complex non-linear features; however this ability comes at the cost of high computational and memory requirements. State-of-art networks require billions of arithmetic operations and […]
May, 23
ImageCL: An Image Processing Language for Performance Portability on Heterogeneous Systems
Modern computer systems typically conbine multicore CPUs with accelerators like GPUs for inproved performance and energy efficiency. However, these systems suffer from poor performance portability, code tuned for one device must be retuned to achieve high performance on another. Image processing is increasing in importance, with applications ranging from seismology and medicine to Photoshop. Based […]
May, 23
Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups
We propose a new method for training computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. Our sparse connection structure facilitates a significant reduction in computational cost and number of parameters of state-of-the-art deep CNNs without compromising accuracy. We validate our approach by using it […]
May, 23
Graphics Supercomputing Applied to Brain Image Analysis with NiftyReg
Medical image processing in general and brain image processing in particular are computationally intensive tasks. Luckily, their use can be liberalized by means of techniques such as GPU programming. In this article we study NiftyReg, a brain image processing library with a GPU implementation using CUDA, and analyse different possible ways of further optimising the […]
May, 23
A Practical Performance Model for Compute and Memory Bound GPU Kernels
Performance prediction of GPU kernels is generally a tedious procedure with unpredictable results. In this paper, we provide a practical model for estimating performance of CUDA kernels on GPU hardware in an automated manner. First, we propose the quadrant-split model, an alternative of the roofline visual performance model, which provides insight on the performance limiting […]
May, 21
The Hitchhiker’s Guide to Cross-Platform OpenCL Application Development
One of the benefits to programming of OpenCL is platform portability. That is, an OpenCL program that follows the OpenCL specification should, in principle, execute reliably on any platform that supports OpenCL. To assess the current state of OpenCL portability, we provide an experience report examining two sets of open source benchmarks that we attempted […]