Posts
Feb, 17
Software-Defined FPGA Accelerator Design for Mobile Deep Learning Applications
Recently, the field of deep learning has received great attention by the scientific community and it is used to provide improved solutions to many computer vision problems. Convolutional neural networks (CNNs) have been successfully used to attack problems such as object recognition, object detection, semantic segmentation, and scene understanding. The rapid development of deep learning […]
Feb, 17
DeeperLab: Single-Shot Image Parser
We present a single-shot, bottom-up approach for whole image parsing. Whole image parsing, also known as Panoptic Segmentation, generalizes the tasks of semantic segmentation for ‘stuff’ classes and instance segmentation for ‘thing’ classes, assigning both semantic and instance labels to every pixel in an image. Recent approaches to whole image parsing typically employ separate standalone […]
Feb, 17
GPU Accelerated Keccak (SHA3) Algorithm
Hash functions like SHA-1 or MD5 are one of the most important cryptographic primitives, especially in the field of information integrity. Considering the fact that increasing methods have been proposed to break these hash algorithms, a competition for a new family of hash functions was held by the US National Institute of Standards and Technology. […]
Feb, 10
Performance Evaluation of OpenMP’s Target Construct on GPUs: Exploring Compiler Optimizations
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU […]
Feb, 10
AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
Emerging new architectures used in High Performance Computing require new research to adapt and optimise algorithms to them.As part of this effort, we propose the newAXC format to improve the performance of the SpMV product for the Intel Xeon Phi coprocessor. The performance of the OpenCL kernel, based on our new format, is compared with […]
Feb, 10
Optimising Convolutional Neural Networks Inference on Low-Powered GPUs
In this paper we present effective optimisation techniques for accelerating convolutional neural networks inference on low-powered heterogeneous devices with OpenCL. Using LeNet and VGG-16 as test networks, we implement a custom neural network system in OpenCL and optimise it to minimise their inference times. Our baseline system shows a speedup of 17x for LeNet. We […]
Feb, 10
Heterogeneous Distributed Big Data Clustering on Sparse Grids
Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The […]
Feb, 10
Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression
Hierarchical matrices are space and time efficient representations of dense matrices that exploit the low rank structure of matrix blocks at different levels of granularity. The hierarchically low rank block partitioning produces representations that can be stored and operated on in near-linear complexity instead of the usual polynomial complexity of dense matrices. In this paper, […]
Feb, 3
ThunderGBM: Fast GBDTs and Random Forests on GPUs
Gradient Boosting Decision Trees (GBDTs) and Random Forests (RFs) have been used in many real-world applications. They are often a standard recipe for building state-of-the-art solutions to machine learning and data mining problems. However, training and prediction are very expensive computationally for large and high dimensional problems. This article presents an efficient and open source […]
Feb, 3
Swizzle Inventor: Data Movement Synthesis for GPU Kernels
Utilizing memory and register bandwidth in modern architectures may require irregular data placement and movement, such as shuffles and broadcasts. We develop Swizzle Inventor to help programmers implement swizzle algorithms, by writing programs that omit swizzles and delegating the creation of those swizzles to an automatic synthesizer. Our synthesis algorithm scales to real-world programs, allowing […]
Feb, 3
The OoO VLIW JIT Compiler for GPU Inference
Current trends in Machine Learning (ML) inference on hardware accelerated devices (e.g., GPUs, TPUs) point to alarmingly low utilization. As ML inference is increasingly time-bounded by tight latency SLOs, increasing data parallelism is not an option. The need for better efficiency motivates GPU multiplexing. Furthermore, existing GPU programming abstractions force programmers to micro-manage GPU resources […]
Feb, 3
High Performance Algorithms for Counting Collisions and Pairwise Interactions
The problem of counting collisions or interactions is common in areas as computer graphics and scientific simulations. Since it is a major bottleneck in applications of these areas, a lot of research has been done on such subject, mainly focused on techniques that allow calculations to be performed within pruned sets of objects. This paper […]