Posts
Feb, 10
Performance Evaluation of OpenMP’s Target Construct on GPUs: Exploring Compiler Optimizations
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU […]
Feb, 10
AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
Emerging new architectures used in High Performance Computing require new research to adapt and optimise algorithms to them.As part of this effort, we propose the newAXC format to improve the performance of the SpMV product for the Intel Xeon Phi coprocessor. The performance of the OpenCL kernel, based on our new format, is compared with […]
Feb, 10
Optimising Convolutional Neural Networks Inference on Low-Powered GPUs
In this paper we present effective optimisation techniques for accelerating convolutional neural networks inference on low-powered heterogeneous devices with OpenCL. Using LeNet and VGG-16 as test networks, we implement a custom neural network system in OpenCL and optimise it to minimise their inference times. Our baseline system shows a speedup of 17x for LeNet. We […]
Feb, 10
Heterogeneous Distributed Big Data Clustering on Sparse Grids
Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The […]
Feb, 10
Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression
Hierarchical matrices are space and time efficient representations of dense matrices that exploit the low rank structure of matrix blocks at different levels of granularity. The hierarchically low rank block partitioning produces representations that can be stored and operated on in near-linear complexity instead of the usual polynomial complexity of dense matrices. In this paper, […]
Feb, 3
ThunderGBM: Fast GBDTs and Random Forests on GPUs
Gradient Boosting Decision Trees (GBDTs) and Random Forests (RFs) have been used in many real-world applications. They are often a standard recipe for building state-of-the-art solutions to machine learning and data mining problems. However, training and prediction are very expensive computationally for large and high dimensional problems. This article presents an efficient and open source […]
Feb, 3
Swizzle Inventor: Data Movement Synthesis for GPU Kernels
Utilizing memory and register bandwidth in modern architectures may require irregular data placement and movement, such as shuffles and broadcasts. We develop Swizzle Inventor to help programmers implement swizzle algorithms, by writing programs that omit swizzles and delegating the creation of those swizzles to an automatic synthesizer. Our synthesis algorithm scales to real-world programs, allowing […]
Feb, 3
The OoO VLIW JIT Compiler for GPU Inference
Current trends in Machine Learning (ML) inference on hardware accelerated devices (e.g., GPUs, TPUs) point to alarmingly low utilization. As ML inference is increasingly time-bounded by tight latency SLOs, increasing data parallelism is not an option. The need for better efficiency motivates GPU multiplexing. Furthermore, existing GPU programming abstractions force programmers to micro-manage GPU resources […]
Feb, 3
High Performance Algorithms for Counting Collisions and Pairwise Interactions
The problem of counting collisions or interactions is common in areas as computer graphics and scientific simulations. Since it is a major bottleneck in applications of these areas, a lot of research has been done on such subject, mainly focused on techniques that allow calculations to be performed within pruned sets of objects. This paper […]
Feb, 3
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression
DNNs have been quickly and broadly exploited to improve the data analysis quality in many complex science and engineering applications. Today’s DNNs are becoming deeper and wider because of increasing demand on the analysis quality and more and more complex applications to resolve. The wide and deep DNNs, however, require large amounts of resources, significantly […]
Jan, 27
Exploiting OpenMP & OpenACC to Accelerate a Molecular Docking Mini-App in Heterogeneous HPC Nodes
In drug discovery, molecular docking is the task in charge of estimating the position of a molecule when interacting with the docking site. This task is usually used to perform screening of a large library of molecules, in the early phase of the process. Given the amount of candidate molecules and the complexity of the […]
Jan, 27
Supporting mixed-datatype matrix multiplication within the BLIS framework
We approach the problem of implementing mixed-datatype support within the general matrix multiplication (GEMM) operation of the BLIS framework, whereby each matrix operand A, B, and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the computation is allowed to take place in a precision different from […]