Posts
May, 19
Automatic Virtualization of Accelerators
Applications are migrating en masse to the cloud, while accelerators such as GPUs, TPUs, and FPGAs proliferate in the wake of Moore’s Law. These technological trends are incompatible. Cloud applications run on virtual platforms, but traditional I/O virtualization techniques have not provided production-ready solutions for accelerators. As a result, cloud providers expose accelerators by using […]
May, 19
OpenDNN: An Open-source, cuDNN-like Deep Learning Primitive Library
Deep neural networks (DNNs) are a key enabler of today’s intelligent applications and services. cuDNN is the de-facto standard library of deep learning primitives, which makes it easy to develop sophisticated DNN models. However, cuDNN is a propriatary software from NVIDIA, and thus does not allow the user to customize it based on her needs. […]
May, 15
CUDA au Coq: A Framework for Machine-validating GPU Assembly Programs
A prototype framework for formal, machinechecked validation of GPU pseudo-assembly code algorithms using the Coq proof assistant is presented and discussed. The framework is the first to afford GPU programmers a reliable means of formally machine-validating high-assurance GPU computations without trusting any specific source-to-assembly compilation toolchain. A formal operational semantics for the PTX pseudo-assembly language […]
May, 15
A Unified Approach to Variable Renaming for Enhanced Vectorization
Despite the fact that compiler technologies for automatic vectorization have been under development for over four decades, there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization for SIMD units. One such gap can be found in the handling of loops with dependence cycles that involve memory-based anti (write-after-read) and […]
May, 15
An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language
Domain-specific languages (DSLs) play an increasingly important role in the generation of high performing software. They allow the user to exploit specific knowledge encoded in the constructs for the generation of code adapted to a particular hardware architecture; at the same time, they make it easier to generate optimized code for a multitude of platforms […]
May, 12
Improving Resource Efficiency in Virtualized Datacenters
In recent years there has been an extraordinary growth of the Internet of Things (IoT) and its protocols. The increasing diffusion of electronic devices with identification, computing and communication capabilities is laying ground for the emergence of a highly distributed service and networking environment. The above mentioned situation implies that there is an increasing demand […]
May, 12
FPGA Implementation of Reduced Precision Convolutional Neural Networks
With the improvement in processing systems, machine learning applications are finding widespread use in almost all sectors of technology. Image recognition is one application of machine learning which has become widely popular with various architectures and systems aimed at improving recognition performance. With classification accuracy now approaching saturation point, many researchers are now focusing on […]
May, 12
Arbitrarily large iterative tomographic reconstruction on multiple GPUs using the TIGRE toolbox
Tomographic image sizes keep increasing over time and while the GPUs that compute the tomographic reconstruction are also increasing in memory size, they are not doing so fast enough to reconstruct the largest datasets. This problem is often solved by reconstructing data in large clusters of GPUs with enough devices to fit the measured X-ray […]
May, 12
Predictable GPGPU Computing in DNN-Driven Autonomous Systems
Graphics processing units (GPUs) are being widely used as co-processors in many domains to accelerate general-purpose workloads that are data-parallel and computationally intensive, i.e., GPGPU. An emerging usage domain is adopting GPGPU to accelerate inherently computation-intensive Deep Neural Network (DNN) workloads in autonomous systems. Such autonomous systems are usually time-sensitive, especially for autonomous driving systems. […]
May, 12
Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges […]
May, 8
FPGA-based acceleration of a particle simulation High Performance Computing application
In the present thesis, it has been studied the possibility to insert FPGAs in the world of High Performance Computing (HPC) systems. Such systems are hybrid platforms that exploit the pure parallel computation of GPUs in order to reach very high performances. Nevertheless, GPU-based systems are power-hungry and require a power consumption so large, that […]
May, 8
Charactering and Detecting CUDA Program Bugs
While CUDA has become a major parallel computing platform and programming model for general-purpose GPU computing, CUDA-induced bug patterns have not yet been well explored. In this paper, we conduct the first empirical study to reveal important categories of CUDA program bug patterns based on 319 bugs identified within 5 popular CUDA projects in GitHub. […]