Posts
Dec, 13
Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL […]
Dec, 13
Efficient code generation for hardware accelerators by refining partially specified implementation
Software programmable hardware accelerators, such as Graphical Processing Units (GPUs), are specialized processors designed to perform specific tasks more efficiently than general purpose processors. They trade off generality against specialized data paths and massive parallelism, providing a raw processing power that is orders of magnitude higher than for contemporary multicore CPUs. Unfortunately, finding an efficient […]
Dec, 6
Accelerate Scientific Deep Learning Models on Heterogeneous Computing Platform with FPGA
AI and deep learning are experiencing explosive growth in almost every domain involving analysis of big data. Deep learning using Deep Neural Networks (DNNs) has shown great promise for such scientific data analysis applications. However, traditional CPU-based sequential computing without special instructions can no longer meet the requirements of mission-critical applications, which are compute-intensive and […]
Dec, 6
Exploring FPGA Optimizations to Compute Sparse Numerical Linear Algebra Kernels
The solution of sparse triangular linear systems (sptrsv) is the bottleneck of many numerical methods. Thus, it is crucial to count with efficient implementations of such kernel, at least for commonly used platforms. In this sense, Field–Programmable Gate Arrays (FPGAs) have evolved greatly in the last years, entering the HPC hardware ecosystem largely due to […]
Dec, 6
Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks
In this paper, we provide a fine-grain machine learning-based method, PerfNetV2, which improves the accuracy of our previous work for modeling the neural network performance on a variety of GPU accelerators. Given an application, the proposed method can be used to predict the inference time and training time of the convolutional neural networks used in […]
Dec, 6
High-Throughput Parallel Viterbi Decoder on GPU Tensor Cores
Many research works have been performed on implementation of Vitrerbi decoding algorithm on GPU instead of FPGA because this platform provides considerable flexibility in addition to great performance. Recently, the recently-introduced Tensor cores in modern GPU architectures provide incredible computing capability. This paper proposes a novel parallel implementation of Viterbi decoding algorithm based on Tensor […]
Dec, 6
Python Workflows on HPC Systems
The recent successes and wide spread application of compute intensive machine learning and data analytics methods have been boosting the usage of the Python programming language on HPC systems. While Python provides many advantages for the users, it has not been designed with a focus on multi-user environments or parallel programming – making it quite […]
Nov, 29
Evaluating the Performance and Portability of Contemporary SYCL Implementations
SYCL is a single-source programming model for heterogeneous systems; it promises improved maintainability, productivity, and opportunity for compiler optimization, when compared to accelerator specific programming models. Several implementations of the SYCL standard have been developed over the past few years, including several backends using contemporary accelerator languages, like OpenCL, CUDA, and HIP. These implementations vary […]
Nov, 29
Efficient Deep Neural Network Inference for Embedded Systems: A Mixture of Experts Approach
Deep neural networks (DNNs) have become one of the dominant machine learning approaches in recent years for many application domains. Unfortunately, DNNs are not well suited to addressing the challenges of embedded systems, where on-device inference on battery-powered, resource-constrained devices is often infeasible due to prohibitively long inferencing time and resource requirements. Furthermore, offloading computation […]
Nov, 29
HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC
Hardware-agnostic programming with high performance portability will be the bedrock for realizing the ubiquitous adoption of emerging accelerator technologies in future heterogeneous high-performance computing (HPC) systems, which is the key to achieving the next level of HPC performance on an expanding accelerator landscape. In this paper, we present HALO 1.0, an open-ended extensible multi-agent software […]
Nov, 29
BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs
Sparse solvers are one of the building blocks of any technology for reliableand high-performance scientific and engineering computing. In this paperwe present a software package which implements an efficient multigrid sparsesolver running on Graphics Processing Units. The package is a branch ofa wider initiative of software development for sparse Linear Algebra com-putations on emergent HPC […]
Nov, 29
AZP: Automatic Specialization for Zero Values in Gaming Applications
Recent research has shown that dynamic zeros in shader programs of gaming applications can be effectively leveraged with a profile-guided, code-versioning transform. This transform duplicates code, specializes one path assuming certain key program operands, called versioning variables, are zero, and leaves the other path unspecialized. Dynamically, depending on the versioning variable’s value, either the specialized […]