Posts
Apr, 18
Efficient Large-Scale Language Model Training on GPU Clusters
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required […]
Apr, 18
A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning
Recently, Deep Neural Networks (DNNs) have recorded great success in handling medical and other complex classification tasks. However, as the sizes of a DNN model and the available dataset increase, the training process becomes more complex and computationally intensive, which usually takes a longer time to complete. In this work, we have proposed a generic […]
Apr, 11
Multiple-Tasks on Multiple-Devices (MTMD): Exploiting Concurrency in Heterogeneous Managed Runtimes
Modern commodity devices are nowadays equipped with a plethora of heterogeneous devices serving different purposes. Being able to exploit such heterogeneous hardware accelerators to their full potential is of paramount importance in the pursuit of higher performance and energy efficiency. Towards these objectives, the reduction of idle time of each device as well as the […]
Apr, 11
Progressive Semantic Segmentation
The objective of this work is to segment high-resolution images without overloading GPU memory usage or losing the fine details in the output segmentation map. The memory constraint means that we must either downsample the big image or divide the image into local patches for separate processing. However, the former approach would lose the fine […]
Apr, 11
Performance Monitoring of Multi-FPGA Systems
Field-Programmable Gate Arrays (FPGAs) have been increasingly deployed in datacenters and there has been a lot of focus on tools that help the development of FPGA applications. Among the most important tools are performance monitors that provide visibility into the state of the hardware. As the application platforms scale from one FPGA to many FPGAs, […]
Apr, 11
Large Scale GPU Based Simulations of Turbulent Bubbly Flow in a Square Duct
In this paper, we present the results of a numerical study of air-water turbulent bubbly flow in a periodic vertical square duct. The study is conducted using a novel numerical technique which leverages Volume of Fluid method for interface capturing and Sharp Surface Force method for accurate representation of the surface tension forces. A three-dimensional […]
Apr, 11
Efficient Video Compression via Content-Adaptive Super-Resolution
Video compression is a critical component of Internet video delivery. Recent work has shown that deep learning techniques can rival or outperform human-designed algorithms, but these methods are significantly less compute and power-efficient than existing codecs. This paper presents a new approach that augments existing codecs with a small, content-adaptive super-resolution model that significantly boosts […]
Apr, 5
An Investigation of Atomic Synchronization for Sort-Based Group-By Aggregation on GPUs
Using heterogeneous processing devices, like GPUs, to accelerate relational database operations is a well-known strategy. In this context, the group by operation is highly interesting for two reasons. Firstly, it incurs large processing costs. Secondly, its results (i.e., aggregates) are usually small reducing data movement costs whose compensation is a major challenge for heterogeneous computing. […]
Apr, 5
Parallel Arbitrary-precision Integer Arithmetic
Arbitrary-precision integer arithmetic computations are driven by applications in solving systems of polynomial equations and public-key cryptography. Such computations arise when high precision is required (with large input values that fit into multiple machine words), or to avoid coefficient overflow due to intermediate expression swell. Meanwhile, the growing demand for faster computation alongside the recent […]
Apr, 5
Daisen: A Framework for Visualizing Detailed GPU Execution
Graphics Processing Units (GPUs) have been widely used to accelerate artificial intelligence, physics simulation, medical imaging, and information visualization applications. To improve GPU performance, GPU hardware designers need to identify performance issues by inspecting a huge amount of simulator-generated traces. Visualizing the execution traces can reduce the cognitive burden of users and facilitate making sense […]
Apr, 5
LS-CAT: A Large-Scale CUDA AutoTuning Dataset
The effectiveness of Machine Learning (ML) methods depend on access to large suitable datasets. In this article, we present how we build the LS-CAT (Large-Scale CUDA AutoTuning) dataset sourced from GitHub for the purpose of training NLP-based ML models. Our dataset includes 19 683 CUDA kernels focused on linear algebra. In addition to the CUDA […]
Apr, 5
Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters
Energy conservation of large data centers for high-performance computing workloads, such as deep learning with big data, is of critical significance, where cutting down a few percent of electricity translates into million-dollar savings. This work studies energy conservation on emerging CPU-GPU hybrid clusters through dynamic voltage and frequency scaling (DVFS). We aim at minimizing the […]