Posts
Jan, 23
NNP/MM: Fast molecular dynamics simulations with machine learning potentials and molecular mechanics
Parametric and non-parametric machine learning potentials have emerged recently as a way to improve the accuracy of bio-molecular simulations. Here, we present NNP/MM, an hybrid method integrating neural network potentials (NNPs) and molecular mechanics (MM). It allows to simulate a part of molecular system with NNP, while the rest is simulated with MM for efficiency. […]
Jan, 23
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time. We therefore […]
Jan, 16
Fancier: A Unified Framework for Java, C, and OpenCL Integration
Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators provide requires the use of specialized native programming interfaces, such as CUDA or OpenCL, or higher-level programming models like OpenMP or OpenACC. However, on managed programming languages, offloading execution into […]
Jan, 16
Research and Development of Porting SYCL on QNX Operating System for High Parallelism
As a standard C++ programming model, SYCL has gained popularity on incorporating various parallel computing frameworks. With the development of hardware technologies, low-level computing devices are becoming increasingly varied and thus result in the great heterogeneity of hardware. Although many computing frameworks, such as OpenCL, OpenMP and CUDA, can benefit to heterogeneous computing, they increase […]
Jan, 16
Dopia: Online Parallelism Management for Integrated CPU/GPU Architectures
Recent desktop and mobile processors often integrate CPU and GPU onto the same die. The limited memory bandwidth of these integrated architectures can negatively affect the performance of data-parallel workloads when all computational resources are active. The combination of active CPU and GPU cores achieving the maximum performance depends on a workload’s characteristics, making manual […]
Jan, 16
A Compiler Framework for Optimizing Dynamic Parallelism on GPUs
Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small […]
Jan, 16
Studying the Potential of Automatic Optimizations in the Intel FPGA SDK for OpenCL
High Level Synthesis (HLS) tools, like the Intel FPGA SDK for OpenCL, improve design productivity and enable efficient design space exploration guided by simple program directives (pragmas), but may sometimes miss important optimizations necessary for high performance. In this paper, we present a study of the tradeoffs in HLS optimizations, and the potential of a […]
Jan, 9
Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment
Deep learning has gained tremendous success in various fields while training deep neural networks (DNNs) is very compute-intensive, which results in numerous deep learning frameworks that aim to offer better usability and higher performance to deep learning practitioners. TensorFlow and PyTorch are the two most popular frameworks. TensorFlow is more promising within the industry context, […]
Jan, 9
Analysis of High Level implementations for Recursive Methods on GPUs
Higher level DSLs have allowed for performant computation on GPUs while providing enough abstraction to the user to avoid significant deployment overhead. However, the SIMD/SIMT model of programming still can encounter unexpected performance drops when trying to translate naively from CPU code. One example of these performance drops is branch divergence, and this failure is […]
Jan, 9
Dynamic GPU Energy Optimization for Machine Learning Training Workloads
GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration […]
Jan, 9
Domain-Specific On-Device Object Detection Method
Object detection is a significant activity in computer vision, and various approaches have been proposed to detect varied objects using deep neural networks (DNNs). However, because DNNs are computation-intensive, it is difficult to apply them to resource-constrained devices. Here, we propose an on-device object detection method using domain-specific models. In the proposed method, we define […]
Jan, 9
CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs
We present CFU Playground, a full-stack open-source framework that enables rapid and iterative design of machine learning (ML) accelerators for embedded ML systems. Our toolchain tightly integrates open-source software, RTL generators, and FPGA tools for synthesis, place, and route. This full-stack development framework gives engineers access to explore bespoke architectures that are customized and co-optimized […]