Studies on CUDA Offloading for Real-Time Simulation and Visualization
The University of Electro-Communications
The University of Electro-Communications, 2020
@phdthesis{martinez2020studies,
title={Studies on CUDA Offloading for Real-Time Simulation and Visualization},
author={Mart{i}nez-Noriega, Edgar Josafat},
year={2020}
}
The Graphics Processing Unit (GPU) is a co-processor designed to aid the Central Processing Unit (CPU) for rendering 3D graphics. The prompt development of these graphics chips due to the popularity of games and media design helped the GPU to evolve its ubiquitous parallel architecture. The programmability of these devices increased with the introduction of shaders, and thus using the GPU for more than rendering pixels. A new paradigm was introduced by General Purpose Computing on Graphics Processing Unit (GPGPU). At the present time, super computers in the top ten are powered by GPUs in order to accelerate physical phenomena simulations. Moreover, programming models such as Compute Unified Device Architecture (CUDA) and OpenCL have been proposed from major GPU manufactures. Nevertheless, CUDA has proven to be the first choice from the developer community due to its extensive support and applications. On the other hand, post-PC devices such as smart phones and tablets have become elemental in our daily life. These mobile devices equipped with touch screen and many sensors, provide new ways to visualize and interact with data. Interactive modelling on Molecular Dynamics (MD) simulation, is one example where these devices can offer a better user experience. However, post-PC devices are designed for low power consumption, thus their computational power is not enough to perform such compute intensive applications. Moreover, a new approach that can complement the low computing power of mobile devices is cloud computing. Implementing a server-client scheme, cloud computing allows to offload computational intensive routines and hookup with massive parallel accelerators such as GPUs. In order to have access to these hardware accelerators, tools such as GPU virtualization frameworks has been proposed: GVirtus, ShadowFax, DS-CUDA, GPUvm, MGP, vCUDA, and rCUDA. These virtualization tools can handle a remote GPU in order to accelerate execution within applications and reducing code complexity. In this dissertation, we study and analyse the rendering, computational power, and power efficiency when GPU virtualization tools are implemented to accelerate an MD simulation and visualization on a tablet device. We proposed to offload the most computational intensive routines to a remote GPU. Two cases are reported: In the first scenario, we used a lowpowered GPU from a notebook as a server in order to keep power efficiency of the whole system. We selected DS-CUDA framework to enable the development of remote offloading using an Android tablet. Only CUDA kernels were offloaded since DS-CUDA preprocessor has the capability to wrap seamlessly CUDA code without modification. Calculation speeds are reported when the MD was compared between GPU and CPU implementation inside the tablet device. However, to get larger calculation performance, the visualization speed need to be decreased. The efficiency of GPU can be improved by decreasing the frequency of updating a frame to render. Nevertheless, this is not the optimal way to achieve real-time visualization of MD simulations. By the time of performing the experiments, we were one of the first attempts to bring GPU virtualization to an Android device. In the second case, a novel idea to tackle communication reduction in the execution of real-time MD simulation and visualization using tablets is proposed by applying Dynamic Parallelism (DP) in the GPU. We switched to the rCUDA virtualization framework instead of DS-CUDA, since the first one is more up to date and presents better communication latency compared against the second one. We implemented DP in order to hide the latency to call a GPU routine from a CPU in our MD simulation and visualization. This technique allows our system to achieve better computational performance, more frames per second than a tablet powered by a CUDA capable GPU. Moreover, our results confirm that keeping the GPU saturated with more steps in the MD simulation per frame helped in the reduction of the latency from the client-side. However, using more steps affects the frame rate of the visualization. We found that 250 steps were optimal for our system achieving enough frame rate and better power efficiency when multiple clients were used. Our system proposal is capable of real-time MD simulation and visualization. With a dt = 2× 10−15 we can reach proximately 800 nsec/day with a frame rate of 20 fps for a 2,744 particles using our proposed system. We were able to achieve interactive frame rates by tuning parameters using a remote GPU from a tablet device. This is rather not conventional since offloading involves the communication bottleneck from the network. However, applying DP we were able to compensate computational and rendering speed. Lastly, we set up the following research directions by reducing the communication overhead between the rendering and computation process using a remote GPU. We proposed to apply software capabilities such as Graphics Interoperability and take advantage of the in-hardware modules of encoder/decoder for image processing. The main idea is to broadcast through the network the final frame buffer. Preliminary results demonstrated poor performance. However, customizing the communication routines with buffer techniques could lead to better execution. This research path presents huge expectations since the evolution of the GPU will be boosted by the incoming services such as game streaming.
July 5, 2020 by hgpu