Posts
Oct, 23
Thwarting Piracy: Anti-debugging Using GPU-assisted Self-healing Codes
Software piracy is one of the concerns in the IT sector. Pirates leverage the debugger tools to reverse engineer the logic that verifies the license keys or bypass the entire verification process. Anti-debugging techniques are used to defeat piracy using self-healing codes. However, anti-debugging methods can be defeated when the licensing protections are limited to […]
Oct, 23
Behavioral graph fraud detection in E-commerce
In e-commerce industry, graph neural network methods are the new trends for transaction risk modeling.The power of graph algorithms lie in the capability to catch transaction linking network information, which is very hard to be captured by other algorithms.However, in most existing approaches, transaction or user connections are defined by hard link strategies on shared […]
Oct, 23
From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels
Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the […]
Oct, 16
Distributed, combined CPU and GPU profiling within HPX using APEX
Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture […]
Oct, 16
Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models
Deep learning has recently become one of the most compute/data-intensive methods and is widely used in many research areas and businesses. One of the critical challenges of deep learning is that it has many parameters that can be adjusted, and the optimal value may need to be determined for faster operation and high accuracy. The […]
Oct, 16
OpenMP Offloading in the Jetson Nano Platform
The nvidia Jetson Nano is a very popular system-on-module and developer kit which brings high-performance specs in a small and power-efficient embedded platform. Integrating a 128-core gpu and a quad-core cpu, it provides enough capabilities to support computationally demanding applications such as AI inference, deep learning and computer vision. While the Jetson Nano family supports […]
Oct, 16
PMT: Power Measurement Toolkit
Efficient use of energy is essential for today’s supercomputing systems, as energy cost is generally a major component of their operational cost. Research into "green computing" is needed to reduce the environmental impact of running these systems. As such, several scientific communities are evaluating the trade-off between time-to-solution and energy-to-solution. While the runtime of an […]
Oct, 16
Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU
Dynamic graph neural network (DGNN) is becoming increasingly popular because of its widespread use in capturing dynamic features in the real world. A variety of dynamic graph neural networks designed from algorithmic perspectives have succeeded in incorporating temporal information into graph processing. Despite the promising algorithmic performance, deploying DGNNs on hardware presents additional challenges due […]
Oct, 9
Towards Performance Portable Programming for Distributed Heterogeneous Systems
Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware in the future. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. […]
Oct, 9
Decompiling x86 Deep Neural Network Executables
Due to their widespread use on heterogeneous hardware devices, deep learning (DL) models are compiled into executables by DL compilers to fully leverage low-level hardware primitives. This approach allows DL computations to be undertaken at low cost across a variety of computing platforms, including CPUs, GPUs, and various hardware accelerators. We present BTD (Bin to […]
Oct, 9
Benchmarking optimization algorithms for auto-tuning GPU kernels
Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU program (kernel) is challenging, and generally only certain specific kernel configurations lead to significant increases in performance. Auto-tuning is the process of […]
Oct, 9
Performance portability study of epistasis detection using SYCL on NVIDIA GPU
We describe the experience of converting a CUDA implementation of a high-order epistasis detection algorithm to SYCL. The goals are for our work to be useful to application and compiler developers with a detailed description of migration paths between CUDA and SYCL. Evaluating the CUDA and SYCL applications on an NVIDIA V100 GPU, we find […]