Posts
Dec, 24
Accelerating Sparse Graph Neural Networks with Tensor Core Optimization
Graph neural networks (GNNs) have seen extensive application in domains such as social networks, bioinformatics, and recommendation systems. However, the irregularity and sparsity of graph data challenge traditional computing methods, which are insufficient to meet the performance demands of GNNs. Recent research has explored parallel acceleration using CUDA Cores and Tensor Cores, but significant challenges […]
Dec, 24
HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages
Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas […]
Dec, 24
Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations
Scientific and engineering problems are frequently governed by partial differential equations; however, the analytical solutions of these equations are often impractical, thereby forcing the adoption of numerical methods. Basic Linear Algebra Subprograms (BLAS) operations constitute a fundamental component of these numerical approaches, incorporating essential tasks such as Level 1 operations (dot products and vector addition), […]
Dec, 15
Deep Learning Model Security: Threats and Defenses
Deep learning has transformed AI applications but faces critical security challenges, including adversarial attacks, data poisoning, model theft, and privacy leakage. This survey examines these vulnerabilities, detailing their mechanisms and impact on model integrity and confidentiality. Practical implementations, including adversarial examples, label flipping, and backdoor attacks, are explored alongside defenses such as adversarial training, differential […]
Dec, 15
Analyzing the Performance Portability of SYCL across CPUs, GPUs, and Hybrid Systems with Protein Database Search
The high-performance computing (HPC) landscape is undergoing rapid transformation, with an increasing emphasis on energy-efficient and heterogeneous computing environments. This comprehensive study extends our previous research on SYCL’s performance portability by evaluating its effectiveness across a broader spectrum of computing architectures, including CPUs, GPUs, and hybrid CPU-GPU configurations from NVIDIA, Intel, and AMD. Our analysis […]
Dec, 15
Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems
The exponential growth of data-intensive machine learning workloads has exposed significant limitations in conventional GPU-accelerated systems, especially when processing datasets exceeding GPU DRAM capacity. We propose MQMS, an augmented in-storage GPU architecture and simulator that is aware of internal SSD states and operations, enabling intelligent scheduling and address allocation to overcome performance bottlenecks caused by […]
Dec, 15
RTCUDB: Building Databases with RT Processors
A spectrum of new hardware has been studied to accelerate database systems in the past decade. Specifically, CUDA cores are known to benefit from the fast development of GPUs and make notable performance improvements. The state-of-the-art GPU-based implementation, i.e., Crystal, can achieve up to 61 times higher performance than CPU-based implementations. However, experiments show that […]
Dec, 15
Leveraging the potential of task-based programming with OpenMP task graphs
The task execution model is widely used in computer engineering, it helps developers to design, develop and understand software systems. OpenMP is the de-facto programming model to parallelize sequential algorithms on shared-memory machines. Coupled with the task parallelization, OpenMP is able to conveniently parallelize structured and non-structured applications, it also allows users to offload work […]
Dec, 8
LLOR: Automated Repair of OpenMP Programs
In this paper, we present a technique for repairing data race errors in parallel programs written in C/C++ and Fortran using the OpenMP API. Our technique can also remove barriers that are deemed unnecessary for correctness. We implement these ideas in our tool called LLOR, which takes a language-independent approach to provide appropriate placements of […]
Dec, 8
Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach
Deep Neural Networks (DNNs) have revolutionized various fields, but their deployment on GPUs often leads to significant energy consumption. Unlike existing methods for reducing GPU energy consumption, which are either hardware-inflexible or limited by workload constraints, this paper addresses the problem at the GPU kernel level. We propose a novel search-based compilation method to generate […]
Dec, 8
Unified schemes for directive-based GPU offloading
GPU is the dominant accelerator device due to its high performance and energy efficiency. Directive-based GPU offloading using OpenACC or OpenMP target is a convenient way to port existing codes originally developed for multicore CPUs. Although OpenACC and OpenMP target provide similar features, both methods have pros and cons. OpenACC has better functions and an […]
Dec, 8
FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran
Due to its historical popularity, Fortran was used to implement many important scientific applications. The complexity of these applications along with the transition to modern high performance languages like C++ has made modernization and optimization challenging for these applications. Significant development time is incurred to understand and optimize key algorithms as well as leverage new […]