Posts
Feb, 10
Ilargi: a GPU Compatible Factorized ML Model Training Framework
The machine learning (ML) training over disparate data sources traditionally involves materialization, which can impose substantial time and space overhead due to data movement and replication. Factorized learning, which leverages direct computation on disparate sources through linear algebra (LA) rewriting, has emerged as a viable alternative to improve computational efficiency. However, the adaptation of factorized […]
Feb, 10
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about […]
Feb, 10
Compiler Support for Speculation in Decoupled Access/Execute Architectures
Irregular codes are bottlenecked by memory and communication latency. Decoupled access/execute (DAE) is a common technique to tackle this problem. It relies on the compiler to separate memory address generation from the rest of the program, however, such a separation is not always possible due to control and data dependencies between the access and execute […]
Feb, 3
On the Partitioning of GPU Power among Multi-Instances
Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power consumption. NVIDIA’s Multi-Instance GPU (MIG) technology improves GPU utilization by enabling isolated partitions with per-partition resource tracking, facilitating GPU sharing by multiple […]
Feb, 3
Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs
Registers are the fastest memory components within the GPU’s complex memory hierarchy, accessed by names rather than addresses. They are managed entirely by the compiler through a process called register allocation, during which the compiler attempts to cache predictable data from thread-local memory into thread-private registers. Computing the permanent of a sparse matrix poses a […]
Feb, 3
CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL
The performance and energy efficiency offered by heterogeneous systems are highly useful for modern C++ applications, but the technological variety demands adequate portability and programmability. Initiatives such as Intel oneAPI facilitate the exploitation of Intel CPUs and GPUs, but not NVIDIA GPUs, which are present in systems of all kinds and are necessarily leveraged by […]
Feb, 3
Modernization and Optimization of MPI Codes
MPI has become the de facto standard for distributed memory computing since its inception in 1994. While the MPI standard has evolved to include new technologies like RDMA, many applications still rely on the original set of MPI operations. This thesis initially investigates the current usage of MPI. We note that developers underutilize modern MPI […]
Feb, 3
Profiling Apple Silicon Performance for ML Training
Apple Silicon has attracted much attention for its performance and role in machine learning (ML) training. Unlike NVIDIA GPUs, which have traditionally dominated ML training, Apple Silicon has a significant difference in memory architecture. It uses Unified Memory, which integrates CPU and GPU memory instead of separate CPU memory and GPU VRAM. However, it is […]
Jan, 27
Column-Oriented Datalog on the GPU
Datalog is a logic programming language widely used in knowledge representation and reasoning (KRR), program analysis, and social media mining due to its expressiveness and high performance. Traditionally, Datalog engines use either row-oriented or column-oriented storage. Engines like VLog and Nemo favor column-oriented storage for efficiency on limited-resource machines, while row-oriented engines like Souffle use […]
Jan, 27
Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?
To match the blooming demand of generative AI workloads, GPU designers have so far been trying to pack more and more compute and memory into single complex and expensive packages. However, there is growing uncertainty about the scalability of individual GPUs and thus AI clusters, as state-of-the-art GPUs are already displaying packaging, yield, and cooling […]
Jan, 27
Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU
In recent times, oneAPI has emerged as a competitive framework to optimize streaming applications on heterogeneous CPU+GPU architectures, since it provides portability and performance thanks to the SYCL programming language and efficient parallel libraries as oneTBB. However, this approach opens up a wealth of implementations alternatives in this type of applications: from how to design […]
Jan, 27
Adaptive Optimization Techniques for High-Performance Computing
The dataset sizes and computing needs of increasingly prevalent high-performance computing (HPC) applications have grown exponentially over the last decade. Moreover, modern computing architectures are evolving with different paradigms, and accelerators have become indispensable parts of computing. Consequently, the imperative for performance optimization for HPC applications and intelligent resource management for evolving architectures has become […]