Posts
Feb, 10
Optimizing the optimizer increasing performance efficiency of modern compilers
A long-standing goal, which is increasingly important in the post-Moore era, is to augment system performance by building more intelligent compilers. One of our motivating hypotheses is that much of the capability needed to advance compiler optimization is already present: state-of-the-art compilers not only provide a large set of code transformations, but also (by-and-large) correctly […]
Feb, 10
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about […]
Feb, 10
Compiler Support for Speculation in Decoupled Access/Execute Architectures
Irregular codes are bottlenecked by memory and communication latency. Decoupled access/execute (DAE) is a common technique to tackle this problem. It relies on the compiler to separate memory address generation from the rest of the program, however, such a separation is not always possible due to control and data dependencies between the access and execute […]
Feb, 10
Towards autonomous resource management: Deep learning prediction of CPU-GPU load balancing
The demand of data centers has increased due to the latest improvements of Artificial Intelligence. These data centers are composed of thousands of servers with cooling systems that consume high amounts of energy. The servers usually contain several processing units that can cooperate for solving computational tasks. When making a proper partitioning of the entire […]
Feb, 10
Ilargi: a GPU Compatible Factorized ML Model Training Framework
The machine learning (ML) training over disparate data sources traditionally involves materialization, which can impose substantial time and space overhead due to data movement and replication. Factorized learning, which leverages direct computation on disparate sources through linear algebra (LA) rewriting, has emerged as a viable alternative to improve computational efficiency. However, the adaptation of factorized […]
Feb, 3
Modernization and Optimization of MPI Codes
MPI has become the de facto standard for distributed memory computing since its inception in 1994. While the MPI standard has evolved to include new technologies like RDMA, many applications still rely on the original set of MPI operations. This thesis initially investigates the current usage of MPI. We note that developers underutilize modern MPI […]
Feb, 3
On the Partitioning of GPU Power among Multi-Instances
Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power consumption. NVIDIA’s Multi-Instance GPU (MIG) technology improves GPU utilization by enabling isolated partitions with per-partition resource tracking, facilitating GPU sharing by multiple […]
Feb, 3
Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs
Registers are the fastest memory components within the GPU’s complex memory hierarchy, accessed by names rather than addresses. They are managed entirely by the compiler through a process called register allocation, during which the compiler attempts to cache predictable data from thread-local memory into thread-private registers. Computing the permanent of a sparse matrix poses a […]
Feb, 3
CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL
The performance and energy efficiency offered by heterogeneous systems are highly useful for modern C++ applications, but the technological variety demands adequate portability and programmability. Initiatives such as Intel oneAPI facilitate the exploitation of Intel CPUs and GPUs, but not NVIDIA GPUs, which are present in systems of all kinds and are necessarily leveraged by […]
Feb, 3
Profiling Apple Silicon Performance for ML Training
Apple Silicon has attracted much attention for its performance and role in machine learning (ML) training. Unlike NVIDIA GPUs, which have traditionally dominated ML training, Apple Silicon has a significant difference in memory architecture. It uses Unified Memory, which integrates CPU and GPU memory instead of separate CPU memory and GPU VRAM. However, it is […]
Jan, 27
Column-Oriented Datalog on the GPU
Datalog is a logic programming language widely used in knowledge representation and reasoning (KRR), program analysis, and social media mining due to its expressiveness and high performance. Traditionally, Datalog engines use either row-oriented or column-oriented storage. Engines like VLog and Nemo favor column-oriented storage for efficiency on limited-resource machines, while row-oriented engines like Souffle use […]
Jan, 27
Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?
To match the blooming demand of generative AI workloads, GPU designers have so far been trying to pack more and more compute and memory into single complex and expensive packages. However, there is growing uncertainty about the scalability of individual GPUs and thus AI clusters, as state-of-the-art GPUs are already displaying packaging, yield, and cooling […]