Posts
Nov, 10
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. However, the computational demands of these complex models pose significant challenges, requiring efficient hardware acceleration. Benchmarking the performance of LLMs across diverse hardware platforms is crucial to understanding their scalability and throughput characteristics. We introduce LLM-Inference-Bench, […]
Nov, 10
On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures
This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ matrix and array (MATAR) library that uniquely offers: (1) a straightforward syntax for […]
Nov, 10
Over-synchronization in GPU Programs
The performance of GPU (Graphics Processing Unit)-accelerated functions affects a large spectrum of modern software. Efficiently synchronizing across thousands of concurrent threads is critical to the performance of GPU programs. GPU vendors have introduced advanced programming constructs, e.g., scopes, for efficiently synchronizing within a chosen subset of threads. However, programmers must explicitly employ them, where […]
Nov, 10
Profile Util library: A quick and easy way to get MPI, OpenMP and GPU runtime information
We present profile_util, a quick and simple way of profiling codes. This is a MPI, OpenMP, and GPU enabled C++17 library. The GPU interface is compatible with both HIP and CUDA and is compatible with more than a single GPU per MPI process. It provides a means of logging MPI, OpenMP and GPU related information, […]
Nov, 10
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU […]
Nov, 3
LLload: An Easy-to-Use HPC Utilization Tool
The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying […]
Nov, 3
Scheduling Languages: A Past, Present, and Future Taxonomy
Scheduling languages express to a compiler a sequence of optimizations to apply. Compilers that support a scheduling language interface allow exploration of compiler optimizations, i.e., exploratory compilers. While scheduling languages have become a common feature of tools for expert users, the proliferation of these languages without unifying common features may be confusing to users. Moreover, […]
Nov, 3
Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data
Modern high performance computing applications often rely on heterogeneous hardware resources to achieve maximum performance. This approach presents obvious benefits, combining the processing power of multiple different processors and allowing them to be more specialized. However, since HPC applications typically need to be programmed in a hardware-aware manner to achieve maximum performance, this places more […]
Nov, 3
MambaCPU: Enhanced Correlation Mining with State Space Models for CPU Performance Prediction
Forecasting CPU performance, which involves estimating performance scores based on hardware characteristics during operation, is crucial for computational system design and resource management. This research field currently faces two primary challenges. First, the diversity of CPU products and the specialized nature of hardware characteristics make real-world data collection difficult. Second, existing approaches, whether reliant on […]
Nov, 3
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
Serving systems for Large Language Models (LLMs) improve throughput by processing several requests concurrently. However, multiplexing hardware resources between concurrent requests involves non-trivial scheduling decisions. Practical serving systems typically implement these decisions at two levels: First, a load balancer routes requests to different servers which each hold a replica of the LLM. Then, on each […]
Oct, 27
Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration
In this paper we develop the first fine-grained rounding error analysis of finite element (FE) cell kernels and assembly. The theory includes mixed-precision implementations and accounts for hardware-acceleration via matrix multiplication units, thus providing theoretical guidance for designing reduced- and mixed-precision FE algorithms on CPUs and GPUs. Guided by this analysis, we introduce hardware-accelerated mixed-precision […]
Oct, 27
Fastrack: Fast IO for Secure ML using GPU TEEs
As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handling authentication and computation. However, CPU-to-GPU communication overheads significantly hinder performance, as data must be encrypted, authenticated, decrypted, and verified, increasing costs by 12.69 […]