29518

Posts

Nov, 10

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. However, the computational demands of these complex models pose significant challenges, requiring efficient hardware acceleration. Benchmarking the performance of LLMs across diverse hardware platforms is crucial to understanding their scalability and throughput characteristics. We introduce LLM-Inference-Bench, […]
Nov, 10

On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures

This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ matrix and array (MATAR) library that uniquely offers: (1) a straightforward syntax for […]
Nov, 10

Over-synchronization in GPU Programs

The performance of GPU (Graphics Processing Unit)-accelerated functions affects a large spectrum of modern software. Efficiently synchronizing across thousands of concurrent threads is critical to the performance of GPU programs. GPU vendors have introduced advanced programming constructs, e.g., scopes, for efficiently synchronizing within a chosen subset of threads. However, programmers must explicitly employ them, where […]
Nov, 10

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU […]
Nov, 10

Profile Util library: A quick and easy way to get MPI, OpenMP and GPU runtime information

We present profile_util, a quick and simple way of profiling codes. This is a MPI, OpenMP, and GPU enabled C++17 library. The GPU interface is compatible with both HIP and CUDA and is compatible with more than a single GPU per MPI process. It provides a means of logging MPI, OpenMP and GPU related information, […]
Nov, 3

LLload: An Easy-to-Use HPC Utilization Tool

The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying […]
Nov, 3

Scheduling Languages: A Past, Present, and Future Taxonomy

Scheduling languages express to a compiler a sequence of optimizations to apply. Compilers that support a scheduling language interface allow exploration of compiler optimizations, i.e., exploratory compilers. While scheduling languages have become a common feature of tools for expert users, the proliferation of these languages without unifying common features may be confusing to users. Moreover, […]
Nov, 3

Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data

Modern high performance computing applications often rely on heterogeneous hardware resources to achieve maximum performance. This approach presents obvious benefits, combining the processing power of multiple different processors and allowing them to be more specialized. However, since HPC applications typically need to be programmed in a hardware-aware manner to achieve maximum performance, this places more […]
Nov, 3

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

Serving systems for Large Language Models (LLMs) improve throughput by processing several requests concurrently. However, multiplexing hardware resources between concurrent requests involves non-trivial scheduling decisions. Practical serving systems typically implement these decisions at two levels: First, a load balancer routes requests to different servers which each hold a replica of the LLM. Then, on each […]
Nov, 3

MambaCPU: Enhanced Correlation Mining with State Space Models for CPU Performance Prediction

Forecasting CPU performance, which involves estimating performance scores based on hardware characteristics during operation, is crucial for computational system design and resource management. This research field currently faces two primary challenges. First, the diversity of CPU products and the specialized nature of hardware characteristics make real-world data collection difficult. Second, existing approaches, whether reliant on […]
Oct, 27

Jailbreaking LLM-Controlled Robots

The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a stand-alone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing […]
Oct, 27

Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration

In this paper we develop the first fine-grained rounding error analysis of finite element (FE) cell kernels and assembly. The theory includes mixed-precision implementations and accounts for hardware-acceleration via matrix multiplication units, thus providing theoretical guidance for designing reduced- and mixed-precision FE algorithms on CPUs and GPUs. Guided by this analysis, we introduce hardware-accelerated mixed-precision […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org