Posts
Feb, 18
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
We introduce QUICK, a group of novel optimized CUDA kernels for the efficient inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory bank-conflict problem of state-of-the-art mixed precision matrix multiplication kernels. Our method interleaves the quantized weight matrices of LLMs offline to skip the shared memory write-back after the dequantization. We demonstrate […]
Feb, 12
Multi-line AI-assisted Code Authoring
CodeCompose is an AI-assisted code authoring tool powered by large language models (LLMs) that provides inline suggestions to 10’s of thousands of developers at Meta. In this paper, we present how we scaled the product from displaying single-line suggestions to multi-line suggestions. This evolution required us to overcome several unique challenges in improving the usability […]
Feb, 12
DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence
The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models […]
Feb, 12
Evaluating the Wide Area Classroom After 24,000 HPC Students
As of 2023 we have taught more than 24,000 students over the course of 106 events using the Wide Area Classroom, a novel distributed teaching platform. This has been a successful effort gauged by several important metrics. We describe both the technical and logistical structure of these events as well as the specific HPC curriculums […]
Feb, 12
Training DNN Models over Heterogeneous Clusters with Optimal Performance
Adjusting batch sizes and adaptively tuning other hyperparameters can significantly speed up deep neural network (DNN) training. Despite the ubiquity of heterogeneous clusters, existing adaptive DNN training techniques solely consider homogeneous environments. Optimizing distributed DNN training over heterogeneous clusters is technically challenging, and directly adapting existing techniques results in low utilization and poor performance. To […]
Feb, 12
Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs
Virtual screening is an early stage in the drug discovery process that selects the most promising candidates. In the urgent computing scenario, finding a solution in the shortest time frame is critical. Any improvement in the performance of a virtual screening application translates into an increase in the number of candidates evaluated, thereby raising the […]
Feb, 4
Gallatin: A General-Purpose GPU Memory Manager
Dynamic memory management is critical for efficiently porting modern data processing pipelines to GPUs. However, building a general-purpose dynamic memory manager on GPUs is challenging due to the massive parallelism and weak memory coherence. Existing state-of-the-art GPU memory managers, Ouroboros and Reg-Eff, employ traditional data structures such as arrays and linked lists to manage memory […]
Feb, 4
Deductive verification for SYCL
A heterogeneous computing system is a system composed of different types of computing units. SYCL is a software development framework with which programs can be developed for such systems. It uses the concept of kernels, where a kernel executes code inside it in parallel, and different kernels can be executed concurrently on multiple computing units. […]
Feb, 4
Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core
The cryosphere plays a significant role in Earth’s climate system. Therefore, an accurate simulation of sea ice is of great importance to improve climate projections. To enable higher resolution simulations, graphics processing units (GPUs) have become increasingly attractive as they offer higher floating point peak performance and better energy efficiency compared to CPUs. However, making […]
Feb, 4
High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations
We present a highly-optimized thread-safe lattice Boltzmann model in which the non-equilibrium part of the distribution function is locally reconstructed via recursivity of Hermite polynomials. Such a procedure allows the explicit incorporation of non-equilibrium moments of the distribution up to the order supported by the lattice. Thus, the proposed approach increases accuracy and stability at […]
Feb, 4
LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory
This paper describes LeftoverLocals: a vulnerability that allows data recovery from GPU memory created by another process on Apple, Qualcomm, and AMD GPUs. LeftoverLocals impacts the security posture of GPU applications, with particular significance to LLMs and ML models that run on impacted GPUs. By recovering local memory, an optimized GPU memory region, we built […]
Jan, 28
Assessing the Impact of Compiler Optimizations on GPUs Reliability
Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex optimizations to improve performance. These optimizations modify how the software is mapped in the underlying hardware; thus, as we show in this […]

