Posts
Aug, 3
GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning
Empirical autotuning methods such as Bayesian optimization (BO) are a powerful approach that allows us to optimize tuning parameters of parallel codes as black-boxes. However, BO is an expensive approach because it relies on empirical samples from true evaluations for varying parameter configurations. In this thesis, we present GBOTuner, an autotuning framework for optimizing the […]
Aug, 3
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now […]
Aug, 3
OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing
As the era of heterogeneous computing evolves, benchmarking tools are vital for measuring performance across diverse architectures. We present OpenDwarfs 2025, a reengineered and modernized version of the OpenDwarfs benchmark suite, originally developed to evaluate the performance of heterogeneous systems using OpenCL. Our comprehensive reengineering process involved addressing compatibility issues with modern compilers, resolving bugs, […]
Aug, 3
Performance Portable Gradient Computations Using Source Transformation
Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and nonlinear solvers. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, resulting in […]
Jul, 20
Kevin: Multi-Turn RL for Generating CUDA Kernels
Writing GPU kernels is a challenging task and critical for AI systems’ efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this […]
Jul, 20
Specx: a C++ task-based runtime system for heterogeneous distributed architectures
Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific computing. Not only does it allow efficient parallelization across distributed heterogeneous computing nodes, but it also allows for elegant source code structuring by describing hardware-independent algorithms. […]
Jul, 20
Pre-Training LLMs on a budget: A comparison of three optimizers
Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and […]
Jul, 20
Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks
The rapid development in scientific research provides a need for more compute power, which is partly being solved by GPUs. This paper presents a microarchitectural analysis of the modern NVIDIA Blackwell architecture by studying GPU performance features with thought through microbenchmarks. We unveil key subsystems, including the memory hierarchy, SM execution pipelines, and the SM […]
Jul, 20
Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler
This work focuses on the use of deep reinforcement learning (DRL) to automate code optimization within modern compiler infrastructures. Code optimization is a critical step in program transformation that aims to improve performance and reduce resource consumption while preserving correctness. Traditional approaches to code optimization rely on manual or heuristic-based methods, which are often time-consuming […]
Jul, 13
Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still […]
Jul, 13
KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling
Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, […]
Jul, 13
Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, its internal design remains largely opaque. The orchestration of communication channels, selection of protocols, and handling of memory movement across devices and nodes are not well understood, making it […]