Posts
Jun, 9
Fast and Practical Strassen’s Matrix Multiplication using FPGAs
Matrix multiplication is a cornerstone operation in a wide array of scientific fields, including machine learning and computer graphics. The standard algorithm for matrix multiplication has a complexity of O(n3) for n×n matrices. Strassen’s algorithm improves this to O(n2.807), but its practicality is limited for small to medium matrix sizes due to the large number […]
Jun, 9
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both […]
Jun, 2
Addressing Challenges in Utilizing GPUs for Accelerating Privacy-Preserving Computation
Cloud computing increasingly handles confidential data, like private inference and query databases. Two strategies are used for secure computation: (1) employing CPU Trusted Execution Environments (TEEs) like AMD SEV, Intel SGX, or ARM TrustZone, and (2) utilizing emerging cryptographic methods like Fully Homomorphic Encryption (FHE) with libraries such as HElib, Microsoft SEAL, and PALISADE. To […]
Jun, 2
Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)
While Field-Programmable Gate Arrays (FPGAs) exist in many design configurations throughout the data center, cloud, and edge, the promise of performance and flexibility offered by the FPGA often remains unrealized for lack of hardware design expertise, with most computation remaining in fixed hardware such as CPUs, GPUs, and ASICs e.g. tensor processors. Identifying programmability as […]
Jun, 2
Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL
Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel’s MKL or NVIDIA’s cuBLAS implemented new and optimized matrix multiplication techniques that increase performance and reduce computational costs. These techniques can also be implemented in CUDA and SYCL and functions with AVX2 and AVX512 instructions, which have […]
Jun, 2
An implementation of tensor product patch smoothers on GPU
We present a GPU implementation of vertex-patch smoothers for higher order finite element methods in two and three dimensions. Analysis shows that they are not memory bound with respect to GPU DRAM, but with respect to on-chip scratchpad memory. Multigrid operations are optimized through localization and reorganized local operations in on-chip memory, achieving minimal global […]
Jun, 2
A Survey of Cloud-Based GPU Threats and Their Impact on AI, HPC, and Cloud Computing
Graphics processing units (GPUs) are the hardware engines driving the AI revolution. Large language model (LLM)-powered generative AI (GenAI) became mainstream with the public release of OpenAI’s ChatGPT. AI usage has given rise to innovative AI-powered applications for businesses, productivity, image generation, video generation, data analysis, and social media, among others. Powering AI applications are […]
May, 26
Enabling full-speed random access to the entire memory on the A100 GPU
We describe some features of the A100 memory architecture. In particular, we give a technique to reverse-engineer some hardware layout information. Using this information, we show how to avoid TLB issues to obtain full-speed random HBM access to the entire memory, as long as we constrain any particular thread to a reduced access window of […]
May, 26
ArchesWeather: An efficient AI weather forecasting model at 1.5° resolution
One of the guiding principles for designing AI-based weather forecasting systems is to embed physical constraints as inductive priors in the neural network architecture. A popular prior is locality, where the atmospheric data is processed with local neural interactions, like 3D convolutions or 3D local attention windows as in Pangu-Weather. On the other hand, some […]
May, 26
GPU Implementations for Midsize Integer Addition and Multiplication
This paper explores practical aspects of using a high-level functional language for GPU-based arithmetic on “midsize” integers. By this we mean integers of up to about a quarter million bits, which is sufficient for most practical purposes. The goal is to understand whether it is possible to support efficient nested-parallel programs with a small, flexible […]
May, 26
STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning
The relentless growth of modern Machine Learning models has spurred the adoption of sparsification techniques to simplify their architectures and reduce the computational demands. Network pruning has demonstrated success in maintaining original network accuracy while shedding significant portions of the original weights. However, leveraging this sparsity efficiently remains challenging due to computational irregularities, particularly in […]
May, 26
Kernel-Centric Optimizations for Deep Neural Networks on GPGPU
Deep learning has achieved remarkable success across various domains, ranging from computer vision to healthcare. General-Purpose Graphics Processing Unit (GPGPU) is one of the major driving forces behind this revolution. GPGPUs offer massive parallel computational power, enabling the training and deployment of large-scale neural networks within practical time and resource constraints. Their programmability also enables […]