27159

Posts

Aug, 28

Fully Concurrent GPU Data Structures

Building efficient concurrent data structures that scale to the level of GPU parallelism is a challenging problem. Solutions designed for the CPU do not scale to the thousands of cores that modern GPUs offer. In this dissertation, we show how to efficiently build a lock-based B-Tree on the GPU; then, we show how to extend […]
Aug, 28

Long Code for Code Search

Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the restriction of multi-head self-attention and GPU memory, there is a limit on the input token length. The existing pretrained code models, such as GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes […]
Aug, 28

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and […]
Aug, 28

Exploring Thread Coarsening on FPGA

Over the past few years, there has been an increased interest in including FPGAs in data centers and high-performance computing clusters along with GPUs and other accelerators. As a result, it has become increasingly important to have a unified, high-level programming interface for CPUs, GPUs and FPGAs. This has led to the development of compiler […]
Aug, 21

BenchPress: A Deep Active Benchmark Generator

We develop BenchPress, the first ML benchmark generator for compilers that is steerable within feature space representations of source code. BenchPress synthesizes compiling functions by adding new code in any part of an empty or existing sequence by jointly observing its left and right context, achieving excellent compilation rate. BenchPress steers benchmark generation towards desired […]
Aug, 21

Performance analysis of matrix-free conjugate gradient kernels using SYCL

We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations […]
Aug, 21

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Training deep neural networks (DNNs) is becoming more and more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More […]
Aug, 21

Optimization of GPU workloads using natural language processing based on deep learning techniques

Setting program parameters is challenging due to the abstract relationship between hardware and software. Automatic optimization algorithms that are accurate are required to cope with the complexity and variety of current hardware and software. Autotuning has always relied on time-consuming trial and error approaches. Machine learning (ML) and Natural Language Processing (NLP) has flourished over […]
Aug, 21

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, […]
Aug, 7

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High […]
Aug, 7

Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities

To cope with the increasing complexity of digital systems programming, deep learning techniques have recently been proposed to enhance software deployment by analysing source code for different purposes, ranging from performance and energy improvement to debugging and security assessment. As embedded platforms for cyber-physical systems are characterised by increasing heterogeneity and parallelism, one of the […]
Aug, 7

Real-Time High-Performance Computing for Embedded Control Systems

Critical real-time systems include a wide spectrum of computer systems whose correct behavior is dictated not only by correct functionality but also by their timely execution with respect to predefined deadlines. The increasing demand for higher performance in these systems has led the industry to recently include embedded Graphics Processing Units (GPUs), mainly for machine […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org