high performance computing on graphics processing units: hgpu.org

Posts

Jul, 30

Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing

This report provides an introduction to the Bandicoot C++ library for GPU linear algebra and scientific computing, detailing its user interface and performance characteristics as well as the technical details of its internal design. Bandicoot is the GPU-enabled counterpart to the well-known Armadillo C++ linear algebra library, aimed at allowing users to enable GPU computation […]

CUDA

•

OpenCL

Jul, 30

Efficiency without Tears: Securing Multilingual Programs with TRINITY

Despite the fact that most real-world programs are developed in multiple languages in the era of data science, existing security techniques are still limited to single-language programs. Worse yet, languages designed for high-performance computing often ignore the necessary security checking in foreign function interfaces (FFI) to pursue supreme execution efficiency. In consequence, security flaws and […]

OpenCL

Jul, 30

Fast Knowledge Graph Completion using Graphics Processing Units

Knowledge graphs can be used in many areas related to data semantics such as question-answering systems, knowledge based systems. However, the currently constructed knowledge graphs need to be complemented for better knowledge in terms of relations. It is called knowledge graph completion. To add new relations to the existing knowledge graph by using knowledge graph […]

CUDA

Jul, 30

A portable C++ library for memory and compute abstraction on multi-core CPUs and GPUs

We present a C++ library for transparent memory and compute abstraction across CPU and GPU architectures. Our library combines generic data structures like vectors, multi-dimensional arrays, maps, graphs, and sparse grids with basic generic algorithms like arbitrary-dimensional convolutions, copying, merging, sorting, prefix sum, reductions, neighbor search, and filtering. The memory layout of the data structures […]

CUDA

•

OpenCL

Jul, 24

Creating a Dataset Supporting Translation Between OpenMP Fortran and C++ Code

In this study, we present a novel dataset for training machine learning models translating between OpenMP Fortran and C++ code. To ensure reliability and applicability, the dataset is initially refined using a meticulous code similarity test. The effectiveness of our dataset is assessed using both quantitative (CodeBLEU) and qualitative (human evaluation) methods. We demonstrate how […]

CUDA

Jul, 24

ProtoX: A First Look

We present a first look at ProtoX, a code generation framework for stencil and pointwise operations that occur frequently in the numerical solution of partial differential equations. ProtoX has Proto as its library frontend and SPIRAL as the backend. Proto is a C++ based domain specific library which optimizes the algorithms used to compute the […]

Jul, 24

qecGPT: decoding Quantum Error-correcting Codes with Generative Pre-trained Transformers

We propose a general framework for decoding quantum error-correcting codes with generative modeling. The model utilizes autoregressive neural networks, specifically Transformers, to learn the joint probability of logical operators and syndromes. This training is in an unsupervised way, without the need for labeled training data, and is thus referred to as pre-training. After the pre-training, […]

Jul, 24

Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution

GPUs are renowned for their exceptional computational acceleration capabilities achieved through massive parallelism. However, utilizing GPUs for computation requires manual identification of code regions suitable for offloading, data transfer management, and synchronization. Recent advancements have capitalized on the LLVM/OpenMP portable target offloading interface, elevating GPU acceleration to new heights. This approach, known as the direct […]

CUDA

Jul, 24

eGPU: A 750 MHz Class Soft GPGPU for FPGA

This paper introduces the eGPU, a SIMT soft processor designed for FPGAs. Soft processors typically achieve modest operating frequencies, a fraction of the headline performance claimed by modern FPGA families, and obtain correspondingly modest performance results. We propose a GPGPU architecture structured specifically to take advantage of both the soft logic and embedded features of […]

Jul, 16

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is […]

CUDA

•

OpenCL

Jul, 16

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly […]

CUDA

Jul, 16

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing

Efficiency without Tears: Securing Multilingual Programs with TRINITY

Fast Knowledge Graph Completion using Graphics Processing Units

A portable C++ library for memory and compute abstraction on multi-core CPUs and GPUs

Creating a Dataset Supporting Translation Between OpenMP Fortran and C++ Code

ProtoX: A First Look

qecGPT: decoding Quantum Error-correcting Codes with Generative Pre-trained Transformers

Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution

eGPU: A 750 MHz Class Soft GPGPU for FPGA

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Recent source codes

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

kvcached: Elastic KV cache for dynamic GPU sharing and efficient multi-LLM inference

Bandicoot: C++ library for GPU accelerated linear algebra

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Fused Kernel Library (FKL)

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Most viewed papers (last 30 days)