25517

Posts

Sep, 5

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and handling the memory bottleneck problem. This paper proposes a thread block-centric locality analysis, which identifies the locality among the thread blocks (TBs) in terms of a number of common data references. In LocalityGuru, we seek to employ a detailed just-in-time […]
Sep, 5

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

This report presents some early results on code generation targeting tensor cores on NVIDIA GPUs using the MLIR compiler infrastructure. The state-of-the-art in high-performance deep learning today is primarily driven by manually optimized highly tuned libraries. The approach to develop such libraries is often not modular or reusable to the same extent that compiler infrastructure […]
Aug, 29

A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks

In recent years, there has been a convergence of Big Data (BD), High Performance Computing (HPC), and Machine Learning (ML) systems. This convergence is due to the increasing complexity of long data analysis pipelines on separated software stacks. With the increasing complexity of data analytics pipelines comes a need to evaluate their systems, in order […]
Aug, 29

The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product

GPU-accelerated database systems have been studied for more than 10 years, ranging from prototyping development to industry products serving in multiple domains of data applications. Existing GPU database research solutions are often focused on specific aspects in parallel algorithms and system implementations for specific features, while industry product development generally concentrates on delivering a whole […]
Aug, 29

Performance Portability and Evaluation of Heterogeneous Components of SeisSol Targeted to Upcoming Intel HPC GPUs

For the first time in over 20 years, Intel is selling discrete graphics cards, including products for high-performance computing, scheduled for release in 2022. This thesis investigates programming models for the upcoming Intel GPUs and selects the Sycl standard, provided by oneAPI and hipSYCL, to port the heterogeneous components of SeisSol. The modules in question […]
Aug, 29

neoSYCL: a SYCL implementation for SX-Aurora TSUBASA

Recently, the high-performance computing world has moved to more heterogeneous architectures. Thus, it has become a standard practice to offload a part of application execution to dedicated accelerators. However, the disadvantage in productivity is still a problem in programming for accelerators. This paper proposes neoSYCL: a SYCL implementation for SX-Aurora TSUBASA, aiming to improve productivity […]
Aug, 29

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices based on Fine-Grained Structured Weight Sparsity

It is appealing but challenging to achieve real-time deep neural network (DNN) inference on mobile devices because even the powerful modern mobile devices are considered as "resource-constrained" when executing large-scale DNNs. It necessitates the sparse model inference via weight pruning, i.e., DNN weight sparsity, and it is desirable to design a new DNN weight sparsity […]
Aug, 22

perf4sight: A toolflow to model CNN training performance on Edge GPUs

The increased memory and processing capabilities of today’s edge devices create opportunities for greater edge intelligence. In the domain of vision, the ability to adapt a Convolutional Neural Network’s (CNN) structure and parameters to the input data distribution leads to systems with lower memory footprint, latency and power consumption. However, due to the limited compute […]
Aug, 22

EXA2PRO: A Framework for High Development Productivity on Heterogeneous Computing Systems

Programming upcoming exascale computing systems is expected to be a major challenge. New programming models are required to improve programmability, by hiding the complexity of these systems from application developers. The EXA2PRO programming framework aims at improving developers productivity for applications that target heterogeneous computing systems. It is based on advanced programming models and abstractions […]
Aug, 22

Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines

We present an approach for integrating the time evolution of quantum systems. We leverage the computation power of graphics processing units (GPUs) to perform the integration of all time steps in parallel. The performance boost is especially prominent for small to medium-sized quantum systems. The devised algorithm can largely be implemented using the recently-specified batched […]
Aug, 22

Performance comparison of CFD-DEM solver MFiX-Exa, on GPUs and CPUs

We present computational performance comparisons of gas-solid simulations performed on current CPU and GPU architectures using MFiX Exa, a CFD-DEM solver that leverages hybrid CPU+GPU parallelism. A representative fluidized bed simulation with varying particle numbers from 2 to 67 million is used to compare serial and parallel performance. A single GPU was observed to be […]
Aug, 22

Better GPU Hash Tables

We revisit the problem of building static hash tables on the GPU and design and build three bucketed hash tables that use different probing schemes. Our implementations are lock-free and offer efficient memory access patterns; thus, only the probing scheme is the factor affecting the performance of the hash table’s different operations. Our results show […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: