high performance computing on graphics processing units: hgpu.org

Posts

Feb, 26

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Sparse matrix-vector multiplication (SpMV) is an essential linear algebra operation that dominates the computing cost in many scientific applications. Due to providing massive parallelism and high memory bandwidth, GPUs are commonly used to accelerate SpMV kernels. Prior studies mainly focused on reducing the latency of SpMV kernels on GPU. However, few attempts have been made […]

CUDA

Feb, 26

Orca: FSS-based Secure Training with GPUs

Secure Two-party Computation (2PC) allows two parties to compute any function on their private inputs without revealing their inputs in the clear to each other. Since 2PC is known to have notoriously high overheads, one of the most popular computation models is that of 2PC with a trusted dealer, where a trusted dealer provides correlated […]

CUDA

Feb, 26

HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

Machine Learning (ML) has been widely adopted in design exploration using high level synthesis (HLS) to give a better and faster performance, and resource and power estimation at very early stages for FPGA-based design. To perform prediction accurately, high-quality and large-volume datasets are required for training ML models.This paper presents a dataset for ML-assisted FPGA […]

Feb, 26

GPU Offloading in ExaHyPE Through C++ Standard Algorithms

The ISO C++17 standard introduces parallel algorithms, a parallel programming model promising portability across a wide variety of parallel hardware including multi-core CPUs, GPUs, and FPGAs. Since 2019, the NVIDIA HPC SDK compiler suite supports this programming model for multi-core CPUs and GPUs. ExaHyPE is a solver engine for hyperbolic partial differential equations for complex […]

CUDA

Feb, 12

A Survey on Optimization Techniques for Edge Artificial Intelligence (AI)

Artificial Intelligence (Al) models are being produced and used to solve a variety of current and future business and technical problems. Therefore, AI model engineering processes, platforms, and products are acquiring special significance across industry verticals. For achieving deeper automation, the number of data features being used while generating highly promising and productive AI models […]

Feb, 12

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

Iterative stencil computations are widely used in numerical simulations. They present a high degree of parallelism, high locality and mostly-coalesced memory access patterns. Therefore, GPUs are good candidates to speed up their computation. However, the development of stencil programs that can work with huge grids in distributed systems with multiple GPUs is not straightforward, since […]

OpenCL

Feb, 12

Improving Performance of Hardware Accelerators by Optimizing Data Movement: A Bioinformatics Case Study

Modern hardware accelerator cards create an accessible platform for developers to reduce execution times for computationally expensive algorithms. The most widely used systems, however, have dedicated memory spaces, resulting in the processor having to transfer data to the accelerator-card memory space before the computation can be executed. Currently, the performance increase from using an accelerator […]

OpenCL

Feb, 12

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based cost models perform significantly better than other methods, they still fall short and suffer from the following problems. First, their […]

CUDA

Feb, 12

ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills

Generalizable manipulation skills, which can be composed to tackle long-horizon and complex daily chores, are one of the cornerstones of Embodied AI. However, existing benchmarks, mostly composed of a suite of simulatable environments, are insufficient to push cutting-edge research works because they lack object-level topological and geometric variations, are not based on fully dynamic simulation, […]

CUDA

Feb, 5

Evaluation of Rust for GPGPU high-performance computing

Research within computer science constantly aims to find ways to improve computing performance in various ways. With the apparent death of Moore’s Law, researchers are focused on exploiting other ways of improving performance, for example, programming language optimizations. Fields such as web development, databases and, machine learning are adopting new programming languages but GPU programming […]

CUDA

Feb, 5

Optimization of massive data applications on heterogeneous architectures

In the last few years, the heterogeneous architectures have become dominant in each part of the computing industry: from heterogeneous GPU accelerators joining multi-core CPUs within the same chip, to Systems on Chip that integrate DSPs or. The main motivation of this thesis is the fact that there is no implementation with optimal solution for […]

OpenCL

Feb, 5

CMLCompiler: A Unified Compiler for Classical Machine Learning

Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified […]

CUDA