25657

Posts

Oct, 3

Embedded Software Synthesis using Heterogeneous Dataflow Models

Dataflow process networks (DPNs) consist of statically defined process nodes with First-In-First-Out (FIFO) buffered point-to-point connections. DPNs are intrinsically data-driven, i.e., node actions are not synchronized among each other and may fire whenever sufficient input operands arrived at a node. In this original form, DPNs are data-driven and therefore a suitable model of computation (MoC) […]
Oct, 3

Accelerating Encrypted Computing on Intel GPUs

Homomorphic Encryption (HE) is an emerging encryption scheme that allows computations to be performed directly on encrypted messages. This property provides promising applications such as privacy-preserving deep learning and cloud computing. Prior works have been proposed to enable practical privacy-preserving applications with architectural-aware optimizations on CPUs, GPUs and FPGAs. However, there is no systematic optimization […]
Sep, 26

Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of parallel and heterogeneous decomposition strategies on a heterogeneous computing platform. Our programming model distinguishes itself as a very general class of task graph […]
Sep, 26

Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

Recent progress in the Natural Language Processing domain has given us several State-of-the-Art (SOTA) pretrained models which can be finetuned for specific tasks. These large models with billions of parameters trained on numerous GPUs/TPUs over weeks are leading in the benchmark leaderboards. In this paper, we discuss the need for a benchmark for cost and […]
Sep, 26

IgNet. A Super-precise Convolutional Neural Network

Convolutional neural networks (CNN) are known to be an effective means to detect and analyze images. Their power is essentially based on the ability to extract out images common features. There exist, however, images involving unique, irregular features or details. Such is a collection of unusual children drawings reflecting the kids imagination and individuality. These […]
Sep, 26

An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

Task graph parallelism has emerged as an important tool to efficiently execute large machine learning workloads on GPUs. Users describe a GPU workload in a task dependency graph rather than aggregated GPU operations and dependencies, allowing the runtime to run whole-graph scheduling optimization to significantly improve the performance. While the new CUDA graph execution model […]
Sep, 26

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research

Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What […]
Sep, 19

A readahead prefetcher for GPU file system layer

GPUs are broadly used in I/O-intensive big data applications. Prior works demonstrate the benefits of using GPU-side file system layer, GPUfs, to improve the GPU performance and programmability in such workloads. However, GPUfs fails to provide high performance for a common I/O pattern where a GPU is used to process a whole data set sequentially. […]
Sep, 19

Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability

High-performance computing (HPC) is a major driver accelerating scientific research and discovery, from quantum simulations to medical therapeutics. The growing number of new HPC systems coming online are being furnished with various hardware components, engineered by competing industry entities, each having their own architectures and platforms to be supported. While the increasing availability of these […]
Sep, 19

Measurement and Analysis of GPU-accelerated Applications with HPCToolkit

To address the challenge of performance analysis on the US DOE’s forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit’s measurement and analysis tools attribute metrics to calling contexts that span […]
Sep, 19

GPU Algorithms for Efficient Exascale Discretizations

In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek […]
Sep, 19

A Study of Mixed Precision Strategies for GMRES on GPUs

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision strategies […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: