25212

Posts

Jun, 20

StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs, GPUs and FPGAs

The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other — less-known — machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation […]
Jun, 20

Experience Report: Writing A Portable GPU Runtime with OpenMP 5.1

GPU runtimes are historically implemented in CUDA or other vendor specific languages dedicated to GPU programming. In this work we show that OpenMP 5.1, with minor compiler extensions, is capable of replacing existing solutions without a performance penalty. The result is a performant and portable GPU runtime that can be compiled with LLVM/Clang to Nvidia […]
Jun, 6

DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications

Deep learning has been shown as a successful method for various tasks, and its popularity results in numerous open-source deep learning software tools. Deep learning has been applied to a broad spectrum of scientific domains such as cosmology, particle physics, computer vision, fusion, and astrophysics. Scientists have performed a great deal of work to optimize […]
Jun, 6

Data-Driven Analysis and Design of Vulkan Ray-Tracing Applications using Automatic Instrumentation

Modern graphics Application Programming Interfaces (APIs) provide first-class support for ray tracing. Hardware vendors implement drivers for the graphics API including a black-box compiler. The black-box compiler creates architecture-specific binaries that leverage ray-tracing hardware acceleration. Ray-tracing support in modern APIs allows all geometry and shaders to be specified for a single execution. Thus, ray tracing […]
Jun, 6

Optimization of Heterogeneous Systems with AI Planning Heuristics and Machine Learning: A Performance and Energy Aware Approach

Heterogeneous computing systems provide high performance and energy efficiency. However, to optimally utilize such systems, solutions that distribute the work across host CPUs and accelerating devices are needed. In this paper, we present a performance and energy aware approach that combines AI planning heuristics for parameter space exploration with a machine learning model for performance […]
Jun, 6

Exploiting co-execution with oneAPI: heterogeneity from a modern perspective

Programming efficiently heterogeneous systems is a major challenge, due to the complexity of their architectures. Intel oneAPI, a new and powerful standards-based unified programming model, built on top of SYCL, addresses these issues. In this paper, oneAPI is provided with co-execution strategies to run the same kernel between different devices, enabling the exploitation of static […]
Jun, 6

Early Experiences Migrating CUDA codes to oneAPI

The heterogeneous computing paradigm represents a real programming challenge due to the proliferation of devices with different hardware characteristics. Recently Intel introduced oneAPI, a new programming environment that allows code developed in DPC++ to be run on different devices such as CPUs, GPUs, FPGAs, among others. This paper presents our first experiences in porting two […]
May, 30

Using Workload Characterization to Guide High Performance Graph Processing

Graph analytics represent an important application domain widely used in many fields such as web graphs, social networks, and Bayesian networks. The sheer size of the graph data sets combined with the irregular nature of the underlying problem pose a significant challenge for performance, scalability, and power efficiency of graph processing. With the exponential growth […]
May, 30

kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos

Empirical Dynamic Modeling (EDM) is a state-of-the-art non-linear time-series analysis framework. Despite its wide applicability, EDM was not scalable to large datasets due to its expensive computational cost. To overcome this obstacle, researchers have attempted and succeeded in accelerating EDM from both algorithmic and implementational aspects. In previous work, we developed a massively parallel implementation […]
May, 30

Sequence Parallelism: Making 4D Parallelism Possible

Within Transformer, self-attention is the key module to learn powerful context-aware representations. However, self-attention suffers from quadratic memory requirements with respect to the sequence length, which limits us to process longer sequence on GPU. In this work, we propose sequence parallelism, a memory efficient parallelism method to help us break input sequence length limitation and […]
May, 30

TENSILE: A Tensor granularity dynamic GPU memory scheduler method towards multiple dynamic workloads system

Recently, deep learning has been an area of intense researching. However, as a kind of computing intensive task, deep learning highly relies on the the scale of the GPU memory, which is usually expensive and scarce. Although there are some extensive works have been proposed for dynamic GPU memory management, they are hard to be […]
May, 30

cuSZ(x): Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs

Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous HPC architecture, GPU-accelerated error-bounded compressors (such as cuSZ and cuZFP) have been developed. However, they suffer from either low performance or low compression ratios. To this end, we propose cuSZ(x) to target both high compression ratio and throughput. We […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: