24716

Posts

Mar, 14

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate […]
Mar, 7

LazyTensor: combining eager execution with domain-specific compilers

Domain-specific optimizing compilers have demonstrated significant performance and portability benefits, but require programs to be represented in their specialized IRs. Existing frontends to these compilers suffer from the "language subset problem" where some host language features are unsupported in the subset of the user’s program that interacts with the domain-specific compiler. By contrast, define-by-run ML […]
Mar, 7

Benchmarking Modern Edge Devices for AI Applications

AI (artificial intelligence) has grown at an overwhelming speed for the last decade, to the extent that it has become one of the mainstream tools that drive the advancements in science and technology. Meanwhile, the paradigm of edge computing has emerged as one of the foremost areas in which applications using the AI technology are […]
Mar, 7

Integrating Accelerators in Heterogeneous Systems

This work studies programmability enhancing abstractions in the context of accelerators and heterogeneous systems. Specifically, the focus is on adapting abstractions that have been successfully established to improve the programmability of CPUs. Specialized accelerators including GPUs, TPUs, and FPGAs promise to deliver orders of magnitude improvements in performance and energy efficiency. However, to exploit these […]
Mar, 7

Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated […]
Mar, 7

Neural Network Libraries: A Deep Learning Framework Designed from Engineers’ Perspectives

While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries, a deep learning framework designed from engineer’s perspective, with […]
Feb, 28

Accelerating AutoDock4 with GPUs and Gradient-Based Local Search

AutoDock4 is a widely used program for docking small molecules to macromolecular targets. It describes ligand–receptor interactions using a physics-inspired scoring function that has been proven useful in a variety of drug discovery projects. However, compared to more modern and recent software, AutoDock4 has longer execution times, limiting its applicability to large scale dockings. To […]
Feb, 28

Acceleration of Intrusion Detection in Encrypted Network Traffic Using Heterogeneous Hardware

More than 75% of Internet traffic is now encrypted, and this percentage is constantly increasing. The majority of communications are secured using common encryption protocols such as SSL/TLS and IPsec to ensure security and protect the privacy of Internet users. However, encryption can be exploited to hide malicious activities, camouflaged into normal network traffic. Traditionally, […]
Feb, 28

Multi-GPU performance optimization of a computational fluid dynamics code using OpenACC

This article investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on multiple platforms. The article shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU.Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple […]
Feb, 28

BASEMENT v3: a modular freeware for river process modelling over multiple computational backends

Modelling river physical processes is of critical importance for flood protection, river management and restoration of riverine environments. Developments in algorithms and computational power have led to a wider spread of river simulation tools. However, the use of two-dimensional models can still be hindered by complexity in the setup and the high computational costs. Here […]
Feb, 28

GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python

As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task […]
Feb, 23

Using hardware performance counters to speed up autotuning convergence on GPUs

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: