27592

Posts

Dec, 4

Fast convolution kernels on pascal GPU with high memory efficiency

The convolution computation is widely used in many fields, especially in CNNs. Because of the rapid growth of the training data in CNNs, GPUs have been used for the acceleration, and memory-efficient algorithms are focused because of thier high performance. In this paper, we propose two convolution kernels for single-channel convolution and multi-channel convolution respectively. […]
Dec, 4

Three Contributions to the Theory and Practice of Optimizing Compilers

The theory and practice of optimizing compilers gather techniques that, from input computer programs, aim at generating code making best use of modern computer hardware. On the theory side, this thesis contributes new results and algorithms in polyhedral geometry. On the practical side, this thesis contributes techniques for the tuning of parameters of programs targeting […]
Dec, 4

Efficient Incremental Text-to-Speech on GPUs

Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural […]
Dec, 4

SkyFlow: Heterogeneous streaming for skyline computation using FlowGraph and SYCL

The skyline is an optimization operator widely used for multi-criteria decision making. It allows minimizing an n-dimensional dataset into its smallest subset. In this work we present SkyFlow, the first heterogeneous CPU+GPU graph-based engine for skyline computation on a stream of data queries. Two data flow approaches, Coarse-grained and Fine-grained, have been proposed for different […]
Dec, 4

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make […]
Nov, 27

User-Driven Online Kernel Fusion for SYCL

Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing speciic tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization […]
Nov, 27

SciAI4Industry – Solving PDEs for industry-scale problems with deep learning

Solving partial differential equations with deep learning makes it possible to reduce simulation times by multiple orders of magnitude and unlock scientific methods that typically rely on large numbers of sequential simulations, such as optimization and uncertainty quantification. Two of the largest challenges of adopting scientific AI for industrial problem settings is that training datasets […]
Nov, 27

Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad

Recent years have seen increased interest from the HPC community in Field Programmable Gate Arrays (FPGAs) as an alternative/additional accelerator. This has been largely due to the slowdown in the transistor scaling and the difficulty of gaining performance improvement and energy efficiency from the current processing solutions. General (scientific) software programmers have shied away from […]
Nov, 27

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also extensively been studied. The computational power provided by the modern GPUs as well as the rapid development of multicore CPU systems […]
Nov, 27

Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment

Background and objectives. The computational biology area is growing up over the years. The interest in researching and developing computational tools for the acquisition, storage, organization, analysis, and visualization of biological data generates the need to create new hardware architectures and new software tools that allow processing big data in acceptable times. In this sense, […]
Nov, 20

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Transformers have become central to recent advances in computer vision. However, training a vision Transformer (ViT) model from scratch can be resource intensive and time consuming. In this paper, we aim to explore approaches to reduce the training costs of ViT models. We introduce some algorithmic improvements to enable training a ViT model from scratch […]
Nov, 20

Hardware Checkpointing and Productive Debugging Flows for FPGAs

As FPGAs become larger and more complex, productive debugging is becoming more challenging. In this work, we detail a new debugging flow based on hardware checkpointing that provides full visibility and controllability while maintaining reasonable execution speed. Hardware checkpointing is useful not only for debugging but also enables several other capabilities such as live migration, […]

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: