24608

Posts

Feb, 21

cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs

Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given […]
Feb, 21

A Survey of Machine Learning for Computer Architecture and Systems

It has been a long time that computer architecture and systems are optimized to enable efficient execution of machine learning (ML) algorithms or models. Now, it is time to reconsider the relationship between ML and systems, and let ML transform the way that computer architecture and systems are designed. This embraces a twofold meaning: the […]
Feb, 14

Serverless Computing Strategies on Cloud Platforms

With the development of Cloud Computing, the delivery of virtualized resources over the Internet has greatly grown in recent years. Functions as a Service (FaaS), one of the newest service models within Cloud Computing, allows the development and implementation of event-based applications that cover managed services in public and on-premises Clouds. Public Cloud Computing providers […]
Feb, 14

Optimization of Data Assignment for Parallel Processing in a Hybrid Heterogeneous Environment Using Integer Linear Programming

In the paper we investigate a practical approach to application of integer linear programming for optimization of data assignment to compute units in a multi-level heterogeneous environment with various compute devices, including CPUs, GPUs and Intel Xeon Phis. The model considers an application that processes a large number of data chunks in parallel on various […]
Feb, 14

Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

We have developed several autotuning benchmarks in CUDA that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. We have used them during the development and evaluation of a novel search method for tuning space proposed in [1]. With our framework Kernel Tuning Toolkit, freely available at Github, we measured […]
Feb, 14

Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models

We present an ensemble prediction system using a Deep Learning Weather Prediction (DLWP) model that recursively predicts key atmospheric variables with six-hour time resolution. This model uses convolutional neural networks (CNNs) on a cubed sphere grid to produce global forecasts. The approach is computationally efficient, requiring just three minutes on a single GPU to produce […]
Feb, 14

Transparent FPGA Acceleration with TensorFlow

Today, artificial neural networks are one of the major innovators pushing the progress of machine learning. This has particularly affected the development of neural network accelerating hardware. However, since most of these architectures require specialized toolchains, there is a certain amount of additional effort for developers each time they want to make use of a […]
Feb, 7

Dependable Embedded Systems

This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus […]
Feb, 7

triSYCL for Xilinx FPGA

Khronos SYCL is a C++ based open-source specification that aims to increase the programmability of heterogeneous architectures. Several SYCL implementations exist, with variations both in terms of conformance to the specification; as well as in the range of hardware they target. Intel recently contributed the first open-source feature-complete SYCL implementation to the LLVM compiler project. […]
Feb, 7

Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach

Deep learning researchers and practitioners usually leverage GPUs to help train their deep neural networks (DNNs) faster. However, choosing which GPU to use is challenging both because (i) there are many options, and (ii) users grapple with competing concerns: maximizing compute performance while minimizing costs. In this work, we present a new practical technique to […]
Feb, 7

AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

With the rapid adoption of machine learning (ML), a number of domains now use the approach of fine-tuning models pre-trained on a large corpus of data. However, our experiments show that even fine-tuning on models like BERT can take many hours when using GPUs. While prior work proposes limiting the number of layers that are […]
Feb, 7

Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?

Graphics Processing Units (GPUs) are currently the dominating programmable architecture for Deep Learning (DL) accelerators. The adoption of Field Programmable Gate Arrays (FPGAs) in DL accelerators is however getting momentum. In this paper, we demonstrate that Direct Hardware Mapping (DHM) of a Convolutional Neural Network (CNN) on an embedded FPGA substantially outperforms a GPU implementation […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: