18762

Posts

Feb, 24

An Empirically Guided Optimization Framework for FPGA OpenCL

FPGAs have been demonstrated to be capable of very high performance, especially power-performance, but generally at the cost of hand-tuned HDL code by FPGA experts. OpenCL is the leading industry effort in improving performance-programmability. But while it is recognized that optimizing OpenCL code using published best practices is critical to achieving good performance, even optimized […]
Feb, 24

A Package for Multi-Dimensional Monte Carlo Integration on Multi-GPUs

We have developed a Python package ZMCintegral for multi-dimensional Monte Carlo integration on multiple Graphics Processing Units(GPUs). The package employs a stratified sampling and heuristic tree search algorithm. We have built two versions of this package: one with Tensorflow and another with Numba, both support general user defined functions with a user-friendly interface. We have […]
Feb, 24

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 […]
Feb, 24

Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures

Real-time systems are ubiquitous in our everyday life, e.g., in safety-critical domains such as automotive, avionics or robotics. The correctness of a real-time system does not only depend on the correctness of its calculations, but also on the non-functional requirement of adhering to deadlines. Failing to meet a deadline may lead to severe malfunctions, therefore […]
Feb, 24

DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We […]
Feb, 17

TensorFlow.js: Machine Learning for the Web and Beyond

TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript […]
Feb, 17

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time sharing and preemption are expensive. Worse, when a DL application cannot completely use a […]
Feb, 17

Software-Defined FPGA Accelerator Design for Mobile Deep Learning Applications

Recently, the field of deep learning has received great attention by the scientific community and it is used to provide improved solutions to many computer vision problems. Convolutional neural networks (CNNs) have been successfully used to attack problems such as object recognition, object detection, semantic segmentation, and scene understanding. The rapid development of deep learning […]
Feb, 17

DeeperLab: Single-Shot Image Parser

We present a single-shot, bottom-up approach for whole image parsing. Whole image parsing, also known as Panoptic Segmentation, generalizes the tasks of semantic segmentation for ‘stuff’ classes and instance segmentation for ‘thing’ classes, assigning both semantic and instance labels to every pixel in an image. Recent approaches to whole image parsing typically employ separate standalone […]
Feb, 17

GPU Accelerated Keccak (SHA3) Algorithm

Hash functions like SHA-1 or MD5 are one of the most important cryptographic primitives, especially in the field of information integrity. Considering the fact that increasing methods have been proposed to break these hash algorithms, a competition for a new family of hash functions was held by the US National Institute of Standards and Technology. […]
Feb, 10

Performance Evaluation of OpenMP’s Target Construct on GPUs: Exploring Compiler Optimizations

OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU […]
Feb, 10

AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL

Emerging new architectures used in High Performance Computing require new research to adapt and optimise algorithms to them.As part of this effort, we propose the newAXC format to improve the performance of the SpMV product for the Intel Xeon Phi coprocessor. The performance of the OpenCL kernel, based on our new format, is compared with […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: