18234

Posts

Jun, 2

Marian: Cost-effective High-Quality Neural Machine Translation in C++

This paper describes the submissions of the "Marian" team to the WNMT 2018 shared task. We investigate combinations of teacher-student training, low-precision matrix products, auto-tuning and other methods to optimize the Transformer model on GPU and CPU. By further integrating these methods with the new averaging attention networks, a recently introduced faster Transformer variant, we […]
Jun, 2

FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL

The Square Kilometre Array (SKA) project will be the world largest radio telescope array. With its large number of antennas, the number of signals that need to be processed is dramatic. One important element of the SKA’s Central Signal Processor package is pulsar search. This paper focuses on the FPGA-based acceleration of the Frequency-Domain Acceleration […]
May, 26

OpenCL 2.2 API Specification

Modern processor architectures have embraced parallelism as an important pathway to increased performance. Facing technical challenges with higher clock speeds in a fixed power envelope, Central Processing Units (CPUs) now improve performance by adding multiple cores. Graphics Processing Units (GPUs) have also evolved from fixed function rendering devices into programmable parallel processors. As todays computer […]
May, 26

Transformations of High-Level Synthesis Codes for High-Performance Computing

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to […]
May, 26

Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are […]
May, 26

One machine, one minute, three billion tetrahedra

This paper presents a new scalable parallelization scheme to generate the 3D Delaunay triangulation of a given set of points. Our first contribution is an efficient serial implementation of the incremental Delaunay insertion algorithm. A simple dedicated data structure and a number of improvements in the insertion algorithm have permitted to accelerate by a factor […]
May, 26

Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe

High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents the efficient inference techniques of IntelCaffe, the first Intel optimized deep learning framework that supports efficient 8-bit low precision inference and model optimization techniques of convolutional neural networks on Intel Xeon Scalable Processors. […]
May, 20

Glow: Graph Lowering Compiler Techniques for Neural Networks

This paper presents the design of Glow, a machine learning compiler for heterogeneous hardware. It is a pragmatic approach to compilation that enables the generation of highly optimized code for multiple targets. Glow lowers the traditional neural network dataflow graph into a two-phase strongly-typed intermediate representation. The high-level intermediate representation allows the optimizer to perform […]
May, 20

OpenCL Vector Swizzling Optimization under Global Value Numbering

Under Heterogeneous System Architecture (HSA), each device work together to improve performance. Open Computing Language (OpenCL) provides functionalities to complete parallel computing in different devices. OpenCL vector bundles identical data types and operates them meanwhile. However, optimization upon vector swizzling becomes difficult due to vector swizzling operations with several OpenCL vectors. In this paper, new […]
May, 20

Generic System Calls for GPUs

GPUs are becoming first-class compute citizens and increasingly support programmability-enhancing features such as shared virtual memory and hardware cache coherence. This enables them to run a wider variety of programs. However, a key aspect of general-purpose programming where GPUs still have room for improvement is the ability to invoke system calls. We explore how to […]
May, 20

Neural Multi-scale Image Compression

This study presents a new lossy image compression method that utilizes the multi-scale features of natural images. Our model consists of two networks: multi-scale lossy autoencoder and parallel multi-scale lossless coder. The multi-scale lossy autoencoder extracts the multi-scale image features to quantized variables and the parallel multi-scale lossless coder enables rapid and accurate lossless coding […]
May, 20

Deep Neural Machine Translation with Weakly-Recurrent Units

Recurrent neural networks (RNNs) have represented for years the state of the art in neural machine translation. Recently, new architectures have been proposed, which can leverage parallel computation on GPUs better than classical RNNs. Faster training and inference combined with different sequence-to-sequence modeling also lead to performance improvements. While the new models completely depart from […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: