18916

Posts

May, 30

Massively Parallel GPU Memory Compaction

Memory fragmentation is a widely studied problem of dynamic memory allocators. It is well known that fragmentation can lead to premature out-of-memory errors and poor cache performance. With the recent emergence of dynamic memory allocators for SIMD accelerators, memory fragmentation is becoming an increasingly important problem on such architectures. Nevertheless, it has received little attention […]
May, 26

Multi-GPU Rendering with Vulkan API

Vulkan API provides a low level interface to modern Graphics Processing Units (GPUs). With this thesis, we demonstrate how to use Vulkan to send commands explicitly to separate GPUs for implementing platform- and vendor independent multi-GPU rendering. We describe how to implement the sort-first and sort-last approaches to perform parallel rendering with Vulkan. We introduce […]
May, 26

Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUs

Since the advent of GPU computing, GPU hardware has evolved at a fast pace. Since application performance heavily depends on the latest hardware improvements, performance portability is extremely challenging for GPU application library developers. Portability becomes even more difficult when new low-level instructions are added to the ISA (e.g., warp shuffle instructions) or the microarchitectural […]
May, 26

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which […]
May, 26

Acceleration of Scientific Deep Learning Models on Heterogeneous Computing Platform with Intel FPGAs

AI and deep learning are experiencing explosive growth in almost every domain involving analysis of big data. Deep learning using Deep Neural Networks (DNNs) has shown great promise for such scientific data analysis applications. However, traditional CPU-based sequential computing can no longer meet the requirements of mission-critical applications, which are compute-intensive and require low latency […]
May, 26

A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation

This paper proposes a hybrid framework for fast and accurate performance estimation of OpenCL kernels running on GPUs. The kernel execution flow is statically analyzed and thereupon the execution trace is generated via a loop-based bidirectional branch search. Then the trace is dynamically simulated to perform a dummy execution of the kernel to obtain the […]
May, 23

Performance Analysis of Deep Learning Workloads on Leading-edge Systems

This work examines the performance of leading-edge systems designed for machine learning computing, including the NVIDIA DGX-2, Amazon Web Services (AWS) P3, IBM Power System Accelerated Compute Server AC922, and a consumer-grade Exxact TensorEX TS4 GPU server. Representative deep learning workloads from the fields of computer vision and natural language processing are the focus of […]
May, 23

A Case Study: Exploiting Neural Machine Translation to Translate CUDA to OpenCL

The sequence-to-sequence (seq2seq) model for neural machine translation has significantly improved the accuracy of language translation. There have been new efforts to use this seq2seq model for program language translation or program comparisons. In this work, we present the detailed steps of using a seq2seq model to translate CUDA programs to OpenCL programs, which both […]
May, 23

Instructions’ Latencies Characterization for NVIDIA GPGPUs

The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Nowadays, Graphics Processing Units (GPUs) are in a variety of systems from supercomputers to mobile phones and tablets. They are not only used for graphics operations but rather as general-purpose special hardware (GPGPUs) to boost the performance […]
May, 19

Neural Query Language: A Knowledge Base Query Language for Tensorflow

Large knowledge bases (KBs) are useful for many AI tasks, but are difficult to integrate into modern gradient-based learning systems. Here we describe a framework for accessing soft symbolic database using only differentiable operators. For example, this framework makes it easy to conveniently write neural models that adjust confidences associated with facts in a soft […]
May, 19

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Sparse matrix-vector multiplication (SpMV) operations are commonly used in various scientific applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular […]
May, 19

Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL

Recent technological advances have proliferated the available computing power, memory, and speed of modern Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). Consequently, the performance and complexity of Artificial Neural Networks (ANNs) is burgeoning. While GPU accelerated Deep Neural Networks (DNNs) currently offer state-of-the-art performance, they consume large amounts […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: