18574

Posts

Oct, 28

Towards Efficient Large-Scale Graph Neural Network Computing

Recent deep learning models have moved beyond low-dimensional regular grids such as image, video, and speech, to high-dimensional graph-structured data, such as social networks, brain connections, and knowledge graphs. This evolution has led to large graph-based irregular and sparse models that go beyond what existing deep learning frameworks are designed for. Further, these models are […]
Oct, 28

Improving OpenCL Performance by Specializing Compiler Phase Selection and Ordering

Automatic compiler phase selection/ordering has traditionally been focused on CPUs and, to a lesser extent, FPGAs. We present experiments regarding compiler phase ordering specialization of OpenCL kernels targeting a GPU. We use iterative exploration to specialize LLVM phase orders on 15 OpenCL benchmarks to an NVIDIA GPU. We analyze the generated NVIDIA PTX code for […]
Oct, 21

Using Compiler Snippets to Exploit Parallelism on Heterogeneous Hardware: A Java Reduction Case Study

Parallel skeletons are essential structured design patterns for efficient heterogeneous and parallel programming. They allow programmers to express common algorithms in such a way that it is much easier to read, maintain, debug and implement for different parallel programming models and parallel architectures. Reductions are one of the most common parallel skeletons. Many programming frameworks […]
Oct, 21

Non-Uniform Domain Decomposition for Heterogeneous Accelerated Processing Units

The use of heterogeneous architectures has become indispensable in optimizing application performance. Nowadays, one of the most popular heterogeneous architectures is discrete CPU+GPU. Despite the high computational power present in such architectures, in many cases, memory data transfers between CPU and GPU are significant performance bottlenecks. As an attempt to mitigate performance costs involved in […]
Oct, 21

A Survey of FPGA-based Accelerators for Convolutional Neural Networks

Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive tasks and due to this, they have received significant interest from the researchers. Given the high computational demands of CNNs, custom hardware accelerators are vital for boosting their performance. The high energy-efficiency, computing capabilities and reconfigurability of FPGA […]
Oct, 21

Exploiting Task Parallelism with OpenCL: A Case Study

While data parallelism aspects of OpenCL have been of primary interest due to the massively data parallel GPUs being on focus, OpenCL also provides powerful capabilities to describe task parallelism. In this article we study the task parallel concepts available in OpenCL and find out how well the different vendor-specific implementations can exploit task parallelism […]
Oct, 21

AI Benchmark: Running Deep Neural Networks on Android Smartphones

Over the last years, the computational power of mobile devices such as smartphones and tablets has grown dramatically, reaching the level of desktop computers available not long ago. While standard smartphone apps are no longer a problem for them, there is still a group of tasks that can easily challenge even high-end devices, namely running […]
Oct, 13

Resource Elastic Virtualization for FPGAs using OpenCL

FPGAs are rising in popularity for acceleration in all kinds of systems. However, even in cloud environments, FPGA devices are typically still used exclusively by one application only. To overcome this, and as an approach to manage FPGA resources with OS functionality, this paper introduces the concept of resource elastic virtualization which allows shrinking and […]
Oct, 13

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms – such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) – requires significant manual effort. We […]
Oct, 13

Towards Lattice Quantum Chromodynamics on FPGA devices

In this paper we describe a single-node, double precision FPGA implementation of the Conjugate Gradient algorithm in the context of Lattice Quantum Chromodynamics. As a benchmark of our proposal we invert numerically the Dirac-Wilson operator on a 4-dimensional grid on a Xilinx Zynq Ultrascale+ evaluation board. In our implementation we separate software/hardware parts in such […]
Oct, 13

Sparse Winograd Convolutional neural networks on small-scale systolic arrays

The reconfigurability, energy-efficiency, and massive parallelism on FPGAs make them one of the best choices for implementing efficient deep learning accelerators. However, state-of-art implementations seldom consider the balance between high throughput of computation power and the ability of the memory subsystem to support it. In this paper, we implement an accelerator on FPGA by combining […]
Oct, 13

Adaptive Partitioning for Iterated Sequences of Irregular OpenCL Kernels

OpenCL defines a common parallel programming language for all devices, although writing tasks adapted to the devices, managing communication and load-balancing issues are left to the programmer. We propose in this paper a static/dynamic approach for the execution of an iterated sequence of data-dependent kernels on a multi-device heterogeneous architecture. The method allows to automatically […]

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: