18733

Posts

Feb, 3

DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression

DNNs have been quickly and broadly exploited to improve the data analysis quality in many complex science and engineering applications. Today’s DNNs are becoming deeper and wider because of increasing demand on the analysis quality and more and more complex applications to resolve. The wide and deep DNNs, however, require large amounts of resources, significantly […]
Jan, 27

Exploiting OpenMP & OpenACC to Accelerate a Molecular Docking Mini-App in Heterogeneous HPC Nodes

In drug discovery, molecular docking is the task in charge of estimating the position of a molecule when interacting with the docking site. This task is usually used to perform screening of a large library of molecules, in the early phase of the process. Given the amount of candidate molecules and the complexity of the […]
Jan, 27

Supporting mixed-datatype matrix multiplication within the BLIS framework

We approach the problem of implementing mixed-datatype support within the general matrix multiplication (GEMM) operation of the BLIS framework, whereby each matrix operand A, B, and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the computation is allowed to take place in a precision different from […]
Jan, 27

Efficient Implementation and Optimization of Geometric Multigrid Operations in the LIFT Framework

Geometric Multigrid (GMG) is an efficient method to solve partial differential equations. It consists of four operations (smooth, residual, restrict and prolongate), which are applied iteratively. All four operations are stencil computations and hence benefit from being executed on many-core architectures like GPUs. Programs executed on a GPU are usually written using low-level programming approaches […]
Jan, 27

AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks

Typically, Ultra-deep neural network(UDNN) tends to yield high-quality model, but its training process is usually resource intensive and time-consuming. Modern GPU’s scarce DRAM capacity is the primary bottleneck that hinders the trainability and the training efficiency of UDNN. In this paper, we present "AccUDNN", an accelerator that aims to make the utmost use of finite […]
Jan, 27

Direct N-body code on low-power embedded ARM GPUs

This work arises on the environment of the ExaNeSt project aiming at design and development of an exascale ready supercomputer with low energy consumption profile but able to support the most demanding scientific and technical applications. The ExaNeSt compute unit consists of densely-packed low-power 64-bit ARM processors, embedded within Xilinx FPGA SoCs. SoC boards are […]
Jan, 20

Exploring FPGA-specific Optimizations for Irregular OpenCL Applications

OpenCL is emerging as a high-level hardware description language to address the productivity challenges of developing applications on FPGAs. Unlike traditional hardware description languages (HDLs), OpenCL provides an abstract interface to facilitate high productivity, enabling end users to rapidly describe the required computations, including parallelism and data movement, to create custom hardware accelerators for their […]
Jan, 20

AutoPhase: Compiler Phase-Ordering for High Level Synthesis with Deep Reinforcement Learning

The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In the context of high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Unfortunately, choosing a good order–often referred to as the phase-ordering problem–is an NP-hard […]
Jan, 20

Automatic acceleration of Numpy applications on GPUs and multicore CPUs

Frameworks like Numpy are a popular choice for application developers from varied fields such as image processing to bio-informatics to machine learning. Numpy is often used to develop prototypes or for deployment since it provides efficient implementation for operations involving arrays. Such an approach requires every operation to be executed eagerly. The result of each […]
Jan, 20

A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities

With the rapid development of in-depth learning, neural network and deep learning algorithms have been widely used in various fields, e.g., image, video and voice processing. However, the neural network model is getting larger and larger, which is expressed in the calculation of model parameters. Although a wealth of existing efforts on GPU platforms currently […]
Jan, 20

Tango: A Deep Neural Network Benchmark Suite for Various Accelerators

Deep neural networks (DNNs) have been proving the effectiveness in various computing fields. To provide more efficient computing platforms for DNN applications, it is essential to have evaluation environments that include assorted benchmark workloads. Though a few DNN benchmark suites have been recently released, most of them require to install proprietary DNN libraries or resource-intensive […]
Jan, 13

Vulkan 1.1.97 – A Specification (with all registered Vulkan extensions)

This document, referred to as the "Vulkan Specification" describes the Vulkan Application Programming Interface (API). Vulkan is a C99 API designed for explicit control of low-level graphics and compute functionality. The Vulkan specification is intended for use by both implementors of the API and application developers seeking to make use of the API, forming a […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: