20721

Posts

Apr, 26

Cpp-Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System at Scale

The Cpp-Taskflow project addresses the long-standing question: How can we make it easier for developers to write parallel and heterogeneous programs with high performance and simultaneous high productivity? Cpp-Taskflow develops a simple and powerful task programming model to enable efficient implementations of heterogeneous decomposition strategies. Our programming model empowers users with both static and dynamic […]
Apr, 19

OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework

Object detection is a technology that deals with recognizing classes of objects and their location. It is used in many different areas, such as in face-detecting systems [16, 34, 37], surveillance tools [9], human-machine interfaces [17], and self-driving cars [18, 23, 25, 26, 30]. These days, deep learning object detection approaches have achieved significantly better […]
Apr, 19

Design Space Exploration of an OpenCL Based SAXPY Kernel Implementation on FPGAs

High-performance computing researchers are trying to find new options, tools to satisfy the performance criteria of a hardware design. FPGA (Field Programmable Gate Array) is one of the accelerators which is widely used for power-efficient applications due to its reconfigurability and high performance. Traditionally FPGA can be programmed using Hardware Description Language (HDL). Using HDL, […]
Apr, 19

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computation and its huge computation cost has led to high demand for flexible, portable, and high-performance library implementation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current […]
Apr, 19

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge

Deep Learning (DL) model-based AI services are increasingly offered in a variety of predictive analytics services such as computer vision, natural language processing, speech recognition. However, the quality of the DL models can degrade over time due to changes in the input data distribution, thereby requiring periodic model updates. Although cloud data-centers can meet the […]
Apr, 19

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia’s latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of […]
Apr, 12

MNN: A Universal and Efficient Inference Engine

Deploying deep learning models on mobile devices draws more and more attention recently. However, designing an efficient inference engine on devices is under the great challenges of model compatibility, device diversity, and resource limitation. To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. […]
Apr, 12

Open Source Face Recognition API

Face recognition applications are widely used today for a variety of tasks, whether personal or professional. When looking for a service that provides face detection and classification, it is easy to find several solutions. In this project another way is described so that it is possible to perform this task according to the desired needs […]
Apr, 12

Using Machine Learning to Estimate Utilization and Throughput for OpenCL-Based SpMV Implementation on an FPGA

Hardware designers use High-Level Synthesis (HLS) tools in order to reduce the design time and design complexity. OpenCL is a framework that uses HLS tools and permits the programmer to write standardized C-like code for the host as well as for the hardware accelerators. Using OpenCL, a program can be written using different memory access […]
Apr, 12

Neural Architecture Search for Lightweight Non-Local Networks

Non-Local (NL) blocks have been widely studied in various vision tasks. However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and […]
Apr, 12

LUDA: Boost LSM Key Value Store Compactions with GPUs

Log-Structured-Merge (LSM) tree-based key value stores are facing critical challenges of fully leveraging the dramatic performance improvements of the underlying storage devices, which makes the compaction operations of LSM key value stores become CPU-bound, and slow compactions significantly degrade key value store performance. To address this issue, we propose LUDA, an LSM key value store […]
Apr, 5

Deep Learning for Compilers

Constructing compilers is hard. Optimising compilers are multi-million dollar projects spanning years of development, yet remain unable to fully exploit the available performance, and are prone to bugs. The rapid transition to heterogeneous parallelism and diverse architectures has raised demand for aggressively-optimising compilers to an all time high, leaving compiler developers struggling to keep up. […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: