22411

Posts

Aug, 2

Deep Learning Application in Plant Stress Imaging: A Review

Plant stress is one of major issues that cause significant economic loss for growers. The labor-intensive conventional methods for identifying the stressed plants constrain their applications. To address this issue, rapid methods are in urgent needs. Developments of advanced sensing and machine learning techniques trigger revolutions for precision agriculture based on deep learning and big […]
Aug, 2

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

We introduce biomedical and clinical English model packages for the Stanza Python NLP library. These packages offer accurate syntactic analysis and named entity recognition capabilities for biomedical and clinical text, by combining Stanza’s fully neural architecture with a wide variety of open datasets as well as large-scale unsupervised biomedical and clinical text data. We show […]
Aug, 2

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.
Jul, 26

Darknet on OpenCL: a multi-platform tool for object detection and classification

The article’s goal is to overview challenges and problems on the way from the state of the art CUDA accelerated neural networks code to multi-GPU code. For this purpose, the authors describe the journey of porting the existing in the GitHub, fully-featured CUDA accelerated Darknet engine to OpenCL. The article presents lessons learned and the […]
Jul, 26

EDSSA: An Encoder-Decoder Semantic Segmentation Networks Accelerator on OpenCL-Based FPGA Platform

Visual semantic segmentation, which is represented by the semantic segmentation network, has been widely used in many fields, such as intelligent robots, security, and autonomous driving. However, these Convolutional Neural Network (CNN)-based networks have high requirements for computing resources and programmability for hardware platforms. For embedded platforms and terminal devices in particular, Graphics Processing Unit […]
Jul, 26

Bit-level Parallelization of 3DES Encryption on GPU

Triple DES (3DES) is a standard fundamental encryption algorithm, used in several electronic payment applications and web browsers. In this paper, we propose a parallel implementation of 3DES on GPU. Since 3DES encrypts data with 64-bit blocks, our approach considers each 64-bit block a kernel block and assign a separate thread to process each bit. […]
Jul, 26

GPU coprocessors as a service for deep learning inference in high energy physics

In the next decade, the demands for computing in large scientific experiments are expected to grow tremendously. During the same time period, CPU performance increases will be limited. At the CERN Large Hadron Collider (LHC), these two issues will confront one another as the collider is upgraded for high luminosity running. Alternative processors such as […]
Jul, 26

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a new programming model that allows GPUs and CPUs to share the same virtual memory space, shifts the complex memory management from programmers to GPU driver/ hardware, and enables kernel execution even when memory is oversubscribed. Meanwhile, UVM may also incur considerable performance overhead […]
Jul, 19

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads

As specialized hardware accelerators such as GPUs become increasingly popular, developers are looking for ways to target these platforms with high-level APIs. One promising approach is kernel libraries such as PyTorch or cuML, which provide interfaces that mirror CPU-only counterparts such as NumPy or Scikit-Learn. Unfortunately, these libraries are hard to develop and to adopt […]
Jul, 19

Compyle: a Python package for parallel computing

Compyle allows users to execute a restricted subset of Python on a variety of HPC platforms. It is an embedded domain-specific language (eDSL) for parallel computing. It currently supports multi-core execution using Cython, and OpenCL and CUDA for GPU devices. Users write code in a restricted subset of Python that is automatically transpiled to high-performance […]
Jul, 19

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies data reuse analysis and access optimization in different levels of the memory hierarchy. To achieve the balance between computation and […]
Jul, 19

A load balance multi-scheduling model for OpenCL kernel tasks in an integrated cluster

Nowadays, embedded systems are comprised of heterogeneous multi-core architectures, i.e., CPUs and GPUs. If the application is mapped to an appropriate processing core, then these architectures provide many performance benefits to applications. Typically, programmers map sequential applications to CPU and parallel applications to GPU. The task mapping becomes challenging because of the usage of evolving […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: