high performance computing on graphics processing units: hgpu.org

Posts

Jul, 26

Darknet on OpenCL: a multi-platform tool for object detection and classification

The article’s goal is to overview challenges and problems on the way from the state of the art CUDA accelerated neural networks code to multi-GPU code. For this purpose, the authors describe the journey of porting the existing in the GitHub, fully-featured CUDA accelerated Darknet engine to OpenCL. The article presents lessons learned and the […]

CUDA

•

OpenCL

Jul, 26

EDSSA: An Encoder-Decoder Semantic Segmentation Networks Accelerator on OpenCL-Based FPGA Platform

Visual semantic segmentation, which is represented by the semantic segmentation network, has been widely used in many fields, such as intelligent robots, security, and autonomous driving. However, these Convolutional Neural Network (CNN)-based networks have high requirements for computing resources and programmability for hardware platforms. For embedded platforms and terminal devices in particular, Graphics Processing Unit […]

OpenCL

Jul, 26

Bit-level Parallelization of 3DES Encryption on GPU

Triple DES (3DES) is a standard fundamental encryption algorithm, used in several electronic payment applications and web browsers. In this paper, we propose a parallel implementation of 3DES on GPU. Since 3DES encrypts data with 64-bit blocks, our approach considers each 64-bit block a kernel block and assign a separate thread to process each bit. […]

CUDA

Jul, 26

GPU coprocessors as a service for deep learning inference in high energy physics

In the next decade, the demands for computing in large scientific experiments are expected to grow tremendously. During the same time period, CPU performance increases will be limited. At the CERN Large Hadron Collider (LHC), these two issues will confront one another as the collider is upgraded for high luminosity running. Alternative processors such as […]

Jul, 26

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a new programming model that allows GPUs and CPUs to share the same virtual memory space, shifts the complex memory management from programmers to GPU driver/ hardware, and enables kernel execution even when memory is oversubscribed. Meanwhile, UVM may also incur considerable performance overhead […]

CUDA

Jul, 19

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads

As specialized hardware accelerators such as GPUs become increasingly popular, developers are looking for ways to target these platforms with high-level APIs. One promising approach is kernel libraries such as PyTorch or cuML, which provide interfaces that mirror CPU-only counterparts such as NumPy or Scikit-Learn. Unfortunately, these libraries are hard to develop and to adopt […]

CUDA

Jul, 19

Compyle: a Python package for parallel computing

Compyle allows users to execute a restricted subset of Python on a variety of HPC platforms. It is an embedded domain-specific language (eDSL) for parallel computing. It currently supports multi-core execution using Cython, and OpenCL and CUDA for GPU devices. Users write code in a restricted subset of Python that is automatically transpiled to high-performance […]

CUDA

•

OpenCL

Jul, 19

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies data reuse analysis and access optimization in different levels of the memory hierarchy. To achieve the balance between computation and […]

CUDA

Jul, 19

A load balance multi-scheduling model for OpenCL kernel tasks in an integrated cluster

Nowadays, embedded systems are comprised of heterogeneous multi-core architectures, i.e., CPUs and GPUs. If the application is mapped to an appropriate processing core, then these architectures provide many performance benefits to applications. Typically, programmers map sequential applications to CPU and parallel applications to GPU. The task mapping becomes challenging because of the usage of evolving […]

OpenCL

Jul, 19

Deep Graph Library Optimizations for Intel(R) x86 Architecture

The Deep Graph Library (DGL) was designed as a tool to enable structure learning from graphs, by supporting a core abstraction for graphs, including the popular Graph Neural Networks (GNN). DGL contains implementations of all core graph operations for both the CPU and GPU. In this paper, we focus specifically on CPU implementations and present […]

CUDA

Jul, 12

GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research

Protein-ligand docking is an in silico tool used to screen potential drug compounds for their ability to bind to a given protein receptor within a drug-discovery campaign. Experimental drug screening is expensive and time consuming, and it is desirable to carry out large scale docking calculations in a high-throughput manner to narrow the experimental search […]

CUDA

•

OpenCL

Jul, 12

Investigating Input Representations and Representation Models of Source Code for Machine Learning

Machine Learning methods are actively used to solve various tasks on source code, such as in Compilers to improve performance of executable code, or IDEs to boost developer productivity. While the use cases are manifold, most of these methods rely on manually-defined features that require substantial engineering efforts, while not necessarily being optimal. In this […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Darknet on OpenCL: a multi-platform tool for object detection and classification

EDSSA: An Encoder-Decoder Semantic Segmentation Networks Accelerator on OpenCL-Based FPGA Platform

Bit-level Parallelization of 3DES Encryption on GPU

GPU coprocessors as a service for deep learning inference in high energy physics

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads

Compyle: a Python package for parallel computing

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

A load balance multi-scheduling model for OpenCL kernel tasks in an integrated cluster

Deep Graph Library Optimizations for Intel(R) x86 Architecture

GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research

Investigating Input Representations and Representation Models of Source Code for Machine Learning

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)