high performance computing on graphics processing units: hgpu.org

Posts

Apr, 26

GEVO: GPU Code Optimization using Evolutionary Computation

GPUs are a key enabler of the revolution in machine learning and high performance computing, functioning as de facto co-processors to accelerate large-scale computation. As the programming stack and tool support have matured, GPUs have also become accessible to programmers, who may lack detailed knowledge of the underlying architecture and fail to fully leverage the […]

CUDA

Apr, 26

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite

FPGAs have found increasing adoption in data center applications since a new generation of high-level tools have become available which noticeably reduce development time for FPGA accelerators and still provide high quality of results. There is however no high-level benchmark suite available which specifically enables a comparison of FPGA architectures, programming tools and libraries for […]

OpenCL

Apr, 26

Cpp-Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System at Scale

The Cpp-Taskflow project addresses the long-standing question: How can we make it easier for developers to write parallel and heterogeneous programs with high performance and simultaneous high productivity? Cpp-Taskflow develops a simple and powerful task programming model to enable efficient implementations of heterogeneous decomposition strategies. Our programming model empowers users with both static and dynamic […]

CUDA

Apr, 19

OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework

Object detection is a technology that deals with recognizing classes of objects and their location. It is used in many different areas, such as in face-detecting systems [16, 34, 37], surveillance tools [9], human-machine interfaces [17], and self-driving cars [18, 23, 25, 26, 30]. These days, deep learning object detection approaches have achieved significantly better […]

CUDA

•

OpenCL

Apr, 19

Design Space Exploration of an OpenCL Based SAXPY Kernel Implementation on FPGAs

High-performance computing researchers are trying to find new options, tools to satisfy the performance criteria of a hardware design. FPGA (Field Programmable Gate Array) is one of the accelerators which is widely used for power-efficient applications due to its reconfigurability and high performance. Traditionally FPGA can be programmed using Hardware Description Language (HDL). Using HDL, […]

OpenCL

Apr, 19

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computation and its huge computation cost has led to high demand for flexible, portable, and high-performance library implementation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current […]

CUDA

•

OpenCL

Apr, 19

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge

Deep Learning (DL) model-based AI services are increasingly offered in a variety of predictive analytics services such as computer vision, natural language processing, speech recognition. However, the quality of the DL models can degrade over time due to changes in the input data distribution, thereby requiring periodic model updates. Although cloud data-centers can meet the […]

CUDA

Apr, 19

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia’s latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of […]

CUDA

Apr, 12

MNN: A Universal and Efficient Inference Engine

Deploying deep learning models on mobile devices draws more and more attention recently. However, designing an efficient inference engine on devices is under the great challenges of model compatibility, device diversity, and resource limitation. To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. […]

OpenCL

Apr, 12

Open Source Face Recognition API

Face recognition applications are widely used today for a variety of tasks, whether personal or professional. When looking for a service that provides face detection and classification, it is easy to find several solutions. In this project another way is described so that it is possible to perform this task according to the desired needs […]

Apr, 12

Using Machine Learning to Estimate Utilization and Throughput for OpenCL-Based SpMV Implementation on an FPGA

Hardware designers use High-Level Synthesis (HLS) tools in order to reduce the design time and design complexity. OpenCL is a framework that uses HLS tools and permits the programmer to write standardized C-like code for the host as well as for the hardware accelerators. Using OpenCL, a program can be written using different memory access […]

OpenCL

Apr, 12

Neural Architecture Search for Lightweight Non-Local Networks

Non-Local (NL) blocks have been widely studied in various vision tasks. However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GEVO: GPU Code Optimization using Evolutionary Computation

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite

Cpp-Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System at Scale

OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework

Design Space Exploration of an OpenCL Based SAXPY Kernel Implementation on FPGAs

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

MNN: A Universal and Efficient Inference Engine

Open Source Face Recognition API

Using Machine Learning to Estimate Utilization and Throughput for OpenCL-Based SpMV Implementation on an FPGA

Neural Architecture Search for Lightweight Non-Local Networks

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)