high performance computing on graphics processing units: hgpu.org

Posts

May, 23

ImageCL: An Image Processing Language for Performance Portability on Heterogeneous Systems

Modern computer systems typically conbine multicore CPUs with accelerators like GPUs for inproved performance and energy efficiency. However, these systems suffer from poor performance portability, code tuned for one device must be retuned to achieve high performance on another. Image processing is increasing in importance, with applications ranging from seismology and medicine to Photoshop. Based […]

OpenCL

May, 23

Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups

We propose a new method for training computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. Our sparse connection structure facilitates a significant reduction in computational cost and number of parameters of state-of-the-art deep CNNs without compromising accuracy. We validate our approach by using it […]

CUDA

May, 23

Graphics Supercomputing Applied to Brain Image Analysis with NiftyReg

Medical image processing in general and brain image processing in particular are computationally intensive tasks. Luckily, their use can be liberalized by means of techniques such as GPU programming. In this article we study NiftyReg, a brain image processing library with a GPU implementation using CUDA, and analyse different possible ways of further optimising the […]

CUDA

May, 23

A Practical Performance Model for Compute and Memory Bound GPU Kernels

Performance prediction of GPU kernels is generally a tedious procedure with unpredictable results. In this paper, we provide a practical model for estimating performance of CUDA kernels on GPU hardware in an automated manner. First, we propose the quadrant-split model, an alternative of the roofline visual performance model, which provides insight on the performance limiting […]

CUDA

•

OpenCL

May, 21

The Hitchhiker’s Guide to Cross-Platform OpenCL Application Development

One of the benefits to programming of OpenCL is platform portability. That is, an OpenCL program that follows the OpenCL specification should, in principle, execute reliably on any platform that supports OpenCL. To assess the current state of OpenCL portability, we provide an experience report examining two sets of open source benchmarks that we attempted […]

OpenCL

May, 21

Architecture-Adaptive Code Variant Tuning

Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and input dataset). Automating code variant selection typically relies on machine learning to construct a model during an offline learning phase that can be quickly […]

CUDA

May, 21

GPU-based Pedestrian Detection for Autonomous Driving

Pedestrian detection has gained a lot of prominence during the last few years. Besides the fact that it is one of the hardest tasks within computer vision, it involves huge computational costs. Obtaining acceptable real-time performance, measured in frames per second (fps), for the most advanced algorithms is nowadays a hard challenge. In this work, […]

CUDA

May, 21

Performance Evaluation of Parallel Count Sort using GPU Computing with CUDA

OBJECTIVE: Sorting is considered a very important application in many areas of computer science. Nowadays parallelization of sorting algorithms using GPU computing, on CUDA hardware is increasing rapidly. The objective behind using GPU computing is that the users can get, the more speedup of the algorithms. METHODS: In this paper, we have focused on count […]

CUDA

May, 21

Employing Directive Based Compression Solutions on Accelerators Global Memory under OpenACC

Programmers invest extensive development effort to optimize a GPU program to achieve peak performance. Achieving this requires an efficient usage of global memory, and avoiding memory bandwidth underutilization. The OpenACC programming model has been introduced to tackle the accelerators programming complexity. However, this models coarse-grained control on a program can make the memory bandwidth utilization […]

CUDA

•

OpenCL

May, 17

GPU-Accelerated Feature Tracking

The motivation of this research is to prove that GPUs can provide significant speedup of long-executing image processing algorithms by way of parallelization and massive data throughput. This thesis accelerates the well-known KLT feature tracking algorithm using OpenCL and an NVidia GeForce GTX 780 GPU. KLT is a fast, efficient and accurate feature tracker but […]

OpenCL

May, 17

DeepLearningKit – an GPU Optimized Deep Learning Framework for Apple’s iOS, OS X and tvOS developed in Metal and Swift

In this paper we present DeepLearningKit – an open source framework that supports using pretrained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOS-based mobile apps on iPhone/iPad, tvOS-based apps for the […]

OpenCL

May, 17

A Foray into Efficient Mapping of Algorithms to Hardware Platforms on Heterogeneous Systems

Heterogeneous computing can potentially offer significant performance and performance per watt improvements over homogeneous computing, but the question "what is the ideal mapping of algorithms to architectures?" remains an open one. In the past couple of years new types of computing devices such as FPGAs have come into general computing use. In this work we […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

ImageCL: An Image Processing Language for Performance Portability on Heterogeneous Systems

Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups

Graphics Supercomputing Applied to Brain Image Analysis with NiftyReg

A Practical Performance Model for Compute and Memory Bound GPU Kernels

The Hitchhiker’s Guide to Cross-Platform OpenCL Application Development

Architecture-Adaptive Code Variant Tuning

GPU-based Pedestrian Detection for Autonomous Driving

Performance Evaluation of Parallel Count Sort using GPU Computing with CUDA

Employing Directive Based Compression Solutions on Accelerators Global Memory under OpenACC

GPU-Accelerated Feature Tracking

DeepLearningKit – an GPU Optimized Deep Learning Framework for Apple’s iOS, OS X and tvOS developed in Metal and Swift

A Foray into Efficient Mapping of Algorithms to Hardware Platforms on Heterogeneous Systems

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)