high performance computing on graphics processing units: hgpu.org

Posts

May, 21

Performance Evaluation of Parallel Count Sort using GPU Computing with CUDA

OBJECTIVE: Sorting is considered a very important application in many areas of computer science. Nowadays parallelization of sorting algorithms using GPU computing, on CUDA hardware is increasing rapidly. The objective behind using GPU computing is that the users can get, the more speedup of the algorithms. METHODS: In this paper, we have focused on count […]

CUDA

May, 21

Employing Directive Based Compression Solutions on Accelerators Global Memory under OpenACC

Programmers invest extensive development effort to optimize a GPU program to achieve peak performance. Achieving this requires an efficient usage of global memory, and avoiding memory bandwidth underutilization. The OpenACC programming model has been introduced to tackle the accelerators programming complexity. However, this models coarse-grained control on a program can make the memory bandwidth utilization […]

CUDA

•

OpenCL

May, 17

GPU-Accelerated Feature Tracking

The motivation of this research is to prove that GPUs can provide significant speedup of long-executing image processing algorithms by way of parallelization and massive data throughput. This thesis accelerates the well-known KLT feature tracking algorithm using OpenCL and an NVidia GeForce GTX 780 GPU. KLT is a fast, efficient and accurate feature tracker but […]

OpenCL

May, 17

DeepLearningKit – an GPU Optimized Deep Learning Framework for Apple’s iOS, OS X and tvOS developed in Metal and Swift

In this paper we present DeepLearningKit – an open source framework that supports using pretrained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOS-based mobile apps on iPhone/iPad, tvOS-based apps for the […]

OpenCL

May, 17

A Foray into Efficient Mapping of Algorithms to Hardware Platforms on Heterogeneous Systems

Heterogeneous computing can potentially offer significant performance and performance per watt improvements over homogeneous computing, but the question "what is the ideal mapping of algorithms to architectures?" remains an open one. In the past couple of years new types of computing devices such as FPGAs have come into general computing use. In this work we […]

OpenCL

May, 17

Attention-based NMT Models as Feature Functions in Phrase-based SMT

This paper describes the AMU-UEDIN submissions to the WMT 2016 shared task on news translation. We explore methods of decode-time integration of attention-based neural translation models with phrase-based statistical machine translation. Efficient batch-algorithms for GPU-querying are proposed and implemented. For English-Russian, the phrase-based system cannot surpass state-of-the-art stand-alone neural models. For the Russian-English task, our […]

CUDA

May, 17

pyJac: analytical Jacobian generator for chemical kinetics

Accurate simulations of combustion phenomena require the use of detailed chemical kinetics in order to capture limit phenomena such as ignition and extinction as well as predict pollutant formation. However, the chemical kinetic models for hydrocarbon fuels of practical interest typically have large numbers of species and reactions and exhibit high levels of mathematical stiffness […]

CUDA

May, 11

Improving GPU Performance: Reducing Memory Conflicts and Latency

Over the last decade Graphics Processing Units (GPUs) have evolved from fixed function computer graphics processors to energy efficient and programmable general purpose compute accelerators. During this period the number of cores in a GPU increased from 128 to 3072, an increase of 24x. However, the peak compute performance only increased by 12x, and memory […]

CUDA

•

OpenCL

May, 11

LightNet: A Versatile, Standalone Matlab-based Environment for Deep Learning

LightNet is a lightweight, versatile and purely Matlab-based deep learning framework. The aim of the design is to provide an easy-to-understand, easy-to-use and efficient computational platform for deep learning research. The implemented framework supports major deep learning architectures such as Multilayer Perceptron Networks (MLP), Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The framework […]

May, 11

An End-to-End System for Unconstrained Face Verifcation with Deep Convolutional Neural Networks

Over the last four years, methods based on Deep Convolutional Neural Networks (DCNNs) have shown impressive performance improvements for object detection and recognition problems. This has been made possible due to the availability of large annotated datasets, a better understanding of the non-linear mapping between input images and class labels as well as the affordability […]

CUDA

May, 11

Theano: A Python framework for fast computation of mathematical expressions

Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers – especially in the machine learning community – and has shown steady performance improvements. Theano is being actively and continuously developed […]

CUDA

May, 11

The GPU-based Parallel Ant Colony System

The Ant Colony System (ACS) is, next to Ant Colony Optimization (ACO) and the MAX-MIN Ant System (MMAS), one of the most efficient metaheuristic algorithms inspired by the behavior of ants. In this article we present three novel parallel versions of the ACS for the graphics processing units (GPUs). To the best of our knowledge, […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Performance Evaluation of Parallel Count Sort using GPU Computing with CUDA

Employing Directive Based Compression Solutions on Accelerators Global Memory under OpenACC

GPU-Accelerated Feature Tracking

DeepLearningKit – an GPU Optimized Deep Learning Framework for Apple’s iOS, OS X and tvOS developed in Metal and Swift

A Foray into Efficient Mapping of Algorithms to Hardware Platforms on Heterogeneous Systems

Attention-based NMT Models as Feature Functions in Phrase-based SMT

pyJac: analytical Jacobian generator for chemical kinetics

Improving GPU Performance: Reducing Memory Conflicts and Latency

LightNet: A Versatile, Standalone Matlab-based Environment for Deep Learning

An End-to-End System for Unconstrained Face Verifcation with Deep Convolutional Neural Networks

Theano: A Python framework for fast computation of mathematical expressions

The GPU-based Parallel Ant Colony System

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)