high performance computing on graphics processing units: hgpu.org

Posts

Aug, 28

Throughput-Oriented Analytical Models for Performance Estimation on Programmable Hardware Accelerators

In this thesis work, we have mainly worked on two topics of GPU performance analysis. First, we have developed an analytical method and a timing estimation tool (TEG) to predict CUDA application’s performance for GT200 generation GPUs. TEG can predict GPU applications’ performance in cycle-approximate level. Second, we have developed an approach to estimate GPU […]

CUDA

Aug, 28

Audiovisual Voice Activity Detection and Localization of Simultaneous Speech Sources

Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in […]

CUDA

Aug, 28

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

Developing accurate building energy simulation models to assist energy efficiency at speed and scale is one of the research goals of the Whole-Building and Community Integration group, which is a part of Building Technologies Research and Integration Center (BTRIC) at Oak Ridge National Laboratory (ORNL). The aim of the Autotune project is to speed up […]

CUDA

Aug, 28

Dynamic Load Balancing on Massively Parallel Computer Architectures

This thesis reports on using dynamic load balancing methods on massively parallel computers in the context of multi-threaded computations. In particular we investigate the applicability of a randomized work stealing algorithm to ray tracing and breadth-first search as representatives of real-world applications with dynamic work creation. For our considerations we made use of current massively […]

CUDA

Aug, 27

The development and expansion of HOOMD-blue through six years of GPU proliferation

HOOMD-blue is the first general purpose MD code built from the ground up for GPU acceleration, and has been actively developed since March 2007. It supports a variety of force fields and integrators targeted at soft-matter simulations. As an open source project, numerous developers have contributed useful feature additions back to the main code. High […]

CUDA

Aug, 27

Compilation techniques and language support to facilitate dependence-driven computation

As the demand increases for high performance and power efficiency in modern computer runtime systems and architectures, programmers are left with the daunting challenge of fully exploiting these systems for efficiency, high-level expressibility, and portability across different computing architectures. Emerging programming models such as the task-based runtime StarPU and many-core architectures such as GPUs force […]

CUDA

Aug, 27

Solutions for Optimizing the Monte Carlo Option Pricing Method’s Implementation Using the Compute Unified Device Architecture

Finance-related problems require more and more computations; therefore, the problem of finding efficient implementations for option pricing models on modern architectures has become an important challenge. Although there are numerous implementations of the Monte Carlo method on central processing units, many of them face limitations arising from the necessary increased computational power. In this paper, […]

CUDA

Aug, 27

Multiple Time Scales Recurrent Neural Network for Complex Action Acquisition

This paper presents novel results of complex action learning experiments based on the use of extended multiple time-scales recurrent neural networks (MTRNN). The experiments were carried out with the iCub humanoid robot, as a model of the developmental learning of motor primitives as the basis of sensorimotor and linguistic compositionality. The model was implemented through […]

CUDA

Aug, 27

GPU-based simulation of the long-range Potts model via parallel tempering

We discuss the efficiency of parallelization on graphical processing units (GPUs) for the simulation of the one dimensional Potts model with long range interactions via parallel tempering. We investigate the behaviour of some thermodynamic properties, such as equilibrium energy and magnetization, critical temperatures as well as the separation between the first- and second-order regime. By […]

CUDA

Aug, 26

Aquila 2.0: Software Architecture for Cognitive Robotics

The modelling of the integration of various cognitive skills and modalities requires complex and computationally intensive algorithms running in parallel while controlling high-performance systems. The distribution of processing across many computers has certainly advanced our software ecosystem and opened up research to new possibilities. While this was an essential move, we are aspiring to augment […]

CUDA

Aug, 26

Fast Object Re-Detection and Localization in Video for Spatio-Temporal Fragment Creation

This paper presents a method for the detection and localization of instances of user-specified objects within a video or a collection of videos. The proposed method is based on the extraction and matching of SURF descriptors in video frames and further incorporates a number of improvements so as to enhance both the detection accuracy and […]

Aug, 26

Estimating the WCET of GPU-Accelerated Applications using Hybrid Analysis

The massive parallelism offered by Graphics Processing Units (GPUs) is now routinely exploited to accelerate computationally intensive tasks in a wide variety of application domains. Efficient GPU programming in languages such as CUDA and OpenCL requires careful application of hand optimisations to exploit parallelism and locality while minimising synchronisation. The effectiveness of such optimisations can […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Throughput-Oriented Analytical Models for Performance Estimation on Programmable Hardware Accelerators

Audiovisual Voice Activity Detection and Localization of Simultaneous Speech Sources

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

Dynamic Load Balancing on Massively Parallel Computer Architectures

The development and expansion of HOOMD-blue through six years of GPU proliferation

Compilation techniques and language support to facilitate dependence-driven computation

Solutions for Optimizing the Monte Carlo Option Pricing Method’s Implementation Using the Compute Unified Device Architecture

Multiple Time Scales Recurrent Neural Network for Complex Action Acquisition

GPU-based simulation of the long-range Potts model via parallel tempering

Aquila 2.0: Software Architecture for Cognitive Robotics

Fast Object Re-Detection and Localization in Video for Spatio-Temporal Fragment Creation

Estimating the WCET of GPU-Accelerated Applications using Hybrid Analysis

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)