high performance computing on graphics processing units: hgpu.org

Posts

Dec, 12

Parallel Evaluation of a Spatial Traversability Cost Function on GPU for Efficient Path Planning

A parallel version of the traditional grid based cost-to-go function generation algorithm used in robot path planning is introduced. The process takes advantage of the spatial layout of an occupancy grid by concurrently calculating the next wave front of grid cells usually evaluated sequentially in traditional dynamic programming algorithms. The algorithm offers an order of […]

OpenGL

Dec, 12

Accelerating non-linear image registration with GPUs

The alignment or registration of two images or volumetric datasets is frequently a requirement in modern image-processing applications, particularly within the context of medical imaging. Modern graphics-processing units (GPUs) are designed to perform simple 3D graphics-pipeline tasks on a massively parallel scale; this processing power can be harnessed for general computation via libraries such as […]

CUDA

Dec, 12

GPU Programming in a High Level Language: Compiling X10 to CUDA

GPU architectures have emerged as a viable way of considerably improving performance for appropriate applications. Program fragments (kernels) appropriate for GPU execution can be implemented in CUDA or OpenCL and glued into an application via an API. While there is plenty of evidence of performance improvements using this approach, there are many issues with productivity. […]

CUDA

•

OpenCL

Dec, 12

A fast and intuitive visual programming language (VPL) for constructing Computer Vision and Image processing systems on GPUs

In this work we present a novel GPU based Visual Programming Language for Computer Vision and Image Processing systems. Many vision algorithms have been shown to perform better on GPUs. However, one of the current drawbacks is the need for considerable GPU programming expertise. We propose an abstraction over GPU implementation details by providing an […]

CUDA

Dec, 12

Theano: A CPU and GPU Math Compiler in Python

Theano is a compiler for mathematical expressions in Python that combines the convenience of NumPy’s syntax with the speed of optimized native machine language. The user composes mathematical expressions in a high-level description that mimics NumPy’s syntax and semantics, while being statically typed and functional (as opposed to imperative). These expressions allow Theano to provide […]

CUDA

Dec, 12

A Common GPU n-Dimensional Array for Python and C

Currently there are multiple incompatible array/matrix/n-dimensional base object implementations for GPUs. This hinders the sharing of GPU code and causes duplicate development work. This paper proposes and presents a first version of a common GPU n-dimensional array(tensor) named GpuNdArray that works with both CUDA and OpenCL. It will be usable from python, C and possibly […]

CUDA

•

OpenCL

Dec, 12

Bringing Parallel Performance to Python with Domain-Specific Selective Embedded Just-in-Time Specialization

Today’s productivity programmers, such as scientists who need to write code to do science, are typically forced to choose between productive and maintainable code with modest performance (e.g. Python plus native libraries such as SciPy [SciPy]) or complex, brittle, hardware-specific code that entangles application logic with performance concerns but runs two to three orders of […]

CUDA

Dec, 11

Self-Supervised Clustering for Codebook Construction: An Application to Object Localization

Approaches to object localization based on codebooks do not exploit the dependencies between appearance and geometric information present in training data. This work addresses the problem of computing a codebook tailored to the task of localization by applying regularization based on geometric information. We present a novel method, the Regularized Combined Partitional-Agglomerative clustering, which extends […]

CUDA

Dec, 11

Aquila: An Open-Source GPU-Accelerated Toolkit for Cognitive Robotics Research

This paper presents a novel open-source software Aquila developed as a part of the iTalk and RobotDoC projects. This software provides many different tools and biologically inspired systems that are useful for cognitive robotics research. Aquila addresses the need for high-performance robot control by adopting the latest parallel processing paradigm based on the NVidia CUDA […]

CUDA

Dec, 11

Gyrokinetic Toroidal Simulations on Leading Multi-and Manycore HPC Systems

The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation research. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production code for studying plasma microturbulence in tokamak devices. Our optimizations encompass all six GTC sub-routines and include multi-level particle and grid decompositions […]

CUDA

Dec, 11

Accelerating Swarm Intelligence Algorithms with GPU-Computing

Swarm intelligence describes the ability of groups of social animals and insects to exhibit highly organized and complex problem-solving behaviors that allow the group as a whole to accomplish tasks which are beyond the capabilities of any individual. This phenomenon found in nature is the inspiration for swarm intelligence algorithms — systems that utilize the […]

CUDA

Dec, 11

Fast Face Detection Using Graphics Processor

Fast face detection is one of the key components of various computer vision applications. Viola-Jones algorithm provides a good and fast detection for low and medium resolution images. This paper proposes a new and fast approach to perform real time face detection. The proposed method includes the enhanced Haar-like features and uses SVM for training […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Parallel Evaluation of a Spatial Traversability Cost Function on GPU for Efficient Path Planning

Accelerating non-linear image registration with GPUs

GPU Programming in a High Level Language: Compiling X10 to CUDA

A fast and intuitive visual programming language (VPL) for constructing Computer Vision and Image processing systems on GPUs

Theano: A CPU and GPU Math Compiler in Python

A Common GPU n-Dimensional Array for Python and C

Bringing Parallel Performance to Python with Domain-Specific Selective Embedded Just-in-Time Specialization

Self-Supervised Clustering for Codebook Construction: An Application to Object Localization

Aquila: An Open-Source GPU-Accelerated Toolkit for Cognitive Robotics Research

Gyrokinetic Toroidal Simulations on Leading Multi-and Manycore HPC Systems

Accelerating Swarm Intelligence Algorithms with GPU-Computing

Fast Face Detection Using Graphics Processor

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)