high performance computing on graphics processing units: hgpu.org

Posts

Mar, 3

Adaptive Video Encoding Based on OpenCL Face Recognition

Video chatting is now a popular way of communication. However, poor network ruins the experience as the faces are blurred. To solve this problem, the team offers a solution to preserve the clarity of faces under limited transmission rate. In this project, the primary goal is to design a video encoder that reduces the size […]

OpenCL

Mar, 3

Adaptive Kinetic-Fluid Solvers for Heterogeneous Computing Architectures

This paper describes recent progress towards porting a Unified Flow Solver (UFS) to heterogeneous parallel computing. UFS is an adaptive kinetic-fluid simulation tool, which combines Adaptive Mesh Refinement (AMR) with automatic cell-by-cell selection of kinetic or fluid solvers based on continuum breakdown criteria. The main challenge of porting UFS to graphics processing units (GPUs) comes […]

CUDA

Mar, 3

Counting Triangles in Large Graphs on GPU

The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this area focused on sequential algorithms, MapReduce parallelization, and fast approximations. In this paper we propose a parallel triangle counting algorithm for CUDA GPU. […]

CUDA

Mar, 3

GPU Based Path Integral Control with Learned Dynamics

We present an algorithm which combines recent advances in model based path integral control with machine learning approaches to learning forward dynamics models. We take advantage of the parallel computing power of a GPU to quickly take a massive number of samples from a learned probabilistic dynamics model, which we use to approximate the path […]

CUDA

Mar, 2

Iris Matching Algorithm on Many-Core Platforms

Biometrics matching has been widely adopted as a secure way for identification and verification purpose. However, the computation demand associated with running this algorithm on a big data set poses great challenge on the underlying hardware platform. Even though modern processors are equipped with more cores and memory capacity, the software algorithm still requires careful […]

CUDA

Mar, 2

Model-driven optimisation of memory hierarchy and multithreading on GPUs

Due to their potentially high peak performance and energy efficiency, GPUs are increasingly popular for scientific computations. However, the complexity of the architecture makes it difficult to write code that achieves high performance. Two of the most important factors in achieving high performance are the usage of the GPU memory hierarchy and the way in […]

CUDA

Mar, 2

Runtime Compilation of Array-Oriented Python Programs

The Python programming language has become a popular platform for data analysis and scientific computing. To mitigate the poor performance of Python’s standard interpreter, numerically intensive computations are typically offloaded to library functions written in high-performance compiled languages such as Fortran or C. When there is no efficient library implementation available for a particular algorithm, […]

CUDA

Mar, 2

Evaluating Performance Portability of OpenACC

Accelerator-based heterogeneous computing is gaining momentum in High Performance Computing arena. However, the increased complexity of the accelerator architectures demands more generic, high-level programming models. OpenACC is one such attempt to tackle the problem. While the abstraction endowed by OpenACC offers productivity, it raises questions on its portability. This paper evaluates the performance portability obtained […]

CUDA

Mar, 2

MILJS: Brand New JavaScript Libraries for Matrix Calculation and Machine Learning

MILJS is a collection of state-of-the-art, platform-independent, scalable, fast JavaScript libraries for matrix calculation and machine learning. Our core library offering a matrix calculation is called Sushi, which exhibits far better performance than any other leading machine learning libraries written in JavaScript. Especially, our matrix multiplication is 177 times faster than the fastest JavaScript benchmark. […]

OpenCL

Feb, 27

Accelerating Deep Convolutional Neural Networks Using Specialized Hardware

Recent breakthroughs in the development of multi-layer convolutional neural networks have led to stateof-the-art improvements in the accuracy of non-trivial recognition tasks such as large-category image classification and automatic speech recognition [1]. These many-layered neural networks are large, complex, and require substantial computing resources to train and evaluate [2]. Unfortunately, these demands come at an […]

Feb, 27

Face Detection on CUDA

Face Detection finds an application in various fields in today’s world. However CPU single thread implementation of face detection consumes lot of time, and despite various optimization techniques, it performs poorly at real time. With the advent of General Purpose GPU (GPGPU) and growing support for parallel programming language like CUDA, it has become possible […]

CUDA

Feb, 27

A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures

In order to improve system performance efficiently, a number of systems choose to equip multi-core and many-core processors (such as GPUs). Due to their discrete memory these heterogeneous architectures comprise a distributed system within a computer. A data-flow programming model is attractive in this setting for its ease of expressing concurrency. Programmers only need to […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Adaptive Video Encoding Based on OpenCL Face Recognition

Adaptive Kinetic-Fluid Solvers for Heterogeneous Computing Architectures

Counting Triangles in Large Graphs on GPU

GPU Based Path Integral Control with Learned Dynamics

Iris Matching Algorithm on Many-Core Platforms

Model-driven optimisation of memory hierarchy and multithreading on GPUs

Runtime Compilation of Array-Oriented Python Programs

Evaluating Performance Portability of OpenACC

MILJS: Brand New JavaScript Libraries for Matrix Calculation and Machine Learning

Accelerating Deep Convolutional Neural Networks Using Specialized Hardware

Face Detection on CUDA

A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)