high performance computing on graphics processing units: hgpu.org

Posts

Dec, 29

Algorithms for manipulating large geometric data

This thesis deals with manipulating huge geometric data in the field of computer graphics. The proposed approach uses a data stream technique to allow processing gigantic datasets that by far exceed the size of the main memory. The amount of data is hierarchically reduced by clustering and replacing each cluster by a representative. The input […]

CUDA

Dec, 29

GPU-Based Acceleration on ACEnet for FDTD Method of Electromagnetic Field Analysis

Graphics Processing Unit (GPU) programming techniques have been applied to a range of scientific and engineering computations. In computational electromagnetics, uses of the GPU technique have dramatically increased since the release of NVIDIA’s Compute Unified Device Architecture (CUDA), a powerful and simple-to-use programmer environment that renders GPU computing easy accessibility to developers not specialized in […]

CUDA

Dec, 29

Accelerating Computational Algorithms

Mathematicians and computational scientists are often limited in their ability to model complex phenomena by the time it takes to run simulations. This thesis will inform interested researchers on how the development of highly parallel computer graphics hardware and the compiler frameworks to exploit it are expanding the range of algorithms that can be explored […]

OpenCL

Dec, 29

Implementing Neural Networks Efficiently

Neural networks and machine learning algorithms in general require a flexible environment where new algorithm prototypes and experiments can be set up as quickly as possible with best possible computational performance. To that end, we provide a new framework called Torch7, that is especially suited to achieve both of these competing goals. Torch7 is a […]

CUDA

Dec, 27

OpenCL Programming by Example

This book follows an example-driven, simplified, and practical approach to using OpenCL for general purpose GPU programming. If you are a beginner in parallel programming and would like to quickly accelerate your algorithms using OpenCL, this book is perfect for you! You will find the diverse topics and case studies in this book interesting and […]

OpenCL

•

OpenGL

Dec, 27

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads

BACKGROUND AND OBJECTIVE: Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data. METHODS: We developed Hierarchical mUlti-reference Genome cOmpression (HUGO), […]

CUDA

Dec, 27

Finite Element Modelling of Prostate Deformation and Needle-Tissue Interactions

During brachytherapy and biopsy, significant prostate motion (including deformation) can occur, causing the target lesion to move during the procedures. One method to improve the accuracy of needle tip placement during these percutaneous procedures is to use a 3D Finite Element (FE) model to estimate the amount of needle deflection. This model is based on […]

CUDA

Dec, 27

BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem

The problem of scheduling sports leagues has received considerable attention in recent years, especially since mathematically optimized schedules often have a large impact both economically and environmentally. The Mirrored Traveling Tournament Problem (mTTP) is an optimization problem that represents certain types of sports scheduling where the main objective is to minimize the total distance traveled […]

CUDA

Dec, 27

Multi-GPU Load Balancing for In-Situ Simulation and Visualization

Multiple-GPU systems have become ubiquitously available due to their support of massive parallel computing and more device memory for large scale problems. Such systems are ideal for In-Situ visualization applications, which require significant computational power for concurrent execution of simulation and visualization. While pipelining based parallel computing scheme overlaps the execution of simulation and rendering […]

Dec, 25

BIDMach: Large-scale Learning with Zero Memory Allocation

This paper describes recent work on the BIDMach toolkit for large-scale machine learning. BIDMach has demonstrated single-node performance that exceeds that of published cluster systems for many common machine-learning task. BIDMach makes full use of both CPU and GPU acceleration (through a sister library BIDMat), and requires only modest hardware (commodity GPUs). One of the […]

CUDA

Dec, 25

Building Multiclass Nonlinear Classifiers with GPUs

The adoption of multiclass classification strategies that train independent binary classifiers becomes challenging when the goal is to retrieve nonlinear models from large datasets and the process requires several passes through the data. In such scenario, the combined use of a search and score algorithm and GPUs allows to obtain binary classifiers in a reduced […]

CUDA

Dec, 25

Efficiency analysis of a physical problem: Different parallel computational approaches for a dynamical integrator evolution

A great challenge for scientists is to execute their computational applications efficiently. Nowadays, parallel programming has become a fundamental key to achieve this goal. High-performance computing provides a solution to exploit parallel architectures in order to get optimal performance. Both parallel programming model and the system architecture will maximize the benefits if both together are […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Algorithms for manipulating large geometric data

GPU-Based Acceleration on ACEnet for FDTD Method of Electromagnetic Field Analysis

Accelerating Computational Algorithms

Implementing Neural Networks Efficiently

OpenCL Programming by Example

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads

Finite Element Modelling of Prostate Deformation and Needle-Tissue Interactions

BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem

Multi-GPU Load Balancing for In-Situ Simulation and Visualization

BIDMach: Large-scale Learning with Zero Memory Allocation

Building Multiclass Nonlinear Classifiers with GPUs

Efficiency analysis of a physical problem: Different parallel computational approaches for a dynamical integrator evolution

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)