high performance computing on graphics processing units: hgpu.org

Posts

Mar, 2

Parallel Implementation of Similarity Measures on GPU Architecture using CUDA

Image processing and pattern recognition algorithms take more time for execution on a single core processor. Graphics Processing Unit (GPU) is more popular now-a-days due to their speed, programmability, low cost and more inbuilt execution cores in it. Most of the researchers started work to use GPUs as a processing unit with a single core […]

CUDA

Mar, 2

Ray Tracing Visualization Toolkit

The Ray Tracing Visualization Toolkit (rtVTK) is a collection of programming and visualization tools supporting visual analysis of ray-based rendering algorithms. rtVTK leverages layered visualization within the spatial domain of computation, enabling investigators to explore the computational elements of any ray-based renderer. Renderers utilize a library for recording and processing ray state, and a configurable […]

OpenCL

•

OpenGL

Mar, 2

Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors

With the emergence of highly multithreaded architectures, performance monitoring techniques face new challenges in efficiently locating sources of performance discrepancies in the program source code. For example, the state-of-the-art performance counters in highly multithreaded graphics processing units (GPUs) report only the overall occurrences of microarchitecture events at the end of program execution. Furthermore, even if […]

CUDA

Mar, 2

Parallel Hashing, Compression and Encryption with OpenCL under OS X

In this dissertation we examine the efficiency of GPUs with a limited number of stream processors (up to 32), located in desktops and laptops, in the execution of algorithms such as hashing (MD5, SHA1), encryption (Salsa20) and compression (LZ78). For the implementation part, the OpenCL framework was used under OS X. The graphic cards tested […]

OpenCL

Mar, 2

GPU Implementation of Split-Field Finite-Difference Time-Domain Method for Drude-Lorentz Dispersive Media

Split-field finite-difference time-domain (SF-FDTD) method can overcome the limitation of ordinary FDTD in analyzing periodic structures under oblique incidence. On the other hand, huge run times of 3D SF-FDTD, is practically a major burden in its usage for analysis and design of nanostructures, particularly when having dispersive media. Here, details of parallel implementation of 3D […]

CUDA

Mar, 1

Chestnut: A GPU Programming Language for Non-Experts

Graphics processing units (GPUs) are powerful devices capable of rapid parallel computation. GPU programming, however, can be quite difficult, limiting its use to experienced programmers and keeping it out of reach of a large number of potential users. We present Chestnut, a domain-specific GPU parallel programming language for parallel multi-dimensional grid applications. Chestnut is designed […]

CUDA

Mar, 1

Black-Box Side-Channel Attacks Highlight the Importance of Countermeasures: An Analysis of the Xilinx Virtex-4 and Virtex-5 Bitstream Encryption Mechanism

This paper presents a side-channel analysis of the bitstream encryption mechanism provided by Xilinx Virtex FPGAs. This work covers our results analyzing the Virtex-4 and Virtex-5 family showing that the encryption mechanism can be completely broken with moderate effort. The presented results provide an overview of a practical real-world analysis and should help practitioners to […]

CUDA

Mar, 1

Parallel Loopy Belief Propagation in Conditional Random Fields

Structured real world data can be represented with graphs whose structure encodes indepen dence assumptions within the data. Due to statistical advantages over generative graphical models, Conditional Random Fields (CRFs) are used in a wide range of classification tasks on structured data sets. CRFs can be learned from both, fully or partially supervised data, and […]

CUDA

Mar, 1

Benchmarking Next Generation Hardware Platforms: An Experimental Approach

Heterogeneous multi-cores-platforms comprised of both general purpose and accelerator cores-are becoming increasingly common. Further, with processor designs in which there are many cores on a chip, a recent trend is to include functional and performance asymmetries to balance their power usage vs. performance requirements. Coupled with this trend in CPUs is the development of high […]

CUDA

Mar, 1

GPU acceleration of the particle filter: the Metropolis resampler

We consider deployment of the particle filter on modern massively parallel hardware architectures, such as Graphics Processing Units (GPUs), with a focus on the resampling stage. While standard multinomial and stratified resamplers require a sum of importance weights computed collectively between threads, a Metropolis resampler favourably requires only pair-wise ratios between weights, computed independently by […]

CUDA

Feb, 29

A Fast and Efficient Simulation Framework for Modeling Heat Transport

Metropolitan centers can be affected by an urban heat island effect. Radiative heat build-up from pavement and buildings increases temperatures in the metropolitan area above the average temperatures normally found in the surrounding environment. One way to help reduce the heat island effect is to add parks, trees, or green roofs to these urban spaces. […]

CUDA

Feb, 29

A Restructuring Algorithm for CUDA

Graphic processing Units (GPUs) are gaining ground in high-performance computing. CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. However, the task of writing optimized CUDA program is complex even for experts. We present a method for restructuring loops into an optimized CUDA kernels based on a […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Parallel Implementation of Similarity Measures on GPU Architecture using CUDA

Ray Tracing Visualization Toolkit

Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors

Parallel Hashing, Compression and Encryption with OpenCL under OS X

GPU Implementation of Split-Field Finite-Difference Time-Domain Method for Drude-Lorentz Dispersive Media

Chestnut: A GPU Programming Language for Non-Experts

Black-Box Side-Channel Attacks Highlight the Importance of Countermeasures: An Analysis of the Xilinx Virtex-4 and Virtex-5 Bitstream Encryption Mechanism

Parallel Loopy Belief Propagation in Conditional Random Fields

Benchmarking Next Generation Hardware Platforms: An Experimental Approach

GPU acceleration of the particle filter: the Metropolis resampler

A Fast and Efficient Simulation Framework for Modeling Heat Transport

A Restructuring Algorithm for CUDA

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)