high performance computing on graphics processing units: hgpu.org

Posts

Nov, 24

High-quality pre-integrated volume rendering using hardware-accelerated pixel shading

We introduce a novel texture-based volume rendering approach that achieves the image quality of the best post-shading approaches with far less slices. It is suitable for new flexible consumer graphics hardware and provides high image quality even for low-resolution volume data and non-linear transfer functions with high frequencies, without the performance overhead caused by rendering […]

OpenGL

Nov, 24

Fast matrix multiplies using graphics hardware

We present a technique for large matrix-matrix multiplies using low cost graphics hardware. The result is computed by literally visualizing the computations of a simple parallel processing algorithm. Current graphics hardware technology has limited precision and thus limits immediate applicability of our algorithm. We include results demonstrating proof of concept, correctness, speedup, and a simple […]

OpenGL

Nov, 24

A practical and robust bump-mapping technique for today’s GPU’s

Bump mapping is a normal-perturbation rendering technique for simulating lighting effects caused by patterned irregularities on otherwise locally smooth surfaces. By encoding such surface patterns in texture maps, texture-based bump mapping simulates a surface’s irregular lighting appearance without modeling the patterns as true geometric perturbations to the surface. Bump mapping is advantageous because it can […]

OpenGL

Nov, 24

Interactive multi-pass programmable shading

Programmable shading is a common technique for production animation, but interactive programmable shading is not yet widely available. We support interactive programmable shading on virtually any 3D graphics hardware using a scene graph library on top of OpenGL. We treat the OpenGL architecture as a general SIMD computer, and translate the high-level shading description into […]

OpenGL

Nov, 24

Comparing GPU-based multi-volume ray casting techniques

The most essential technique to visualize 3D scalar data is direct volume rendering. For many applications it is necessary that two or more 3D data are visualized simultaneously. We present an overview of data intermixing techniques for visualization with the direct volume rendering technique ray casting. The techniques are Classification Level Intermixing, Accumulation Level Intermixing […]

CUDA

Nov, 23

Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of […]

CUDA

Nov, 23

Many-core algorithms for statistical phylogenetics

MOTIVATION: Statistical phylogenetics is computationally intensive, resulting in considerable attention meted on techniques for parallelization. Codon-based models allow for independent rates of synonymous and replacement substitutions and have the potential to more adequately model the process of protein-coding sequence evolution with a resulting increase in phylogenetic accuracy. Unfortunately, due to the high number of codon […]

CUDA

Nov, 23

Synergistic execution of stream programs on multicores with accelerators

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method […]

CUDA

Nov, 23

Graphical Asian Options

We discuss the problem of pricing Asian options in Black-Scholes model using CUDA on a graphics processing unit. We survey some of the issues with GPU programming and discuss code design and memory usage. We show that by using a Quasi Monte Carlo simulation with a geometric Asian option as a control variate, it is […]

CUDA

Nov, 23

Challenges and opportunities of obtaining performance from multi-core CPUs and many-core GPUs

Multi-core processors represent a major development in computing technology. For example, Intel Coretrade 2 Quad processors, IBM Cell processors, and Nvidia GeForce 9800 GX2, are widely used. However, most applications struggle to make the best use of the power provided by many-core processors. Easy-to-use software tools are hard to find. Furthermore, it’s not clear what […]

Nov, 23

Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA

In this article a very efficient implementation of a 2D-Lattice Boltzmann kernel using the Compute Unified Device Architecture (CUDA) interface developed by nVIDIA is presented. By exploiting the explicit parallelism exposed in the graphics hardware we obtain more than one order in performance gain compared to standard CPUs. A non-trivial example, the flow through a […]

CUDA

Nov, 23

Efficient Probabilistic Model Checking on General Purpose Graphics Processors

We present algorithms for parallel probabilistic model checking on general purpose graphic processing units (GPGPUs). For this purpose we exploit the fact that some of the basic algorithms for probabilistic model checking rely on matrix vector multiplication. Since this kind of linear algebraic operations are implemented very efficiently on GPGPUs, the new parallel algorithms can […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

High-quality pre-integrated volume rendering using hardware-accelerated pixel shading

Fast matrix multiplies using graphics hardware

A practical and robust bump-mapping technique for today’s GPU’s

Interactive multi-pass programmable shading

Comparing GPU-based multi-volume ray casting techniques

Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Many-core algorithms for statistical phylogenetics

Synergistic execution of stream programs on multicores with accelerators

Graphical Asian Options

Challenges and opportunities of obtaining performance from multi-core CPUs and many-core GPUs

Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA

Efficient Probabilistic Model Checking on General Purpose Graphics Processors

Recent source codes

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

LC Framework

pplx-garden: Perplexity open source garden for inference technology

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

OpScanner

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Most viewed papers (last 30 days)