high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » A parallel error diffusion implementation on a GPU

A parallel error diffusion implementation on a GPU

Yao Zhang, John Ludd Recker, Robert Ulichney, Giordano B. Beretta, Ingeborg Tastl, I-Jong Lin, and John D. Owens

University of California, Davis, One Shields Avenue, Davis, CA, USA

IS&T/SPIE Electronic Imaging 2011 / Parallel Processing for Imaging Applications, volume 7872, pages 78720K:1-9, 2011

@inproceedings{Zhang:2011:APE,

author={Yao Zhang and John Ludd Recker and Robert Ulichney and Giordano B. Beretta and Ingeborg Tastl and I-Jong Lin and John D. Owens},

title={A Parallel Error Diffusion Implementation on a {GPU}},

booktitle={Proceedings of SPIE: IS&T/SPIE Electronic Imaging 2011 / Parallel Processing for Imaging Applications},

year={2011,volume=7872,month=jan,pages={78720K:1–9},url={http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1049},doi={10.1117/12.872616},ucdcite={a58}

}

Download (PDF)

View

Source

2772

views

In this paper, we investigate the suitability of the GPU for a parallel implementation of the pinwheel error diffusion. We demonstrate a high-performance GPU implementation by efficiently parallelizing and unrolling the image processing algorithm. Our GPU implementation achieves a 10 – 30x speedup over a two-threaded CPU error diffusion implementation with comparable image quality. We have conducted experiments to study the performance and quality tradeoffs for differences in image block sizes. We also present a performance analysis at assembly level to understand the performance bottlenecks.

Tags: Algorithms, CUDA, Image processing, nVidia, nVidia GeForce GTX 460

October 5, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

A parallel error diffusion implementation on a GPU

Your response

Recent source codes

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

Most viewed papers (last 30 days)

A parallel error diffusion implementation on a GPU

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)