high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Accelerating Double Precision Floating-point Hessenberg Reduction on FPGA and Multicore Architectures

Accelerating Double Precision Floating-point Hessenberg Reduction on FPGA and Multicore Architectures

Miaoqing Huang, Lingyuan Wang, Tarek El-Ghazawi

CSCE Department, University of Arkansas

Symposium on Application Accelerators in High Performance Computing, 2010

@article{huangaccelerating,

title={Accelerating Double Precision Floating-point Hessenberg Reduction on FPGA and Multicore Architectures},

author={Huang, M. and Wang, L. and El-Ghazawi, T.},

booktitle={Application Accelerators in High Performance Computing, 2010 Symposium, Papers},

year={2010}

}

Download (PDF)

View

Source

2118

views

Double precision floating-point performance is critical for hardware acceleration technologies to be adopted by domain scientists. In this work we use the Hessenberg reduction to demonstrate the potential of FPGAs and GPUs for obtaining satisfactory double precision floating-point performance. Currently a Xeon (Nehalem) 2.26 GHz CPU can outperform Xilinx Virtex4LX200 by 3.6 folds. However, given higher frequency, more hardware resources and local memory banks, FPGAs have the potential to outperform multicore CPUs in the near future. On the GPU side, a GTX 480 (Fermi) achieves 19.4x speedup against the Xeon CPU. Based on the current trend, GPUs will keep widening the advantages against both FPGAs and CPUs on double precision floating-point performance.

Tags: Computer science, CUDA, Extended precision, FPGA, nVidia, nVidia GeForce GTX 480, OpenMP, Performance, Tesla C1060

February 18, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Accelerating Double Precision Floating-point Hessenberg Reduction on FPGA and Multicore Architectures

Your response

Recent source codes

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Accelerating Double Precision Floating-point Hessenberg Reduction on FPGA and Multicore Architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)