Floating Point Arithmetic for Transport Triggered Architectures

hgpu.org » Programming » Algorithms » Floating Point Arithmetic for Transport Triggered Architectures

Floating Point Arithmetic for Transport Triggered Architectures

Timo Tapani Viitanen

Faculty of Computing and Electrical Engineering, Department of Computer Systems, Tampere University of Technology

Tampere University of Technology, 2012

@article{viitanen2012floating,

title={Floating Point Arithmetic for Transport Triggered Architectures},

author={Viitanen, T.T.},

year={2012}

}

Download (PDF)

View

Source

2098

views

Many computational applications have high performance and energy-efficiency requirements which "off-the-shelf" general-purpose processors cannot meet. On the other hand, designing special-purpose hardware accelerators can be prohibitively expensive in terms of development time. One approach to the problem is to design an Application-Specific Instruction set Processor (ASIP), which is programmable, but tailored for the task at hand. The process of customizing an ASIP requires heavy automation to be cost-effective. The TTA-based Codesign Environment (TCE) is an ASIP design toolset based on the highly flexible Transport Triggered Architecture (TTA) processor model, which scales from simple low-power cores up to high performance Very Long Instruction Word (VLIW) processors. Hardware accelerated support for floating-point arithmetic is necessary for many applications in the fields of scientific computation and digital signal processing, which would especially benefit from the scalability and instruction-level parallelism of TTA. This thesis introduces a comprehensive suite of Register Transfer Level (RTL) implementations of floating-point units designed and implemented for the TCE project. The main design requirements were portability and performance on FieldProgrammable Gate Array (FPGA) platforms even at the cost of reduced standards compliance. The suite includes an option for half-precision arithmetic. In addition, this thesis proposes fast software floating-point division and square root algorithms based on special instructions. The implemented units were verified on the register transfer level using an automated test bench. When benchmarked on an Altera Stratix-II FPGA, the units exhibited performance close to the highly optimized units supplied by Altera, while retaining platform independence. On more recent FPGAs such as the Xilinx Virtex-6, finer-grained pipelining is required for maximum performance.

Tags: Algorithms, Benchmarking, Computer science, FPGA, OpenCL, Thesis

December 30, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org