Top-Performance Tokenization and Small-Ruleset Regular Expression Matching: A Quantitative Performance Analysis and Optimization Study on the Cell/B.E. Processor

Daniele Paolo Scarpazza
Business Analytics and Math Department, IBM T.J.Watson Research Center, Yorktown Heights, NY 10598, USA
International Journal of Parallel Programming, Volume 39, Number 1, 3-32, 2010


   title={Top-Performance Tokenization and Small-Ruleset Regular Expression Matching},

   author={Scarpazza, D.P.},

   journal={International Journal of Parallel Programming},





Source Source   



In the last decade, the volume of unstructured data that Internet and enterprise applications create and consume has been growing at impressive rates. The tools we use to process these data are search engines, business analytics suites, natural-language processors and XML processors. These tools rely on tokenization, a form of regular expression matching aimed at extracting words and keywords in a character stream. The further growth of unstructured data-processing paradigms depends critically on the availability of high-performance tokenizers. Despite the impressive amount of parallelism that the multi-core revolution has made available (in terms of multiple threads and wider SIMD units), most applications employ tokenizers that do not exploit this parallelism. I present a technique to design tokenizers that exploit multiple threads and wide SIMD units to process multiple independent streams of data at a high throughput. The technique benefits indefinitely from any future scaling in the number of threads or SIMD width. I show the approach
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: