5258

Evaluation of streaming aggregation on parallel hardware architectures

Scott Schneidert, Henrique Andrade, Bugra Gedik, Kun-Lung Wu, Dimitrios S. Nikolopoulos
Virginia Tech. Department of Computer Science, Blacksburg, VA, USA
Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems, DEBS ’10, 2010

@inproceedings{schneidert2010evaluation,

   title={Evaluation of streaming aggregation on parallel hardware architectures},

   author={Schneidert, S. and Andrade, H. and Gedik, B. and Wu, K.L. and Nikolopoulos, D.S.},

   booktitle={Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems},

   pages={248–257},

   year={2010},

   organization={ACM}

}

Download Download (PDF)   View View   Source Source   

1430

views

We present a case study parallelizing streaming aggregation on three different parallel hardware architectures. Aggregation is a performance-critical operation for data summarization in stream computing, and is commonly found in sense-and-respond applications. Currently available commodity parallel hardware provides promise as accelerators for streaming aggregation. However, how streaming aggregation can map to the different parallel architectures is still an open question. Streaming aggregation is obviously data parallel, but in practice its performance relies more on efficient data movement than computation, as we will demonstrate. Furthermore, we used workloads such as stock market data, which introduces unique data distribution problems. The three parallel architectures we use in our study are an Intel Core 2 Quad processor, an Nvidia GTX 285 GPU and the IBM PowerXCell 8i, an enhanced version of the Cell Broadband Engine architecture. Our implementations use OpenMP, CUDA and Cellgen (a compiler for OpenMP-like support on Cell) respectively. We find that the Cell’s programmable local storage, and its low latency, high bandwidth access to main memory are best suited for parallelizing streaming aggregation. GPUs in the future can overcome the latency and bandwidth limitations by being fully integrated in the system’s memory hierarchy. In order to attain good performance on existing parallel architectures, we find that developers must characterize their problem in terms of communication versus computation costs; memory access patterns, including assessing whether their algorithms reuse data; and the granularity of data access patterns.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: