SparkJNI: A Reference Design for a Heterogeneous Apache Spark Framework

hgpu.org » Applications » Computer science » SparkJNI: A Reference Design for a Heterogeneous Apache Spark Framework

SparkJNI: A Reference Design for a Heterogeneous Apache Spark Framework

Tudor Alexandru Voicu

Delft University of Technology

Delft University of Technology, 2016

@article{voicu2016sparkjni,

title={SparkJNI: A Reference Design for a Heterogeneous Apache Spark Framework},

author={Voicu, TA},

year={2016}

}

Download (PDF)

View

Source

2800

views

The digital era’s requirements pose many challenges related to deployment, implementation and efficient resource utilization in modern hybrid computing infrastructures. In light of the recent improvements in computing units, the defacto structure of a high-performance computing cluster, ordinarily consisted of CPUs only, is superseeded by heterogeneous architectures (comprised of GPUs, FPGAs and DSPs) which offer higher performance and lower power consumption. Big Data, as a younger field but with a much aggressive development pace starts to exhibit the characteristic needs of its archetype and the development community is targeting the integration of specialized processors here, as well. The benefits do not come for granted and could be easily overshadowed by challenges in implementation and deployment when considering development time and cost. In this research, we analyze the state-of-the-art developments in the field of heterogeneous-accelerated Spark, the current Big Data standard, and we provide a reference design and implementation for a JNI-accelerated Spark framework. The design is validated by a set of benchmarked micro-kernels. The JNI-induced overhead is as low as 12% in access times and bandwidth, with speedups up to 12x for compute-intensive algorithms, in comparison to pure Java Spark implementations. Based on the promising results of the benchmarks, the SparkJNI framework is implemented as an easy interface to native libraries and specialized accelerators. A cutting-edge DNA analysis algorithm (PairHMM) is integrated, targeting cluster deployments, with benchmark results for the DNA pipeline stage showing an overall speedup of ~2.7 over state-of-the-art developments. The result of the presented work, along with the SparkJNI framework are publicly available for open-source usage and development, with our aim being a contribution to current and future Big Data Spark shift drivers.

Tags: big data, Computer science, FPGA, Heterogeneous systems, Hybrid computing, Java, OpenCL, Spark, Thesis

June 1, 2017 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org