Processing Big Data in Main Memory and on GPU

hgpu.org » Applications » Computer science » Processing Big Data in Main Memory and on GPU

Processing Big Data in Main Memory and on GPU

Meisam Fathi Salmi

The Ohio State University

The Ohio State University, 2016

BibTeX

Download (PDF)

View

Source

2684

views

Many large-scale systems were designed with the assumption that I/O is the bottleneck, but this assumption has been challenged in the past decade with new trends in hardware capabilities and workload demands. The computational power of CPU cores has not improved proportional to the performance of disks and network interfaces in the past decade, but the demand for computational power in various workloads has grown out of proportion. GPUs outperform CPUs for various workloads such as query processing and machine learning workloads. When such workloads runs on a single computer, the data processing systems must use GPUs to stay competitive. But GPUs have never been studied for large-scale data analytics systems. To maximize GPUs erformance, core assumptions about the behavior of large-sclale systems should be shaken and the whole systems should be redesigned. In this report, we used Apache Spark as a case to study the performance benefits of using GPUs in a large-scale, distributed, in-memory, data analytics system. Our system, Spark-GPU, exploits the massively parallel processing power of the GPUs in a large-scale, in-memory system and accelerates crucial data analytics workloads. Spark-GPU minimizes memory management overhead, reduces the extraneous garbage collection, minimizes internal and external data transfers, converts data into a GPU-friendly format, and provides batch processing. Spark-GPU detects GPU-friendly tasks based on predefined patterns in computation and automatically schedules them on the available GPUs in the cluster. We have evaluated Spark-GPU with a set of representative data analytics workloads to show its effectiveness. The results show that Spark-GPU significantly accelerates data mining and statistical analysis workloads, but provides limited performance speedup for traditional query processing workloads.

Tags: big data, Computer science, CUDA, Hadoop, Java, OpenCL, Scala, Spark, Thesis

June 14, 2016 by hgpu

Rating: 1.5/5. From 2 votes.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org