Is Apache Spark ready for Petabyte Scale?
Slides courtesy of the Linux Foundation
Ashwin Shankar and Cheolsoo Park from Netflix Inc gave an excellent presentation at the Apache Big Data Conference in Budapest this week on how Netflix is using Apache Spark at Petabyte scale.
There are very few companies that are operating at petabyte scale. Netflix is one of them.
Validation from Netflix about the scale of Apache Spark is a great boon for the open source framework, that is gaining immense popularity and adoption in the big data community.
Netflix encountered many issues when using Spark on AWS. They opened many bug reports and provided good solutions for problems.
Netflix has explored Spark on Mesos and Yarn. Spark on Yarn involved 1000+ nodes and memory was 100TB+.
Spark on YARN exposed the following significant problems:
Number of executors requested from YARN was negative. This was resolved via https://issues.apache.org/jira/browse/SPARK-6954 (Courtesy Chelsoo)
Spark causes Map Reduce Jobs to get stuck. This was resolved via Yarn project: https://issues.apache.org/jira/browse/YARN-2730
This is good validation by Netflix on Apache Spark for mainstream big data processing.