Apache Spark Logo

Is Apache Spark ready for Petabyte Scale?

GDE Error: Error retrieving file - if necessary turn off error checking (404:Not Found)

Slides courtesy of the Linux Foundation

Ashwin Shankar and Cheolsoo Park from Netflix Inc gave an excellent presentation at the Apache Big Data Conference in Budapest this week on how Netflix is using Apache Spark at Petabyte scale.

There are very few companies that are operating at petabyte scale. Netflix is one of them.

Validation from Netflix about the scale of Apache Spark is a great boon for the open source framework, that is gaining immense popularity and adoption in the big data community.

Netflix encountered many issues when using Spark on AWS. They opened many bug reports and provided good solutions for problems.

Netflix has explored Spark on Mesos and Yarn. Spark on Yarn involved 1000+ nodes and memory was 100TB+.

Spark on YARN exposed the following significant problems:

This is good validation by Netflix on Apache Spark for mainstream big data processing.