Advanced Example: Hadoop Map Reduce and Apache Spark


Solve an advanced problem using Hadoop Map Reduce and demonstrate the same problem using Apache Spark.


Hadoop Map Reduce Program


Apache Spark based solution


Differences between the two solutions

Boilerplate code

In the map reduce code, you have to write a mapper class and a reducer class.

In Spark, you do not have to write a lot of boilerplate code.  If you use Java 8, then you can use Lambda expressions to cut the boilerplate code further.

In Spark, there is no need for a mapper class or a reducer class. You can perform transformations such as map, flatMap, mapToPair, reduceByKey etc on a RDD.



The Spark based solution is very fast compared to the map reduce solution. Around 45secs for 16 million records in a virtual machine sandbox.



If you write Scala code, then the Apache Spark code can be further simplified into few lines of code. In this post, we have solved the patent citation problem using Java.