Advanced Example: Hadoop Map Reduce and Apache Spark

Objective

Solve an advanced problem using Hadoop Map Reduce and demonstrate the same problem using Apache Spark.

 

Hadoop Map Reduce Program

http://blog.hampisoftware.com/?p=20

 

Apache Spark based solution

http://blog.hampisoftware.com/?p=41

 

Differences between the two solutions

Boilerplate code

In the map reduce code, you have to write a mapper class and a reducer class.

In Spark, you do not have to write a lot of boilerplate code.  If you use Java 8, then you can use Lambda expressions to cut the boilerplate code further.

In Spark, there is no need for a mapper class or a reducer class. You can perform transformations such as map, flatMap, mapToPair, reduceByKey etc on a RDD.

 

Performance

The Spark based solution is very fast compared to the map reduce solution. Around 45secs for 16 million records in a virtual machine sandbox.

 

Scala

If you write Scala code, then the Apache Spark code can be further simplified into few lines of code. In this post, we have solved the patent citation problem using Java.