Advanced Example: Spark Action with OOzie

In this post, we will look at running a Spark job using Apache OOzie.

For background, you should read up on my earlier post on Patent Citation.

SparkAction on OOzie

As of OOzie 4.2, there is an action for running Spark jobs.

The workflow.xml is going to look as follows.

<workflow-app xmlns='uri:oozie:workflow:0.2' name='oozie-java-spark-wf'>
   <start to='java-spark' />

   <action name='java-spark'>
    <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <master>local</master>
            <name>Spark Patent Citation</name>
            <class>spark.PatentCitation</class>
            <jar>${nameNode}/user/root/oozie/spark-patent/lib/patentcitation_spark.jar</jar>
            <spark-opts>--executor-memory 1G --num-executors 10</spark-opts>
            <arg>${nameNode}/user/captain/input/cite75_99.txt</arg>
            <arg>${nameNode}/user/captain/output</arg>
</spark>


    <ok to="end"/>
    <error to="fail"/>
    </action>

    <kill name="fail">
      <message>Spark Java PatentCitation failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

OOzie Workflow Properties

The job.properties is as follows:


nameNode=hdfs://sandbox:8020
jobTracker=sandbox:8050
master=local[*]
appRoot=spark-patent
jobOutput=/user/captain/output
oozie.wf.application.path=/user/root/oozie/spark-patent
oozie.use.system.libpath=true

Ensure that you save the workflow.xml in hdfs under the location

/user/root/oozie/spark-parent

The patentcitation_spark.jar goes into the lib directory in hdfs.

How do you start OOzie job?
oozie job -oozie htp://localhost:11000/oozie -config job.properties -run

Note: job.properties should be in local directory and not on hdfs.

The cite75_99.txt input file goes in hdfs directory /user/captain/input

Once the oozie job has finished, you will find the output in
/user/captain/output/part-0000

If you have any questions, do not hesitate to ask.