Advanced Example: Spark Action with OOzie

In this post, we will look at running a Spark job using Apache OOzie.

For background, you should read up on my earlier post on Patent Citation.

SparkAction on OOzie

As of OOzie 4.2, there is an action for running Spark jobs.

The workflow.xml is going to look as follows.

<workflow-app xmlns='uri:oozie:workflow:0.2' name='oozie-java-spark-wf'>
   <start to='java-spark' />

   <action name='java-spark'>
    <spark xmlns="uri:oozie:spark-action:0.1">
                <delete path="${jobOutput}"/>
            <name>Spark Patent Citation</name>
            <spark-opts>--executor-memory 1G --num-executors 10</spark-opts>

    <ok to="end"/>
    <error to="fail"/>

    <kill name="fail">
      <message>Spark Java PatentCitation failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    <end name="end"/>

OOzie Workflow Properties

The is as follows:


Ensure that you save the workflow.xml in hdfs under the location


The patentcitation_spark.jar goes into the lib directory in hdfs.

How do you start OOzie job?
oozie job -oozie htp://localhost:11000/oozie -config -run

Note: should be in local directory and not on hdfs.

The cite75_99.txt input file goes in hdfs directory /user/captain/input

Once the oozie job has finished, you will find the output in

If you have any questions, do not hesitate to ask.