Oozie Tutorial — Workflow Management

February 28, 2018

What is Oozie?

Oozie is a workflow management system that is designed to schedule and run Hadoop jobs in a distributed environment. Oozie has the ability to schedule multiple complex jobs in a sequential order or to run in parallel. It integrates well with Hadoop jobs such as MapReduce, Hive, Pig, and others, and allows ssh and shell access.

Oozie Architecture

Oozie is a Java web-application under the Apache 2.0 license. Oozie kicks off jobs using a unique callback HTTP URL so that it can notify that URL when the task is completed. If by chance that task doesn’t callback to that URL, Oozie will just poll to make sure that the task did complete.

There are three major types of configurable items inside of Oozie:

Workflows — These are Directed Acyclic Graphs to define the sequence of actions that need to be executed. These have most of the details of the individual actions inside of it.
Coordinators — These are the mechanism of kicking off workflows based on time and data availability. The coordinators ensure that workflows are kicked off when they need to be.
Property Files — This is the file that can define variables for the workflows or coordinators and are used in the initial kick off of workflows and coordinators.

The best way to understand Oozie is to start using Oozie, so let’s jump in and create our own property file, Oozie workflow, and coordinator.

Oozie Example

For this Oozie tutorial, refer back to the HBase tutorial where we loaded some data. For the Oozie tutorial, we are going to create a workflow and coordinator that run every 5 minutes and drop the HBase tables and repopulate the tables via our Pig script that we created.

Create Shell Script to Drop HBase Tables

First thing we need to do is create a shell script that drops those HBase tables.

Type the following to create and open the file:

vi dropHbaseTables.sh

Press i to enter interactive mode so that you can edit the file. Copy the following code:

echo "disable 'movies'" |hbase shell -n
echo "drop 'movies'" |hbase shell -n
echo "disable 'ratings'" |hbase shell -n
echo "drop 'ratings'" |hbase shell -n
echo "disable 'users'" |hbase shell -n
echo "drop 'users'" |hbase shell -n
echo "create 'users', 'userdata'" |hbase shell -n
echo "create 'ratings', 'ratingsdata'" |hbase shell -n
echo "create 'movies', 'moviedata'" |hbase shell -n

Press the Esc button, and then types :wq, to save and exit the file.

Now we will put this into HDFS somewhere.

hdfs dfs -put dropHbaseTables.sh /

Remember back to the HDFS article. This command takes a file and puts it into HDFS. Breakdown of the command is as follows: hdfs dfs -put <source file> <destination path>

The pig script that we are going to be running should already be in HDFS. If not, jump over to the HBase article to follow the directions to get that into HDFS.

Setting up the Coordinator and Workflow

Let’s start by creating the properties file that will kick off the coordinator.

vi repopulateHbase.properties

Copy in the following:

nameNode=hdfs://quickstart.cloudera
jobTracker=yarnRM
oozie.use.system.libpath=true
oozie.coord.application.path=${nameNode}/repopulateHbaseCoordinator.xml

In this file, we’ve setup a couple of variables that we can use and defined some properties of Oozie most important being oozie.coord.application.path which tells Oozie where the xml definition of the coordinator is located at.

Let’s create the coordinator and push it to HDFS next.

vi repopulateHbaseCoordinator.xml

Copy in the following:

<coordinator-app name="repopulateHbase" xmlns="uri:oozie:coordinator:0.2" freque
ncy="${coord:minutes(5)}" start="2018-02-10T07:00Z" end="9999012031T07:00Z" time
zone="America/Chicago" >
  <controls>
    <execution>LAST_ONLY</execution>
  </controls>
  <action>
    <workflow>
      <app-path>hdfs://quickstart.cloudera/repopulateHbaseWorkflow.xml</app-path
>
    <configuration>
       <property>
          <name>nameNode</name>
          <value>hdfs://quickstart.cloudera</value>
        </property>
        <property>
          <name>jobTracker</name>
          <value>yarnRM</value>
        </property>
        <property>
           <name>SCRIPTDIR</name>
           <value>hdfs://quickstart.cloudera/</value>
        </property>
         </configuration>
    </workflow>
  </action>
</coordinator-app>

In the top portion, we configure the coordinator to call repopulateHbase every 5 minutes starting on February 10, 2018 at 7:00 Zulu time. We suggest changing the date a little closer to your start date. After that we set the end for a really large value which represents infinity or forever — for our purposes we can just edit the end time to set it closer to the start time.

Next, we tell the coordinator where the workflow is located in HDFS. Let’s put this into HDFS.

hdfs dfs -put repopulateHbaseCoordinator.xml /

Now we can move on to the workflow document which will define the two actions we want to run in sequential order: the shell script and the Pig script.

vi repopulateHbaseWorkflow.xml

Copy in the following:

<?xml version="1.0"?>
<workflow-app name="repopulateHbase" xmlns="uri:oozie:workflow:0.5">
  <start to="dropScript" />
  <kill name="Kill">
    <message>Action failed becasuse of [$(wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>
  <action name="dropScript">
    <shell xmlns="uri:oozie:shell-action:0.1">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <exec>dropHbaseTables.sh</exec>
      <file>${SCRIPTDIR}dropHbaseTables.sh</file>
    </shell>
    <ok to="pigScript" />
    <error to="Kill" />
  </action>
  <action name="pigScript">
    <pig>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <script>${SCRIPTDIR}loadHbase.pig</script>
    </pig>
    <ok to="end" />
    <error to="Kill" />
  </action>
  <end name="end" />
</workflow-app>

Notice in this workflow that each “thing” we want to do is wrapped in an action node. This is how we define what we want to do inside of Oozie. We’ve added the needed information using variables that we defined in the coordinator file. Once all of this is done we can push to HDFS.

hdfs dfs -put repopulateHbaseWorkflow.xml /

Start the Oozie Coordinator

Now let’s kick off the coordinator using the following command:

oozie job -oozie http://localhost:11000/oozie -config repopulateHbase.properties -run
//Output
0000004-180211201027900-oozie-oozi-C

Let’s dissect this oozie command.

We add job because we are about to interact with a job.

-oozie specifies the oozie server.

-config specifies the property file

-run tells the command that we want to run that property file to kick off the coordinator.

Later we will use the -info followed by an oozie id to get the status of an oozie job. There are many different options to use the oozie command, but these are the most popular.

This command will give you an id as an output (0000004-180211201027900-oozie-oozi-C in our case). Keep that handy because we will be using that to look at the status of our coordinator and workflow. These ids will be different for everyone so just use mine as an example. Let’s look at the status of the coordinator:

Remember to replace 0000004-180211201027900-oozie-oozi-C with your id from above.

oozie job -oozie http://localhost:11000/oozie -info 0000004-180211201027900-oozie-oozi-C
0000004-180211201027900-oozie-oozi-C@528   SKIPPED   -                                    -         2018-02-12 03:18 GMT 2018-02-12 02:55 GMT 
------------------------------------------------------------------------------------------------------------------------------------
0000004-180211201027900-oozie-oozi-C@529   SKIPPED   -                                    -         2018-02-12 03:18 GMT 2018-02-12 03:00 GMT 
------------------------------------------------------------------------------------------------------------------------------------
0000004-180211201027900-oozie-oozi-C@530   SKIPPED   -                                    -         2018-02-12 03:18 GMT 2018-02-12 03:05 GMT 
------------------------------------------------------------------------------------------------------------------------------------
0000004-180211201027900-oozie-oozi-C@531   SKIPPED   -                                    -         2018-02-12 03:18 GMT 2018-02-12 03:10 GMT 
------------------------------------------------------------------------------------------------------------------------------------
0000004-180211201027900-oozie-oozi-C@532   RUNNING   0000005-180211201027900-oozie-oozi-W -         2018-02-12 03:18 GMT 2018-02-12 03:15 GMT

You will see all of the workflows that this coordinator has scheduled. There should only be one running at a time and the one that is running should have a workflow id now (0000005-180211201027900-oozie-oozi-W in our case). This allows us to jump into the workflow and check the status of it. Let’s do that now.

oozie job -oozie http://localhost:11000/oozie -info 0000005-180211201027900-oozie-oozi-W
Job ID : 0000005-180211201027900-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : repopulateHbase
App Path      : hdfs://quickstart.cloudera/repopulateHbaseWorkflow.xml
Status        : RUNNING
Run           : 0
User          : root
Group         : -
Created       : 2018-02-12 03:18 GMT
Started       : 2018-02-12 03:18 GMT
Last Modified : 2018-02-12 03:18 GMT
Ended         : -
CoordAction ID: 0000004-180211201027900-oozie-oozi-C@532
Actions
------------------------------------------------------------------------------------------------------------------------------------
ID                                                                            Status    Ext ID                 Ext Status Err Code  
------------------------------------------------------------------------------------------------------------------------------------
0000005-180211201027900-oozie-oozi-W@dropScript                               PREP      -                      -          -         
------------------------------------------------------------------------------------------------------------------------------------
0000005-180211201027900-oozie-oozi-W@:start:                                  OK        -                      OK         -         
------------------------------------------------------------------------------------------------------------------------------------

This will show the status of the workflow. You can keep running this command until you see it finish. Once it is complete you can query HBase and see the data inside of the tables.

Kill the Coordinator

Before we go, let’s go ahead and kill the coordinator because earlier we told the coordinator that we want it to run forever. This is not necessarily good for our single node docker image cluster. So let’s kill the coordinator with the following command:

oozie job -oozie http://localhost:11000/oozie -kill 0000004-180211201027900-oozie-oozi-C

Congratulations. You have successfully scheduled two separate Hadoop jobs to run sequentially using Oozie. You have the basics to create your own workflows.