Why We Need Workflow Orchestration

orchestrating workflows over heterogeneous networking infrastructures and windows workflow foundation orchestration and microsoft workflow orchestration
CecilGalot Profile Pic
CecilGalot,United Kingdom,Teacher
Published Date:06-08-2017
Your Website URL(Optional)
Orchestration Workflow orchestration, sometimes referred to as workflow automation or business process automation, refers to the tasks of scheduling, coordinating, and managing workflows. Workflows are sequences of data processing actions. A system capable of performing orchestration is called an orchestration framework or workflow automa‐ tion framework. Workflow orchestration is an important but often neglected part of application archi‐ tectures. It is especially important in Hadoop because many applications are devel‐ oped as MapReduce jobs, in which you are very limited in what you can do within a single job. However, even as the Hadoop toolset evolves and more flexible processing frameworks like Spark gain prominence, there are benefits to breaking down complex processing workflows into reusable components and using an external engine to han‐ dle the details involved in stitching them together. Why We NeedWorkflow Orchestration Developing end-to-end applications with Hadoop usually involves several steps to process the data. You may want to use Sqoop to retrieve data from a relational data‐ base and import it to Hadoop, then run a MapReduce job to validate some data con‐ straints and convert the data into a more suitable format. Then, you may execute a few Hive jobs to aggregate and analyze the data, or if the analysis is particularly involved, there may be additional MapReduce steps. Each of these jobs can be referred to as an action. These actions have to be scheduled, coordinated, and managed. 183For example, you may want to schedule an action: • At a particular time • After a periodic interval • When an event happens (e.g., when a file becomes available) You may want to coordinate an action: • To run when a previous action finishes successfully You may want to manage an action to: • Send email notifications if the action succeeded or failed • Record the time taken by the action or various steps of the action This series of actions, or a workflow, can be expressed as a directed acyclic graph (DAG) of actions. When the workflow executes, actions either run in parallel or depend on the results of previous actions. One of the benefits of workflow orchestration is that breaking down complex into simple reusable units is simply a good engineering practice. The workflow engine helps in defining the interfaces between the workflow components. The other bene‐ fits involve pushing concerns out of application logic by using features that are already a part of the orchestration system. Good workflow orchestration engines will support business requirements such as metadata and data lineage tracking (the ability to track how specific data points were modified through multiple steps of transformation and aggregation), integration between various software systems in the enterprise, data lifecycle management, and the ability to track and report on data quality. They will also support operational functionality such as managing a repository of workflow components, scheduling flexible workflows, managing dependencies, monitoring the status of the workflows from a centralized location, being able to restart failed workflows, being able to gen‐ erate reports, and being able to roll back workflows when issues have been detected. Let’s talk about what common choices are available to architects orchestrating a Hadoop-based application. The Limits of Scripting When first rolling out a Hadoop-based application, you might be tempted to use a scripting language like Bash, Perl, or Python to tie the different actions together. For short and noncritical workflows, this approach is fine and is usually easier to imple‐ ment than the alternatives. 184 Chapter 6: Orchestration This usually looks like the bash script shown in Example 6-1. Example 6-1.workflow.sh sqoop job -exec import_data if beeline -u "jdbc:hive2://host2:10000/db" -f add_partition.hql 2&1 grep FAIL then echo "Failed to add partition. Cannot continue." fi if beeline -u "jdbc:hive2://host2:10000/db" -f aggregate.hql 2&1 grep FAIL then echo "Failed to aggregate. Cannot continue." fi sqoop job -exec export_data And you will usually execute this script, say every day at 1 a.m., using a crontab entry like Example 6-2. Example 6-2. Crontab entry 0 1 /path/to/my/workflow.sh /var/log/workflow-`date +%y-%m-%d`.log As this script moves to production and your organization relies on its results, addi‐ tional requirements emerge. You want to handle errors more elegantly, be able to notify users and other monitoring systems of the workflow status, be able to track the execution time of the workflow as a whole and of individual actions, apply more sophisticated logic between different stages of the workflow, be able to rerun the workflow in whole or in part, and be able to reuse components among various work‐ flows. Attempting to maintain a home-grown automation script in the face of growing pro‐ duction requirements can be a frustrating exercise of reinventing the wheel. Since there are many tools that perform this part of the job, we recommend becoming familiar with one of them and using it for your orchestration needs. Be aware that converting an existing workflow script to run in a workflow manager is not always trivial. If the steps are relatively isolated, as in Example 6-1 where each step uses the files the previous step wrote in HDFS, then conversion is fairly easy. If, on the other hand, the script passes information between steps using standard output or the local filesystem, things can get a bit more complex. It’s important to take this into account both when choosing to start developing a workflow in a script versus a workflow manager and when planning the conversion. The Limits of Scripting 185The Enterprise Job Scheduler and Hadoop Many companies have a standard framework for workflow automation and schedul‐ ing software that they use across the entire organization. Popular choices include ControlM, UC4, Autosys, and ActiveBatch. Using the existing software to schedule Hadoop workflows is a reasonable choice. This allows you to reuse existing infra‐ structure and avoid the learning curve of an additional framework when it is not nec‐ essary. This approach also makes it relatively easy to integrate in the same workflow actions that involve multiple systems, not just Hadoop. In general, these systems work by installing an agent on each server where actions can be executed. In Hadoop clusters, this is usually the edge node (also called gateway nodes) where the client utilities and application JARs are deployed. When designing a workflow with these tools, the developer typically indicates the server where each action will execute and then indicates the command that the agent will execute on the server in question. In Example 6-1, those commands are sqoop and beeline, respec‐ tively. The agent will execute the command, wait for it to complete, and report the return status back to the workflow management server. When designing the work‐ flow, the developer can specify rules of how to handle success or failure of each action and of the workflow as a whole. The same enterprise job scheduler can also be used to schedule the workflow to run at specific times or periodic intervals. Because these enterprise workflow automation systems are not Hadoop specific, detailed explanations of how to use each of them are beyond the scope of this book. We will focus on frameworks that are part of the Hadoop ecosystem. Orchestration Frameworks in the Hadoop Ecosystem There are a few workflow engines in the Hadoop ecosystem. They are tightly integra‐ ted within the Hadoop ecosystem and have built-in support for it. As a result, many organizations that need to schedule Hadoop workflows and don’t have a standard automation solution choose one of these workflow engines for workflow automation and scheduling. A few of the more popular open source workflow engines for distributed systems include Apache Oozie, Azkaban, Luigi, and Chronos: • Oozie was developed by Yahoo in order to support its growing Hadoop clusters and the increasing number of jobs and workflows running on those clusters. • Azkaban was developed by LinkedIn with the goal of being a visual and easy way to manage workflows. 186 Chapter 6: Orchestration • Luigi is an open source Python package from Spotify that allows you to orches‐ trate long-running batch jobs and has built-in support for Hadoop. • Chronos is an open source, distributed, and fault-tolerant scheduler from Airbnb that runs on top of Mesos. It’s essentially a distributed system that’s meant to serve as a replacement for cron. In this chapter, we will focus on Oozie and show how to build workflows using it. We chose Oozie because it is included with every Hadoop distribution. Other orchestra‐ tion engines have similar capabilities, whereas the syntax and details are different. When choosing a workflow engine, consider the following: Ease of installation How easy is it to install the workflow engine? How easy is it to upgrade it as newer versions are released? Community involvement and uptake How fast does the community move to add support for new and promising projects in the ecosystem? As new projects get added to the ecosystem (e.g., Spark being a fairly popular recent addition), you don’t want the lack of support from your workflow engine preventing you from using a newer project in your workflows. User interface support Are you going to be crafting workflows as files or via UI? If files, how easy is it to create and update these files? If UI, how powerful and intuitive is the UI? Testing How do you test your workflows after you have written them? Logs Does the engine provide easy access to logs? Workflow management Does the engine provide the level of management you want? Does it track times taken by the workflow as a whole or in part? Does it allow the flexibility to con‐ trol your DAG of actions (e.g., being able to make decisions based on the output of a previous action). Error handling Does the engine allow you to rerun the job or parts of it in case of failure? Does it allow you to notify users? Orchestration Frameworks in the Hadoop Ecosystem 187Oozie Terminology Let’s review Oozie terminology before we go further: Workflow action A single unit of work that can be done by the orchestration engine (e.g., a Hive query, a MapReduce job, a Sqoop import) Workflow A control-dependency DAG of actions (or jobs) Coordinator Definition of data sets and schedules that trigger workflows Bundle Collection of coordinators Oozie Overview Oozie is arguably the most popular scalable, distributed workflow engine for Hadoop. Most, if not all, Hadoop distributions ship with it. Its main benefits are its deep inte‐ gration with the Hadoop ecosystem and its ability to scale to thousands of concurrent workflows. Oozie has a number of built-in actions for popular Hadoop ecosystem components like MapReduce, Hive, Sqoop, and distcp. This makes it easy to build workflows out of these building blocks. In addition, Oozie can execute any Java app and shell script. We’ll provide a brief overview of Oozie here and then provide considerations and rec‐ ommendations for its use. The main logical components of Oozie are: Aworkflow engine Executes a workflow. A workflow includes actions such as Sqoop, Hive, Pig, and Java. A scheduler (aka coordinator) Schedules workflows based on frequency or on existence of data sets in preset locations. REST API Includes APIs to execute, schedule, and monitor workflows. 188 Chapter 6: OrchestrationCommand-line client Makes REST API calls and allows users to execute, schedule, and monitor jobs from the command line. Bundles Represent a collection of coordinator applications that can be controlled together. Notifications Sends events to an external JMS queue when the job status changes (when a job starts, finishes, fails, moves to the next action, etc.). This also allows for simple integration with external applications and tools. SLA monitoring Tracks SLAs for jobs based on start time, end time, or duration. Oozie will notify you when a job misses or meets its SLA through a web dashboard, REST API, JMS queue, or email. Backend database Stores Oozie’s persistent information: coordinators, bundles, SLAs, and workflow history. The database can be MySQL, Postgres, Oracle, or MSSQL Server. Oozie functions in a client-server model. Figure 6-1 depicts Oozie’s client-server architecture. Figure 6-1. Oozie architecture When you execute a workflow, the sequence of events shown in Figure 6-2 takes place. Oozie Overview 189 Figure 6-2. Oozie sequence of events duringworkflow execution As you can see, the client connects to the Oozie server and submits the job configura‐ tion. This is a list of key-value pairs that defines important parameters for the job execution, but not the workflow itself. The workflow, a set of actions and the logic that connects them, is defined in a separate file calledworkflow.xml . The job configu‐ ration must include the location on HDFS of theworkflow.xml file. It can also contain other parameters, and very frequently it includes the URIs for the NameNode, Job‐ Tracker (if you are using MapReduce v1 MR1), or YARN Resource Manager (if you are using MapReduce v2 MR2). These parameters are later used in the workflow definition. When the Oozie server receives the job configuration from the client, it reads the workflow.xml file with the workflow definition from HDFS. The Oozie server then parses the workflow definition and launches the actions as described in the workflow file. Why Oozie Server Uses a Launcher The Oozie server doesn’t execute the action directly. For any action, the Oozie server launches a MapReduce job with only one map task (called a launcher job), which then launches the Pig or Hive action. This architectural decision makes the Oozie server lightweight and more scalable. If all the actions were launched, monitored, and managed by a single Oozie server, all client libraries would have to exist on this server, making it very heavyweight and likely a bottleneck because all the actions would be executed by the same node. How‐ 190 Chapter 6: Orchestration ever, a separate launcher task enables Oozie to use the existing distributed MapRe‐ duce framework to delegate the launching, monitoring, and managing of an action to the launcher task and hence to the node where this launcher task runs. This launcher picks up the client libraries from the Oozie sharedlib, which is a directory on HDFS that contains all the client libraries required for various Oozie actions. The exceptions to this discussion are filesystem, SSH, and email actions, which are executed by the Oozie server directly. For example, if you run a Sqoop action, the Oozie server will launch a single-mapper MapReduce job called sqoop-launcher; this mapper will run the Sqoop client, which will in turn launch its own MapReduce job called sqoop-action. This job will have as many mappers as Sqoop was configured to execute. In case of Java or shell actions, the launcher will execute the Java or shell application itself and will not generate addi‐ tional MapReduce jobs. OozieWorkflow As mentioned earlier, workflow definitions in Oozie are written in an XML file called workflow.xml . A workflow contains action nodes and control nodes. Action nodes are nodes responsible for running the actual action, whereas a control node controls the flow of the execution. Control nodes can start or end a workflow, make a decision, fork and join the execution, or kill the execution. Theworkflow.xml file is a represen‐ tation of a DAG of control and action nodes expressed as XML. A complete discus‐ sion of the schema of this XML file is beyond the scope of this book but is available in the Apache Oozie documentation. Oozie comes with a default web interface that uses ExtJS. However, the open source UI tool Hue (Hadoop user experience) comes with an Oozie application that is a more commonly used UI for building and designing Oozie workflows. Hue generates a workflow.xml file based on the workflow designed via the UI. Azkaban Azkaban is an open source orchestration engine from LinkedIn. As with Oozie, you can write workflows (simply called flows ) in Azkaban and schedule them. Oozie and Azkaban were developed with different goals in mind: Oozie was written to manage thousands of jobs on very large clusters and thus emphasizes scalability. Azkaban was developed with ease of use as its primary goal and emphasizes simplicity and visuali‐ zation. Azkaban ships as two services and a backend database; the services are Azkaban Web Server and Azkaban Executor. Each of these can run on different hosts. Azkaban web OozieWorkflow 191 server is more than just a web UI; it is the main controller of all flows scheduled and run by Azkaban. It’s responsible for their management, scheduling, and monitoring. The Azkaban executor is responsible for executing flows in Azkaban. Currently the web server is a single point of failure, while there is support for multiple executors— providing both high availability and scalability. Both the web server and the executor talk to the backend MySQL database, which stores collections of workflows, workflow schedules, SLAs, the status of executing workflows, and workflow history. Azkaban has built-in support for running only local UNIX commands and simple Java programs. However, you can easily install the Azkaban Job Types plug-in for ena‐ bling support for Hadoop, Hive, and Pig jobs. Azkaban Web Server includes a num‐ ber of other useful plug-ins: HDFS Browser for browsing contents of HDFS (similar to Hue’s HDFS browser), Security Manager for talking to a Hadoop cluster in a secure way, Job Summary for providing summaries of jobs run, Pig Visualizer for visualizing Pig jobs, and Reportal for creating, managing, and running reports. Figure 6-3 shows Azkaban’s architecture. Figure 6-3. Azkaban architecture Although Table 6-1 is not meant to provide an exhaustive comparison between Oozie and Azkaban, it lists some of their points of difference. Table 6-1. Comparison between Oozie and Azkaban Criteria Oozie Azkaban Workflow denition fi Uses XML, which needs to be Uses simple declarative .job files. Needs to be uploaded to HDFS. zipped up and uploaded via web interface. 192 Chapter 6: Orchestration Criteria Oozie Azkaban Scalability Spawns a single map-only task called All jobs are spawned via the single executor, Launcher that manages the launched which may cause scalability issues. job. Makes for a more scalable system but adds an extra level of indirection when debugging. Hadoop ecosystem Offers very good support. Has Supports Java MapReduce, Pig, Hive, and support support for running Java MapReduce, VoldemortBuildAndPush (for pushing data to Streaming MapReduce, Pig, Hive, Voldemort key-value store) jobs. Other jobs Sqoop, and Distcp actions. have to be implemented ascommand job types. Security Integrates with Kerberos. Supports Kerberos but has support only for the old MR1 API for MapReduce jobs on secure Hadoop cluster. Versioning Offers easy workflow versioning by Provides no clear way of versioning workflows HDFS symlinks. workflows since they are uploaded via the web UI. Integration outside Supports workflows on HDFS, S3, or a Does not require flows to be on HDFS; may be of Hadoop local filesystem (although local used completely outside of Hadoop. ecosystem filesystem is not recommended). Email, SSH, shell actions are supported. Offers API for writing custom actions. Workflow Supports workflow parameterization Requires flows to be parameterized via parameterization via variables and Expression variables only. Language (EL) functions (e.g., wf:user()). Workflow Supports time and data triggers. Supports only supports time triggers. scheduling Interaction with the Supports REST API, Java API, and CLI; Has a nice web interface, can issue cURL server has web interfaces (built-in and via commands to the web server trigger flows; Hue). no formal Java API or CLI. OozieWorkflow 193 Workflow Patterns Now that we have a good understanding of the architecture of popular orchestration engines, let’s go through some workflow patterns that are commonly found in the industry. Point-to-PointWorkflow This type of workflow is common when you are performing actions sequentially. For example, let’s say you wanted to perform some aggregation on a data set using Hive and if it succeeded, export it to a RDBMS using Sqoop. When implemented in Oozie, yourworkflow.xml file would look the following: workflow-app xmlns="uri:oozie:workflow:0.4" name="aggregate_and_load" global job-trackerjobTracker/job-tracker name-nodenameNode/name-node /global start to="aggregate" / action name="aggregate" hive xmlns="uri:oozie:hive-action:0.5" job-xmlhive-site.xml/job-xml scriptpopulate_agg_table.sql/script /hive ok to="sqoop-export" / error to="kill" / /action action name="sqoop_export" sqoop xmlns="uri:oozie:sqoop-action:0.4" argexport/arg argconnect/arg argjdbc:oracle:thin://orahost:1521/oracle/arg argusername/arg argscott/arg argpassword/arg argtiger/arg argtable/arg argmytable/arg argexport-dir/arg arg/etl/BI/clickstream/aggregate-preferences/output/arg /sqoop ok to="end" / error to="kill" / /action kill name="kill" message Workflow failed. Error message 194 Chapter 6: Orchestration wf:errorMessage(wf:lastErrorNode())/message /kill end name="end" / /workflow-app At a high level, this workflow looks like Figure 6-4. Figure 6-4. Point-to-pointworkflow Now, let’s dig into each of the XML elements in this workflow: • Theglobal element contains global configurations that will be used by all actions —for example, the URIs for JobTracker and NameNode. The job-tracker ele‐ ment is a little misleading. Oozie uses the same parameter for reading the Job‐ Tracker URI (when using MR1) and the YARN ResourceManager URI (when using MR2). • The start element points to the first action to be run in the workflow. Each action has a name, and then a set of parameters depending on the action type. • The first action element is for running the Hive action. It takes the location of the Hive configuration file and the location of the HQL script to execute. Remember, since the process issuing the Hive command is the launcher map task, which can run on any node of the Hadoop cluster, the Hive configuration file and the HQL scripts need to be present on HDFS so they can be accessed by the node that runs the Hive script. In our example, the populate_agg_table.sql script performs an aggregation via Hive and stores the aggregate results on HDFS at hdfs://nameservice1/etl/BI/clickstream/aggregate-preferences/output. • The sqoop_export action is responsible for running a Sqoop export job to export the aggregate results from HDFS to an RDBMS. It takes a list of arguments that correspond to the arguments you’d use when calling Sqoop from the command line. • In addition, each action contains directions on where to proceed if the action succeeds (either to the next action in the workflow or to the end), or if the action fails (to the “kill” node, which generates an appropriate error message based on Workflow Patterns 195 the last action that failed, but could also send a notification JMS message or an email). • Note that each action and the workflow itself has an XML schema with a version (e.g., xmlns="uri:oozie:sqoop-action:0.4"). This defines the elements avail‐ able in the XML definition. For example, the global element only exists in uri:oozie:workflow:0.4 and higher. If you use an older XML schema, you’ll need to define jobTracker and nameNode elements in each action. We recom‐ mend using the latest schema available in the version of Oozie you are using. Fan-OutWorkflow The fan-out workflow pattern is most commonly used when multiple actions in the workflow could run in parallel, but a later action requires all previous actions to be completed before it can be run. This is also called a fork-and-join pattern. In this example, we want to run some preliminary statistics first and subsequently run three different queries for doing some aggregations. These three queries—for prescriptions and refills, office visits, and lab results—can themselves run in parallel and only depend on the preliminary statistics to have been computed. When all of these three queries are done, we want to generate a summary report by running yet another Hive query. Here is the workflow definition in Oozie: workflow-app name="build_reports" xmlns="uri:oozie:workflow:0.4" global job-trackerjobTracker/job-tracker name-nodenameNode/name-node job-xmlhiveSiteXML/job-xml /global start to="preliminary_statistics" / action name="preliminary_statistics" hive xmlns="uri:oozie:hive-action:0.5" scriptscripts/stats.hql/script /hive ok to="fork_aggregates" / error to="kill" / /action fork name="fork_aggregates" path start="prescriptions_and_refills" / path start="office_visits" / path start="lab_results" / /fork 196 Chapter 6: Orchestration action name="prescriptions_and_refills" hive xmlns="uri:oozie:hive-action:0.5" scriptscripts/refills.hql/script /hive ok to="join_reports" / error to="kill" / /action action name="office_visits" hive xmlns="uri:oozie:hive-action:0.5" scriptscripts/visits.hql/script /hive ok to="join_reports" / error to="kill" / /action action name="lab_results" hive xmlns="uri:oozie:hive-action:0.5" scriptscripts/labs.hql/script /hive ok to="join_reports" / error to="kill" / /action join name="join_reports" to="summary_report" / action name="summary_report" hive xmlns="uri:oozie:hive-action:0.5" scriptscripts/summary_report.hql/script /hive ok to="end" / error to="kill" / /action kill name="kill" message Workflow failed. Error message wf:errorMessage(wf:lastErrorNode())/message /kill end name="end" / /workflow-app In this workflow, the first action, preliminary_statistics, computes the prelimi‐ nary statistics. When this action has finished successfully, the workflow control moves to the fork element, which will enable the next three actions (prescrip tions_and_refills, office_visits, and lab_results) to run in parallel. When each of those actions has completed, the control proceeds to the join element. The join element stalls the workflow until all the three preceding Hive actions have com‐ pleted. Note that the assumption here is that the action following the join depends on the results of all actions that are part of the fork, so if even one of these actions fails, the entire workflow will be killed. When the three forked actions have completed suc‐ Workflow Patterns 197 cessfully, the control proceeds to the last action, summary_report. The Workflow will finish when the last action completes. Note that since all workflows use the same Hive definitions, the job-xml is defined once, in theglobal section. Figure 6-5 shows the fan-out workflow. Figure 6-5. Fan-outworkflow Capture-and-DecideWorkflow The capture-and-decide workflow is commonly used when the next action needs to be chosen based on the result of a previous action. For example, say you have some Java code in the main class com.hadooparchitecturebook.DataValidationRunner that validates incoming data into Hadoop. Based on this validation, you want to pro‐ ceed to processing this data by running the main class com.hadooparchitecture book.ProcessDataRunner, if there are no errors. However, in case of errors, you want your error handling code to put information about the errors into a separate errors directory on HDFS by running the main class com.hadooparchitecturebook.Move OutputToErrorsAction and report that directory back to the user. Here is theworkflow.xml file for implementing such a workflow in Oozie: workflow-app name="validation" xmlns="uri:oozie:workflow:0.4" global job-trackerjobTracker/job-tracker name-nodenameNode/name-node /global start to="validate" / action name='validate' java 198 Chapter 6: Orchestration main-classcom.hadooparchitecturebook.DataValidationRunner /main-class arg-Dinput.base.dir=wf:conf('input.base.dir')/arg arg-Dvalidation.output.dir=wf:conf('input.base.dir')/dataset /arg capture-output / /java ok to="check_for_validation_errors" / error to="fail" / /action decision name='check_for_validation_errors' switch case to="validation_failure" (wf:actionData("validate")"errors" == "true") /case default to="process_data" / /switch /decision action name='process_data' java main-classcom.hadooparchitecturebook.ProcessDataRunner/main-class arg-Dinput.dir=wf:conf('input.base.dir')/dataset/arg /java ok to="end" / error to="fail" / /action action name="validation_failure" java main-classcom.hadooparchitecturebook.MoveOutputToErrorsAction /main-class argwf:conf('input.base.dir')/arg argwf:conf('errors.base.dir')/arg capture-output / /java ok to="validation_fail" / error to="fail" / /action kill name="validation_fail" messageInput validation failed. Please see error text in: wf:actionData("validation_failure")"errorDir" /message /kill kill name="fail" messageJava failed, error messagewf:errorMessage (wf:lastErrorNode())/message /kill Workflow Patterns 199 end name="end" / /workflow-app Figure 6-6 shows a high-level overview of this workflow. Figure 6-6. Capture-and-decideworkflow In this workflow, the first action, validate, runs a Java program to validate an input data set. Some additional parameters are being passed to the Java program based on the value of the input.base.dir parameter in the Oozie configuration. Note that we use the capture-output / element to capture the option of this action, which will then be used in the next decision node, check_for_validation_errors. Java, SSH, and shell actions all support capture of outputs. In all cases, the output has to be in Java property file format, which is basically lines of key-value pairs separated by new‐ lines. For shell and SSH actions, this should be written to standard output (stdout). For Java actions, you can collect the keys and values in a Properties object and write it to a file. The filename must be obtained from the oozie.action.output.proper ties system property within the program. 200 Chapter 6: Orchestration In the preceding example, we are checking for the value of the errors property from the output of the validate action. Here’s an example of how you’d write errors=false for the output to be captured in Java: File file = new File(System.getProperty("oozie.action.output.properties")); Properties props = new Properties(); props.setProperty("errors", "false"); OutputStream os = new FileOutputStream(file); props.store(os, ""); os.close(); The captured output is then referenced in the decision node via wf:action Data("_action_name_")"_key_" where _action_name_ is the name of the action and_key_ is the property name. The decision node then contains aswitch statement, which directs the control to the validation_failure node if the errors property was set totrue in the captured output of thevalidate action, or to theprocess_data data action otherwise. Theprocess_data action simply calls another Java program to process the data, pass‐ ing the location of the data set as a property to the program. In case of validation errors, the validation_failure action calls another Java program, passing two parameters to it. This time, however, the parameters are passed in as command-line arguments instead of properties (notice the lack of -D property=value syntax). This program is expected to put information about the validation errors in a direc‐ tory and pass the name of this directory back to the workflow using the capture- output element. The name of the errors directory is passed as the errorsDir property, which is read by the next action, validation_fail, which in turn reports the location of the directory back to the user. ParameterizingWorkflows Often, there is a requirement to run the same workflow on different objects. For example, if you have a workflow that uses Sqoop to import a table from a relational database and then uses Hive to run some validations on the data, you may want to use the same workflow on many different tables. In this case, you will want the table name to be a parameter that is passed to the workflow. Another common use for parameters is when you’re specifying directory names and dates. Oozie allows you to specify parameters as variables in the Oozie workflow or coordi‐ nator actions and then set values for these parameters when calling your workflows or coordinators. Apart from enabling you to quickly change values when you need to without having to edit code, this also allows you to run the same workflow on differ‐ ent clusters (usually dev, test, and production) without redeploying your workflows. Any parameters specified in Oozie workflows can be set in any of the following ways: ParameterizingWorkflows 201 • You can set the property values in your config-defaults.xml file, which Oozie uses to pick default values for various configuration properties. This file must be loca‐ ted in HDFS next to theworkflow.xml file for the job. • You can specify the values of these properties in the job.properties file or the Hadoop XML configuration file, which you pass using the -config command- line option when launching your workflow. However, this method has some risk since the user may accidentally make other changes when editing the file, and many times the same file is used for multiple executions by different users. • You can specify the values of these properties on the command line when calling the workflow or coordinator using the-D property=value syntax. • You can pass in the parameter when calling the workflow from the coordinator. • You can specify a list of mandatory parameters (also called formal parameters) in the workflow definition. The list can contain default values. When you submit a job, the submission will be rejected if values for those parameters are missing. You specify this list in the beginning of the workflow, before the global sec‐ tion: parameters property nameinputDir/name /property property nameoutputDir/name valueout-dir/value /property /parameters These parameters can then be accessed in workflows as jobTracker or wf:conf('jobTracker'). If the parameter name is not a Java property (i.e., con‐ tains more than just A-Za-z_0-9A-Za-z_), then it can only be accessed by the latter method,wf:conf('property.name'). In addition, we’ve already seen how many actions themselves pass arguments to the code they call within the workflow. Shell actions may pass the arguments to the shell command, Hive actions may take in the name of the Hive script as a parameter or additional arguments for parameterizing the Hive query, and Sqoop actions may take arguments to parameterize Sqoop import or export. The following example shows how we pass a parameter to a Hive action that is then used in the query: workflow-app name="cmd-param-demo" xmlns="uri:oozie:workflow:0.4" global job-trackerjobTracker/job-tracker name-nodenameNode/name-node /global 202 Chapter 6: Orchestration

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.