Designing an Orchestration layer for data processing jobs

4 min readSep 16, 2021

Introduction

Nowadays, most organizations have huge datasets available to them and they’re always trying to derive meaningful insights out of it. A number of products are available in the market to define and execute such data processing jobs, Microsoft’s Azure Databricks for example. Assume you already have a huge volume of data in different data sets and a number of users who want to make use of it. How can you provide an interface to your users to create their own data products from existing data assets?

Recently, we came across such a problem statement- an interface to define data processing jobs and run them on a compute of user’s choice. This blog is about the design we converged on to meet those requirements.

Requirements

Upon discussions with the customer and target users, following are the requirements we identified for an MVP.

Users shall upload a file, for example- a Spark SQL script, that contains the data processing logic.
Users shall specify the input data asset and the output the job produces.
Users shall run these jobs On Demand- as and when needed, or configure them to be run On Schedule or On Event. This is where some out of the box solutions we evaluated didn’t meet the requirements. Jobs need to be triggered on certain events like- an update to input data asset. Users also wanted the flexibility to trigger jobs on schedule.
Users shall be able to specify certain conditions that will checked before running a job. For example- the input data must have been updated in last ‘x’ hours. If the conditions are not met, retries shall happen as per retry policy.

As the discussions progressed, we identified a few more items critical to the vision we had in mind.

The data processing workload was going to be run on dedicated compute like Azure Databricks. What we wanted was a scalable and reliable orchestration solution with minimal operationalization efforts, for the user to author jobs and manage job runs.
Though the initial target for these jobs was Azure Databricks, the solution should be able to support other computes as well.
The solution should be customizable for current and future needs. For instance, the evaluation of conditions mentioned earlier could get complex and depend on events happening on other data sets.
Ability to control concurrency- more often than not multiple instance of the same job could result in inconsistency in the output. The solution needed a way to control concurrent runs of same job. The data ingestion pattern to existing datasets suggested that the number of events could be significant and could result in concurrent runs if not controlled.

Solution Architecture

After several rounds of discussions, we came up with an architecture that could meet our requirements. The ability for customizations, modularity, extensibility and lesser development and operationalization efforts were the guiding principles in our discussions.

Job management portal

A modern, single page application that offers a GUI for the user to create new jobs, view and edit existing jobs, trigger on-demand jobs and view status of past and current job runs.

Job management service

A RESTful service where the jobs are defined and stored. Jobs defined by a user as well as status of job runs are stored in the database owned by this service. This service exposed CRUD operations for job metadata and job run statuses as RESTful APIs.

Execution queue

For better decoupling, we decided to go with an asynchronous pattern to trigger the jobs. Irrespective of the type of job- On demand, On schedule or On event, new job runs are submitted to this queue.

Scheduler

A standalone service to trigger jobs based on CRON expressions defined in each job. On creation of ‘On-Schedule’ jobs, job management service would create a recurring schedule here. Job runs will be submitted to the queue by this service as scheduled.

Developing a scheduler from scratch would take considerable development and testing efforts. You could go for a readily available solution for this. In our case, we went for Hangfire, an opensource solution.

Events Consumer

A thin background worker, dedicated to process incoming events from the data lake. The sole responsibility of this process is to parse the events, identify the jobs to run and send them to execution queue.

Job Processor

Another background worker that interacts with the compute, Azure Databricks for example, for actual data processing. This is the most complex component in this design. Responsibilities of this component included- reading the job metadata from job management service, checking for any running instances of the same job, checking for conditions by interacting with conditions check service, updating job run status, starting job execution on the compute, waiting for the job to finish and retrying jobs in case of condition check or other failures.

Conditions Check Service

A RESTful service dedicated to check conditions. This service would interact with relevant data sources to evaluate all the conditions for a job.

Compute

This is where the actual data processing happens, this would be a third party service, like Azure Databricks. The compute engine would download the job logic, a Spark SQL file for example, and execute it to produce the output dataset.

Conclusion

The solution proved stable and highly responsive after completing first milestone of the implementation phase. This might not be an architecture that fits every need. In many cases, out of the box solutions like Azure Data Factory could be sufficient. Our case had some special requirements those weren’t fully supported by out of the box solutions we evaluated, like- the condition checks, concurrency checks, schedules, triggers from existing data processing solutions and flexibility to add more features.