Streamlining Databricks Workflows with Enterprise Workload Automation

Written by Richard Kao | Jul 26, 2023 1:00:00 PM

The ability to harness data and AI will play an increasingly pivotal role in an organization’s long-term fortunes. To more fully capitalize on their organization’s data, many data scientists, engineers, and analysts have grown increasingly reliant upon Databricks. While Databricks provides valuable capabilities for integrating, storing, processing and governing data, it presents a challenge for workload automation because it creates yet another island of automation with scheduling feature limitations.

Scheduling Challenges

As organizations continue to expand their use of Databricks, the volume of notebook-based workflows continues to expand. Running these workloads represents an increasingly time-consuming, labor-intensive effort.

Databricks features a basic, time-based scheduler that operators can use to automatically run jobs at specified times. The problem is that Databricks workflows typically have dependencies:

Upstream dependencies. Various environments may feed data into Databricks to support ongoing activities.
Downstream dependencies. Often, downstream applications rely on Databricks outputs for their ongoing functioning.

To coordinate these various processes and associated dependencies, administrators can only use forced time delays, that is, scheduling subsequent tasks to start at a time after which prior tasks have been completed.

If the processing of one task takes longer than the forced time delay established, a subsequent task will kick off, either with wrong or incomplete data. This dynamic gets magnified in many environments, where pipeline activities will be managed in phases, with dozens of workloads needing to be completed in phase one, before dozens of phase-two workloads can be initiated, and so on.

In these cases, a team may decide to set a forced time delay. For example, imagine the longest phase-one activities can take up to 10 hours to complete. A team could then add a buffer of two hours, and schedule all phase-two workloads to start a total of 12 hours after phase-one tasks were kicked off.

This approach exposes an organization in a few different ways:

Poor data quality. If even a single phase-one job is delayed or fails, the second phase jobs will kick off with inaccurate or incomplete data, which means the ultimate output of the job sequence will be suboptimal or even unusable.
Labor-intensive follow-up and mitigation. If a failure is discovered while a workstream is underway, the administrator will have to disable the schedule, potentially in multiple products, and manually troubleshoot and address any issues that may have arisen.
High costs. Databricks instances may need to be kept idling. For example, in the scenario above, instances will be left idling for two hours, or more if jobs complete ahead of schedule. This can be very costly, in many environments representing fees of thousands of dollars, and these costs are accrued frequently.
High failure rates. When running Databricks, users are highly reliant upon a range of dispersed, distributed networks. At any given moment, automation jobs can fail, simply due to a call to an API returning an error message. When this type of downtime occurs, scheduled jobs will fail, leading to cascading issues with subsequent downstream jobs.

Looking to scripting?

To avoid some of the challenges outlined above, some automation teams have sought to develop shell scripts for creating automated workflows. However, these approaches require significant up-front investment, are very difficult to support and run over time, and are not scalable. Further, inefficiency and costs continue to mount as the scale of the environment grows.

Running automation across multiple platforms?

It is important to underscore that all these disadvantages occur when automation groups are only trying to manage automation in Databricks. However, the reality is that many organizations are running interrelated automation job streams that span a range of platforms and services. It is in these multi-platform scenarios that the lack of an enterprise workload automation solution becomes an even more significant vulnerability.

Solution

As long as automation has been around, the potential for costly, brittle islands of automation has also been around. While businesses continue to expand their use of cloud-based solutions like Databricks, automation teams don’t want to add a siloed automation tool into their environment that they have to maintain and support.

That’s why the use of enterprise automation continues to be so essential. Enterprise automation, like AutoSys and Automic from Broadcom, provides central management of automation workloads across a range of environments and platforms and the ability to adapt to the evolving requirements of cloud-driven workloads, including that driven by Databricks.

Introducing Automation by Broadcom for Cloud Integrations

Automation by Broadcom offers a wide range of cloud integrations. You can see all our cloud integrations in our Automation Marketplace.

With the broad platform and service coverage, workload teams can efficiently manage complex, multi-phase automation deployments within Databricks—as well as complex pipelines that span platforms and services from a range of platforms and vendors, including cloud vendors and on-premises systems. For example, an automation team could establish—and centrally manage—a multi-vendor extract, transform, load (ETL) workflow, with aggregation of data in a Google Cloud Platform instance, processing of data in Databricks, and analytics in Amazon QuickSight.

The Databricks integration

With AutoSys and Automic integrations for Databricks, developers and data science teams can fully leverage the power of Databricks in pursuing data science initiatives. At the same time, automation groups can continue to employ AutoSys and Automic as their central, unified platform for managing and orchestrating automation workloads across their application landscape.

AutoSys and Automic offer a rich set of capabilities that are invaluable for IT operations teams. Any process dependencies can be modeled, there is centralized operational control, and 360-degree visibility of all services running in production.

How it works

AutoSys and Automic can issue a range of commands to Databricks:

Start a cluster
Run and monitor a job
Stop a cluster

There are multiple ways operators can run jobs. They can submit jobs with “run now” and “run submit” payloads. When operators need to override default configurations, they can run jobs based on JSON payloads.

In this way, teams can automate the entire workflow in Databricks, including starting a cluster on demand, running and monitoring any number of jobs, and then stopping a cluster when jobs are complete.

The Databricks integrations use the public Databricks REST API 2.0. In AutoSys and Automic, operators establish a set of parameters, including the location of the Databricks endpoint, tokens needed to log in, specific cluster info, monitoring criteria, level of log detail, and more.

Administrators can then easily track the progress of jobs and get notified immediately if processes complete successfully or encounter a failure. If a failure occurs, administrators can easily refer to the log to identify the cause.

Advantages

By leveraging the Databricks integration, organizations can realize a number of benefits, particularly as their usage of Databricks and other cloud solutions continues to expand. Here are a few of the potential upsides:

Improved data integrity and accuracy. AutoSys and Automic enable groups to establish multi-phase, multi-job data pipelines that address required interdependencies. As a result, they can better ensure that downstream jobs only kick off when all upstream processes have been completed, which helps ensure complete and accurate data sets are used consistently.
Improved operational efficiency. In the past, if groups were solely reliant upon scheduling capabilities in Databricks, they’d be saddled with extensive manual efforts in terms of tracking job progress, and, when issues occur, manually investigating the issue, stopping related jobs, and so on. AutoSys and Automic enable automation teams to avoid all this manual effort. Further, the Databricks integration enables groups to avoid the high cost and effort of developing complex, custom shell scripts and maintaining them over time.
Reduced job duration and costs. By establishing effective dependencies between multiple phases and workloads, teams can ensure jobs are started, run, and stopped with maximum speed. This means the costs associated with leaving instances idle while waiting for forced time delays are reduced or eliminated completely.
Improved job resilience. AutoSys and Automic enable teams to add safeguards to account for connectivity issues, which is vital for cloud services like Databricks. For example, an API connection may go down temporarily. Instead of having a temporary issue lead to a permanent job failure, this enables operators to establish configurable intervals to retry connections and continue processing when services are back online.

Conclusion

For today’s enterprises, extracting maximum value from data is an increasingly critical imperative. To achieve this objective, it is vital to establish seamless automated data pipelines and to have the ability to leverage a range of cloud platforms, including Databricks. With AutoSys and Automic, teams can leverage a unified platform for managing all automation workloads running in Databricks and all their other cloud-based services and on-premises platforms.

To learn more, read about Broadcom’s Automation Marketplace for Cloud Integrations, and be sure to read our AutoSys cloud integration primer on why AutoSys is so important for cloud automation.

View full post