Video

Automic Automation Cloud Integration: Google Dataproc Agent Integration

Broadcom's Google Dataproc Automation Agent lets you easily execute Dataproc jobs, monitor and manage them with your existing enterprise workload automation, as well as other cloud-native activities.

Learn More

You instantly inherit the advanced capabilities of your enterprise solution, enabling you to deliver your digital transformation more quickly and successfully. This video explains the Automic Automation Google Dataproc agent integration and its benefits. It presents its components and demonstrates how to install, configure, and use it.

Video Transcript

Welcome to this video on the Automic Automation Google Dataproc integration solution. In this video, we will explain the Google Cloud Dataproc integration and what it brings to the Automic Automation user community.

Google Cloud Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30 plus open-source tools and frameworks. The Automic Google Cloud Dataproc agent seamlessly integrates Google Cloud jobs with enterprise workload automation, enhancing efficiency and accelerating digital transformation by extending your existing enterprise automation to Dataproc. We remain end-to-end visibility, regain centralized command and control, alerting, SLA management, reporting, and auditing.

Integrating Automic Automation with Google Cloud Dataproc allows you to run Dataproc jobs in your workspace from Automic Automation. We'll provide some technical insights so that the integration components are clearly identified and the deployment sequence is understood. We'll focus on the configuration of the agent and the design of the two core object templates: connections and jobs. Finally, we'll run through a demo.

Automic Automation plays a central role in orchestrating operations across multiple environments, including the cloud. Automic Automation synchronizes these processes with other non-cloud operations. By integrating Google Cloud Dataproc, we can configure process automation centrally in Automic Automation, trigger, monitor, and supervise everything in one place. Google Cloud Dataproc processes can then be synchronized with all other environments routinely supported by Automic Automation.

Dataproc's role is reduced to execute the jobs. All other functions, especially those pertaining to automation, are delegated to Automic Automation. This means that we don't have to log in to the Google Cloud Dataproc environment and keep on refreshing it by ourselves. Automic Automation manages all the execution and monitoring aspects. Automic Automation lets us build configurations with intuitive interfaces like drag and drop workflows and supervised processes in simple dashboard tools designed natively for operations. Statuses are color-coded and retrieving logs is done with a basic right-click.

From an operations perspective, Automic Automation highly simplifies the configuration and orchestration of Google Cloud Dataproc jobs. Externalizing operations to a tool with a high degree of third-party integration means we can synchronize all cloud with non-cloud workload using various agents and job object types. We can build sophisticated configurations involving multiple applications, database packages, system processes like backups and data consolidation, file transfers, web services, and other on-premise workload.

A conventional architecture involves two systems: the Automic Automation host and a dedicated system for the agent. The agent is configured with a simple INI file containing standard values: system agent name, connection, and TLS. When we start the agent, it connects to the engine and it adds two new objects to the repository: a connection object to store the Google Cloud Dataproc endpoint and login data, and a job template designed to trigger Dataproc jobs.

Let's assume we're automating for four instances of Google Cloud Dataproc. We create a connection object in Automic Automation for each instance by duplicating the con template for each of these instances. Lastly, we create the Dataproc jobs in Automic Automation for each corresponding process in Google Cloud Dataproc. The Automic Automation jobs include the connection object based on the target system. When we execute the jobs in Automic Automation, this triggers the corresponding process in Google Cloud Dataproc. We're able to retrieve the successive statuses, supervise the child processes in the cloud, and finally generate a job report. In Automic Automation, this job can be incorporated in workflows and integrated with other non-cloud processes.

The procedure to deploy the Google Cloud Dataproc integration is as follows: First, we download the integration package from Marketplace. This package contains all the necessary elements. We unzip this package, which produces a directory containing the agent, the INI configuration files, and several other items like the start command. We use the appropriate INI file for our specific platform. Google Cloud Dataproc is a standard Automic agent. It requires at least four values to be updated: agent name, Automic system, JCP connection and TLS port, and finally TLS certificate.

When the agent is configured, we start it. New object templates are deployed. We create a connection to every Google Cloud Dataproc instance we need to support. For this, we use the template con object, which we duplicate as many times as we need. The con object references the Google Cloud Dataproc endpoint. Finally, we use the Google Cloud Dataproc template job to create the jobs we need. We match these Automic Automation jobs to the Google Dataproc jobs, reference the connection object, and run them. We're able to supervise the jobs and their children, generate logs, and retrieve the statuses. The jobs can then be incorporated into non-cloud workflows.

We install, configure, and start an agent to deploy the Google Cloud Dataproc integration. The agent is included in the Google Cloud Dataproc package, which we download from Marketplace. We unzip the package, which creates a file system agents/datapro/bin that contains the agent files. Based on the platform, we rename the agent configuration file UCXJCITX and set a minimum of four values: the agent name, the AE system name, the host name and port connection to the automation engine's JCP, and finally the directory containing the TLS certificate. Finally, we start the agent by invoking the JR file via the Java command. The agent connects to the AE and deploys the object templates needed to support the integration: the con or connection object, and a Google Cloud Dataproc job, which can be a start or stop cluster job, a submit job to cluster, or an instantiate workflow template job.

In our demo, we will create a connection object. Once we have established the connection to the Google Cloud Dataproc environment, we'll create Dataproc jobs. We'll create an instantiate workflow job first and run the integration. Then we create a submit to cluster job, and finally we'll execute and supervise the jobs. This integration also includes a third job type which we will not explain in this demo as it is quite obvious: the start or stop cluster job, which you can use to start or stop clusters.

Let's explore the Google Cloud Dataproc console. In your Google Cloud console, type "Dataproc" in the search bar and open the service. The left-hand navigation breaks Dataproc into three core areas: clusters, jobs, and workflows. A cluster is simply a fleet of Google Compute Engine virtual machines preconfigured with Spark and Hadoop. In our demo project, we have two clusters: Replica WLA Demo WLA Dataproc, the one we'll run jobs on. We'll use WLA Dataproc to submit work, then move over to Replica WLA Demo to demonstrate how easy it is to stop and restart a cluster when it's idle.

A job is a single task (Spark, Hive, PySpark, you name it) sent to a cluster. From this page, you can view every submitted job along with its ID, status, and runtime; submit new jobs or rerun past ones; kick off quick tutorials, lessons, or launch a predefined workflow. Need to string multiple jobs together? Use a workflow template. A Dataproc workflow enables you to define a sequence of jobs such as ETL aggregation and ML training and run them end-to-end with a single click. Here you can monitor each step, whether it's an on-demand run instantiate workflow or an inline definition instantiate workflow inline, ensuring your multi-stage pipelines stay consistent, repeatable, and easy to manage.

Let's start with the demo. Here you see our Google Dataproc environment where we have already set up and executed several integrations. Let's move on to the Automic system. Here we create connection and job objects with specific inputs to connect to Google Cloud Dataproc. If we open a connection object, we must enter the endpoint, which is the URL of the Google Cloud Dataproc environment. Next, we specify an authentication type from the drop-down menu. Currently, we support the service account key, which you can provide directly as JSON or via a file path. We select JSON and specify the token key in JSON format. If you are using a proxy in your environment, you can specify the proxy host name, port, username, and password in the proxy section.

Now that the connection object is defined, we can create a job: either a start or stop cluster job, a submit to cluster job, or an instantiate workflow job. Let's start with the instantiate workflow job. Start by selecting the connection object we have just created. Following the connection input, the job requires the project ID and optionally the location details. The next field is operation type. From the drop-down list, we can select the appropriate type. There are two options available: instantiate and instantiate inline. Instantiate runs a workflow template and requires a template ID. Use this for predefined reusable templates. Instantiate inline runs a workflow instantly without creating a template. You choose this for a quick one-off workflow. For this demo, we have selected instantiate. Next is the template ID. This is the ID of the Dataproc workflow template you want to execute.

The last parameter is "parameters," which is optional. It lets you send arguments to the cluster while the job is running. These arguments are in JSON format and act like a set of keys and values. Think of them as settings you can adjust, like the input path, where the output should go, the cluster name, or any other setting your workflow needs at runtime. For this demo, we select none. Everything is configured now, so we save and execute the job.

Let's go to the “executions” view. It shows the list of executions in Automic Automation. As you can see, this is the job we have created and it ended.

Let's have a look at the reports. The report captures the final response we received. The agent log lists all the connection details followed by the job inputs and execution logs. Finally, we see that the job was completed successfully.

Make a note of the job ID ending with 1745. We will use these IDs for verification in the Google Cloud Dataproc instance.

Let's check the job status on the Google Cloud Dataproc instance. This is the job we triggered and it is also completed successfully.

That wraps up the demo on how Automic Automation can integrate with Google Cloud Dataproc to run, execute, and monitor jobs. Thank you for watching this video.

Note: This transcript was generated with the assistance of an artificial intelligence language model. While we strive for accuracy and quality, please note that the transcription may not be entirely error-free. We recommend independently verifying the content and consulting with a product expert for specific advice or information. We do not assume any responsibility or liability for the use or interpretation of this content.

Want more Automic content?

Visit the Automic homepage, where you'll find blogs, videos, courses, and more!