Video
Managing Cloud Storage with AutoSys Plug in Extensions
See how you can manage storage objects for Amazon S3, Google Cloud Storage, and IBM Cloud Object Storage (COS) and integrate these into your data pipelines with AutoSys.
Video Transcript
For this demonstration I'm going to show the capabilities of our cloud storage plugin extension, which has the ability to interact with Amazon S3, Google Cloud Storage, as well as IBM Cloud Storage.
We can do a couple of operations and it's the same across all of the different Cloud storages that we support:
- We can copy data files between two buckets in cloud storage
- We can upload and download the files to and from the agent machine
- We also have the ability to do file triggering on that, so we can we can trigger when a file arrives it follows very closely to the existing file trigger and we can basically monitor a bucket for arrival of a file
So with this particular demonstration, I'm just going to show you a scenario - which is typically what you might see in a cloud-native kind of pipeline - where data needs to be uploaded to the cloud and we need to then manipulate that data using some kind of analytics or transforms to generate new data.
I didn't include the analytics, but the concept here was mostly centered around the storage, so basically once we get it into storage some kind of analytics can happen. They can use a variety of the different plug-in extensions that we currently have to manipulate data in the cloud, and then once the data is manipulated, they can push it back out to storage.
In this particular scenario, I was trying to show it doesn't matter whether you're on Amazon or Google - we can actually move the data even between the two Cloud platforms. We can download it from the Google Cloud Storage or Amazon S3 buckets, and we can upload it to another Cloud as long as it's using S3. Right now, we targeted IBM Cloud, Google Cloud, and Amazon or AWS.
In this example here, I have a file that's hosted on the agent machine and I'm uploading it to S3 storage. If you look at the way that the job is set up, this is using the plugin extension and it has some information about the endpoint for Amazon. This here is using a new feature in 12.1 which is around the security profiles, so here all of my access keys and access secrets are stored in AutoSys Secure. This is new for 12.1.
The rest of this information is about how to connect to it, so the buckets are in AP South 1. This is the file that I want to upload to the bucket, so it's sitting on the agent machine and this is the name of the bucket, and this is the name of the file that's going to be stored in the bucket. When this runs it will upload the file there, and once it's done the job will go to success.
The next thing here is just showing the file watching capability. This is really the same thing - you need an endpoint, you need a security profile, the access keys, the secrets, and the bucket name, and then this one has some information about the operation. Steady state we're waiting for 20 seconds for this file to be not changed in 20 seconds.
We also have the ability to monitor for file size as well as the use regex so we can do regex pattern matching. The idea here is that once you get it up to cloud storage, you would put some extra jobs in here to do some ETL stuff. Like maybe use glue or do some other kind of analytics to manipulate the data, and then re-store it out the cloud storage and then we can download the cloud storage from the bucket.
So here, we are taking the data out and then saving it onto the local system as this. It's saving it to a different name. The idea here is that we can also span multiple clouds. So now that we maybe did some processing in Amazon and now we want to maybe process some data inside of Google, we want to transfer that data into Google Cloud Storage.
So here, you'll see that I am taking that data that I just processed with Amazon and now I'm uploading that data to Google Cloud Storage. You'll see that “data1” is being uploaded right here, so it's taking out data from the previous job box that I ran and then it's going to upload it to Google Cloud Storage and then do the exact same thing on Google Cloud.
I have a file watcher on Google Cloud which is also monitoring the file for arrival and with the same parameters as before on Amazon, except this is pointing to Google Storage. In Google, they don't use access keys and access secrets; they use something called a token file.
This is a token file, and we know how to generate this special Amazon JWT sign tokens. We can generate the tokens from the file because the file contains the certificates as well as the signature information.
Once this is done, the idea here is that you would process this data maybe using data flow or data Fusion, or maybe do something in bigquery. There's a variety of things you can do at that point in Google to massage the data. Then we can download it again and maybe pass it on to something else.
This kind of shows you a little bit about the capabilities of how we targeted the S3 plug-in extension, allowing us to manipulate data in cloud storage as well as do file monitoring and triggering off of files uploaded to cloud storage.
So here you see so the first one. It uploaded it, then it's waiting for 20 seconds for a steady state on the file that was uploaded. This is just a guarantee that if you upload it to the files it may not be fully available in S3. You can monitor for steady state, so when the file doesn't change after 20 seconds then we know it's actually fully uploaded.
For these things we kind of give you some details about what you can see in the job running. If you go to the spool file you can see what it actually did. Here you see that hese are your the parameters you specified, and it uploaded the file. Now it's done doing the verify and now it's going to the Google part of the demo.
For all of these you can get some detail in the spool file about what actually occurred in the execution. This is for the file watcher - here you'll see it was monitoring for file completion so it did trigger. Same thing for Google - you can see the exact same information about the details on the Google run.
Here it's showing the upload operation and the file Watcher is exactly the same. You'll see we detected the file was uploaded and the job was marked as complete, and the job also marked as success here. If any of these uploads or downloads fails then the job will be marked as failed, so we will know that it failed.
This is mostly the beginning of a cloud ETL pipeline that's typically done where you're running an ETL job, where you have to upload and manipulate data into cloud storage.