Develop an Amazon Redshift ETL serverless framework utilizing RSQL, AWS Batch, and AWS Step Capabilities

0
11


Amazon Redshift RSQL is a command-line consumer for interacting with Amazon Redshift clusters and databases. You’ll be able to hook up with an Amazon Redshift cluster, describe database objects, question information, and think about question ends in numerous output codecs. You should utilize enhanced management circulate instructions to interchange current extract, rework, load (ETL) and automation scripts.

This submit explains how one can create a totally serverless and cost-effective Amazon Redshift ETL orchestration framework. To attain this, you should use Amazon Redshift RSQL and AWS companies corresponding to AWS Batch and AWS Step Capabilities.

Overview of resolution

Once you’re migrating from current information warehouses to Amazon Redshift, your current ETL processes are applied as proprietary scripts. These scripts comprise SQL statements and sophisticated enterprise logic corresponding to if-then-else management circulate logic, error reporting, and error dealing with. You’ll be able to convert all these options to Amazon Redshift RSQL, which you should use to interchange current ETL and different automation scripts. To study extra about Amazon Redshift RSQL options, examples, and use circumstances, see Speed up your information warehouse migration to Amazon Redshift – Half 4.

AWS Schema Conversion Software (AWS SCT) can convert proprietary scripts to Amazon Redshift RSQL. AWS SCT can robotically convert Teradata BTEQ scripts to Amazon Redshift RSQL. To study extra learn how to use AWS SCT, see Changing Teradata BTEQ scripts to Amazon Redshift RSQL with AWS SCT.

The aim of the answer offered on this submit is to run advanced ETL jobs applied in Amazon Redshift RSQL scripts within the AWS Cloud with out having to handle any infrastructure. Along with assembly purposeful necessities, this resolution additionally offers full auditing and traceability of all ETL processes that you just run.

The next diagram reveals the ultimate structure.

The deployment is absolutely automated utilizing AWS Cloud Improvement Equipment (AWS CDK) and includes of the next stacks:

  • EcrRepositoryStack – Creates a non-public Amazon Elastic Container Registry (Amazon ECR) repository that hosts our Docker picture with Amazon Redshift RSQL
  • RsqlDockerImageStack – Builds our Docker picture asset and uploads it to the ECR repository
  • VpcStack – Creates a VPC with remoted subnets, creates an Amazon Easy Storage Service (Amazon S3) VPC endpoint gateway, in addition to Amazon ECR, Amazon Redshift, and Amazon CloudWatch VPC endpoint interfaces
  • RedshiftStack – Creates an Amazon Redshift cluster, allows encryption, enforces encryption in-transit, allows auditing, and deploys the Amazon Redshift cluster in remoted subnets
  • BatchStack – Creates a compute setting (utilizing AWS Fargate), job queue, and job definition (utilizing our Docker picture with RSQL)
  • S3Stack – Creates information, scripts, and logging buckets; allows encryption at-rest; enforces safe switch; allows object versioning; and disables public entry
  • SnsStack – Creates an Amazon Easy Notification Service (Amazon SNS) subject and electronic mail subscription (electronic mail is handed as a parameter)
  • StepFunctionsStack – Creates a state machine to orchestrate serverless RSQL ETL jobs
  • SampleDataDeploymentStack – Deploys pattern RSQL ETL scripts and pattern TPC benchmark datasets

Stipulations

You need to have the next stipulations:

Deploy AWS CDK stacks

To deploy the serverless RSQL ETL framework resolution, use the next code. Change 123456789012 together with your AWS account quantity, eu-west-1 with the AWS Area to which you need deploy the answer, and your.electronic mail@instance.com together with your electronic mail deal with to which ETL success and failure notifications are despatched.

git clone https://github.com/aws-samples/amazon-redshift-serverless-rsql-etl-framework
cd amazon-redshift-serverless-rsql-etl-framework
npm set up
./cdk.sh 123456789012 eu-west-1 bootstrap
./cdk.sh 123456789012 eu-west-1 deploy --all --parameters SnsStack:EmailAddressSubscription=your.electronic mail@instance.com

The entire course of takes a couple of minutes. Whereas AWS CDK creates all of the stacks, you possibly can proceed studying this submit.

Create the RSQL container picture

AWS CDK creates an RSQL Docker picture. This Docker picture is the essential constructing block of our resolution. All ETL processes run inside it. AWS CDK creates the Docker picture regionally utilizing Docker Engine after which uploads it to the Amazon ECR repository.

The Docker picture relies on an Amazon Linux 2 Docker picture. It has the next instruments put in: the AWS Command Line Interface (AWS CLI), unixODBC, Amazon Redshift ODBC driver, and Amazon Redshift RSQL. It additionally incorporates .odbc.ini file, which defines the etl profile, which is used to hook up with the Amazon Redshift cluster. See the next code:

The next code instance reveals the .odbc.ini file. It defines an etl profile, which makes use of an AWS Identification and Entry Administration (IAM) function to get momentary cluster credentials to hook up with Amazon Redshift. AWS CDK creates this function for us. Due to this, we don’t have to hard-code credentials in a Docker picture. The Database, DbUser, and ClusterID parameters are set in AWS CDK. Additionally, AWS CDK replaces the Area parameter at runtime with the Area to which you deploy the stacks.

[ODBC]
Hint=no

[etl]
Driver=/choose/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so
Database=demo
DbUser=etl
ClusterID=redshiftblogdemo
Area=eu-west-1
IAM=1

For extra details about connecting to Amazon Redshift clusters with RSQL, see Hook up with a cluster with Amazon Redshift RSQL.

Our Docker picture implements a well known fetch and run integration sample. To study extra about this sample, see Making a Easy “Fetch & Run” AWS Batch Job. The Docker picture fetches the ETL script from an exterior repository, after which runs it. AWS CDK passes the details about the ETL script to run to the Docker container at runtime as an AWS Batch job parameter. The job parameter is uncovered to the container as an setting variable known as BATCH_SCRIPT_LOCATION. Our job additionally expects two different setting variables: DATA_BUCKET_NAME, which is the identify of the S3 information bucket, and COPY_IAM_ROLE_ARN, which is the Amazon Redshift IAM function used for the COPY command to load the information into Amazon Redshift. All setting variables are set robotically by AWS CDK. The fetch_and_run.sh script is the entry level of the Docker container. See the next code:

#!/bin/bash

# This script expects the next env variables to be set:
# BATCH_SCRIPT_LOCATION - full S3 path to RSQL script to run
# DATA_BUCKET_NAME - S3 bucket identify with the information
# COPY_IAM_ROLE_ARN - IAM function ARN that will probably be used to repeat the information from S3 to Redshift

PATH="/bin:/usr/bin:/sbin:/usr/sbin:/usr/native/bin:/usr/native/sbin"

if [ -z "${BATCH_SCRIPT_LOCATION}" ] || [ -z "${DATA_BUCKET_NAME}" ] || [ -z "${COPY_IAM_ROLE_ARN}" ]; then
    echo "BATCH_SCRIPT_LOCATION/DATA_BUCKET_NAME/COPY_IAM_ROLE_ARN not set. No script to run."
    exit 1
fi

# obtain script to a temp file
TEMP_SCRIPT_FILE=$(mktemp)
aws s3 cp ${BATCH_SCRIPT_LOCATION} ${TEMP_SCRIPT_FILE}

# execute script
# envsubst will substitute ${COPY_IAM_ROLE_ARN} and ${COPY_IAM_ROLE_ARN} placeholders with precise values
envsubst < ${TEMP_SCRIPT_FILE} | rsql -D etl

exit $?

Create AWS Batch assets

Subsequent, AWS CDK creates the AWS Batch compute setting, job queue, and job definition. As a totally managed service, AWS Batch helps you run batch computing workloads of any scale. AWS CDK creates a Fargate serverless compute setting for us. The compute setting deploys inside the identical VPC because the Amazon Redshift cluster, contained in the remoted subnets. The job definition makes use of our Docker picture with Amazon Redshift RSQL.

This step turns Amazon Redshift RSQL right into a serverless service. You’ll be able to construct advanced ETL workflows based mostly on this generic job.

Create a Step Capabilities state machine

AWS CDK then strikes to the deployment of the Step Capabilities state machine. Step Capabilities allows you to construct advanced workflows in a visible method instantly in your browser. This service helps over 9,000 API actions from over 200 AWS companies.

You should utilize Amazon States Language to create a state machine on the Step Capabilities console. The Amazon States Language is a JSON-based, structured language used to outline your state machine. You can too construct them programmatically utilizing AWS CDK, as I’ve executed for this submit.

After AWS CDK finishes, a brand new state machine is created in your account known as ServerlessRSQLETLFramework. To run it, full the next steps:

  1. Navigate to the Step Capabilities console.
  2. Select the operate to open the main points web page.
  3. Select Edit, after which select Workflow Studio New.
    The next screenshot reveals our state machine.
  4. Select Cancel to depart Workflow Studio, then select Cancel once more to depart the edit mode.
    You’ll be introduced again to the main points web page.
  5. Select Begin execution.
    A dialog field seems. By default, the Title parameter is ready to a random identifier, and the Enter parameter is ready to a pattern JSON doc.
  6. Delete the Enter parameter and select Begin execution to start out the state machine.

The Graph view on the main points web page updates in actual time. The state machine begins with a parallel state with two branches. Within the left department, the primary job hundreds buyer information into staging desk, then the second job merges new and current buyer data. In the correct department, two smaller tables for areas and nations are loaded after which merged one after one other. The parallel state waits till all branches are full earlier than transferring to the vacuum-analyze state, which runs VACUUM and ANALYZE instructions on Amazon Redshift. The pattern state machine additionally implements the Amazon SNS Publish API actions to ship notifications about success or failure.

From the Graph view, you possibly can verify the standing of every state by selecting it. Each state that makes use of an exterior useful resource has a hyperlink to it on the Particulars tab. In our instance, subsequent to each AWS Batch Job state, you possibly can see a hyperlink to the AWS Batch Job particulars web page. Right here, you possibly can view the standing, runtime, parameters, IAM roles, hyperlink to Amazon CloudWatch Logs with the logs produced by ETL scripts, and extra.

Clear up

To keep away from ongoing expenses for the assets that you just created, delete them. AWS CDK deletes all assets besides information assets corresponding to S3 buckets and Amazon ECR repositories.

  1. First, delete all AWS CDK stacks. Within the following code, present your individual AWS account and AWS Area:
    ./cdk.sh 123456789012 eu-west-1 destroy --all

  2. On the Amazon S3 console, empty and delete buckets with names beginning with:
    1. s3stack-rsqletldemodata
    2. s3stack-rsqletldemoscripts
    3. s3stack-rsqletldemologging
  3. Lastly, on the Amazon ECR console, delete repositories with names beginning with:
    1. ecrrepositorystack-amazonlinuxrsql
    2. cdk-container-assets

Subsequent steps

Listed here are some concepts of extra enhancements that you may add to the described resolution.

You’ll be able to break giant advanced state machines into smaller constructing blocks by creating self-contained state machines. In our instance, you can create state machines for each pair of copy and merge jobs. You can create three such state machines: Copy and Merge Buyer, Copy and Merge Area, and Copy and Merge Nation, after which name them from the primary state machine. For advanced workflows, a special workforce can work on every sub-state machine in parallel. Additionally, this sample promotes reuse of current parts, finest practices, and safety mechanisms.

You should utilize Amazon S3 Object Capabilities or Amazon S3 EventBridge notifications to start out a state machine robotically after you add a file to an S3 bucket. To study extra about Amazon S3 integration with Amazon EventBridge, see Use Amazon S3 Occasion Notifications with Amazon EventBridge. This fashion you possibly can obtain a totally event-driven serverless ETL orchestration framework.

Abstract

You should utilize Amazon Redshift RSQL, AWS Batch, and Step Capabilities to create fashionable, serverless, and cost-effective ETL workflows. There is no such thing as a infrastructure to handle, and Amazon Redshift RSQL works as a serverless RSQL service. On this submit, we demonstrated learn how to use this serverless RSQL service to construct extra advanced ETL workflows with Step Capabilities.

Step Capabilities integrates natively with over 200 AWS companies. This opens a brand new world of potentialities to AWS prospects and companions, who can combine their processes with different information, analytics, machine studying, and compute companies corresponding to Amazon S3, Amazon DynamoDB, AWS Glue, Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), Amazon SageMaker, AWS Lambda, and extra. The extra benefit of Step Capabilities and AWS Batch is that you’ve full traceability and auditing out of the field. Step Capabilities reveals Graph or Occasion views along with an entire historical past for all state machine runs.

On this submit, I used RSQL automation scripts because the constructing blocks of ETL workflows. Utilizing RSQL is a typical integration sample that we see for patrons migrating from Teradata BTEQ scripts. Nevertheless, if in case you have easy ETL or ELT processes that may be written as plain SQL, you possibly can invoke the Amazon Redshift Information API instantly from Step Capabilities. To study extra about this integration sample, see ETL orchestration utilizing the Amazon Redshift Information API and AWS Step Capabilities with AWS SDK integration.


Concerning the creator

Lukasz is a Principal Software program Dev Engineer working within the AWS DMA workforce. Lukasz helps prospects transfer their workloads to AWS and makes a speciality of migrating information warehouses and information lakes to AWS. In his free time, Lukasz enjoys studying new human languages.

LEAVE A REPLY

Please enter your comment!
Please enter your name here