Information is a key enabler for your enterprise. Many AWS prospects have built-in their knowledge throughout a number of knowledge sources utilizing AWS Glue, a serverless knowledge integration service, to be able to make data-driven enterprise choices. To develop the facility of information at scale for the long run, it’s extremely advisable to design an end-to-end growth lifecycle in your knowledge integration pipelines. The next are widespread asks from our prospects:
- Is it potential to develop and take a look at AWS Glue knowledge integration jobs on my native laptop computer?
- Are there advisable approaches to provisioning elements for knowledge integration?
- How can we construct a steady integration and steady supply (CI/CD) pipeline for our knowledge integration pipeline?
- What’s the finest apply to maneuver from a pre-production atmosphere to manufacturing?
To deal with these asks, this publish defines the event lifecycle for knowledge integration and demonstrates how software program engineers and knowledge engineers can design an end-to-end growth lifecycle utilizing AWS Glue, together with growth, testing, and CI/CD, utilizing a pattern baseline template.
Finish-to-end growth lifecycle for an information integration pipeline
In the present day, it’s widespread to outline not solely knowledge integration jobs but in addition all the info elements in code. This implies which you can depend on normal software program finest practices to construct your knowledge integration pipeline. The software program growth lifecycle on AWS defines the next six phases: Plan, Design, Implement, Take a look at, Deploy, and Keep.
On this part, we talk about every part within the context of information integration pipeline.
Plan
Within the planning part, builders accumulate necessities from stakeholders reminiscent of end-users to outline an information requirement. This might be what the use instances are (for instance, advert hoc queries, dashboard, or troubleshooting), how a lot knowledge to course of (for instance, 1 TB per day), what varieties of information, what number of totally different knowledge sources to tug from, how a lot knowledge latency to simply accept to make it queryable (for instance, quarter-hour), and so forth.
Design
Within the design part, you analyze necessities and determine one of the best answer to construct the info integration pipeline. In AWS, it’s essential select the precise providers to realize the purpose and provide you with the structure by integrating these providers and defining dependencies between elements. For instance, you could select AWS Glue jobs as a core part for loading knowledge from totally different sources, together with Amazon Easy Storage Service (Amazon S3), then integrating them and preprocessing and enriching knowledge. Then you could need to chain a number of AWS Glue jobs and orchestrate them. Lastly, you could need to use Amazon Athena and Amazon QuickSight to current the enriched knowledge to end-users.
Implement
Within the implementation part, knowledge engineers code the info integration pipeline. They analyze the necessities to determine coding duties to realize the ultimate consequence. The code contains the next:
- AWS useful resource definition
- Information integration logic
When utilizing AWS Glue, you’ll be able to outline the info integration logic in a job script, which will be written in Python or Scala. You should utilize your most well-liked IDE to implement AWS useful resource definition utilizing the AWS Cloud Growth Package (AWS CDK) or AWS CloudFormation, and in addition the enterprise logic of AWS Glue job scripts for knowledge integration. To be taught extra about how you can implement your AWS Glue job scripts regionally, seek advice from Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container.
Take a look at
Within the testing part, you test the implementation for bugs. High quality evaluation contains testing the code for errors and checking if it meets the necessities. As a result of many groups instantly take a look at the code you write, the testing part usually runs parallel to the event part. There are various kinds of testing:
- Unit testing
- Integration testing
- Efficiency testing
For unit testing, even for knowledge integration, you’ll be able to depend on a typical testing framework reminiscent of pytest and ScalaTest. To be taught extra about how you can obtain unit testing regionally, seek advice from Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container.
Deploy
When knowledge engineers develop an information integration pipeline, you code and take a look at on a unique copy of the product than the one which the end-users have entry to. The atmosphere that end-users use is named manufacturing, whereas different copies are stated to be within the growth or the pre-production atmosphere.
Having separate construct and manufacturing environments ensures which you can proceed to make use of the info integration pipeline even whereas it’s being modified or upgraded. The deployment part contains a number of duties to maneuver the most recent construct copy to the manufacturing atmosphere, reminiscent of packaging, atmosphere configuration, and set up.
The next elements are deployed by the AWS CDK or AWS CloudFormation:
- AWS sources
- Information integration job scripts for AWS Glue
AWS CodePipeline lets you construct a mechanism to automate deployments amongst totally different environments, together with growth, pre-production, and manufacturing. If you commit your code to AWS CodeCommit, CodePipeline mechanically provisions AWS sources based mostly on the CloudFormation templates included within the commit and uploads script recordsdata included within the decide to Amazon S3.
Keep
Even after you deploy your answer to a manufacturing atmosphere, it’s not the top of your venture. It’s essential to monitor the info integration pipeline repeatedly and hold sustaining and bettering it. Extra particularly, you additionally want to repair bugs, resolve buyer points, and handle software program modifications. As well as, it’s essential monitor the general system efficiency, safety, and person expertise to determine new methods to enhance the present knowledge integration pipeline.
Resolution overview
Usually, you’ve gotten a number of accounts to handle and provision sources in your knowledge pipeline. On this publish, we assume the next three accounts:
- Pipeline account – This hosts the end-to-end pipeline
- Dev account – This hosts the mixing pipeline within the growth atmosphere
- Prod account – This hosts the info integration pipeline within the manufacturing atmosphere
If you’d like, you should utilize the identical account and the identical Area for all three.
To start out making use of this end-to-end growth lifecycle mannequin to your knowledge platform simply and rapidly, we ready the baseline template aws-glue-cdk-baseline utilizing the AWS CDK. The template is constructed on high of AWS CDK v2 and CDK Pipelines. It provisions two sorts of stacks:
- AWS Glue app stack – This provisions the info integration pipeline: one within the dev account and one within the prod account
- Pipeline stack – This provisions the Git repository and CI/CD pipeline within the pipeline account
The AWS Glue app stack provisions the info integration pipeline, together with the next sources:
- AWS Glue jobs
- AWS Glue job scripts
The next diagram illustrates this structure.
On the time of publishing of this publish, the AWS CDK has two variations of the AWS Glue module: @aws-cdk/aws-glue and @aws-cdk/aws-glue-alpha, containing L1 constructs and L2 constructs, respectively. The pattern AWS Glue app stack is outlined utilizing aws-glue-alpha, the L2 assemble for AWS Glue, as a result of it’s easy to outline and handle AWS Glue sources. If you wish to use the L1 assemble, seek advice from Construct, Take a look at and Deploy ETL options utilizing AWS Glue and AWS CDK based mostly CI/CD pipelines.
The pipeline stack provisions the complete CI/CD pipeline, together with the next sources:
The next diagram illustrates the pipeline workflow.
Each time the enterprise requirement modifications (reminiscent of including knowledge sources or altering knowledge transformation logic), you make modifications on the AWS Glue app stack and re-provision the stack to replicate your modifications. That is completed by committing your modifications within the AWS CDK template to the CodeCommit repository, then CodePipeline displays the modifications on AWS sources utilizing CloudFormation change units.
Within the following sections, we current the steps to arrange the required atmosphere and exhibit the end-to-end growth lifecycle.
Conditions
You want the next sources:
Initialize the venture
To initialize the venture, full the next steps:
- Clone the baseline template to your office:
- Create a Python digital atmosphere particular to the venture on the shopper machine:
We use a digital atmosphere to be able to isolate the Python atmosphere for this venture and never set up software program globally.
- Activate the digital atmosphere in response to your OS:
- On MacOS and Linux, use the next command:
- On a Home windows platform, use the next command:
After this step, the following steps run inside the bounds of the digital atmosphere on the shopper machine and work together with the AWS account as wanted.
- Set up the required dependencies described in necessities.txt to the digital atmosphere:
- Edit the configuration file
default-config.yaml
based mostly in your environments (substitute every account ID with your individual): - Run
pytest
to initialize the snapshot take a look at recordsdata by working the next command:
Bootstrap your AWS environments
Run the next instructions to bootstrap your AWS environments:
- Within the pipeline account, substitute PIPELINE-ACCOUNT-NUMBER, REGION, and PIPELINE-PROFILE with your individual values:
- Within the dev account, substitute PIPELINE-ACCOUNT-NUMBER, DEV-ACCOUNT-NUMBER, REGION, and DEV-PROFILE with your individual values:
- Within the prod account, substitute PIPELINE-ACCOUNT-NUMBER, PROD-ACCOUNT-NUMBER, REGION, and PROD-PROFILE with your individual values:
If you use just one account for all environments, you’ll be able to simply run the cdk bootstrap
command one time.
Deploy your AWS sources
Run the command utilizing the pipeline account to deploy the sources outlined within the AWS CDK baseline template:
This creates the pipeline stack within the pipeline account and the AWS Glue app stack within the growth account.
When the cdk deploy
command is accomplished, let’s confirm the pipeline utilizing the pipeline account.
On the CodePipeline console, navigate to GluePipeline
. Then confirm that GluePipeline
has the next levels: Supply
, Construct
, UpdatePipeline
, Belongings
, DeployDev
, and DeployProd
. Additionally confirm that the levels Supply
, Construct
, UpdatePipeline
, Belongings
, DeployDev
have succeeded, and DeployProd
is pending. It could possibly take about quarter-hour.
Now that the pipeline has been created efficiently, you too can confirm the AWS Glue app stack useful resource on the AWS CloudFormation console within the dev
account.
At this step, the AWS Glue app stack is deployed solely within the dev
account. You may attempt to run the AWS Glue job ProcessLegislators
to see the way it works.
Configure your Git repository with CodeCommit
In an earlier step, you cloned the Git repository from GitHub. Though it’s potential to configure the AWS CDK template to work with GitHub, GitHub Enterprise, or Bitbucket, for this publish, we use CodeCommit. If you happen to desire these third-party Git suppliers, configure the connections and edit pipeline_stack.py to outline the variable supply
to make use of the goal Git supplier utilizing CodePipelineSource.
Since you already ran the cdk deploy command, the CodeCommit repository has already been created with all of the required code and associated recordsdata. Step one is to arrange entry to CodeCommit. The following step is to clone the repository from the CodeCommit repository to your native. Run the next instructions:
Within the subsequent step, we make modifications on this native copy of the CodeCommit repository.
Finish-to-end growth lifecycle
Now that the atmosphere has been efficiently created, you’re prepared to start out growing an information integration pipeline utilizing this baseline template. Let’s stroll by end-to-end growth lifecycle.
If you need to outline your individual knowledge integration pipeline, it’s essential add extra AWS Glue jobs and implement job scripts. For this publish, let’s assume the use case so as to add a brand new AWS Glue job with a brand new job script to learn a number of S3 places and be a part of them.
Implement and take a look at in your native atmosphere
First, implement and take a look at the AWS Glue job and its job script in your native atmosphere utilizing Visible Studio Code.
Arrange your growth atmosphere by following the steps in Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container. The next steps are required within the context of this publish:
- Begin Docker.
- Pull the Docker picture that has the native growth atmosphere utilizing the AWS Glue ETL library:
- Run the next command to outline the AWS named profile title:
- Run the next command to make it obtainable with the baseline template:
- Run the Docker container:
- Begin Visible Studio Code.
- Select Distant Explorer within the navigation pane, then select the arrow icon of the workspace folder within the container
public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01
.
If the workspace folder shouldn’t be proven, select Open folder and choose /house/glue_user/workspace
.
Then you will notice a view just like the next screenshot.
Optionally, you’ll be able to set up AWS Instrument Package for Visible Studio Code, and begin Amazon CodeWhisperer to allow code suggestions powered by machine studying mannequin. For instance, in aws_glue_cdk_baseline/job_scripts/process_legislators.py, you’ll be able to put feedback like “# Write a DataFrame in Parquet format to S3”, press Enter key, then CodeWhisperer will advocate a code snippet just like the next:
Now you put in the required dependencies described in necessities.txt to the container atmosphere.
- Run the next instructions in the terminal in Visible Studio Code:
- Implement the code.
Now let’s make the required modifications for a brand new AWS Glue job right here.
- Edit the file aws_glue_cdk_baseline/glue_app_stack.py. Let’s add the next new code block after the present job definition of
ProcessLegislators
to be able to add the brand new AWS Glue jobJoinLegislators
:
Right here, you added three job parameters for various S3 places utilizing the variable config
. It’s the dictionary generated from default-config.yaml. On this baseline template, we use this central config file for managing parameters for all of the Glue jobs within the construction <stage title>/jobs/<job title>/<parameter title>
. Within the continuing steps, you present these places by the AWS Glue job parameters.
- Create a brand new job script known as
aws_glue_cdk_baseline/job_scripts/join_legislators.py
: - Create a brand new unit take a look at script for the brand new AWS Glue job known as
aws_glue_cdk_baseline/job_scripts/checks/test_join_legislators.py
: - In default-config.yaml, add the next underneath
prod
anddev
: - Add the next underneath
"jobs"
within the variableconfig
in checks/unit/test_glue_app_stack.py, checks/unit/test_pipeline_stack.py, and checks/snapshot/test_snapshot_glue_app_stack.py (no want to exchange S3 places): - Select Run on the high proper to run the person job scripts.
If the Run button shouldn’t be proven, set up Python into the container by Extensions within the navigation pane.
- For native unit testing, run the next command in the terminal in Visible Studio Code:
Then you’ll be able to confirm that the newly added unit take a look at handed efficiently.
- Run
pytest
to initialize the snapshot take a look at recordsdata by working following command:
Deploy to the event atmosphere
Full following steps to deploy the AWS Glue app stack to the event atmosphere and run integration checks there:
- Arrange entry to CodeCommit.
- Commit and push your modifications to the CodeCommit repo:
You may see that the pipeline is efficiently triggered.
Integration take a look at
There may be nothing required for working the mixing take a look at for the newly added AWS Glue job. The combination take a look at script integ_test_glue_app_stack.py runs all the roles together with a selected tag, then verifies the state and its period. If you wish to change the situation or the brink, you’ll be able to edit assertions at the top of the integ_test_glue_job technique.
Deploy to the manufacturing atmosphere
Full the next steps to deploy the AWS Glue app stack to the manufacturing atmosphere:
- On the CodePipeline console, navigate to
GluePipeline
. - Select Assessment underneath the
DeployProd
stage. - Select Approve.
Watch for the DeployProd
stage to finish, then you’ll be able to confirm the AWS Glue app stack useful resource within the dev account.
Clear up
To scrub up your sources, full following steps:
- Run the next command utilizing the pipeline account:
- Delete the AWS Glue app stack within the dev account and prod account.
Conclusion
On this publish, you realized how you can outline the event lifecycle for knowledge integration and the way software program engineers and knowledge engineers can design an end-to-end growth lifecycle utilizing AWS Glue, together with growth, testing, and CI/CD, by a pattern AWS CDK template. You may get began constructing your individual end-to-end growth lifecycle in your workload utilizing AWS Glue.
Concerning the writer
Noritaka Sekiyama is a Principal Large Information Architect on the AWS Glue workforce. He works based mostly in Tokyo, Japan. He’s accountable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking along with his street bike.