Finish-to-end growth lifecycle for knowledge engineers to construct an information integration pipeline utilizing AWS Glue

July 27, 2023

1

Information is a key enabler for your enterprise. Many AWS prospects have built-in their knowledge throughout a number of knowledge sources utilizing AWS Glue, a serverless knowledge integration service, to be able to make data-driven enterprise choices. To develop the facility of information at scale for the long run, it’s extremely advisable to design an end-to-end growth lifecycle in your knowledge integration pipelines. The next are widespread asks from our prospects:

Is it potential to develop and take a look at AWS Glue knowledge integration jobs on my native laptop computer?
Are there advisable approaches to provisioning elements for knowledge integration?
How can we construct a steady integration and steady supply (CI/CD) pipeline for our knowledge integration pipeline?
What’s the finest apply to maneuver from a pre-production atmosphere to manufacturing?

To deal with these asks, this publish defines the event lifecycle for knowledge integration and demonstrates how software program engineers and knowledge engineers can design an end-to-end growth lifecycle utilizing AWS Glue, together with growth, testing, and CI/CD, utilizing a pattern baseline template.

Finish-to-end growth lifecycle for an information integration pipeline

In the present day, it’s widespread to outline not solely knowledge integration jobs but in addition all the info elements in code. This implies which you can depend on normal software program finest practices to construct your knowledge integration pipeline. The software program growth lifecycle on AWS defines the next six phases: Plan, Design, Implement, Take a look at, Deploy, and Keep.

On this part, we talk about every part within the context of information integration pipeline.

Plan

Within the planning part, builders accumulate necessities from stakeholders reminiscent of end-users to outline an information requirement. This might be what the use instances are (for instance, advert hoc queries, dashboard, or troubleshooting), how a lot knowledge to course of (for instance, 1 TB per day), what varieties of information, what number of totally different knowledge sources to tug from, how a lot knowledge latency to simply accept to make it queryable (for instance, quarter-hour), and so forth.

Design

Within the design part, you analyze necessities and determine one of the best answer to construct the info integration pipeline. In AWS, it’s essential select the precise providers to realize the purpose and provide you with the structure by integrating these providers and defining dependencies between elements. For instance, you could select AWS Glue jobs as a core part for loading knowledge from totally different sources, together with Amazon Easy Storage Service (Amazon S3), then integrating them and preprocessing and enriching knowledge. Then you could need to chain a number of AWS Glue jobs and orchestrate them. Lastly, you could need to use Amazon Athena and Amazon QuickSight to current the enriched knowledge to end-users.

Implement

Within the implementation part, knowledge engineers code the info integration pipeline. They analyze the necessities to determine coding duties to realize the ultimate consequence. The code contains the next:

AWS useful resource definition
Information integration logic

When utilizing AWS Glue, you’ll be able to outline the info integration logic in a job script, which will be written in Python or Scala. You should utilize your most well-liked IDE to implement AWS useful resource definition utilizing the AWS Cloud Growth Package (AWS CDK) or AWS CloudFormation, and in addition the enterprise logic of AWS Glue job scripts for knowledge integration. To be taught extra about how you can implement your AWS Glue job scripts regionally, seek advice from Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container.

Take a look at

Within the testing part, you test the implementation for bugs. High quality evaluation contains testing the code for errors and checking if it meets the necessities. As a result of many groups instantly take a look at the code you write, the testing part usually runs parallel to the event part. There are various kinds of testing:

Unit testing
Integration testing
Efficiency testing

For unit testing, even for knowledge integration, you’ll be able to depend on a typical testing framework reminiscent of pytest and ScalaTest. To be taught extra about how you can obtain unit testing regionally, seek advice from Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container.

Deploy

When knowledge engineers develop an information integration pipeline, you code and take a look at on a unique copy of the product than the one which the end-users have entry to. The atmosphere that end-users use is named manufacturing, whereas different copies are stated to be within the growth or the pre-production atmosphere.

Having separate construct and manufacturing environments ensures which you can proceed to make use of the info integration pipeline even whereas it’s being modified or upgraded. The deployment part contains a number of duties to maneuver the most recent construct copy to the manufacturing atmosphere, reminiscent of packaging, atmosphere configuration, and set up.

The next elements are deployed by the AWS CDK or AWS CloudFormation:

AWS sources
Information integration job scripts for AWS Glue

AWS CodePipeline lets you construct a mechanism to automate deployments amongst totally different environments, together with growth, pre-production, and manufacturing. If you commit your code to AWS CodeCommit, CodePipeline mechanically provisions AWS sources based mostly on the CloudFormation templates included within the commit and uploads script recordsdata included within the decide to Amazon S3.

Keep

Even after you deploy your answer to a manufacturing atmosphere, it’s not the top of your venture. It’s essential to monitor the info integration pipeline repeatedly and hold sustaining and bettering it. Extra particularly, you additionally want to repair bugs, resolve buyer points, and handle software program modifications. As well as, it’s essential monitor the general system efficiency, safety, and person expertise to determine new methods to enhance the present knowledge integration pipeline.

Resolution overview

Usually, you’ve gotten a number of accounts to handle and provision sources in your knowledge pipeline. On this publish, we assume the next three accounts:

Pipeline account – This hosts the end-to-end pipeline
Dev account – This hosts the mixing pipeline within the growth atmosphere
Prod account – This hosts the info integration pipeline within the manufacturing atmosphere

If you’d like, you should utilize the identical account and the identical Area for all three.

To start out making use of this end-to-end growth lifecycle mannequin to your knowledge platform simply and rapidly, we ready the baseline template aws-glue-cdk-baseline utilizing the AWS CDK. The template is constructed on high of AWS CDK v2 and CDK Pipelines. It provisions two sorts of stacks:

AWS Glue app stack – This provisions the info integration pipeline: one within the dev account and one within the prod account
Pipeline stack – This provisions the Git repository and CI/CD pipeline within the pipeline account

The AWS Glue app stack provisions the info integration pipeline, together with the next sources:

AWS Glue jobs
AWS Glue job scripts

The next diagram illustrates this structure.

On the time of publishing of this publish, the AWS CDK has two variations of the AWS Glue module: @aws-cdk/aws-glue and @aws-cdk/aws-glue-alpha, containing L1 constructs and L2 constructs, respectively. The pattern AWS Glue app stack is outlined utilizing aws-glue-alpha, the L2 assemble for AWS Glue, as a result of it’s easy to outline and handle AWS Glue sources. If you wish to use the L1 assemble, seek advice from Construct, Take a look at and Deploy ETL options utilizing AWS Glue and AWS CDK based mostly CI/CD pipelines.

The pipeline stack provisions the complete CI/CD pipeline, together with the next sources:

The next diagram illustrates the pipeline workflow.

Each time the enterprise requirement modifications (reminiscent of including knowledge sources or altering knowledge transformation logic), you make modifications on the AWS Glue app stack and re-provision the stack to replicate your modifications. That is completed by committing your modifications within the AWS CDK template to the CodeCommit repository, then CodePipeline displays the modifications on AWS sources utilizing CloudFormation change units.

Within the following sections, we current the steps to arrange the required atmosphere and exhibit the end-to-end growth lifecycle.

Conditions

You want the next sources:

Initialize the venture

To initialize the venture, full the next steps:

Clone the baseline template to your office:

$ git clone git@github.com:aws-samples/aws-glue-cdk-baseline.git
$ cd aws-glue-cdk-baseline.git

Create a Python digital atmosphere particular to the venture on the shopper machine:

We use a digital atmosphere to be able to isolate the Python atmosphere for this venture and never set up software program globally.

Activate the digital atmosphere in response to your OS:
- On MacOS and Linux, use the next command:
```
$ supply .venv/bin/activate
```
- On a Home windows platform, use the next command:
```
% .venvScriptsactivate.bat
```

After this step, the following steps run inside the bounds of the digital atmosphere on the shopper machine and work together with the AWS account as wanted.

Set up the required dependencies described in necessities.txt to the digital atmosphere:
```
$ pip set up -r necessities.txt
$ pip set up -r requirements-dev.txt
```

Edit the configuration file default-config.yaml based mostly in your environments (substitute every account ID with your individual):

pipelineAccount:
awsAccountId: 123456789101
awsRegion: us-east-1

devAccount:
awsAccountId: 123456789102
awsRegion: us-east-1

prodAccount:
awsAccountId: 123456789103
awsRegion: us-east-1

Run pytest to initialize the snapshot take a look at recordsdata by working the next command:
```
$ python3 -m pytest --snapshot-update
```

Bootstrap your AWS environments

Run the next instructions to bootstrap your AWS environments:

Within the pipeline account, substitute PIPELINE-ACCOUNT-NUMBER, REGION, and PIPELINE-PROFILE with your individual values:

$ cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE> 
--cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess

Within the dev account, substitute PIPELINE-ACCOUNT-NUMBER, DEV-ACCOUNT-NUMBER, REGION, and DEV-PROFILE with your individual values:

$ cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> 
--cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess 
--trust <PIPELINE-ACCOUNT-NUMBER>

Within the prod account, substitute PIPELINE-ACCOUNT-NUMBER, PROD-ACCOUNT-NUMBER, REGION, and PROD-PROFILE with your individual values:

$ cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> 
--cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess 
--trust <PIPELINE-ACCOUNT-NUMBER>

If you use just one account for all environments, you’ll be able to simply run the cdk bootstrap command one time.

Deploy your AWS sources

Run the command utilizing the pipeline account to deploy the sources outlined within the AWS CDK baseline template:

$ cdk deploy --profile <PIPELINE-PROFILE>

This creates the pipeline stack within the pipeline account and the AWS Glue app stack within the growth account.

When the cdk deploy command is accomplished, let’s confirm the pipeline utilizing the pipeline account.

On the CodePipeline console, navigate to GluePipeline. Then confirm that GluePipeline has the next levels: Supply, Construct, UpdatePipeline, Belongings, DeployDev, and DeployProd. Additionally confirm that the levels Supply, Construct, UpdatePipeline, Belongings, DeployDev have succeeded, and DeployProd is pending. It could possibly take about quarter-hour.

Now that the pipeline has been created efficiently, you too can confirm the AWS Glue app stack useful resource on the AWS CloudFormation console within the dev account.

At this step, the AWS Glue app stack is deployed solely within the dev account. You may attempt to run the AWS Glue job ProcessLegislators to see the way it works.

Configure your Git repository with CodeCommit

In an earlier step, you cloned the Git repository from GitHub. Though it’s potential to configure the AWS CDK template to work with GitHub, GitHub Enterprise, or Bitbucket, for this publish, we use CodeCommit. If you happen to desire these third-party Git suppliers, configure the connections and edit pipeline_stack.py to outline the variable supply to make use of the goal Git supplier utilizing CodePipelineSource.

Since you already ran the cdk deploy command, the CodeCommit repository has already been created with all of the required code and associated recordsdata. Step one is to arrange entry to CodeCommit. The following step is to clone the repository from the CodeCommit repository to your native. Run the next instructions:

$ mkdir aws-glue-cdk-baseline-codecommit
$ cd aws-glue-cdk-baseline-codecommit
$ git clone ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/aws-glue-cdk-baseline

Within the subsequent step, we make modifications on this native copy of the CodeCommit repository.

Finish-to-end growth lifecycle

Now that the atmosphere has been efficiently created, you’re prepared to start out growing an information integration pipeline utilizing this baseline template. Let’s stroll by end-to-end growth lifecycle.

If you need to outline your individual knowledge integration pipeline, it’s essential add extra AWS Glue jobs and implement job scripts. For this publish, let’s assume the use case so as to add a brand new AWS Glue job with a brand new job script to learn a number of S3 places and be a part of them.

Implement and take a look at in your native atmosphere

First, implement and take a look at the AWS Glue job and its job script in your native atmosphere utilizing Visible Studio Code.

Arrange your growth atmosphere by following the steps in Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container. The next steps are required within the context of this publish:

Begin Docker.
Pull the Docker picture that has the native growth atmosphere utilizing the AWS Glue ETL library:
```
$ docker pull public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01
```
Run the next command to outline the AWS named profile title:
```
$ PROFILE_NAME="<DEV-PROFILE>"
```
Run the next command to make it obtainable with the baseline template:
```
$ cd aws-glue-cdk-baseline/
$ WORKSPACE_LOCATION=$(pwd)
```

Run the Docker container:

$ docker run -it -v ~/.aws:/house/glue_user/.aws -v $WORKSPACE_LOCATION:/house/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true 
--rm -p 4040:4040 -p 18080:18080 
--name glue_pyspark public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark

Begin Visible Studio Code.
Select Distant Explorer within the navigation pane, then select the arrow icon of the workspace folder within the container public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01.

If the workspace folder shouldn’t be proven, select Open folder and choose /house/glue_user/workspace.

Then you will notice a view just like the next screenshot.

Optionally, you’ll be able to set up AWS Instrument Package for Visible Studio Code, and begin Amazon CodeWhisperer to allow code suggestions powered by machine studying mannequin. For instance, in aws_glue_cdk_baseline/job_scripts/process_legislators.py, you’ll be able to put feedback like “# Write a DataFrame in Parquet format to S3”, press Enter key, then CodeWhisperer will advocate a code snippet just like the next:

Now you put in the required dependencies described in necessities.txt to the container atmosphere.

Run the next instructions in the terminal in Visible Studio Code:

$ pip set up -r necessities.txt
$ pip set up -r requirements-dev.txt

Implement the code.

Now let’s make the required modifications for a brand new AWS Glue job right here.

Edit the file aws_glue_cdk_baseline/glue_app_stack.py. Let’s add the next new code block after the present job definition of ProcessLegislators to be able to add the brand new AWS Glue job JoinLegislators:

        self.new_glue_job = glue.Job(self, "JoinLegislators",
            executable=glue.JobExecutable.python_etl(
                glue_version=glue.GlueVersion.V4_0,
                python_version=glue.PythonVersion.THREE,
                script=glue.Code.from_asset(
                    path.be a part of(path.dirname(__file__), "job_scripts/join_legislators.py")
                )
            ),
            description="a brand new instance PySpark job",
            default_arguments={
                "--input_path_orgs": config[stage]['jobs']['JoinLegislators']['inputLocationOrgs'],
                "--input_path_persons": config[stage]['jobs']['JoinLegislators']['inputLocationPersons'],
                "--input_path_memberships": config[stage]['jobs']['JoinLegislators']['inputLocationMemberships']
            },
            tags={
                "atmosphere": self.atmosphere,
                "artifact_id": self.artifact_id,
                "stack_id": self.stack_id,
                "stack_name": self.stack_name
            }
        )

Right here, you added three job parameters for various S3 places utilizing the variable config. It’s the dictionary generated from default-config.yaml. On this baseline template, we use this central config file for managing parameters for all of the Glue jobs within the construction <stage title>/jobs/<job title>/<parameter title>. Within the continuing steps, you present these places by the AWS Glue job parameters.

Create a brand new job script known as aws_glue_cdk_baseline/job_scripts/join_legislators.py:

aws_glue_cdk_baseline/job_scripts/join_legislators.py:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import Be part of
from awsglue.utils import getResolvedOptions


class JoinLegislators:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
            params.append('input_path_orgs')
            params.append('input_path_persons')
            params.append('input_path_memberships')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
            self.input_path_orgs = args['input_path_orgs']
            self.input_path_persons = args['input_path_persons']
            self.input_path_memberships = args['input_path_memberships']
        else:
            jobname = "take a look at"
            self.input_path_orgs = "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
            self.input_path_persons = "s3://awsglue-datasets/examples/us-legislators/all/individuals.json"
            self.input_path_memberships = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
        self.job.init(jobname, args)
    
    def run(self):
        dyf = join_legislators(self.context, self.input_path_orgs, self.input_path_persons, self.input_path_memberships)
        df = dyf.toDF()
        df.printSchema()
        df.present()
        print(df.depend())

def read_dynamic_frame_from_json(glue_context, path):
    return glue_context.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format="json"
    )

def join_legislators(glue_context, path_orgs, path_persons, path_memberships):
    orgs = read_dynamic_frame_from_json(glue_context, path_orgs)
    individuals = read_dynamic_frame_from_json(glue_context, path_persons)
    memberships = read_dynamic_frame_from_json(glue_context, path_memberships)
    orgs = orgs.drop_fields(['other_names', 'identifiers']).rename_field('id', 'org_id').rename_field('title', 'org_name')
    dynamicframe_joined = Be part of.apply(orgs, Be part of.apply(individuals, memberships, 'id', 'person_id'), 'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
    return dynamicframe_joined

if __name__ == '__main__':
    JoinLegislators().run()

Create a brand new unit take a look at script for the brand new AWS Glue job known as aws_glue_cdk_baseline/job_scripts/checks/test_join_legislators.py:

import pytest
import sys
import join_legislators
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

    yield(context)

def test_counts(glue_context):
    dyf = join_legislators.join_legislators(glue_context, 
        "s3://awsglue-datasets/examples/us-legislators/all/organizations.json",
        "s3://awsglue-datasets/examples/us-legislators/all/individuals.json", 
        "s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
    assert dyf.toDF().depend() == 10439

In default-config.yaml, add the next underneath prod and dev:

 JoinLegislators:
      inputLocationOrgs: "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
      inputLocationPersons: "s3://awsglue-datasets/examples/us-legislators/all/individuals.json"
      inputLocationMemberships: "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"

Add the next underneath "jobs" within the variable config in checks/unit/test_glue_app_stack.py, checks/unit/test_pipeline_stack.py, and checks/snapshot/test_snapshot_glue_app_stack.py (no want to exchange S3 places):

,
            "JoinLegislators": {
                "inputLocationOrgs": "s3://path_to_data_orgs",
                "inputLocationPersons": "s3://path_to_data_persons",
                "inputLocationMemberships": "s3://path_to_data_memberships"
            }

Select Run on the high proper to run the person job scripts.

If the Run button shouldn’t be proven, set up Python into the container by Extensions within the navigation pane.

For native unit testing, run the next command in the terminal in Visible Studio Code:
```
$ cd aws_glue_cdk_baseline/job_scripts/
$ python3 -m pytest
```

Then you’ll be able to confirm that the newly added unit take a look at handed efficiently.

Run pytest to initialize the snapshot take a look at recordsdata by working following command:
```
$ cd ../../
$ python3 -m pytest --snapshot-update
```

Deploy to the event atmosphere

Full following steps to deploy the AWS Glue app stack to the event atmosphere and run integration checks there:

Arrange entry to CodeCommit.

Commit and push your modifications to the CodeCommit repo:

$ git add .
$ git commit -m "Add the second Glue job"
$ git push

You may see that the pipeline is efficiently triggered.

Integration take a look at

There may be nothing required for working the mixing take a look at for the newly added AWS Glue job. The combination take a look at script integ_test_glue_app_stack.py runs all the roles together with a selected tag, then verifies the state and its period. If you wish to change the situation or the brink, you’ll be able to edit assertions at the top of the integ_test_glue_job technique.

Deploy to the manufacturing atmosphere

Full the next steps to deploy the AWS Glue app stack to the manufacturing atmosphere:

On the CodePipeline console, navigate to GluePipeline.
Select Assessment underneath the DeployProd stage.
Select Approve.

Watch for the DeployProd stage to finish, then you’ll be able to confirm the AWS Glue app stack useful resource within the dev account.

Clear up

To scrub up your sources, full following steps:

Run the next command utilizing the pipeline account:
```
$ cdk destroy --profile <PIPELINE-PROFILE>
```
Delete the AWS Glue app stack within the dev account and prod account.

Conclusion

On this publish, you realized how you can outline the event lifecycle for knowledge integration and the way software program engineers and knowledge engineers can design an end-to-end growth lifecycle utilizing AWS Glue, together with growth, testing, and CI/CD, by a pattern AWS CDK template. You may get began constructing your individual end-to-end growth lifecycle in your workload utilizing AWS Glue.

Concerning the writer

Noritaka Sekiyama is a Principal Large Information Architect on the AWS Glue workforce. He works based mostly in Tokyo, Japan. He’s accountable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking along with his street bike.

Supply hyperlink

Previous articleZero Belief Blueprint: Safeguarding Manufacturing Operations from Cyber Threats

Next articleLooking for a generalizable technique for source-free area adaptation – Google Analysis Weblog

Finish-to-end growth lifecycle for knowledge engineers to construct an information integration pipeline utilizing AWS Glue

Finish-to-end growth lifecycle for an information integration pipeline

Plan

Design

Implement

Take a look at

Deploy

Keep

Resolution overview

Conditions

Initialize the venture

Bootstrap your AWS environments

Deploy your AWS sources

Configure your Git repository with CodeCommit

Finish-to-end growth lifecycle

Implement and take a look at in your native atmosphere

Deploy to the event atmosphere

Integration take a look at

Deploy to the manufacturing atmosphere

Clear up

Conclusion

Concerning the writer

Past Work raises $2.5M to make work extra ‘human’ with LLMs

Managing and Analyzing Recreation Information

Why Reinvent the Wheel? The Challenges of DIY Open Supply Analytics Platforms

LEAVE A REPLY Cancel reply

Most Popular

The position of gadget reliability engineering

The World of Galaxy Z Flip5 and Galaxy Z Fold5 – Samsung International Newsroom

MXene-based Nanomaterials with Enzyme-Like Properties for Biomedical Functions

Measuring Helium in Distant Galaxies Might Give Physicists Perception Into Why the Universe Exists

Recent Comments

ABOUT US

POPULAR POSTS

The position of gadget reliability engineering

The World of Galaxy Z Flip5 and Galaxy Z Fold5 – Samsung International Newsroom

MXene-based Nanomaterials with Enzyme-Like Properties for Biomedical Functions

POPULAR CATEGORY