In fashionable knowledge architectures, datasets are mixed throughout a company utilizing a wide range of purpose-built providers to unlock insights. Because of this, knowledge governance turns into a key element for knowledge shoppers and producers to know that their data-driven selections are based mostly on trusted and correct datasets. One side of knowledge governance is knowledge lineage, which captures the movement of knowledge because it goes by means of numerous programs and permits shoppers to grasp how a dataset was derived.
To be able to seize knowledge lineage persistently throughout numerous analytical providers, it’s worthwhile to use a standard lineage mannequin and a sturdy job orchestration that is ready to tie collectively numerous knowledge flows. One attainable resolution is the open-source OpenLineage venture. It gives a technology-agnostic metadata mannequin for capturing knowledge lineage and integrates with broadly used instruments. For job orchestration, it integrates with Apache Airflow, which you’ll be able to run on AWS conveniently by means of the managed service Amazon Managed Workflows for Apache Airflow (Amazon MWAA). OpenLineage gives a plugin for Apache Airflow that extracts knowledge lineage from Directed Acyclic Graphs (DAGs).
On this submit, we present how one can get began with knowledge lineage on AWS utilizing OpenLineage. We offer a step-by-step configuration information for the openlineage-airflow plugin on Amazon MWAA. Moreover, we share an AWS Cloud Growth Equipment (AWS CDK) venture that deploys a pre-configured demo surroundings for evaluating and experiencing OpenLineage first-hand.
OpenLineage on Apache Airflow
Within the following instance, Airflow turns OLTP knowledge right into a star schema on Amazon Redshift Serverless.
After staging and getting ready supply knowledge from Amazon Easy Storage Service (Amazon S3), truth and dimension tables are ultimately created. For this, Airflow orchestrates the execution of SQL statements that create and populate tables on Redshift Serverless.
The openlineage-airflow plugin collects metadata about creation of datasets and dependencies between them. This enables us to maneuver from a jobs-centric strategy of Airflow to a datasets-centric strategy, bettering the observability of workflows.
The next screenshot exhibits elements of the captured lineage for the earlier instance. It’s displayed in Marquez, an open-source metadata service for assortment and visualization of knowledge lineage with help for the OpenLineage customary. In Marquez, you possibly can analyze the upstream datasets and transformations that ultimately create the consumer dimension desk on the best.
The instance on this submit relies on SQL and Amazon Redshift. OpenLineage additionally helps different transformation engines and knowledge shops comparable to Apache Spark and dbt.
Resolution overview
The next diagram exhibits the AWS setup required to seize knowledge lineage utilizing OpenLineage.
The workflow consists of the next elements:
- The openlineage-airflow plugin is configured on Airflow as a lineage backend. Metadata in regards to the DAG runs is handed by Airflow core to the plugin, which converts it into OpenLineage format and sends it to an exterior metadata retailer. In our demo setup, we use Marquez because the metadata retailer.
- The openlineage-airflow plugin receives its configuration from surroundings variables. To populate these variables on Amazon MWAA, a customized Airflow plugin is used. First, the plugin reads supply values from AWS Secrets and techniques Supervisor. Then, it creates surroundings variables.
- Secrets and techniques Supervisor is configured as a secrets and techniques backend. Usually, the sort of configuration is saved in Airflow’s native metadata database. Nonetheless, this strategy has limitations. As an illustration, in case of a number of Airflow environments, it’s worthwhile to monitor and retailer credentials throughout a number of environments, and updating credentials requires you to replace all of the environments. With a secrets and techniques backend, you possibly can centralize configuration.
- For demo functions, we acquire knowledge lineage from an information pipeline, which creates a star schema in Redshift Serverless.
Within the following sections, we stroll you thru the steps for end-to-end configuration.
Set up the openlineage-airflow plugin
Specify the next dependency within the necessities.txt
file of the Amazon MWAA surroundings. Be aware that the newest Airflow model at present accessible on Amazon MWAA is 2.4.3; for this submit, use the appropriate model 0.19.2 of the plugin:
For extra particulars on putting in Python dependencies on Amazon MWAA, discuss with Putting in Python dependencies.
For Airflow < 2.3, configure the plugin’s lineage backend by means of the next configuration overrides on the Amazon MWAA surroundings and cargo it instantly at Airflow begin by disabling lazy load of plugins:
For extra info on configuration overrides, discuss with Configuration choices overview.
Configure the Secrets and techniques Supervisor backend with Amazon MWAA
Utilizing Secrets and techniques Supervisor as a secrets and techniques backend for Amazon MWAA is easy. First, present the execution position of Amazon MWAA with learn permission to Secrets and techniques Supervisor. You should use the next coverage template as a place to begin:
Second, configure Secrets and techniques Supervisor as a backend in Amazon MWAA by means of the next configuration overrides:
For extra info configuring a secrets and techniques backend in Amazon MWAA, discuss with Configuring an Apache Airflow connection utilizing a Secrets and techniques Supervisor secret and Transfer your Apache Airflow connections and variables to AWS Secrets and techniques Supervisor.
Deploy a customized envvar plugin to Amazon MWAA
Apache Airflow has a built-in plugin supervisor by means of which it may be prolonged with customized performance. In our case, this performance is to populate OpenLineage-specific surroundings variables based mostly on values in Secrets and techniques Supervisor. Natively, Amazon MWAA permits surroundings variables with the prefix AIRFLOW__
, however the openlineage-airflow plugin expects the prefix OPENLINEAGE__
.
The next Python code is used within the plugin. We assume the file known as envvar_plugin.py
:
Amazon MWAA has a mechanism to put in a plugin by means of a zipper archive. You zip your code, add the archive to an S3 bucket, and cross the URL to the file to Amazon MWAA:
Add plugins.zip
to an S3 bucket and configure the URL in Amazon MWAA. The next screenshot exhibits the configuration through the Amazon MWAA console.
For extra info on putting in customized plugins on Amazon MWAA, discuss with Making a customized plugin that generates runtime surroundings variables.
Configure connectivity between the openlineage-airflow plugin and Marquez
As a final step, retailer the URL to Marquez in Secrets and techniques Supervisor. For this, create a secret referred to as airflow/variables/OPENLINEAGE_URL
with worth <protocol>://<hostname/ip>:<port>
(for instance, https://marquez.mysite.com:5000
).
In case it’s worthwhile to spin up Marquez on AWS, you’ve got a number of choices to host, together with working it on Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Compute Cloud (Amazon EC2). Discuss with Operating Marquez on AWS or try our infrastructure template within the subsequent part to deploy Marquez on AWS.
Deploy with an AWS CDK-based resolution template
Assuming you need to arrange a demo infrastructure for the entire above in a single step, you should utilize the following template based mostly on the AWS CDK.
The template has the next stipulations:
Full the next steps to deploy the template:
- Clone GitHub repository and set up Python dependencies. Bootstrap the AWS CDK if required.
- Replace the worth for the variable
EXTERNAL_IP
inconstants.py
to your outbound IP for connecting to the web:This configures safety teams so to entry Marquez however block different shoppers.
constants.py
is discovered within the root folder of the cloned repository. - Deploy the VPC_S3 stack to provision a brand new VPC devoted for this resolution in addition to the safety teams which might be utilized by the completely different elements:
It creates a brand new S3 bucket and uploads the supply uncooked knowledge based mostly on the TICKIT pattern database. This serves because the touchdown space from the OLTP database. We then have to parse the metadata of those recordsdata by means of an AWS Glue crawler, which facilitates the native integration between Amazon Redshift and the S3 knowledge lake.
- Deploy the lineage stack to create an EC2 occasion that hosts Marquez:
Entry the Marquez net UI by means of
https://{ec2.public_dns_name}:3000/
. This URL can be accessible as a part of the AWS CDK outputs for the lineage stack. - Deploy the Amazon Redshift stack to create a Redshift Serverless endpoint:
- Deploy the Amazon MWAA stack to create an Amazon MWAA surroundings:
You’ll be able to entry the Amazon MWAA UI by means of the URL offered within the AWS CDK output.
Check a pattern knowledge pipeline
On Amazon MWAA, you possibly can see an instance knowledge pipeline deployed that consists of two DAGs. It builds a star schema on high of the TICKIT pattern database. One DAG is chargeable for loading knowledge from the S3 knowledge lake into an Amazon Redshift staging layer; the second DAG masses knowledge from the staging layer to the dimensional mannequin.
Open the Amazon MWAA UI by means of the URL obtained within the deployment steps and launch the next DAGs: rs_source_to_staging
and rs_staging_to_dm
. As a part of the run, the lineage metadata is distributed to Marquez.
After the DAG has been run, open the Marquez URL obtained within the deployment steps. In Marquez, you could find the lineage metadata for the computed star schema and associated knowledge property on Amazon Redshift.
Clear up
Delete the AWS CDK stacks to keep away from ongoing costs for the sources that you just created. Run the next command within the aws-mwaa-openlineage venture listing so that every one sources are undeployed:
Abstract
On this submit, we confirmed you how one can automate knowledge lineage with OpenLineage on Amazon MWAA. As a part of this, we coated how one can set up and configure the openlineage-airflow plugin on Amazon MWAA. Moreover, we offered a ready-to-use infrastructure template for an entire demo surroundings.
We encourage you to discover what else could be achieved with OpenLineage. A job orchestrator like Apache Airflow is just one piece of an information platform and never all attainable knowledge lineage could be captured on it. We suggest exploring OpenLineage’s integration with different platforms like Apache Spark or dbt. For extra info, discuss with Integrations.
Moreover, we suggest you go to the AWS Large Knowledge Weblog for different helpful weblog posts on Amazon MWAA and knowledge governance on AWS.
In regards to the Authors
Stephen Stated is a Senior Options Architect and works with Digital Native Companies. His areas of curiosity are knowledge analytics, knowledge platforms and cloud-native software program engineering.
Vishwanatha Nayak is a Senior Options Architect at AWS. He works with giant enterprise prospects serving to them design and construct safe, cost-effective, and dependable fashionable knowledge platforms utilizing the AWS cloud. He’s enthusiastic about know-how and likes sharing data by means of weblog posts and twitch periods.
Paul Villena is an Analytics Options Architect with experience in constructing fashionable knowledge and analytics options to drive enterprise worth. He works with prospects to assist them harness the ability of the cloud. His areas of pursuits are infrastructure-as-code, serverless applied sciences and coding in Python.