Automated information governance with AWS Glue Knowledge High quality, delicate information detection, and AWS Lake Formation

October 11, 2023

1

Knowledge governance is the method of making certain the integrity, availability, usability, and safety of a company’s information. Because of the quantity, velocity, and number of information being ingested in information lakes, it will probably get difficult to develop and preserve insurance policies and procedures to make sure information governance at scale to your information lake. Knowledge confidentiality and information high quality are the 2 important themes for information governance. Knowledge confidentiality refers back to the safety and management of delicate and personal info to forestall unauthorized entry, particularly when coping with personally identifiable info (PII). Knowledge high quality focuses on sustaining correct, dependable, and constant information throughout the group. Poor information high quality can result in misguided selections, inefficient operations, and compromised enterprise efficiency.

Corporations want to make sure information confidentiality is maintained all through the info pipeline and that high-quality information is on the market to shoppers in a well timed method. A variety of this effort is handbook, the place information homeowners and information stewards outline and apply the insurance policies statically up entrance for every dataset within the lake. This will get tedious and delays the info adoption throughout the enterprise.

On this submit, we showcase how you can use AWS Glue with AWS Glue Knowledge High quality, delicate information detection transforms, and AWS Lake Formation tag-based entry management to automate information governance.

Answer overview

Let’s think about a fictional firm, OkTank. OkTank has a number of ingestion pipelines that populate a number of tables within the information lake. OkTank needs to make sure the info lake is ruled with information high quality guidelines and entry insurance policies in place always.

A number of personas devour information from the info lake, akin to enterprise leaders, information scientists, information analysts, and information engineers. For every set of customers, a distinct degree of governance is required. For instance, enterprise leaders want top-quality and extremely correct information, information scientists can’t see PII information and want information inside an appropriate high quality vary for his or her mannequin coaching, and information engineers can see all information besides PII.

At present, these necessities are hard-coded and managed manually for every set of customers. OkTank needs to scale this and is on the lookout for methods to regulate governance in an automatic approach. Primarily, they’re on the lookout for the next options:

When new information and tables get added to the info lake, the governance insurance policies (information high quality checks and entry controls) get routinely utilized for them. Except the info is licensed to be consumed, it shouldn’t be accessible to the end-users. For instance, they need to guarantee primary information high quality checks are utilized on all new tables and supply entry to the info based mostly on the info high quality rating.
As a result of modifications in supply information, the present information profile of information lake tables could drift. It’s required to make sure the governance is met as outlined. For instance, the system ought to routinely mark columns as delicate if delicate information is detected in a column that was earlier marked as public and was out there publicly for customers. The system ought to disguise the column from unauthorized customers accordingly.

For the aim of this submit, the next governance insurance policies are outlined:

No PII information ought to exist in tables or columns tagged as public.
If a column has any PII information, the column ought to be marked as delicate. The desk ought to then even be marked delicate.
The next information high quality guidelines ought to be utilized on all tables:
- All tables ought to have a minimal set of columns: data_key, data_load_date, and data_location.
- data_key is a key column and may meet key necessities of being distinctive and full.
- data_location ought to match with places outlined in a separate reference (base) desk.
- The data_load_date column ought to be full.
Person entry to tables is managed as per the next desk.

Person Description	Can Entry Delicate Tables	Can Entry Delicate Columns	Min Knowledge High quality Threshold Wanted to devour Knowledge
Class 1	Sure	Sure	100%
Class 2	Sure	No	50%
Class 3	No	No	0%

On this submit, we use AWS Glue Knowledge High quality and delicate information detection options. We additionally use Lake Formation tag-based entry management to handle entry at scale.

The next diagram illustrates the answer structure.

The governance necessities highlighted within the earlier desk are translated to the next Lake Formation LF-Tags.

IAM Person	LF-Tag: tbl_class	LF-Tag: col_class	LF-Tag: dq_tag
Class 1	delicate, public	delicate, public	DQ100
Class 2	delicate, public	public	DQ100,DQ90,DQ50_80,DQ80_90
Class 3	public	public	DQ90, DQ100, DQ_LT_50, DQ50_80, DQ80_90

This submit makes use of AWS Step Capabilities to orchestrate the governance jobs, however you need to use another orchestration device of selection. To simulate information ingestion, we manually place the recordsdata in an Amazon Easy Storage Service (Amazon S3) bucket. On this submit, we set off the Step Capabilities state machine manually for ease of understanding. In apply, you possibly can combine or invoke the roles as a part of an information ingestion pipeline, by way of occasion triggers like AWS Glue crawler or Amazon S3 occasions, or schedule them as wanted.

On this submit, we use an AWS Glue database named oktank_autogov_temp and a goal desk named buyer on which we apply the governance guidelines. We use AWS CloudFormation to provision the assets. AWS CloudFormation allows you to mannequin, provision, and handle AWS and third-party assets by treating infrastructure as code.

Conditions

Full the next prerequisite steps:

Establish an AWS Area by which you need to create the assets and make sure you use the identical Area all through the setup and verifications.
Have a Lake Formation administrator function to run the CloudFormation template and grant permissions.

Sign up to the Lake Formation console and add your self as a Lake Formation information lake administrator for those who aren’t already an admin. In case you are establishing Lake Formation for the primary time in your Area, then you are able to do this within the following pop-up window that seems up if you hook up with the Lake Formation console and choose the specified Area.

In any other case, you possibly can add information lake directors by selecting Administrative roles and duties within the navigation pane on the Lake Formation console and selecting Add directors. Then choose Knowledge lake administrator, id your customers and roles, and select Affirm.

Deploy the CloudFormation stack

Run the supplied CloudFormation stack to create the answer assets.

You must present a singular bucket title and specify passwords for the three customers reflecting three completely different person personas (Class 1, Class 2, and Class 3) that we use for this submit.

The stack provisions an S3 bucket to retailer the dummy information, AWS Glue scripts, outcomes of delicate information detection, and Amazon Athena question ends in their respective folders.

The stack copies the AWS Glue scripts into the scripts folder and creates two AWS Glue jobs Knowledge-High quality-PII-Checker_Job and LF-Tag-Handler_Job pointing to the corresponding scripts.

The AWS Glue job Knowledge-High quality-PII-Checker_Job applies the info high quality guidelines and publishes the outcomes. It additionally checks for delicate information within the columns. On this submit, we verify for the PERSON_NAME and EMAIL information sorts. If any columns with delicate information are detected, it persists the delicate information detection outcomes to the S3 bucket.

AWS Glue Knowledge High quality makes use of Knowledge High quality Definition Language (DQDL) to writer the info high quality guidelines.

The information high quality necessities as outlined earlier on this submit are written as the next DQDL within the script:

Guidelines = [
ReferentialIntegrity "data_location" "reference.data_location" = 1.0,
IsPrimaryKey "data_key",
ColumnExists "data_load_date",
IsComplete "data_load_date
]

The next screenshot exhibits a pattern end result from the job after it runs. You’ll be able to see this after you set off the Step Capabilities workflow in subsequent steps. To verify the outcomes, on the AWS Glue console, select ETL jobs and select the job known as Knowledge-High quality-PII-Checker_Job. Then navigate to the Knowledge high quality tab to view the outcomes.

The AWS Glue jobLF-Tag-Handler_Job fetches the info high quality metrics printed by Knowledge-High quality-PII-Checker_Job. It checks the standing of the DataQuality_PIIColumns end result. It will get the record of delicate column names from the delicate information detection file created within the Knowledge-High quality-PII-Checker_Job and tags the columns as delicate. The remainder of the columns are tagged as public. It additionally tags the desk asdelicate if delicate columns are detected. The desk is marked as public if no delicate columns are detected.

The job additionally checks the info high quality rating for the DataQuality_BasicChecks end result set. It maps the info high quality rating into tags as proven within the following desk and applies the corresponding tag on the desk.

Knowledge High quality Rating	Knowledge High quality Tag
100%	DQ100
90-100%	DQ90
80-90%	DQ80_90
50-80%	DQ50_80
Lower than 50%	DQ_LT_50

The CloudFormation stack copies some mock information to the information folder and registers this location below AWS Lake Formation Knowledge lake places so Lake Formation can govern entry on the situation utilizing service-linked function for Lake Formation.

The buyer subfolder accommodates the preliminary buyer dataset for the desk buyer. The base subfolder accommodates the bottom dataset, which we use to verify referential integrity as a part of the info high quality checks. The column data_location within the buyer desk ought to match with places outlined on this base desk.

The stack additionally copies some further mock information to the bucket below the data-v1 folder. We use this information to simulate information high quality points.

It additionally creates the next assets:

An AWS Glue database known as oktank_autogov_temp and two tables below the database:
- buyer – That is our goal desk on which we can be governing the entry based mostly on information high quality guidelines and PII checks.
- base – That is the bottom desk that has the reference information. One of many information high quality guidelines checks that the client information at all times adheres to places current within the base desk.
AWS Id and Entry Administration (IAM) customers and roles:
- DataLakeUser_Category1 – The information lake person equivalent to the Class 1 person. This person ought to have the ability to entry delicate information however wants 100% correct information.
- DataLakeUser_Category2 – The information lake person equivalent to the Class 2 person. This person shouldn’t be in a position to entry delicate columns within the desk. It wants greater than 50% correct information.
- DataLakeUser_Category3 – The information lake person equivalent to the Class 3 person. This person shouldn’t be in a position to entry tables containing delicate information. Knowledge high quality will be 0%.
- GlueServiceDQRole – The function for the info high quality and delicate information detection job.
- GlueServiceLFTaggerRole – The function for the LF-Tags handler job for making use of the tags to the desk.
- StepFunctionRole – The Step Capabilities function for triggering the AWS Glue jobs.
Lake Formation LF-Tags keys and values:
- tbl_class – delicate, public
- dq_class – DQ100, DQ90, DQ80_90, DQ50_80, DQ_LT_50
- col_class – delicate, public

A Step Capabilities state machine named AutoGovMachine that you simply use to set off the runs for the AWS Glue jobs to verify information high quality and replace the LF-Tags.
Athena workgroups named auto_gov_blog_workgroup_temporary_user1, auto_gov_blog_workgroup_temporary_user2, and auto_gov_blog_workgroup_temporary_user3. These workgroups level to completely different Athena question end result places for every person. Every person is granted entry to the corresponding question end result location solely. This ensures a selected person doesn’t entry the question outcomes of different customers. It’s best to change to a selected workgroup to run queries in Athena as a part of the check for the particular person.

The CloudFormation stack generates the next outputs. Pay attention to the values of the IAM customers to make use of in subsequent steps.

Grant permissions

After you launch the CloudFormation stack, full the next steps:

On the Lake Formation console, below Permissions select Knowledge lake permissions within the navigation pane.
Seek for the database oktank_autogov_temp and desk buyer.
If IAMAllowedPrincipals entry if current, choose it select Revoke.

Select Revoke once more to revoke the permissions.

Class 1 customers can entry all information besides if the info high quality rating of the desk is under 100%. Due to this fact, we grant the person the mandatory permissions.

Below Permissions within the navigation pane, select Knowledge lake permissions.
Seek for database oktank_autogov_temp and desk buyer.
Select Grant
Choose IAM customers and roles and select the worth for UserCategory1 out of your CloudFormation stack output.
Below LF-Tags or catalog assets, select Add LF-Tag key-value pair.
Add the next key-value pairs:
1. For the col_class key, add the values public and delicate.
2. For the tbl_class key, add the values public and delicate.
3. For the dq_tag key, add the worth DQ100.

For Desk permissions, choose Choose.
Select Grant.

Class 2 customers can’t entry delicate columns. They will entry tables with an information high quality rating above 50%.

Repeat the previous steps to grant the suitable permissions in Lake Formation to UserCategory2:
1. For the col_class key, add the worth public.
2. For the tbl_class key, add the values public and delicate.
3. For the dq_tag key, add the values DQ50_80, DQ80_90, DQ90, and DQ100.

For Desk permissions, choose Choose.
Select Grant.

Class 3 customers can’t entry tables that comprise any delicate columns. Such tables are marked as delicate by the system. They will entry tables with any information high quality rating.

Repeat the previous steps to grant the suitable permissions in Lake Formation to UserCategory3:
1. For the col_class key, add the worth public.
2. For the tbl_class key, add the worth public.
3. For the dq_tag key, add the values DQ_LT_50, DQ50_80, DQ80_90, DQ90, and DQ100.

For Desk permissions, choose Choose.
Select Grant.

You’ll be able to confirm the LF-Tag permissions assigned in Lake Formation by navigating to the Knowledge lake permissions web page and trying to find the Useful resource sort LF-Tag expression.

Take a look at the answer

Now we are able to check the workflow. We check three completely different use instances on this submit. You’ll discover how the permissions to the tables change based mostly on the values of LF-Tags utilized to the buyer desk and the columns of the desk. We use Athena to question the tables.

Use case 1

On this first use case, a brand new desk was created on the lake and new information was ingested to the desk. The information file cust_feedback_v0.csv was copied to the information/buyer location within the S3 bucket. This simulates new information ingestion on a brand new desk known as buyer.

Lake Formation doesn’t enable any customers to entry this desk presently. To check this state of affairs, full the next steps:

Sign up to the Athena console with the UserCategory1 person.
Change the workgroup to auto_gov_blog_workgroup_temporary_user1 within the Athena question editor.
Select Acknowledge to simply accept the workgroup settings.

Run the next question within the question editor:

choose * from "oktank_autogov_temp"."buyer" restrict 10

On the Step Capabilities console, run the AutoGovMachine state machine.
Within the Enter – elective part, use the next JSON and change the BucketName worth with the bucket title you used for the CloudFormation stack earlier (for this submit, we use auto-gov-blog):

{
  "Remark": "Auto Governance with AWS Glue and AWS LakeFormation",
  "BucketName": "<Change along with your bucket title>"
}

The state machine triggers the AWS Glue jobs to verify information high quality on the desk and apply the corresponding LF-Tags.

You’ll be able to verify the LF-Tags utilized on the desk and the columns. To take action, when the state machine is full, check in to Lake Formation with the admin function used earlier to grant permissions.
Navigate to the desk buyer below the oktank_autogov_temp database and select Edit LF-Tags to validate the tags utilized on the desk.

You too can validate that columns customer_email and customer_name are tagged as delicate for the col_class LF-Tag.

To verify this, select Edit Schema for the buyer desk.
Choose the 2 columns and select Edit LF-Tags.

You’ll be able to verify the tags on these columns.

The remainder of the columns are tagged as public.

Sign up to the Athena console with UserCategory1 and run the identical question once more:

choose * from "oktank_autogov_temp"."buyer" restrict 10

This time, the person is ready to see the info. It is because the LF-Tag permissions we utilized earlier are in impact.

Sign up as UserCategory2 person to confirm permissions.
Change to workgroup auto_gov_blog_workgroup_temporary_user2 in Athena.

This person can entry the desk however can solely see public columns. Due to this fact, the person shouldn’t have the ability to see the customer_email and customer_phone columns as a result of these columns comprise delicate information as recognized by the system.

Run the identical question once more:

choose * from "oktank_autogov_temp"."buyer" restrict 10

Sign up to Athena and confirm the permissions for DataLakeUser_Category3.
Change to workgroup auto_gov_blog_workgroup_temporary_user3 in Athena.

This person can’t entry the desk as a result of the desk is marked as delicate because of the presence of delicate information columns within the desk.

Run the identical question once more:

choose * from "oktank_autogov_temp"."buyer" restrict 10

Use case 2

Let’s ingest some new information on the desk.

Sign up to the Amazon S3 console with the admin function used earlier to grant permissions.
Copy the file cust_feedback_v1.csv from the data-v1 folder within the S3 bucket to the information/buyer folder within the S3 bucket utilizing the default choices.

This new information file has information high quality points as a result of the column data_location breaks referential integrity with the base desk. This information additionally introduces some delicate information in column comment1. This column was earlier marked as public as a result of it didn’t have any delicate information.

The next screenshot exhibits what the buyer folder ought to seem like now.

Run the AutoGovMachine state machine once more and use the identical JSON because the StartExecution enter you used earlier:

{
  "Remark": "Auto Governance with AWS Glue and AWS LakeFormation",
  "BucketName": "<Change along with your bucket title>"
}

The job classifies column comment1 as delicate on the buyer desk. It additionally updates the dq_tag worth on the desk as a result of the info high quality has modified because of the breaking referential integrity verify.

You’ll be able to confirm the brand new tag values by way of the Lake Formation console as described earlier. The dq_tag worth was DQ100. The worth is modified to DQ50_80, reflecting the info high quality rating for the desk.

Additionally, earlier the worth for the col_class tag for the comment1 column was public. The worth is now modified to delicate as a result of delicate information is detected on this column.

Class 2 customers shouldn’t have the ability to entry delicate columns within the desk.

Sign up with UserCategory2 to Athena and rerun the sooner question:

choose * from "oktank_autogov_temp"."buyer" restrict 10

The column comment1 is no longer out there for UserCategory2 as anticipated. The entry permissions are dealt with routinely.

Additionally, as a result of the info high quality rating goes down under 100%, this new dataset is no longer out there for the Category1 person. This person ought to have entry to information solely when the rating is 100% as per our outlined guidelines.

Sign up with UserCategory1 to Athena and rerun the sooner question:

choose * from "oktank_autogov_temp"."buyer" restrict 10

You will note the person just isn’t in a position to entry the desk now. The entry permissions are dealt with routinely.

Use case 3

Let’s repair the invalid information and take away the info high quality situation.

Delete the cust_feedback_v1.csv file from the information/buyer Amazon S3 location.
Copy the file cust_feedback_v1_fixed.csv from the data-v1 folder within the S3 bucket to the information/buyer S3 location. This information file fixes the info high quality points.
Rerun the AutoGovMachine state machine.

When the state machine is full, the info high quality rating goes as much as 100% once more and the tag on the desk will get up to date accordingly. You’ll be able to confirm the brand new tag as proven earlier by way of the Lake Formation console.

The Category1 person can entry the desk once more.

Clear up

To keep away from incurring additional costs, delete the CloudFormation stack to delete the assets provisioned as a part of this submit.

Conclusion

This submit lined AWS Glue Knowledge High quality and delicate detection options and Lake Formation LF-Tag based mostly entry management. We explored how one can mix these options and use them to construct a scalable automated information governance functionality in your information lake. We explored how person permissions modified when information was initially ingested to the desk and when information drift was noticed as a part of subsequent ingestions.

For additional studying, check with the next assets:

In regards to the Writer

Shoukat Ghouse is a Senior Massive Knowledge Specialist Options Architect at AWS. He helps clients world wide construct strong, environment friendly and scalable information platforms on AWS leveraging AWS analytics providers like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.

Supply hyperlink

Previous articleGet 1TB of lifetime cloud storage for simply $120

Automated information governance with AWS Glue Knowledge High quality, delicate information detection, and AWS Lake Formation

Answer overview

Conditions

Deploy the CloudFormation stack

Grant permissions

Take a look at the answer

Use case 1

Use case 2

Use case 3

Clear up

Conclusion

In regards to the Writer

Llama 2 Basis Fashions Accessible in Databricks Lakehouse AI

Don’t Blink: You’ll Miss One thing Wonderful!

Provide Chain Suggestions for Software program Corporations to Keep away from Knowledge Breaches

LEAVE A REPLY Cancel reply

Most Popular

Get 1TB of lifetime cloud storage for simply $120

Patch Tuesday, October 2023 Version – Krebs on Safety

2023 Amazon Massive Deal Day offers on drones

GroupBy Panel Will Showcase AI-Pushed E-Commerce Methods

Recent Comments

ABOUT US

POPULAR POSTS

Get 1TB of lifetime cloud storage for simply $120

Patch Tuesday, October 2023 Version – Krebs on Safety

2023 Amazon Massive Deal Day offers on drones

POPULAR CATEGORY