Saturday, October 14, 2023
HomeBig DataIntroducing runtime roles for Amazon EMR steps: Use IAM roles and AWS...

Introducing runtime roles for Amazon EMR steps: Use IAM roles and AWS Lake Formation for entry management with Amazon EMR


You should use the Amazon EMR Steps API to submit Apache Hive, Apache Spark, and others varieties of purposes to an EMR cluster. You may invoke the Steps API utilizing Apache Airflow, AWS Steps Capabilities, the AWS Command Line Interface (AWS CLI), all of the AWS SDKs, and the AWS Administration Console. Jobs submitted with the Steps API use the Amazon Elastic Compute Cloud (Amazon EC2) occasion profile to entry AWS sources equivalent to Amazon Easy Storage Service (Amazon S3) buckets, AWS Glue tables, and Amazon DynamoDB tables from the cluster.

Beforehand, if a step wanted entry to a selected S3 bucket and one other step wanted entry to a selected DynamoDB desk, the AWS Identification and Entry Administration (IAM) coverage connected to the occasion profile needed to enable entry to each the S3 bucket and the DynamoDB desk. This meant that the IAM insurance policies you assigned to the occasion profile needed to comprise a union of all of the permissions for each step that ran on an EMR cluster.

We’re glad to introduce runtime roles for EMR steps. A runtime function is an IAM function that you just affiliate with an EMR step, and jobs use this function to entry AWS sources. With runtime roles for EMR steps, now you can specify totally different IAM roles for the Spark and the Hive jobs, thereby scoping down entry at a job degree. This lets you simplify entry controls on a single EMR cluster that’s shared between a number of tenants, whereby every tenant may be simply remoted utilizing IAM roles.

The flexibility to specify an IAM function with a job can also be obtainable on Amazon EMR on EKS and Amazon EMR Serverless. You too can use AWS Lake Formation to use table- and column-level permission for Apache Hive and Apache Spark jobs which are submitted with EMR steps. For extra info, discuss with Configure runtime roles for Amazon EMR steps.

On this publish, we dive deeper into runtime roles for EMR steps, serving to you perceive how the assorted items work collectively, and the way every step is remoted on an EMR cluster.

Answer overview

On this publish, we stroll by means of the next:

  1. Create an EMR cluster enabled to make use of the brand new role-based entry management with EMR steps.
  2. Create two IAM roles with totally different permissions by way of the Amazon S3 knowledge and Lake Formation tables they will entry.
  3. Enable the IAM principal submitting the EMR steps to make use of these two IAM roles.
  4. See how EMR steps working with the identical code and making an attempt to entry the identical knowledge have totally different permissions primarily based on the runtime function specified at submission time.
  5. See the way to monitor and management actions utilizing supply id propagation.

Arrange EMR cluster safety configuration

Amazon EMR safety configurations simplify making use of constant safety, authorization, and authentication choices throughout your clusters. You may create a safety configuration on the Amazon EMR console or by way of the AWS CLI or AWS SDK. Whenever you connect a safety configuration to a cluster, Amazon EMR applies the settings within the safety configuration to the cluster. You may connect a safety configuration to a number of clusters at creation time, however can’t apply them to a working cluster.

To allow runtime roles for EMR steps, we now have to create a safety configuration as proven within the following code and allow the runtime roles property (configured by way of EnableApplicationScopedIAMRole). Along with the runtime roles, we’re enabling propagation of the supply id (configured by way of PropagateSourceIdentity) and help for Lake Formation (configured by way of LakeFormationConfiguration). The supply id is a mechanism to watch and management actions taken with assumed roles. Enabling Propagate supply id lets you audit actions carried out utilizing the runtime function. Lake Formation is an AWS service to securely handle an information lake, which incorporates defining and implementing central entry management insurance policies in your knowledge lake.

Create a file known as step-runtime-roles-sec-cfg.json with the next content material:

{
    "AuthorizationConfiguration": {
        "IAMConfiguration": {
            "EnableApplicationScopedIAMRole": true,
            "ApplicationScopedIAMRoleConfiguration": 
                {
                    "PropagateSourceIdentity": true
                }
        },
        "LakeFormationConfiguration": {
            "AuthorizedSessionTagValue": "Amazon EMR"
        }
    }
}

Create the Amazon EMR safety configuration:

aws emr create-security-configuration 
--name 'iamconfig-with-iam-lf' 
--security-configuration file://step-runtime-roles-sec-cfg.json

You too can do the identical by way of the Amazon console:

  1. On the Amazon EMR console, select Safety configurations within the navigation pane.
  2. Select Create.
  3. Select Create.
  4. For Safety configuration identify, enter a reputation.
  5. For Safety configuration setup choices, choose Select customized settings.
  6. For IAM function for purposes, choose Runtime function.
  7. Choose Propagate supply id to audit actions carried out utilizing the runtime function.
  8. For High quality-grained entry management, choose AWS Lake Formation.
  9. Full the safety configuration.

The safety configuration seems in your safety configuration record. You too can see that the authorization mechanism listed right here is the runtime function as an alternative of the occasion profile.

Launch the cluster

Now we launch an EMR cluster and specify the safety configuration we created. For extra info, discuss with Specify a safety configuration for a cluster.

The next code offers the AWS CLI command for launching an EMR cluster with the suitable safety configuration. Observe that this cluster is launched on the default VPC and public subnet with the default IAM roles. As well as, the cluster is launched with one major and one core occasion of the desired occasion sort. For extra particulars on the way to customise the launch parameters, discuss with create-cluster.

If the default EMR roles EMR_EC2_DefaultRole and EMR_DefaultRole don’t exist in IAM in your account (that is the primary time you’re launching an EMR cluster with these), earlier than launching the cluster, use the next command to create them:

aws emr create-default-roles

Create the cluster with the next code:

#Change along with your Key Pair
KEYPAIR=<MY_KEYPAIR>
INSTANCE_TYPE="r4.4xlarge"
#Change along with your Safety Configuration Identify
SECURITY_CONFIG="iamconfig-with-iam-lf"
#Change along with your S3 log URI
LOG_URI="s3://mybucket/logs/"

aws emr create-cluster 
--name "iam-passthrough-cluster" 
--release-label emr-6.7.0 
--use-default-roles 
--security-configuration $SECURITY_CONFIG 
--ec2-attributes KeyName=$KEYPAIR 
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$INSTANCE_TYPE  InstanceGroupType=CORE,InstanceCount=1,InstanceType=$INSTANCE_TYPE 
--applications Identify=Spark Identify=Hadoop Identify=Hive 
--log-uri $LOG_URI

When the cluster is absolutely provisioned (Ready state), let’s attempt to run a step on it with runtime roles for EMR steps enabled:

#Change along with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "--class",
              "org.apache.spark.examples.SparkPi",
              "/usr/lib/spark/examples/jars/spark-examples.jar",
              "5"
            ]
        }]'

After launching the command, we obtain the next as output:

An error occurred (ValidationException) when calling the AddJobFlowSteps operation: Runtime roles are required for this cluster. Please specify the function utilizing the ExecutionRoleArn parameter.

The step failed, asking us to offer a runtime function. Within the subsequent part, we arrange two IAM roles with totally different permissions and use them because the runtime roles for EMR steps.

Arrange IAM roles as runtime roles

Any IAM function that you just need to use as a runtime function for EMR steps should have a belief coverage that enables the EMR cluster’s EC2 occasion profile to imagine it. In our setup, we’re utilizing the default IAM function EMR_EC2_DefaultRole because the occasion profile function. As well as, we create two IAM roles known as test-emr-demo1 and test-emr-demo2 that we use as runtime roles for EMR steps.

The next code is the belief coverage for each of the IAM roles, which lets the EMR cluster’s EC2 occasion profile function, EMR_EC2_DefaultRole, assume these roles and set the supply id and LakeFormationAuthorizedCaller tag on the function periods. The TagSession permission is required in order that Amazon EMR can authorize to Lake Formation. The SetSourceIdentity assertion is required for the propagate supply id characteristic.

Create a file known as trust-policy.json with the next content material (change 123456789012 along with your AWS account ID):

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:SetSourceIdentity"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:TagSession",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/LakeFormationAuthorizedCaller": "Amazon EMR"
                }
            }
        }
    ]
}

Use that coverage to create the 2 IAM roles, test-emr-demo1 and test-emr-demo2:

aws iam create-role 
--role-name test-emr-demo1 
--assume-role-policy-document file://trust-policy.json

aws iam create-role 
--role-name test-emr-demo2 
--assume-role-policy-document file://trust-policy.json

Arrange permissions for the principal submitting the EMR steps with runtime roles

The IAM principal submitting the EMR steps must have permissions to invoke the AddJobFlowSteps API. As well as, you need to use the Situation key elasticmapreduce:ExecutionRoleArn to manage entry to particular IAM roles. For instance, the next coverage permits the IAM principal to solely use IAM roles test-emr-demo1 and test-emr-demo2 because the runtime roles for EMR steps.

  1. Create the job-submitter-policy.json file with the next content material (change 123456789012 along with your AWS account ID):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Sid": "AddStepsWithSpecificExecRoleArn",
                "Effect": "Allow",
                "Action": [
                    "elasticmapreduce:AddJobFlowSteps"
                ],
                "Useful resource": "*",
                "Situation": {
                    "StringEquals": {
                        "elasticmapreduce:ExecutionRoleArn": [
                            "arn:aws:iam::123456789012:role/test-emr-demo1",
                            "arn:aws:iam::123456789012:role/test-emr-demo2"
                        ]
                    }
                }
            },
            {
                "Sid": "EMRDescribeCluster",
                "Impact": "Enable",
                "Motion": [
                    "elasticmapreduce:DescribeCluster"
                ],
                "Useful resource": "*"
            }
        ]
    }

  2. Create the IAM coverage with the next code:
    aws iam create-policy 
    --policy-name emr-runtime-roles-submitter-policy 
    --policy-document file://job-submitter-policy.json

  3. Assign this coverage to the IAM principal (IAM person or IAM function) you’re going to make use of to submit the EMR steps (change 123456789012 along with your AWS account ID and change john with the IAM person you employ to submit your EMR steps):
    aws iam attach-user-policy 
    --user-name john 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-submitter-policy"

IAM person john can now submit steps utilizing arn:aws:iam::123456789012:function/test-emr-demo1 and arn:aws:iam::123456789012:function/test-emr-demo2 because the step runtime roles.

Use runtime roles with EMR steps

We now put together our setup to indicate runtime roles for EMR steps in motion.

Arrange Amazon S3

To arrange your Amazon S3 knowledge, full the next steps:

  1. Create a CSV file known as take a look at.csv with the next content material:
  2. Add the file to Amazon S3 in three totally different areas:
    #Change this along with your bucket identify
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    aws s3 cp take a look at.csv s3://${BUCKET_NAME}/demo1/
    aws s3 cp take a look at.csv s3://${BUCKET_NAME}/demo2/
    aws s3 cp take a look at.csv s3://${BUCKET_NAME}/nondemo/

    For our preliminary take a look at, we use a PySpark utility known as take a look at.py with the next contents:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("my app").enableHiveSupport().getOrCreate()
    
    #Change this along with your bucket identify
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    strive:
      spark.learn.csv("s3://" + BUCKET_NAME + "/demo1/take a look at.csv").present()
      print("Accessed demo1")
    besides:
      print("Couldn't entry demo1")
    
    strive:
      spark.learn.csv("s3://" + BUCKET_NAME + "/demo2/take a look at.csv").present()
      print("Accessed demo2")
    besides:
      print("Couldn't entry demo2")
    
    strive:
      spark.learn.csv("s3://" + BUCKET_NAME + "/nondemo/take a look at.csv").present()
      print("Accessed nondemo")
    besides:
      print("Couldn't entry nondemo")
    spark.cease()

    Within the script, we’re making an attempt to entry the CSV file current beneath three totally different prefixes within the take a look at bucket.

  3. Add the Spark utility inside the identical S3 bucket the place we positioned the take a look at.csv file however in a special location:
    #Change this along with your bucket identify
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    aws s3 cp take a look at.py s3://${BUCKET_NAME}/scripts/

Arrange runtime function permissions

To indicate how runtime roles for EMR steps works, we assign to the roles we created totally different IAM permissions to entry Amazon S3. The next desk summarizes the grants we offer to every function (emr-steps-roles-new-us-east-1 is the bucket you configured within the earlier part).

S3 areas IAM Roles test-emr-demo1 test-emr-demo2
s3://emr-steps-roles-new-us-east-1/* No Entry No Entry
s3://emr-steps-roles-new-us-east-1/demo1/* Full Entry No Entry
s3://emr-steps-roles-new-us-east-1/demo2/* No Entry Full Entry
s3://emr-steps-roles-new-us-east-1/scripts/* Learn Entry Learn Entry
  1. Create the file demo1-policy.json with the next content material (substitute emr-steps-roles-new-us-east-1 along with your bucket identify):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo1",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo1/*"
                ]                    
            },
            {
                "Impact": "Enable",
                "Motion": [
                    "s3:Get*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts/*"
                ]                    
            }
        ]
    }

  2. Create the file demo2-policy.json with the next content material (substitute emr-steps-roles-new-us-east-1 along with your bucket identify):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo2",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo2/*"
                ]                    
            },
            {
                "Impact": "Enable",
                "Motion": [
                    "s3:Get*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts/*"
                ]                    
            }
        ]
    }

  3. Create our IAM insurance policies:
    aws iam create-policy 
    --policy-name test-emr-demo1-policy 
    --policy-document file://demo1-policy.json
    
    aws iam create-policy 
    --policy-name test-emr-demo2-policy 
    --policy-document file://demo2-policy.json

  4. Assign to every function the associated coverage (change 123456789012 along with your AWS account ID):
    aws iam attach-role-policy 
    --role-name test-emr-demo1 
    --policy-arn "arn:aws:iam::123456789012:coverage/test-emr-demo1-policy"
    
    aws iam attach-role-policy 
    --role-name test-emr-demo2 
    --policy-arn "arn:aws:iam::123456789012:coverage/test-emr-demo2-policy"

    To make use of runtime roles with Amazon EMR steps, we have to add the next coverage to our EMR cluster’s EC2 occasion profile (on this instance EMR_EC2_DefaultRole). With this coverage, the underlying EC2 situations for the EMR cluster can assume the runtime function and apply a tag to that runtime function.

  5. Create the file runtime-roles-policy.json with the next content material (change 123456789012 along with your AWS account ID):
    {
        "Model": "2012-10-17",
        "Assertion": [{
                "Sid": "AllowRuntimeRoleUsage",
                "Effect": "Allow",
                "Action": [
                    "sts:AssumeRole",
                    "sts:TagSession",
                    "sts:SetSourceIdentity"
                ],
                "Useful resource": [
                    "arn:aws:iam::123456789012:role/test-emr-demo1",
                    "arn:aws:iam::123456789012:role/test-emr-demo2"
                ]
            }
        ]
    }

  6. Create the IAM coverage:
    aws iam create-policy 
    --policy-name emr-runtime-roles-policy 
    --policy-document file://runtime-roles-policy.json

  7. Assign the created coverage to the EMR cluster’s EC2 occasion profile, on this instance EMR_EC2_DefaultRole:
    aws iam attach-role-policy 
    --role-name EMR_EC2_DefaultRole 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-policy"

Check permissions with runtime roles

We’re now able to carry out our first take a look at. We run the take a look at.py script, beforehand uploaded to Amazon S3, two instances as Spark steps: first utilizing the test-emr-demo1 function after which utilizing the test-emr-demo2 function because the runtime roles.

To run an EMR step specifying a runtime function, you want the newest model of the AWS CLI. For extra particulars about updating the AWS CLI, discuss with Putting in or updating the newest model of the AWS CLI.

Let’s submit a step specifying test-emr-demo1 because the runtime function:

#Change along with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change along with your AWS Account ID
ACCOUNT_ID=123456789012
#Change along with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:function/test-emr-demo1

This command returns an EMR step ID. To examine our step output logs, we will proceed two other ways:

  • From the Amazon EMR console – On the Steps tab, select the View logs hyperlink associated to the precise step ID and choose stdout.
  • From Amazon S3 – Whereas launching our cluster, we configured an S3 location for logging. We will discover our step logs beneath $(LOG_URI)/steps/<stepID>/stdout.gz.

The logs might take a few minutes to populate after the step is marked as Accomplished.

The next is the output of the EMR step with test-emr-demo1 because the runtime function:

+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|  a| 1a|
|  2|  b| 2b|
+---+---+---+

Accessed demo1
Couldn't entry demo2
Couldn't entry nondemo

As we will see, solely the demo1 folder was accessible by our utility.

Diving deeper into the step stderr logs, we will see that the associated YARN utility application_1656350436159_0017 was launched with the person 6GC64F33KUW4Q2JY6LKR7UAHWETKKXYL. We will affirm this by connecting to the EMR major occasion utilizing SSH and utilizing the YARN CLI:

[hadoop@ip-172-31-63-203]$ yarn utility -status application_1656350436159_0017
...
Software-Id : application_1656350436159_0017
Software-Identify : my app
Software-Sort : SPARK
Person : 6GC64F33KUW4Q2JY6LKR7UAHWETKKXYL
Queue : default
Software Precedence : 0
...

Please word that in your case, the YARN utility ID and the person might be totally different.

Now we submit the identical script once more as a brand new EMR step, however this time with the function test-emr-demo2 because the runtime function:

#Change along with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change along with your AWS Account ID
ACCOUNT_ID=123456789012
#Change along with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:function/test-emr-demo2

The next is the output of the EMR step with test-emr-demo2 because the runtime function:

Couldn't entry demo1
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|  a| 1a|
|  2|  b| 2b|
+---+---+---+

Accessed demo2
Couldn't entry nondemo

As we will see, solely the demo2 folder was accessible by our utility.

Diving deeper into the step stderr logs, we will see that the associated YARN utility application_1656350436159_0018 was launched with a special person 7T2ORHE6Z4Q7PHLN725C2CVWILZWYOLE. We will affirm this through the use of the YARN CLI:

[hadoop@ip-172-31-63-203]$ yarn utility -status application_1656350436159_0018
...
Software-Id : application_1656350436159_0018
Software-Identify : my app
Software-Sort : SPARK
Person : 7T2ORHE6Z4Q7PHLN725C2CVWILZWYOLE
Queue : default
Software Precedence : 0
...

Every step was in a position to solely entry the CSV file that was allowed by the runtime function, so step one was in a position to solely entry s3://emr-steps-roles-new-us-east-1/demo1/take a look at.csv and the second step was solely in a position to entry s3://emr-steps-roles-new-us-east-1/demo2/take a look at.csv. As well as, we noticed that Amazon EMR created a novel person for the steps, and used the person to run the roles. Please word that each roles want at the least learn entry to the S3 location the place the step scripts are positioned (for instance, s3://emr-steps-roles-demo-bucket/scripts/take a look at.py).

Now that we now have seen how runtime roles for EMR steps work, let’s take a look at how we will use Lake Formation to use fine-grained entry controls with EMR steps.

Use Lake Formation-based entry management with EMR steps

You should use Lake Formation to use table- and column-level permissions with Apache Spark and Apache Hive jobs submitted as EMR steps. First, the info lake admin in Lake Formation must register Amazon EMR because the AuthorizedSessionTagValue to implement Lake Formation permissions on EMR. Lake Formation makes use of this session tag to authorize callers and supply entry to the info lake. The Amazon EMR worth is referenced contained in the step-runtime-roles-sec-cfg.json file we used earlier once we created the EMR safety configuration, and contained in the trust-policy.json file we used to create the 2 runtime roles test-emr-demo1 and test-emr-demo2.

We will accomplish that on the Lake Formation console within the Exterior knowledge filtering part (change 123456789012 along with your AWS account ID).

On the IAM runtime roles’ belief coverage, we have already got the sts:TagSession permission with the situation “aws:RequestTag/LakeFormationAuthorizedCaller": "Amazon EMR". So we’re able to proceed.

To show how Lake Formation works with EMR steps, we create one database named entities with two tables named customers and merchandise, and we assign in Lake Formation the grants summarized within the following desk.

IAM Roles Tables entities
(DB)
customers
(Desk)
merchandise
(Desk)
test-emr-demo1 Full Learn Entry No Entry
test-emr-demo2 Learn Entry on Columns: uid, state Full Learn Entry

Put together Amazon S3 information

We first put together our Amazon S3 information.

  1. Create the customers.csv file with the next content material:
    00005678,john,pike,england,london,Hidden Highway 78
    00009039,paolo,rossi,italy,milan,Through degli Alberi 56A
    00009057,july,finn,germany,berlin,Inexperienced Highway 90

  2. Create the merchandise.csv file with the next content material:
    P0000789,Bike2000,Sport
    P0000567,CoverToCover,Smartphone
    P0005677,Whiteboard X786,Residence

  3. Add these information to Amazon S3 in two totally different areas:
    #Change this along with your bucket identify
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    aws s3 cp customers.csv s3://${BUCKET_NAME}/entities-database/customers/
    aws s3 cp merchandise.csv s3://${BUCKET_NAME}/entities-database/merchandise/

Put together the database and tables

We will create our entities database through the use of the AWS Glue APIs.

  1. Create the entities-db.json file with the next content material (substitute emr-steps-roles-new-us-east-1 along with your bucket identify):
    {
        "DatabaseInput": {
            "Identify": "entities",
            "LocationUri": "s3://emr-steps-roles-new-us-east-1/entities-database/",
            "CreateTableDefaultPermissions": []
        }
    }

  2. With a Lake Formation admin person, run the next command to create our database:
    aws glue create-database 
    --cli-input-json file://entities-db.json

    We additionally use the AWS Glue APIs to create the tables customers and merchandise.

  3. Create the users-table.json file with the next content material (substitute emr-steps-roles-new-us-east-1 along with your bucket identify):
    {
        "TableInput": {
            "Identify": "customers",
            "StorageDescriptor": {
                "Columns": [{
                        "Name": "uid",
                        "Type": "string"
                    },
                    {
                        "Name": "name",
                        "Type": "string"
                    },
                    {
                        "Name": "surname",
                        "Type": "string"
                    },
                    {
                        "Name": "state",
                        "Type": "string"
                    },
                    {
                        "Name": "city",
                        "Type": "string"
                    },
                    {
                        "Name": "address",
                        "Type": "string"
                    }
                ],
                "Location": "s3://emr-steps-roles-new-us-east-1/entities-database/customers/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "area.delim": ",",
                        "serialization.format": ","
                    }
                }
            },
            "TableType": "EXTERNAL_TABLE",
            "Parameters": {
                "EXTERNAL": "TRUE"
            }
        }
    }

  4. Create the products-table.json file with the next content material (substitute emr-steps-roles-new-us-east-1 along with your bucket identify):
    {
        "TableInput": {
            "Identify": "merchandise",
            "StorageDescriptor": {
                "Columns": [{
                        "Name": "product_id",
                        "Type": "string"
                    },
                    {
                        "Name": "name",
                        "Type": "string"
                    },
                    {
                        "Name": "category",
                        "Type": "string"
                    }
                ],
                "Location": "s3://emr-steps-roles-new-us-east-1/entities-database/merchandise/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "area.delim": ",",
                        "serialization.format": ","
                    }
                }
            },
            "TableType": "EXTERNAL_TABLE",
            "Parameters": {
                "EXTERNAL": "TRUE"
            }
        }
    }

  5. With a Lake Formation admin person, create our tables with the next instructions:
    aws glue create-table 
        --database-name entities 
        --cli-input-json file://users-table.json
        
    aws glue create-table 
        --database-name entities 
        --cli-input-json file://products-table.json

Arrange the Lake Formation knowledge lake areas

To entry our tables knowledge in Amazon S3, Lake Formation wants learn/write entry to them. To realize that, we now have to register Amazon S3 areas the place our knowledge resides and specify for them which IAM function to acquire credentials from.

Let’s create our IAM function for the info entry.

  1. Create a file known as trust-policy-data-access-role.json with the next content material:
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Sid": "",
                "Effect": "Allow",
                "Principal": {
                    "Service": "lakeformation.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  2. Use the coverage to create the IAM function emr-demo-lf-data-access-role:
    aws iam create-role 
    --role-name emr-demo-lf-data-access-role 
    --assume-role-policy-document file://trust-policy-data-access-role.json

  3. Create the file data-access-role-policy.json with the next content material (substitute emr-steps-roles-new-us-east-1 along with your bucket identify):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/entities-database",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/entities-database/*"
                ]
            },
            {
                "Impact": "Enable",
                "Motion": [
                    "s3:ListBucket"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1"
                ]
            }
        ]
    }

  4. Create our IAM coverage:
    aws iam create-policy 
    --policy-name data-access-role-policy 
    --policy-document file://data-access-role-policy.json

  5. Assign to our emr-demo-lf-data-access-role the created coverage (change 123456789012 along with your AWS account ID):
    aws iam attach-role-policy 
    --role-name emr-demo-lf-data-access-role 
    --policy-arn "arn:aws:iam::123456789012:coverage/data-access-role-policy"

    We will now register our knowledge location in Lake Formation.

  6. On the Lake Formation console, select Knowledge lake areas within the navigation pane.
  7. Right here we will register our S3 location containing knowledge for our two tables and select the created emr-demo-lf-data-access-role IAM function, which has learn/write entry to that location.

For extra particulars about including an Amazon S3 location to your knowledge lake and configuring your IAM knowledge entry roles, discuss with Including an Amazon S3 location to your knowledge lake.

Implement Lake Formation permissions

To make certain we’re utilizing Lake Formation permissions, we should always affirm that we don’t have any grants arrange for the principal IAMAllowedPrincipals. The IAMAllowedPrincipals group consists of any IAM customers and roles which are allowed entry to your Knowledge Catalog sources by your IAM insurance policies, and it’s used to keep up backward compatibility with AWS Glue.

To verify Lake Formations permissions are enforced, navigate to the Lake Formation console and select Knowledge lake permissions within the navigation pane. Filter permissions by “Database”:“entities” and take away all of the permissions given to the principal IAMAllowedPrincipals.

For extra particulars on IAMAllowedPrincipals and backward compatibility with AWS Glue, discuss with Altering the default safety settings in your knowledge lake.

Configure AWS Glue and Lake Formation grants for IAM runtime roles

To permit our IAM runtime roles to correctly work together with Lake Formation, we should always present them the lakeformation:GetDataAccess and glue:Get* grants.

Lake Formation permissions management entry to Knowledge Catalog sources, Amazon S3 areas, and the underlying knowledge at these areas. IAM permissions management entry to the Lake Formation and AWS Glue APIs and sources. Due to this fact, though you may need the Lake Formation permission to entry a desk within the Knowledge Catalog (SELECT), your operation fails in the event you don’t have the IAM permission on the glue:Get* API.

For extra particulars about Lake Formation entry management, discuss with Lake Formation entry management overview.

  1. Create the emr-runtime-roles-lake-formation-policy.json file with the next content material:
    {
        "Model": "2012-10-17",
        "Assertion": {
            "Sid": "LakeFormationManagedAccess",
            "Impact": "Enable",
            "Motion": [
                "lakeformation:GetDataAccess",
                "glue:Get*",
                "glue:Create*",
                "glue:Update*"
            ],
            "Useful resource": "*"
        }
    }

  2. Create the associated IAM coverage:
    aws iam create-policy 
    --policy-name emr-runtime-roles-lake-formation-policy 
    --policy-document file://emr-runtime-roles-lake-formation-policy.json

  3. Assign this coverage to each IAM runtime roles (change 123456789012 along with your AWS account ID):
    aws iam attach-role-policy 
    --role-name test-emr-demo1 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-lake-formation-policy"
    
    aws iam attach-role-policy 
    --role-name test-emr-demo2 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-lake-formation-policy"

Arrange Lake Formation permissions

We now arrange the permission in Lake Formation for the 2 runtime roles.

  1. Create the file users-grants-test-emr-demo1.json with the next content material to grant SELECT entry to all columns within the entities.customers desk to test-emr-demo1:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:function/test-emr-demo1"
        },
        "Useful resource": {
            "Desk": {
                "DatabaseName": "entities",
                "Identify": "customers"
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  2. Create the file users-grants-test-emr-demo2.json with the next content material to grant SELECT entry to the uid and state columns within the entities.customers desk to test-emr-demo2:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:function/test-emr-demo2"
        },
        "Useful resource": {
            "TableWithColumns": {
                "DatabaseName": "entities",
                "Identify": "customers",
                "ColumnNames": ["uid", "state"]
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  3. Create the file products-grants-test-emr-demo2.json with the next content material to grant SELECT entry to all columns within the entities.merchandise desk to test-emr-demo2:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:function/test-emr-demo2"
        },
        "Useful resource": {
            "Desk": {
                "DatabaseName": "entities",
                "Identify": "merchandise"
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  4. Let’s arrange our permissions in Lake Formation:
    aws lakeformation grant-permissions 
    --cli-input-json file://users-grants-test-emr-demo1.json
    
    aws lakeformation grant-permissions 
    --cli-input-json file://users-grants-test-emr-demo2.json
    
    aws lakeformation grant-permissions 
    --cli-input-json file://products-grants-test-emr-demo2.json

  5. Test the permissions we outlined on the Lake Formation console on the Knowledge lake permissions web page by filtering by “Database”:“entities”.

Check Lake Formation permissions with runtime roles

For our take a look at, we use a PySpark utility known as test-lake-formation.py with the next content material:


from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("Pyspark - TEST IAM RBAC with LF").enableHiveSupport().getOrCreate()

strive:
    print("== choose * from entities.customers restrict 3 ==n")
    spark.sql("choose * from entities.customers restrict 3").present()
besides Exception as e:
    print(e)

strive:
    print("== choose * from entities.merchandise restrict 3 ==n")
    spark.sql("choose * from entities.merchandise restrict 3").present()
besides Exception as e:
    print(e)

spark.cease()

Within the script, we’re making an attempt to entry the tables customers and merchandise. Let’s add our Spark utility in the identical S3 bucket that we used earlier:

#Change this along with your bucket identify
BUCKET_NAME="emr-steps-roles-new-us-east-1"

aws s3 cp test-lake-formation.py s3://${BUCKET_NAME}/scripts/

We’re now able to carry out our take a look at. We run the test-lake-formation.py script first utilizing the test-emr-demo1 function after which utilizing the test-emr-demo2 function because the runtime roles.

Let’s submit a step specifying test-emr-demo1 because the runtime function:

#Change along with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change along with your AWS Account ID
ACCOUNT_ID=123456789012
#Change along with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:function/test-emr-demo1

The next is the output of the EMR step with test-emr-demo1 because the runtime function:

== choose * from entities.customers restrict 3 ==

+--------+-----+-------+-------+------+--------------------+
|     uid| identify|surname|  state|  metropolis|             tackle|
+--------+-----+-------+-------+------+--------------------+
|00005678| john|   pike|england|london|      Hidden Highway 78|
|00009039|paolo|  rossi|  italy| milan|Through degli Alberi 56A|
|00009057| july|   finn|germany|berlin|       Inexperienced Highway 90|
+--------+-----+-------+-------+------+--------------------+

== choose * from entities.merchandise restrict 3 ==

Inadequate Lake Formation permission(s) on merchandise (...)

As we will see, our utility was solely in a position to entry the customers desk.

Submit the identical script once more as a brand new EMR step, however this time with the function test-emr-demo2 because the runtime function:

#Change along with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change along with your AWS Account ID
ACCOUNT_ID=123456789012
#Change along with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:function/test-emr-demo2

The next is the output of the EMR step with test-emr-demo2 because the runtime function:

== choose * from entities.customers restrict 3 ==

+--------+-------+
|     uid|  state|
+--------+-------+
|00005678|england|
|00009039|  italy|
|00009057|germany|
+--------+-------+

== choose * from entities.merchandise restrict 3 ==

+----------+---------------+----------+
|product_id|           identify|  class|
+----------+---------------+----------+
|  P0000789|       Bike2000|     Sport|
|  P0000567|   CoverToCover|Smartphone|
|  P0005677|Whiteboard X786|      Residence|
+----------+---------------+----------+

As we will see, our utility was in a position to entry a subset of columns for the customers desk and all of the columns for the merchandise desk.

We will conclude that the permissions whereas accessing the Knowledge Catalog are being enforced primarily based on the runtime function used with the EMR step.

Audit utilizing the supply id

The supply id is a mechanism to watch and management actions taken with assumed roles. The Propagate supply id characteristic equally lets you monitor and management actions taken utilizing runtime roles by the roles submitted with EMR steps.

We already configured EMR_EC2_defaultRole with "sts:SetSourceIdentity" on our two runtime roles. Additionally, each runtime roles let EMR_EC2_DefaultRole to SetSourceIdentity of their belief coverage. So we’re able to proceed.

We now see the Propagate supply id characteristic in motion with a easy instance.

Configure the IAM function that’s assumed to submit the EMR steps

We configure the IAM function job-submitter-1, which is assumed specifying the supply id and which is used to submit the EMR steps. On this instance, we enable the IAM person paul to imagine this function and set the supply id. Please word you need to use any IAM principal right here.

  1. Create a file known as trust-policy-2.json with the next content material (change 123456789012 along with your AWS account ID):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::123456789012:user/paul"
                },
                "Action": "sts:AssumeRole"
            },
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::123456789012:user/paul"
                },
                "Action": "sts:SetSourceIdentity"
            }
        ]
    }

  2. Use it because the belief coverage to create the IAM function job-submitter-1:
    aws iam create-role 
    --role-name job-submitter-1 
    --assume-role-policy-document file://trust-policy-2.json

    We use now the identical emr-runtime-roles-submitter-policy coverage we outlined earlier than to permit the function to submit EMR steps utilizing the test-emr-demo1 and test-emr-demo2 runtime roles.

  3. Assign this coverage to the IAM function job-submitter-1 (change 123456789012 along with your AWS account ID):
    aws iam attach-role-policy 
    --role-name job-submitter-1 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-submitter-policy"

Check the supply id with AWS CloudTrail

To indicate how propagation of supply id works with Amazon EMR, we generate a task session with the supply id test-ad-user.

With the IAM person paul (or with the IAM principal you configured), we first carry out the impersonation (change 123456789012 along with your AWS account ID):

aws sts assume-role 
--role-arn arn:aws:iam::123456789012:function/job-submitter-1 
--role-session-name demotest 
--source-identity test-ad-user

The next code is the output acquired:

{
"Credentials": {
    "SecretAccessKey": "<SECRET_ACCESS_KEY>",
    "SessionToken": "<SESSION_TOKEN>",
    "Expiration": "<EXPIRATION_TIME>",
    "AccessKeyId": "<ACCESS_KEY_ID>"
},
"AssumedRoleUser": {
    "AssumedRoleId": "AROAUVT2HQ3......:demotest",
    "Arn": "arn:aws:sts::123456789012:assumed-role/test-emr-role/demotest"
},
"SourceIdentity": "test-ad-user"
}

We use the momentary AWS safety credentials of the function session, to submit an EMR step together with the runtime function test-emr-demo1:

export AWS_ACCESS_KEY_ID="<ACCESS_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<SECRET_ACCESS_KEY>"
export AWS_SESSION_TOKEN="<SESSION_TOKEN>" 

#Change along with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change along with your AWS Account ID
ACCOUNT_ID=123456789012
#Change along with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:function/test-emr-demo1

In a couple of minutes, we will see occasions showing within the AWS CloudTrail log file. We will see all of the AWS APIs that the roles invoked utilizing the runtime function. Within the following snippet, we will see that the step carried out the sts:AssumeRole and lakeformation:GetDataAccess actions. It’s value noting how the supply id test-ad-user has been preserved within the occasions.

Clear up

Now you can delete the EMR cluster you created.

  1. On the Amazon EMR console, select Clusters within the navigation pane.
  2. Choose the cluster iam-passthrough-cluster, then select Terminate.
  3. Select Terminate once more to verify.

Alternatively, you may delete the cluster through the use of the Amazon EMR CLI with the next command (change the EMR cluster ID with the one returned by the beforehand run aws emr create-cluster command):

aws emr terminate-clusters --cluster-ids j-3KVXXXXXXX7UG

Conclusion

On this publish, we mentioned how one can management knowledge entry on Amazon EMR on EC2 clusters through the use of runtime roles with EMR steps. We mentioned how the characteristic works, how you need to use Lake Formation to use fine-grained entry controls, and the way to monitor and management actions utilizing a supply id. To study extra about this characteristic, discuss with Configure runtime roles for Amazon EMR steps.


In regards to the authors

Stefano Sandona is an Analytics Specialist Answer Architect with AWS. He loves knowledge, distributed techniques and safety. He helps prospects all over the world architecting their knowledge platforms. He has a powerful deal with Amazon EMR and all the safety facets round it.

Sharad Kala is a senior engineer at AWS working with the EMR group. He focuses on the safety facets of the purposes working on EMR. He has a eager curiosity in working and studying about distributed techniques.



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments