Use Amazon EMR with S3 Entry Grants to scale Spark entry to Amazon S3

December 9, 2023

1

Amazon EMR is happy to announce integration with Amazon Easy Storage Service (Amazon S3) Entry Grants that simplifies Amazon S3 permission administration and permits you to implement granular entry at scale. With this integration, you may scale job-based Amazon S3 entry for Apache Spark jobs throughout all Amazon EMR deployment choices and implement granular Amazon S3 entry for higher safety posture.

On this publish, we’ll stroll by means of a number of totally different eventualities of the best way to use Amazon S3 Entry Grants. Earlier than we get began on strolling by means of the Amazon EMR and Amazon S3 Entry Grants integration, we’ll arrange and configure S3 Entry Grants. Then, we’ll use the AWS CloudFormation template under to create an Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) Cluster, an EMR Serverless software and two totally different job roles.

After the setup, we’ll run a number of eventualities of how you should use Amazon EMR with S3 Entry Grants. First, we’ll run a batch job on EMR on Amazon EC2 to import CSV information and convert to Parquet. Second, we’ll use Amazon EMR Studio with an interactive EMR Serverless software to research the info. Lastly, we’ll present the best way to arrange cross-account entry for Amazon S3 Entry Grants. Many shoppers use totally different accounts throughout their group and even exterior their group to share information. Amazon S3 Entry Grants make it straightforward to grant cross-account entry to your information even when filtering by totally different prefixes.

Apart from this publish, you may study extra about Amazon S3 Entry Grants from Scaling information entry with Amazon S3 Entry Grants.

Stipulations

Earlier than you launch the AWS CloudFormation stack, guarantee you’ve got the next:

An AWS account that gives entry to AWS companies
The newest model of the AWS Command Line Interface (AWS CLI)
An AWS Identification and Entry Administration (AWS IAM) person with an entry key and secret key to configure the AWS CLI, and permissions to create an IAM function, IAM insurance policies, and stacks in AWS CloudFormation
A second AWS account should you want to check the cross-account performance

Walkthrough

Create assets with AWS CloudFormation

To be able to use Amazon S3 Entry Grants, you’ll want a cluster with Amazon EMR 6.15.0 or later. For extra info, see the documentation for utilizing Amazon S3 Entry Grants with an Amazon EMR cluster, an Amazon EMR on EKS cluster, and an Amazon EMR Serverless software. For the aim of this publish, we’ll assume that you’ve got two several types of information entry customers in your group—analytics engineers with learn and write entry to the info within the bucket and enterprise analysts with read-only entry. We’ll make the most of two totally different AWS IAM roles, however you may also join your personal id supplier on to IAM Identification Heart should you like.

Right here’s the structure for this primary portion. The AWS CloudFormation stack creates the next AWS assets:

A Digital Personal Cloud (VPC) stack with personal and public subnets to make use of with EMR Studio, route tables, and Community Deal with Translation (NAT) gateway.
An Amazon S3 bucket for EMR artifacts like log information, Spark code, and Jupyter notebooks.
An Amazon S3 bucket with pattern information to make use of with S3 Entry Grants.
An Amazon EMR cluster configured to make use of runtime roles and S3 Entry Grants.
An Amazon EMR Serverless software configured to make use of S3 Entry Grants.
An Amazon EMR Studio the place customers can login and create workspace notebooks with the EMR Serverless software.
Two AWS IAM roles we’ll use for our EMR job runs: one for Amazon EC2 with write entry and one other for Serverless with learn entry.
One AWS IAM function that can be utilized by S3 Entry Grants to entry bucket information (i.e., the Position to make use of when registering a location with S3 Entry Grants. S3 Entry Grants use this function to create short-term credentials).

To get began, full the next steps:

Select Launch Stack:
Settle for the defaults and choose I acknowledge that this template could create IAM assets.

The AWS CloudFormation stack takes roughly 10–quarter-hour to finish. As soon as the stack is completed, go to the outputs tab the place one can find info needed for the next steps.

Create Amazon S3 Entry Grants assets

First, we’re going to create an Amazon S3 Entry Grants assets in our account. We create an S3 Entry Grants occasion, an S3 Entry Grants location that refers to our information bucket created by the AWS CloudFormation stack that’s solely accessible by our information bucket AWS IAM function, and grant totally different ranges of entry to our reader and author roles.

To create the required S3 Entry Grants assets, use the next AWS CLI instructions as an administrative person and exchange any of the fields between the arrows with the output out of your CloudFormation stack.

aws s3control create-access-grants-instance 
  --account-id <YOUR_ACCOUNT_ID>

Subsequent, we create a brand new S3 Entry Grants location. What’s a Location? Amazon S3 Entry Grants works by merchandising AWS IAM credentials with entry scoped to a selected S3 prefix. An S3 Entry Grants location can be related to an AWS IAM Position from which these short-term classes can be created.

In our case, we’re going to scope the AWS IAM Position to the bucket created with our AWS CloudFormation stack and provides entry to the info bucket function created by the stack. Go to the outputs tab to search out the values to interchange with the next code snippet:

aws s3control create-access-grants-location 
  --account-id <YOUR_ACCOUNT_ID> 
  --location-scope "s3://<DATA_BUCKET>/" 
  --iam-role-arn <DATA_BUCKET_ROLE>

Notice the AccessGrantsLocationId worth within the response. We’ll want that for the following steps the place we’ll stroll by means of creating the required S3 Entry Grants to restrict learn and write entry to your bucket.

For the learn/write person, use s3-control create-access-grant to permit READWRITE entry to the “output/*” prefix:

aws s3control create-access-grant 
  --account-id <YOUR_ACCOUNT_ID> 
  --access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> 
  --access-grants-location-configuration S3SubPrefix="output/*" 
  --permission READWRITE 
  --grantee GranteeType=IAM,GranteeIdentifier=<DATA_WRITER_ROLE>

For the learn person, use s3control create-access-grant once more to permit solely READ entry to the identical prefix:

aws s3control create-access-grant 
  --account-id <YOUR_ACCOUNT_ID> 
  --access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> 
  --access-grants-location-configuration S3SubPrefix="output/*" 
  --permission READ 
  --grantee GranteeType=IAM,GranteeIdentifier=<DATA_READER_ROLE>

Demo State of affairs 1: Amazon EMR on EC2 Spark Job to generate Parquet information

Now that we’ve received our Amazon EMR environments arrange and granted entry to our roles by way of S3 Entry Grants, it’s vital to notice that the 2 AWS IAM roles for our EMR cluster and EMR Serverless software have an IAM coverage that solely enable entry to our EMR artifacts bucket. They haven’t any IAM entry to our S3 information bucket and as an alternative use S3 Entry Grants to fetch short-lived credentials scoped to the bucket and prefix. Particularly, the roles are granted s3:GetDataAccess and s3:GetDataAccessGrantsInstanceForPrefix permissions to request entry by way of the precise S3 Entry Grants occasion created in our area. This lets you simply handle your S3 entry in a single place in a extremely scoped and granular trend that enhances your safety posture. By combining S3 Entry Grants with job roles on EMR on Amazon Elastic Kubernetes Service (Amazon EKS) and EMR Serverless in addition to runtime roles for Amazon EMR steps starting with EMR 6.7.0, you may simply handle entry management for particular person jobs or queries. S3 Entry Grants can be found on EMR 6.15.0 and later. Let’s first run a Spark job on EMR on EC2 as our analytics engineer to transform some pattern information into Parquet.

For this, use the pattern code offered in converter.py. Obtain the file and duplicate it to the EMR_ARTIFACTS_BUCKET created by the AWS CloudFormation stack. We’ll submit our job with the ReadWrite AWS IAM function. Notice that for the EMR cluster, we configured S3 Entry Grants to fall again to the IAM function if entry isn’t offered by S3 Entry Grants. The DATA_WRITER_ROLE has learn entry to the EMR artifacts bucket by means of an IAM coverage so it could learn our script. As earlier than, exchange all of the values with the <> symbols from the Outputs tab of your CloudFormation stack.

aws s3 cp converter.py s3://<EMR_ARTIFACTS_BUCKET>/code/
aws emr add-steps --cluster-id <EMR_CLUSTER_ID> 
    --execution-role-arn <DATA_WRITER_ROLE> 
    --steps '[
        {
            "Type": "CUSTOM_JAR",
            "Name": "converter",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Args": [
                    "spark-submit",
                    "--deploy-mode",
                    "client",
                    "s3://<EMR_ARTIFACTS_BUCKET>/code/converter.py",
                    "s3://<DATA_BUCKET>/output/weather-data/"
            ]
        }
    ]'

As soon as the job finishes, we must always see some Parquet information in s3://<DATA_BUCKET>/output/weather-data/. You’ll be able to see the standing of the job within the Steps tab of the EMR console.

Demo State of affairs 2: EMR Studio with an interactive EMR Serverless software to research information

Now let’s go forward and login to EMR Studio and connect with your EMR Serverless software with the ReadOnly runtime function to research the info from situation 1. First we have to allow the interactive endpoint in your Serverless software.

Choose the EMRStudioURL within the Outputs tab of your AWS CloudFormation stack.
Choose Purposes underneath the Serverless part on the left-hand aspect.
Choose the EMRBlog software, then the Motion dropdown, and Configure.
Increase the Interactive endpoint part and make it possible for Allow interactive endpoint is checked.
Scroll down and click on Configure software to avoid wasting your modifications.
Again on the Purposes web page, choose EMRBlog software, then the Begin software button.

Subsequent, create a brand new workspace in our Studio.

Select Workspaces on the left-hand aspect, then the Create workspace button.
Enter a Workspace identify, go away the remaining defaults, and select Create Workspace.
After creating the workspace, it ought to launch in a brand new tab in a number of seconds.

Now join your Workspace to your EMR Serverless software.

Choose the EMR Compute button on the left-hand aspect as proven within the following code.
Select EMR Serverless because the compute kind.
Select the EMRBlog software and the runtime function that begins with EMRBlog.
Select Connect. The window will refresh and you may open a brand new PySpark pocket book and comply with alongside under. To execute the code your self, obtain the AccessGrantsReadOnly.ipynb pocket book and add it into your workspace utilizing the Add Information button within the file browser.

Let’s do a fast learn of the info.

df = spark.learn.parquet(f"s3://{DATA_BUCKET}/output/weather-data/")
df.createOrReplaceTempView("climate")
df.present()

We’ll do a easy depend(*):

spark.sql("SELECT yr, COUNT(*) FROM climate GROUP BY 1").present()

It’s also possible to see that if we attempt to write information into the output location, we get an Amazon S3 error.

df.write.format("csv").mode("overwrite").save("s3://<DATA_BUCKET>/output/weather-data-2/")

Whereas you may also grant comparable entry by way of AWS IAM insurance policies, Amazon S3 Entry Grants may be helpful for conditions the place your group has outgrown managing entry by way of IAM, desires to map S3 Entry Grants to IAM Identification Heart principals or roles, or has beforehand used EMR File System (EMRFS) Position Mappings. S3 Entry Grants credentials are additionally short-term offering safer entry to your information. As well as, as proven under, cross-account entry additionally advantages from the simplicity of S3 Entry Grants.

Demo State of affairs 3 – Cross-account entry

One of many different extra frequent entry patterns is accessing information throughout accounts. This sample has grow to be more and more frequent with the emergence of information mesh, the place information producers and customers are decentralized throughout totally different AWS accounts.

Beforehand, cross-account entry required establishing advanced cross-account assume function actions and customized credentials suppliers when configuring your Spark job. With S3 Entry Grants, we solely have to do the next:

Create an Amazon EMR job function and cluster in a second information shopper account
The info producer account grants entry to the info shopper account with a brand new occasion useful resource coverage
The info producer account creates an entry grant for the info shopper job function

And that’s it! When you have a second account useful, go forward and deploy this AWS CloudFormation stack within the information shopper account, to create a brand new EMR Serverless software and job function. If not, simply comply with alongside under. The AWS CloudFormation stack ought to end creating in underneath a minute. Subsequent, let’s go forward and grant our information shopper entry to the S3 Entry Grants occasion in our information producer account.

Exchange <DATA_PRODUCER_ACCOUNT_ID> and <DATA_CONSUMER_ACCOUNT_ID> with the related 12-digit AWS account IDs.

You may additionally want to vary the area within the command and coverage.

aws s3control put-access-grants-instance-resource-policy 
    --account-id <DATA_PRODUCER_ACCOUNT_ID> 
    --region us-east-2 
    --policy '{
    "Model": "2012-10-17",
    "Id": "S3AccessGrantsPolicy",
    "Assertion": [
        {
            "Sid": "AllowAccessToS3AccessGrants",
            "Principal": {
                "AWS": "<DATA_CONSUMER_ACCOUNT_ID>"
            },
            "Effect": "Allow",
            "Action": [
                "s3:ListAccessGrants",
                "s3:ListAccessGrantsLocations",
                "s3:GetDataAccess",
				"s3:GetAccessGrantsInstanceForPrefix"
            ],
            "Useful resource": "arn:aws:s3:us-east-2:<DATA_PRODUCER_ACCOUNT_ID>:access-grants/default"
        }
    ]
}'

After which grant READ entry to the output folder to our EMR Serverless job function within the information shopper account.

aws s3control create-access-grant 
--account-id <DATA_PRODUCER_ACCOUNT_ID> 
--access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> 
--access-grants-location-configuration S3SubPrefix="output/*" 
--permission READ 
--grantee GranteeType=IAM,GranteeIdentifier=DATA_CONSUMER_JOB_ROLE 
--region us-east-2

Now that we’ve achieved that, we will learn information within the information shopper account from the bucket within the information producer account. We’ll simply run a easy COUNT(*) once more. Exchange the <APPLICATION_ID>, <DATA_CONSUMER_JOB_ROLE>, and <DATA_CONSUMER_LOG_BUCKET> with the values from the Outputs tab on the AWS CloudFormation stack created in your second account.

And exchange <DATA_PRODUCER_BUCKET> with the bucket out of your first account.

aws emr-serverless start-job-run 
  --application-id <APPLICATION_ID> 
  --execution-role-arn <DATA_CONSUMER_JOB_ROLE> 
  --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://<DATA_CONSUMER_LOG_BUCKET>/logs/"
            }
        }
    }' 
  --job-driver '{
    "sparkSubmit": {
        "entryPoint": "SELECT COUNT(*) FROM parquet.`s3://<DATA_PRODUCER_BUCKET>/output/weather-data/`",
        "sparkSubmitParameters": "--class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver -e"
    }
  }'

Anticipate the job to achieve a accomplished state, after which fetch the stdout log out of your bucket, changing the <APPLICATION_ID>, <JOB_RUN_ID> from the job above, and <DATA_CONSUMER_LOG_BUCKET>.

aws emr-serverless get-job-run --application-id <APPLICATION_ID> --job-run-id <JOB_RUN_ID>
{
    "jobRun": {
        "applicationId": "00feq2s6g89r2n0d",
        "jobRunId": "00feqnp2ih45d80e",
        "state": "SUCCESS",
        ...
}

If you’re on a unix-based machine and have gunzip put in, then you should use the next command as your administrative person.

Notice that this command solely makes use of AWS IAM Position Insurance policies, not Amazon S3 Entry Grants.

aws s3 cp s3:// <DATA_CONSUMER_LOG_BUCKET>/logs/functions/<APPLICATION_ID>/jobs/<JOB_RUN_ID>/SPARK_DRIVER/stdout.gz - | gunzip

In any other case, you should use the get-dashboard-for-job-run command and open the ensuing URL in your browser to view the Driver stdout logs within the Executors tab of the Spark UI.

aws emr-serverless get-dashboard-for-job-run --application-id <APPLICATION_ID> --job-run-id <JOB_RUN_ID>

Cleansing up

To be able to keep away from incurring future prices for examples assets in your AWS accounts, make sure you take the next steps:

You could manually delete the Amazon EMR Studio workspace created within the first a part of the publish
Empty the Amazon S3 buckets created by the AWS CloudFormation stacks
Be sure to delete the Amazon S3 Entry Grants, useful resource insurance policies, and S3 Entry Grants location created within the steps above utilizing the delete-access-grant, delete-access-grants-instance-resource-policy, delete-access-grants-location, and delete-access-grants-instance instructions.
Delete the AWS CloudFormation Stacks created in every account

Comparability to AWS IAM Position Mapping

In 2018, EMR launched EMRFS function mapping as a method to offer storage-level authorization by configuring EMRFS with a number of IAM roles. Whereas efficient, function mapping required managing customers or teams domestically in your EMR cluster along with sustaining the mappings between these identities and their corresponding IAM roles. Together with runtime roles on EMR on EC2 and job roles for EMR on EKS and EMR Serverless, it’s now simpler to grant entry to your information on S3 on to the related principal on a per-job foundation.

Conclusion

On this publish, we confirmed you the best way to arrange and use Amazon S3 Entry Grants with Amazon EMR as a way to simply handle information entry in your Amazon EMR workloads. With S3 Entry Grants and EMR, you may simply configure entry to information on S3 for IAM identities or utilizing your company listing in IAM Identification Heart as your id supply. S3 Entry Grants is supported throughout EMR on EC2, EMR on EKS, and EMR Serverless beginning in EMR launch 6.15.0.

To study extra, see the S3 Entry Grants and EMR documentation and be happy to ask any questions within the feedback!

In regards to the creator

Damon Cortesi is a Principal Developer Advocate with Amazon Net Companies. He builds instruments and content material to assist make the lives of information engineers simpler. When not exhausting at work, he nonetheless builds information pipelines and splits logs in his spare time.

Supply hyperlink

Previous articleN. Korean Kimsuky Concentrating on South Korean Analysis Institutes with Backdoor Assaults

Next articleWhat’s Subsequent for the iPad Magic Keyboard and Apple Pencil

Use Amazon EMR with S3 Entry Grants to scale Spark entry to Amazon S3

Stipulations

Walkthrough

Create assets with AWS CloudFormation

Create Amazon S3 Entry Grants assets

Demo State of affairs 1: Amazon EMR on EC2 Spark Job to generate Parquet information

Demo State of affairs 2: EMR Studio with an interactive EMR Serverless software to research information

Demo State of affairs 3 – Cross-account entry

Cleansing up

Comparability to AWS IAM Position Mapping

Conclusion

In regards to the creator

Construct scalable and serverless RAG workflows with a vector engine for Amazon OpenSearch Serverless and Amazon Bedrock Claude fashions

How customized LLMs can turbocharge operations whereas defending priceless IP

AI and Moral Hacking: Redefining Honest Play in Video games

LEAVE A REPLY Cancel reply

Most Popular

Quectel prioritizes safety throughout its full portfolio

Democratic divisions over the Israel-Hamas battle are deepening

Construct scalable and serverless RAG workflows with a vector engine for Amazon OpenSearch Serverless and Amazon Bedrock Claude fashions

Researchers Unveal GuLoader Malware’s Newest Anti-Evaluation Methods

Recent Comments

ABOUT US

POPULAR POSTS

Quectel prioritizes safety throughout its full portfolio

Democratic divisions over the Israel-Hamas battle are deepening

Construct scalable and serverless RAG workflows with a vector engine for Amazon OpenSearch Serverless and Amazon Bedrock Claude fashions

POPULAR CATEGORY