Monday, October 23, 2023
HomeBig DataMigrate knowledge from Azure Blob Storage to Amazon S3 utilizing AWS Glue

Migrate knowledge from Azure Blob Storage to Amazon S3 utilizing AWS Glue


Right this moment, we’re happy to announce new AWS Glue connectors for Azure Blob Storage and Azure Knowledge Lake Storage that help you transfer knowledge bi-directionally between Azure Blob Storage, Azure Knowledge Lake Storage, and Amazon Easy Storage Service (Amazon S3).

We’ve seen a requirement to design functions that allow knowledge to be moveable throughout cloud environments and provide the capacity to derive insights from a number of knowledge sources. One of many knowledge sources now you can rapidly combine with is Azure Blob Storage, a managed service for storing each unstructured knowledge and structured knowledge, and Azure Knowledge Lake Storage, a knowledge lake for analytics workloads. With these connectors, you’ll be able to carry the information from Azure Blob Storage and Azure Knowledge Lake Storage individually to Amazon S3.

On this publish, we use Azure Blob Storage for example and reveal how the brand new connector works, introduce the connector’s features, and give you key steps to set it up. We give you stipulations, share learn how to subscribe to this connector in AWS Market, and describe learn how to create and run AWS Glue for Apache Spark jobs with it. Relating to the Azure Knowledge Lake Storage Gen2 Connector, we spotlight any main variations on this publish.

AWS Glue is a serverless knowledge integration service that makes it easy to find, put together, and mix knowledge for analytics, machine studying, and utility growth. AWS Glue natively integrates with numerous knowledge shops corresponding to MySQL, PostgreSQL, MongoDB, and Apache Kafka, together with AWS knowledge shops corresponding to Amazon S3, Amazon Redshift, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB. AWS Glue Market connectors help you uncover and combine further knowledge sources, corresponding to software program as a service (SaaS) functions and your customized knowledge sources. With only a few clicks, you’ll be able to seek for and choose connectors from AWS Market and start your knowledge preparation workflow in minutes.

How the connectors work

On this part, we talk about how the brand new connectors work.

Azure Blob Storage connector

This connector depends on the Spark DataSource API and calls Hadoop’s FileSystem interface. The latter has carried out libraries for studying and writing numerous distributed or conventional storage. This connector additionally consists of the hadoop-azure module, which helps you to run Apache Hadoop or Apache Spark jobs immediately with knowledge in Azure Blob Storage. AWS Glue hundreds the library from the Amazon Elastic Container Registry (Amazon ECR) repository throughout initialization (as a connector), reads the connection credentials utilizing AWS Secrets and techniques Supervisor, and reads knowledge supply configurations from enter parameters. When AWS Glue has web entry, the Spark job in AWS Glue can learn from and write to Azure Blob Storage.

We help the next two strategies for authentication: the authentication key for Shared Key and shared entry signature (SAS) tokens:

#Methodology 1:Shared Key, 
spark.conf.set("fs.azure.account.key.youraccountname.blob.core.home windows.internet", "*Your account key")

df = spark.learn.format("csv").possibility("header","true").load("wasbs://yourblob@youraccountname.blob.core.home windows.internet/loadingtest-input/100mb")

df.write.format("csv").possibility("compression","snappy").mode("overwrite").save("wasbs://<container_name>@<account_name>.blob.core.home windows.internet/output-CSV/20210831/")

#Methodology 2:Shared Entry Signature (SAS) Tokens
spark.conf.set("fs.azure.sas.yourblob.youraccountname.blob.core.home windows.internet", "Your SAS token*")

df = spark.learn.format("csv").possibility("header","true").load("wasbs://yourblob@youraccountname.blob.core.home windows.internet/loadingtest-input/100mb")

df.write.format("csv").possibility("compression","snappy").mode("overwrite").save("wasbs://<container_name>@<account_name>.blob.core.home windows.internet/output-CSV/20210831/")

Azure Knowledge Lake Storage Gen2 connector

The utilization of Azure Knowledge Lake Storage Gen2 is way the identical because the Azure Blob Storage connector. The Azure Knowledge Lake Storage Gen2 connector makes use of the identical library because the Azure Blob Storage connector, and depends on the Spark DataSource API, Hadoop’s FileSystem interface, and the Azure Blob Storage connector for Hadoop.

As of this writing, we solely help the Shared Key authentication technique:

#Methodology: Shared Key 
spark.conf.set("fs.azure.account.key.youraccountname.dfs.core.home windows.internet", "*Your account key")

# Learn file from ADLS instance
df= spark.learn.format("csv").possibility("header","true").load("abfss://<container_name>@<account_name>.dfs.core.home windows.internet/input-csv/covid/") 

# Write file to ADLS instance
df.write.format("parquet").possibility("compression","snappy").partitionBy("state").mode("overwrite").save("abfss://<container_name>@<account_name>.dfs.core.home windows.internet/output-glue3/csv-partitioned/") 

Resolution overview

The next structure diagram reveals how AWS Glue connects to Azure Blob Storage for knowledge ingestion.

Within the following sections, we present you learn how to create a brand new secret for Azure Blob Storage in Secrets and techniques Supervisor, subscribe to the AWS Glue connector, and transfer knowledge from Azure Blob Storage to Amazon S3.

Conditions

You want the next stipulations:

  • A storage account in Microsoft Azure and your knowledge path in Azure Blob Storage. Put together the storage account credentials prematurely. For directions, discuss with Create a storage account shared key.
  • A Secrets and techniques Supervisor secret to retailer a Shared Key secret, utilizing one of many supporting authentication strategies.
  • An AWS Identification and Entry Administration (IAM) position for the AWS Glue job with the next insurance policies:
    • AWSGlueServiceRole, which permits the AWS Glue service position entry to associated providers.
    • AmazonEC2ContainerRegistryReadOnly, which offers read-only entry to Amazon EC2 Container Registry repositories. This coverage is for utilizing AWS Market’s connector libraries.
    • A Secrets and techniques Supervisor coverage, which offers learn entry to the key in Secrets and techniques Supervisor.
    • An S3 bucket coverage for the S3 bucket that you’ll want to load ETL (extract, rework, and cargo) knowledge from Azure Blob Storage.

Create a brand new secret for Azure Blob Storage in Secrets and techniques Supervisor

Full the next steps to create a secret in Secrets and techniques Supervisor to retailer the Azure Blob Storage connection strings utilizing the Shared Key authentication technique:

  1. On the Secrets and techniques Supervisor console, select Secrets and techniques within the navigation pane.
  2. Select Retailer a brand new secret.
  3. For Secret kind, choose Different kind of secret.
  4. Change the values for accountName, accountKey, and container with your personal values.
  5. Go away the remainder of the choices at their default.
  6. Select Subsequent.
  7. Present a reputation for the key, corresponding to azureblobstorage_credentials.
  8. Observe the remainder of the steps to retailer the key.

Subscribe to the AWS Glue connector for Azure Blob Storage

To subscribe to the connector, full the next steps:

  1. Navigate to the Azure Blob Storage Connector for AWS Glue on AWS Market.
  2. On the product web page for the connector, use the tabs to view details about the connector, then select Proceed to Subscribe.
  3. Evaluation the pricing phrases and the vendor’s Finish Consumer License Settlement, then select Settle for Phrases.
  4. Proceed to the following step by selecting Proceed to Configuration.
  5. On the Configure this software program web page, select the achievement choices and the model of the connector to make use of.

We have now offered two choices for the Azure Blob Storage Connector: AWS Glue 3.0 and AWS Glue 4.0. On this instance, we concentrate on AWS Glue 4.0. Select Proceed to Launch.

  1. On the Launch this software program web page, select Utilization directions to overview the utilization directions offered by AWS.
  2. If you’re able to proceed, select Activate the Glue connector from AWS Glue Studio.

The console will show the Create market connection web page in AWS Glue Studio.

Transfer knowledge from Azure Blob Storage to Amazon S3

To maneuver your knowledge to Amazon S3, you have to configure the customized connection after which arrange an AWS Glue job.

Create a customized connection in AWS Glue

An AWS Glue connection shops connection data for a selected knowledge retailer, together with login credentials, URI strings, digital personal cloud (VPC) data, and extra. Full the next steps to create your connection:

  1. On the AWS Glue console, select Connectors within the navigation pane.
  2. Select Create connection.
  3. For Connector, select Azure Blob Storage Connector for AWS Glue.
  4. For Identify, enter a reputation for the connection (for instance, AzureBlobStorageConnection).
  5. Enter an non-obligatory description.
  6. For AWS secret, enter the key you created (azureblobstorage_credentials).
  7. Select Create connection and activate connector.

The connector and connection data is now seen on the Connectors web page.

Create an AWS Glue job and configure connection choices

Full the next steps:

  1. On the AWS Glue console, select Connectors within the navigation pane.
  2. Select the connection you created (AzureBlobStorageConnection).
  3. Select Create job.
  4. For Identify, enter Azure Blob Storage Connector for AWS Glue. This identify must be distinctive amongst all of the nodes for this job.
  5. For Connection, select the connection you created (AzureBlobStorageConnection).
  6. For Key, enter path, and for Worth, enter your Azure Blob Storage URI. For instance, once we created our new secret, we already set a container worth for the Azure Blob Storage. Right here, we enter the file path /input_data/.
  7. Enter one other key-value pair. For Key, enter fileFormat. For Worth, enter csv, as a result of our pattern knowledge is on this format.
  8. Optionally, if the CSV file incorporates a header line, enter one other key-value pair. For Key, enter header. For Worth, enter true.
  9. To preview your knowledge, select the Knowledge preview tab, then select Begin knowledge preview session and select the IAM position outlined within the stipulations.
  10. Select Verify and await the outcomes to show.
  11. Choose S3 as Goal Location.
  12. Select Browse S3 to see the S3 buckets that you’ve got entry to and select one because the goal vacation spot for the information output.
  13. For the opposite choices, use the default values.
  14. On the Job particulars tab, for IAM Function, select the IAM position outlined within the stipulations.
  15. For Glue model, select your AWS Glue model.
  16. Proceed to create your ETL job. For directions, discuss with Creating ETL jobs with AWS Glue Studio.
  17. Select Run to run your job.

When the job is full, you’ll be able to navigate to the Run particulars web page on the AWS Glue console and examine the logs in Amazon CloudWatch.

The information is ingested into Amazon S3, as proven within the following screenshot. We at the moment are capable of import knowledge from Azure Blob Storage to Amazon S3.

Scaling concerns

On this instance, we use the default AWS Glue capability, 10 DPU (Knowledge Processing Items). A DPU is a standardized unit of processing capability that consists of 4 vCPUs of compute capability and 16 GB of reminiscence. To scale your AWS Glue job, you’ll be able to enhance the variety of DPU, and in addition make the most of Auto Scaling. With Auto Scaling enabled, AWS Glue routinely provides and removes staff from the cluster relying on the workload. After you select the utmost variety of staff, AWS Glue will adapt the fitting dimension of sources for the workload.

Clear up

To wash up your sources, full the next steps:

  1. Take away the AWS Glue job and secret in Secrets and techniques Supervisor with the next command:
    aws glue delete-job —job-name <your_job_name>
    
    aws glue delete-connection —connection-name <your_connection_name>
    
    aws secretsmanager delete-secret —secret-id <your_secretsmanager_id>

  2. In case you are not going to make use of this connector, you’ll be able to cancel the subscription to the Azure Blob Storage connector:
    1. On the AWS Market console, go to the Handle subscriptions web page.
    2. Choose the subscription for the product that you just wish to cancel.
    3. On the Actions menu, select Cancel subscription.
    4. Learn the data offered and choose the acknowledgement examine field.
    5. Select Sure, cancel subscription.
  3. Delete the information within the S3 bucket that you just used within the earlier steps.

Conclusion

On this publish, we confirmed learn how to use AWS Glue and the brand new connector for ingesting knowledge from Azure Blob Storage to Amazon S3. This connector offers entry to Azure Blob Storage, facilitating cloud ETL processes for operational reporting, backup and catastrophe restoration, knowledge governance, and extra.

We welcome any suggestions or questions within the feedback part.

Appendix

If you want SAS token authentication for Azure Knowledge Lake Storage Gen 2, you need to use Azure SAS Token Supplier for Hadoop. To try this, add the JAR file to your S3 bucket and configure your AWS Glue job to set the S3 location within the job parameter --extra-jars (in AWS Glue Studio, Dependent JARs path). Then save the SAS token in Secrets and techniques Supervisor and set the worth to spark.hadoop.fs.azure.sas.mounted.token.<azure storage account>.dfs.core.home windows.internet in SparkConf utilizing script mode at runtime. Study extra in README.

from pyspark.sql import SparkSession
spark = SparkSession.builder 
.config("spark.hadoop.fs.azure.sas.mounted.token.<azure storage account>.dfs.core.home windows.ne", sas_token) 
.getOrCreate()

References


In regards to the authors

Qiushuang Feng is a Options Architect at AWS, answerable for Enterprise prospects’ technical structure design, consulting, and design optimization on AWS Cloud providers. Earlier than becoming a member of AWS, Qiushuang labored in IT corporations corresponding to IBM and Oracle, and collected wealthy sensible expertise in growth and analytics.

Noritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue workforce. He’s obsessed with architecting fast-growing knowledge environments, diving deep into distributed huge knowledge software program like Apache Spark, constructing reusable software program artifacts for knowledge lakes, and sharing data in AWS Large Knowledge weblog posts.

Shengjie Luo is a Large Knowledge Architect on the Amazon Cloud Expertise skilled service workforce. They’re answerable for options consulting, structure, and supply of AWS-based knowledge warehouses and knowledge lakes. They’re expert in serverless computing, knowledge migration, cloud knowledge integration, knowledge warehouse planning, and knowledge service structure design and implementation.

Greg Huang is a Senior Options Architect at AWS with experience in technical structure design and consulting for the China G1000 workforce. He’s devoted to deploying and using enterprise-level functions on AWS Cloud providers. He possesses almost 20 years of wealthy expertise in large-scale enterprise utility growth and implementation, having labored within the cloud computing subject for a few years. He has intensive expertise in serving to numerous kinds of enterprises migrate to the cloud. Previous to becoming a member of AWS, he labored for well-known IT enterprises corresponding to Baidu and Oracle.

Maciej Torbus is a Principal Buyer Options Supervisor inside Strategic Accounts at Amazon Net Companies. With intensive expertise in large-scale migrations, he focuses on serving to prospects transfer their functions and techniques to extremely dependable and scalable architectures in AWS. Outdoors of labor, he enjoys crusing, touring, and restoring classic mechanical watches.



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments