Introducing the Cloud Shuffle Storage Plugin for Apache Spark

December 4, 2022

1

AWS Glue is a serverless knowledge integration service that makes it simple to find, put together, and mix knowledge for analytics, machine studying (ML), and utility growth. In AWS Glue, you should use Apache Spark, an open-source, distributed processing system in your knowledge integration duties and massive knowledge workloads.

Apache Spark makes use of in-memory caching and optimized question execution for quick analytic queries towards your datasets, that are cut up into a number of Spark partitions on totally different nodes to be able to course of a considerable amount of knowledge in parallel. In Apache Spark, shuffling occurs when knowledge must be redistributed throughout the cluster. Throughout a shuffle, knowledge is written to native disk and transferred throughout the community. The shuffle operation is commonly constrained by the accessible native disk capability, or knowledge skew, which may trigger straggling executors. Spark typically throws a No house left on system or MetadataFetchFailedException error when there isn’t sufficient disk house left on the executor and there’s no restoration. Such Spark jobs can’t usually succeed with out including further compute and connected storage, whereby compute is commonly idle, and ends in further value.

In 2021, we launched Amazon S3 shuffle for AWS Glue 2.0 with Spark 2.4. This characteristic disaggregated Spark compute and shuffle storage by using Amazon Easy Storage Service (Amazon S3) to retailer Spark shuffle recordsdata. Utilizing Amazon S3 for Spark shuffle storage enabled you to run data-intensive workloads extra reliably. After the launch, we continued investing on this space, and picked up buyer suggestions.

Right this moment, we’re happy to launch Cloud Shuffle Storage Plugin for Apache Spark. It helps the most recent Apache Spark 3.x distribution so you may reap the benefits of the plugin in AWS Glue or another Spark environments. It’s now additionally natively accessible to make use of in AWS Glue Spark jobs on AWS Glue 3.0 and the most recent AWS Glue model 4.0 with out requiring any additional setup or bringing exterior libraries. Just like the Amazon S3 shuffle for AWS Glue 2.0, the Cloud Shuffle Storage Plugin helps you remedy constrained disk house errors throughout shuffle in serverless Spark environments.

We’re additionally excited to announce the discharge of software program binaries for the Cloud Shuffle Storage Plugin for Apache Spark beneath the Apache 2.0 license. You’ll be able to obtain the binaries and run them on any Spark setting. The brand new plugin is open-cloud, comes with out-of-the field assist for Amazon S3, and could be simply configured to make use of different types of cloud storage reminiscent of Google Cloud Storage and Microsoft Azure Blob Storage.

Understanding a shuffle operation in Apache Spark

In Apache Spark, there are two varieties of transformations:

Slim transformation – This consists of map, filter, union, and mapPartition, the place every enter partition contributes to just one output partition.
Huge transformation – This consists of be a part of, groupBykey, reduceByKey, and repartition, the place every enter partition contributes to many output partitions. Spark SQL queries together with JOIN, ORDER BY, GROUP BY require huge transformations.

A large transformation triggers a shuffle, which happens each time knowledge is reorganized into new partitions with every key assigned to considered one of them. Throughout a shuffle section, all Spark map duties write shuffle knowledge to a neighborhood disk that’s then transferred throughout the community and fetched by Spark scale back duties. The quantity of knowledge shuffled is seen within the Spark UI. When shuffle writes take up extra space than the native accessible disk capability, it causes a No house left on system error.

As an example one of many typical situations, let’s use the question q80.sql from the usual TPC-DS 3 TB dataset for example. This question makes an attempt to calculate the whole gross sales, returns, and eventual revenue realized throughout a particular time-frame. It includes a number of huge transformations (shuffles) attributable to left outer be a part of and group by.

Let’s run the next question on AWS Glue 3.0 job with 10 G1.X employees the place a complete of 640GB of native disk house is on the market:

with ssr as
 (choose  s_store_id as store_id,
          sum(ss_ext_sales_price) as gross sales,
          sum(coalesce(sr_return_amt, 0)) as returns,
          sum(ss_net_profit - coalesce(sr_net_loss, 0)) as revenue
  from store_sales left outer be a part of store_returns on
         (ss_item_sk = sr_item_sk and ss_ticket_number = sr_ticket_number),
     date_dim, retailer, merchandise, promotion
 the place ss_sold_date_sk = d_date_sk
       and d_date between forged('2000-08-23' as date)
                  and (forged('2000-08-23' as date) + interval '30' day)
       and ss_store_sk = s_store_sk
       and ss_item_sk = i_item_sk
       and i_current_price > 50
       and ss_promo_sk = p_promo_sk
       and p_channel_tv = 'N'
 group by s_store_id),
 csr as
 (choose  cp_catalog_page_id as catalog_page_id,
          sum(cs_ext_sales_price) as gross sales,
          sum(coalesce(cr_return_amount, 0)) as returns,
          sum(cs_net_profit - coalesce(cr_net_loss, 0)) as revenue
  from catalog_sales left outer be a part of catalog_returns on
         (cs_item_sk = cr_item_sk and cs_order_number = cr_order_number),
     date_dim, catalog_page, merchandise, promotion
 the place cs_sold_date_sk = d_date_sk
       and d_date between forged('2000-08-23' as date)
                  and (forged('2000-08-23' as date) + interval '30' day)
        and cs_catalog_page_sk = cp_catalog_page_sk
       and cs_item_sk = i_item_sk
       and i_current_price > 50
       and cs_promo_sk = p_promo_sk
       and p_channel_tv = 'N'
 group by cp_catalog_page_id),
 wsr as
 (choose  web_site_id,
          sum(ws_ext_sales_price) as gross sales,
          sum(coalesce(wr_return_amt, 0)) as returns,
          sum(ws_net_profit - coalesce(wr_net_loss, 0)) as revenue
  from web_sales left outer be a part of web_returns on
         (ws_item_sk = wr_item_sk and ws_order_number = wr_order_number),
     date_dim, web_site, merchandise, promotion
 the place ws_sold_date_sk = d_date_sk
       and d_date between forged('2000-08-23' as date)
                  and (forged('2000-08-23' as date) + interval '30' day)
        and ws_web_site_sk = web_site_sk
       and ws_item_sk = i_item_sk
       and i_current_price > 50
       and ws_promo_sk = p_promo_sk
       and p_channel_tv = 'N'
 group by web_site_id)
 choose channel, id, sum(gross sales) as gross sales, sum(returns) as returns, sum(revenue) as revenue
 from (choose
        'retailer channel' as channel, concat('retailer', store_id) as id, gross sales, returns, revenue
      from ssr
      union all
      choose
        'catalog channel' as channel, concat('catalog_page', catalog_page_id) as id,
        gross sales, returns, revenue
      from csr
      union all
      choose
        'internet channel' as channel, concat('web_site', web_site_id) as id, gross sales, returns, revenue
      from  wsr) x
 group by rollup (channel, id)
 order by channel, id

The next screenshot reveals the Executor tab within the Spark UI.
Spark UI Executor Tab

The next screenshot reveals the standing of Spark jobs included within the AWS Glue job run.
Spark UI Jobs
Within the failed Spark job (job ID=7), we are able to see the failed Spark stage within the Spark UI.
Spark UI Failed stage
There was 167.8GiB shuffle write in the course of the stage, and 14 duties failed because of the error java.io.IOException: No house left on system as a result of the host 172.34.97.212 ran out of native disk.
Spark UI Tasks

Cloud Shuffle Storage for Apache Spark

Cloud Shuffle Storage for Apache Spark lets you retailer Spark shuffle recordsdata on Amazon S3 or different cloud storage companies. This offers full elasticity to Spark jobs, thereby permitting you to run your most knowledge intensive workloads reliably. The next determine illustrates how Spark map duties write the shuffle recordsdata to the Cloud Shuffle Storage. Reducer duties think about the shuffle blocks as distant blocks and browse them from the identical shuffle storage.

This structure allows your serverless Spark jobs to make use of Amazon S3 with out the overhead of operating, working, and sustaining further storage or compute nodes.
Chopper diagram
The next Glue job parameters allow and tune Spark to make use of S3 buckets for storing shuffle knowledge. It’s also possible to allow at-rest encryption when writing shuffle knowledge to Amazon S3 through the use of safety configuration settings.

Key	Worth	Clarification
`--write-shuffle-files-to-s3`	`TRUE`	That is the principle flag, which tells Spark to make use of S3 buckets for writing and studying shuffle knowledge.
`--conf`	`spark.shuffle.storage.path=s3://<shuffle-bucket>`	That is optionally available, and specifies the S3 bucket the place the plugin writes the shuffle recordsdata. By default, we use –TempDir/shuffle-data.

The shuffle recordsdata are written to the situation and create recordsdata reminiscent of following:

s3://<shuffle-storage-path>/<Spark utility ID>/[0-9]/<Shuffle ID>/shuffle_<Shuffle ID>_<Mapper ID>_0.knowledge

With the Cloud Shuffle Storage plugin enabled and utilizing the identical AWS Glue job setup, the TPC-DS question now succeeded with none job or stage failures.
Spark UI Jobs with Chopper plugin

Software program binaries for the Cloud Shuffle Storage Plugin

Now you can additionally obtain and use the plugin in your individual Spark environments and with different cloud storage companies. The plugin binaries can be found to be used beneath the Apache 2.0 license.

Bundle the plugin together with your Spark functions

You’ll be able to bundle the plugin together with your Spark functions by including it as a dependency in your Maven pom.xml as you develop your Spark functions, as proven within the follwoing code. For extra particulars on the plugin and Spark variations, confer with Plugin variations.

<repositories>
   ...
    <repository>
        <id>aws-glue-etl-artifacts</id>
        <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/launch/</url>
    </repository>
</repositories>
...
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>chopper-plugin</artifactId>
    <model>3.1-amzn-LATEST</model>
</dependency>

You’ll be able to alternatively obtain the binaries from AWS Glue Maven artifacts instantly and embody them in your Spark utility as follows:

#!/bin/bash
sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/launch/com/amazonaws/chopper-plugin/3.1-amzn-LATEST/chopper-plugin-3.1-amzn-LATEST.jar -P /usr/lib/spark/jars

Submit the Spark utility by together with the JAR recordsdata on the classpath and specifying the 2 Spark configs for the plugin:

spark-submit --deploy-mode cluster 
--conf spark.shuffle.kind.io.plugin.class=com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin 
--conf spark.shuffle.storage.path=s3://<s3 bucket>/<shuffle-dir> 
 --class <your class> <your utility jar>

The next Spark parameters allow and configure Spark to make use of an exterior storage URI reminiscent of Amazon S3 for storing shuffle recordsdata; the URI protocol determines which storage system to make use of.

Key	Worth	Clarification
`spark.shuffle.storage.path`	`s3://<shuffle-storage-path>`	It specifies an URI the place the shuffle recordsdata are saved, which a lot be a legitimate Hadoop FileSystem and be configured as wanted
`spark.shuffle.kind.io.plugin.class`	`com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin`	The entry class within the plugin

Different cloud storage integration

This plugin comes with out-of-the field assist for Amazon S3 and will also be configured to make use of different types of cloud storage reminiscent of Google Cloud Storage and Microsoft Azure Blob Storage. To allow different Hadoop FileSystem suitable cloud storage companies, you may merely add a storage URI for the corresponding service scheme, reminiscent of gs:// for Google Cloud Storage as a substitute of s3:// for Amazon S3, add the FileSystem JAR recordsdata for the service, and set the suitable authentication configurations.

For extra details about learn how to combine the plugin with Google Cloud Storage and Microsoft Azure Blob Storage, confer with Utilizing AWS Glue Cloud Shuffle Plugin for Apache Spark with Different Cloud Storage Providers.

Greatest practices and concerns

Word the next concerns:

This characteristic replaces native shuffle storage with Amazon S3. You need to use it to deal with frequent failures with value/efficiency advantages in your serverless analytics jobs and pipelines. We advocate enabling this characteristic once you need to guarantee dependable runs of your data-intensive workloads that create a considerable amount of shuffle knowledge or once you’re getting No house left on system error. It’s also possible to use this plugin in case your job encounters fetch failures org.apache.spark.shuffle.MetadataFetchFailedException or in case your knowledge is skewed.
We advocate setting S3 bucket lifecycle insurance policies on the shuffle bucket (spark.shuffle.storage.s3.path) in an effort to clear up outdated shuffle knowledge mechanically.
The shuffle knowledge on Amazon S3 is encrypted by default. It’s also possible to encrypt the info with your individual AWS Key Administration Service (AWS KMS) keys.

Conclusion

This publish launched the brand new Cloud Shuffle Storage Plugin for Apache Spark and described its advantages to independently scale storage in your Spark jobs with out including further employees. With this plugin, you may count on jobs processing terabytes of knowledge to run way more reliably.

The plugin is on the market in AWS Glue 3.0 and 4.0 Spark jobs in all AWS Glue supported Areas. We’re additionally releasing the plugin’s software program binaries beneath the Apache 2.0 license. You need to use the plugin in AWS Glue or different Spark environments. We sit up for listening to your suggestions.

In regards to the Authors

Noritaka Sekiyama s a Principal Massive Information Architect on the AWS Glue workforce. He’s chargeable for constructing software program artifacts that assist clients construct knowledge lakes on the cloud.

Rajendra Gujja is a Senior Software program Growth Engineer on the AWS Glue workforce. He’s keen about distributed computing and the whole lot and something concerning the knowledge.

Chuhan Liu is a Software program Growth Engineer on the AWS Glue workforce.

Gonzalo Herreros is a Senior Massive Information Architect on the AWS Glue workforce.

Mohit Saxena is a Senior Software program Growth Supervisor on the AWS Glue workforce. His workforce focuses on constructing distributed programs to allow clients with knowledge integration and connectivity to a wide range of sources, effectively handle knowledge lakes on Amazon S3, and optimizes Apache Spark for fault-tolerance with ETL workloads.