Amazon EMR on EKS widens the efficiency hole: Run Apache Spark workloads 5.37 occasions sooner and at 4.3 occasions decrease price

April 13, 2023

1

Amazon EMR on EKS gives a deployment choice for Amazon EMR that enables organizations to run open-source large knowledge frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, Spark purposes run on the Amazon EMR runtime for Apache Spark. This performance-optimized runtime supplied by Amazon EMR makes your Spark jobs run quick and cost-effectively. Additionally, you possibly can run different kinds of enterprise purposes, reminiscent of net purposes and machine studying (ML) TensorFlow workloads, on the identical EKS cluster. EMR on EKS simplifies your infrastructure administration, maximizes useful resource utilization, and reduces your price.

We have now been frequently bettering the Spark efficiency in every Amazon EMR launch to additional shorten job runtime and optimize customers’ spending on their Amazon EMR large knowledge workloads. As of the Amazon EMR 6.5 launch in January 2022, the optimized Spark runtime was 3.5 occasions sooner than OSS Spark v3.1.2 with as much as 61% decrease prices. Amazon EMR 6.10 is now 1.59 occasions sooner than Amazon EMR 6.5, which has resulted in 5.37 occasions higher efficiency than OSS Spark v3.3.1 with 76.8% price financial savings.

On this put up, we describe the benchmark setup and outcomes on prime of the EMR on EKS surroundings. We additionally share a Spark benchmark answer that fits all Amazon EMR deployment choices, so you possibly can replicate the method in your surroundings to your personal efficiency take a look at instances. The answer makes use of the TPC-DS dataset and unmodified knowledge schema and desk relationships, however derives queries from TPC-DS to help the SparkSQL take a look at instances. It isn’t akin to different printed TPC-DS benchmark outcomes.

Benchmark setup

To check with the EMR on EKS 6.5 take a look at outcome detailed within the put up Amazon EMR on Amazon EKS gives as much as 61% decrease prices and as much as 68% efficiency enchancment for Spark workloads, this benchmark for the most recent launch (Amazon EMR 6.10) makes use of the identical strategy: a TPC-DS benchmark framework and the identical measurement of TPC-DS enter dataset from an Amazon Easy Storage Service (Amazon S3) location. For the supply knowledge, we selected the three TB scale issue, which comprises 17.7 billion data, roughly 924 GB compressed knowledge in Parquet file format. The setup directions and technical particulars could be discovered within the aws-sample repository.

In abstract, the whole efficiency take a look at job contains 104 SparkSQL queries and was accomplished in roughly 24 minutes (1,397.55 seconds) with an estimated working price of $5.08 USD. The enter knowledge and take a look at outcome outputs have been each saved on Amazon S3.

The job has been configured with the next parameters that match with the earlier Amazon EMR 6.5 take a look at:

EMR launch – EMR 6.10.0
{Hardware}:
- Compute – 6 X c5d.9xlarge situations, 216 vCPU, 432 GiB reminiscence in complete
- Storage – 6 x 900 NVMe SSD build-in storage
- Amazon EBS root quantity – 6 X 20GB gp2
Spark configuration:
- Driver pod – 1 occasion amongst different 7 executors on a shared Amazon Elastic Compute Cloud (Amazon EC2) node:
  - spark.driver.cores=4
  - spark.driver.reminiscence=5g
  - spark.kubernetes.driver.restrict.cores=4.1
- Executor pod – 47 situations distributed over 6 EC2 nodes
  - spark.executor.cores=4
  - spark.executor.reminiscence=6g
  - spark.executor.memoryOverhead=2G
  - spark.kubernetes.executor.restrict.cores=4.3
Metadata retailer – We use Spark’s in-memory knowledge catalog to retailer metadata for TPC-DS databases and tables—spark.sql.catalogImplementation is about to the default worth in-memory. The actual fact tables are partitioned by the date column, which consists of partitions starting from 200–2,100. No statistics are pre-calculated for these tables.

Outcomes

A single take a look at session consists of 104 Spark SQL queries that have been run sequentially. We ran every Spark runtime session (EMR runtime for Apache Spark, OSS Apache Spark) 3 times. The Spark benchmark job produces a CSV file to Amazon S3 that summarizes the median, minimal, and most runtime for every particular person question.

The way in which we calculate the ultimate benchmark outcomes (geomean and the full job runtime) are based mostly on arithmetic means. We take the imply of the median, minimal, and most values per question utilizing the formulation of AVERAGE(), for instance AVERAGE(F2:H2). Then we take a geometrical imply of the typical column I by the formulation GEOMEAN(I2:I105) and SUM(I2:I105) for the full runtime.

Beforehand, we noticed that EMR on EKS 6.5 is 3.5 occasions sooner than OSS Spark on EKS, and prices 2.6 occasions much less. From this benchmark, we discovered that the hole has widened: EMR on EKS 6.10 now gives a 5.37 occasions efficiency enchancment on common and as much as 11.61 occasions improved efficiency for particular person queries over OSS Spark 3.3.1 on Amazon EKS. From the working price perspective, we see the numerous discount by 4.3 occasions.

The next graph exhibits the efficiency enchancment of Amazon EMR 6.10 in comparison with OSS Spark 3.3.1 on the particular person question stage. The X-axis exhibits the identify of every question, and the Y-axis exhibits the full runtime in seconds on logarithmic scale. Probably the most important efficiency positive aspects for eight queries (q14a, q14b, q23b, q24a, q24b, this autumn, q67, q72) demonstrated over 10 occasions sooner for the runtime.

Job price estimation

The fee estimate doesn’t account for Amazon S3 storage, or PUT and GET requests. The Amazon EMR on EKS uplift calculation is predicated on the hourly billing info offered by AWS Price Explorer.

c5d.9xlarge hourly worth – $1.728
Variety of EC2 situations – 6
Amazon EBS storage per GB-month – $0.10
Amazon EBS gp2 root quantity – 20GB
Job run time (hour) –
- OSS Spark 3.3.1 – 2.09
- EMR on EKS 6.5.0 – 0.68
- EMR on EKS 6.10.0 – 0.39

Price part	OSS Spark 3.3.1 on EKS	EMR on EKS 6.5.0	EMR on EKS 6.10.0
Amazon EC2	$21.67	$7.05	$4.04
EMR on EKS	$ –	$1.57	$0.99
Amazon EKS	$0.21	$0.07	$0.04
Amazon EBS root quantity	$0.03	$0.01	$0.01
Whole	$21.88	$8.70	$5.08

Efficiency enhancements

Though we enhance on Amazon EMR’s efficiency with every launch, Amazon EMR 6.10 contained many efficiency optimizations, making it 5.37 occasions sooner than OSS Spark v3.3.1 and 1.59 occasions sooner than our first launch of 2022, Amazon EMR 6.5. This extra efficiency enhance was achieved by the addition of a number of optimizations, together with:

Enhancements to affix efficiency, reminiscent of the next:
- Shuffle-Hash Joins (SHJ) are extra CPU and I/O environment friendly than Shuffle-Kind-Merge Joins (SMJ) when the prices of constructing and probing the hash desk, together with the supply of reminiscence, are lower than the price of sorting and performing the merge be a part of. Nonetheless, SHJs have drawbacks, reminiscent of danger of out of reminiscence errors because of its lack of ability to spill to disk, which prevents them from being aggressively used throughout Spark rather than SMJs by default. We have now optimized our use of SHJs in order that they are often utilized to extra queries by default than in OSS Spark.
- For some question shapes, now we have eradicated redundant joins and enabled using extra performant be a part of varieties.
We have now decreased the quantity of information shuffled earlier than joins and the potential for knowledge explosions after joins by selectively pushing down aggregates by joins.
Bloom filters can enhance efficiency by decreasing the quantity of information shuffled earlier than the be a part of. Nonetheless, there are instances the place bloom filters are usually not useful and might even regress efficiency. For instance, the bloom filter introduces a dependency between phases that reduces question parallelism, however could find yourself filtering out comparatively little knowledge. Our enhancements enable bloom filters to be safely utilized to extra question plans than OSS Spark.
Aggregates with high-precision decimals are computationally intensive in OSS Spark. We optimized high-precision decimal computations to rising their efficiency.

Abstract

With model 6.10, Amazon EMR has additional enhanced the EMR runtime for Apache Spark compared to our earlier benchmark checks for Amazon EMR model 6.5. When working EMR workloads with the the equal Apache Spark model 3.3.1, we noticed 1.59 occasions higher efficiency with 41.6% cheaper prices than Amazon EMR 6.5.

With our TPC-DS benchmark setup, we noticed a major efficiency improve of 5.37 occasions and a price discount of 4.3 occasions utilizing EMR on EKS in comparison with OSS Spark.

To study extra and get began with EMR on EKS, check out the EMR on EKS Workshop and go to the EMR on EKS Finest Practices Information web page.

In regards to the Authors

Melody Yang is a Senior Huge Knowledge Resolution Architect for Amazon EMR at AWS. She is an skilled analytics chief working with AWS clients to offer greatest apply steering and technical recommendation as a way to help their success in knowledge transformation. Her areas of pursuits are open-source frameworks and automation, knowledge engineering and DataOps.