Run fault tolerant and cost-optimized Spark clusters utilizing Amazon EMR on EKS and Amazon EC2 Spot Cases

December 20, 2022

1

Amazon EMR on EKS is a deployment possibility in Amazon EMR that lets you run Spark jobs on Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Elastic Compute Cloud (Amazon EC2) Spot Cases prevent as much as 90% over On-Demand Cases, and is a good way to value optimize the Spark workloads operating on Amazon EMR on EKS. As a result of Spot is an interruptible service, if we are able to transfer or reuse the intermediate shuffle information, it improves the general stability and SLA of the job. The newest variations of Amazon EMR on EKS have built-in Spark options to allow this functionality.

On this put up, we talk about these options—Node Decommissioning and Persistent Quantity Declare (PVC) reuse—and their affect on rising the fault tolerance of Spark jobs on Amazon EMR on EKS when value optimizing utilizing EC2 Spot Cases.

Amazon EMR on EKS and Spot

EC2 Spot Cases are spare EC2 capability offered at a steep low cost of as much as 90% over On-Demand costs. Spot Cases are an excellent selection for stateless and versatile workloads. The caveat with this low cost and spare capability is that Amazon EC2 can interrupt an occasion with a proactive or reactive (2-minute) warning when it wants the capability again. You’ll be able to provision compute capability in an EKS cluster utilizing Spot Cases utilizing a managed or self-managed node group and supply value optimization in your workloads.

Amazon EMR on EKS makes use of Amazon EKS to run jobs with the EMR runtime for Apache Spark, which will be value optimized by operating the Spark executors on Spot. It gives as much as 61% decrease prices and as much as 68% efficiency enchancment for Spark workloads on Amazon EKS. The Spark utility launches a driver and executors to run the computation. Spark is a semi-fault tolerant framework that’s resilient to executor loss attributable to an interruption and due to this fact can run on EC2 Spot. Alternatively, when the motive force is interrupted, the job fails. Therefore, we advocate operating drivers on on-demand situations. A number of the greatest practices for operating Spark on Amazon EKS are relevant with Amazon EMR on EKS.

EC2 Spot situations additionally helps in value optimization by enhancing the general throughput of the job. This may be achieved by auto-scaling the cluster utilizing Cluster Autoscaler (for managed nodegroups) or Karpenter.

Although Spark executors are resilient to Spot interruptions, the shuffle information and RDD information is misplaced when the executor will get killed. The misplaced shuffle information have to be recomputed, which will increase the general runtime of the job. Apache Spark has launched two options (in variations 3.1 and three.2) that addresses this difficulty. Amazon EMR on EKS launched options similar to node decommissioning (model 6.3) and PVC reuse (model 6.8) to simplify restoration and reuse shuffle information, which will increase the general resiliency of your utility.

Node decommissioning

The node decommissioning characteristic works by stopping scheduling of latest jobs on the nodes which can be to be decommissioned. It additionally strikes any shuffle information or cache current in these nodes to different executors (friends). If there are not any different out there executors, the shuffle information and cache are moved to a distant fallback storage.

Fig 1 : Node Decommissioning

Let’s take a look at the decommission steps in additional element.

If one of many nodes that’s operating executors is interrupted, the executor begins the method of decommissioning and sends the message to the motive force:

21/05/05 17:41:41 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Acquired executor 7 decommissioned message
21/05/05 17:41:41 DEBUG TaskSetManager: Legitimate locality ranges for TaskSet 2.0: NO_PREF, ANY
21/05/05 17:41:41 INFO KubernetesClusterSchedulerBackend: Decommission executors: 7
21/05/05 17:41:41 DEBUG TaskSchedulerImpl: parentName: , identify: TaskSet_2.0, runningTasks: 10
21/05/05 17:41:41 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(7, 192.168.82.107, 39007, None)) as being decommissioning.

21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Will exit when completed decommissioning
21/05/05 20:22:17 INFO BlockManager: Beginning block supervisor decommissioning course of...
21/05/05 20:22:17 DEBUG FileSystem: On the lookout for FS supporting s3a

The executor appears to be like for RDD or shuffle information and tries to copy or migrate these information. It first tries to discover a peer executor. If profitable, it should transfer the information to the peer executor:

22/06/07 20:41:38 INFO ShuffleStatus: Updating map output for 46 to BlockManagerId(4, 192.168.13.235, 34737, None)
22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Acquired shuffle information block replace for 0 46, ignore.
22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Acquired shuffle index block replace for 0 46, updating.

Nonetheless, if It isn’t capable of finding a peer executor, it should attempt to transfer the information to a fallback storage if out there.

Fig 2: Fallback Storage

The executor is then decommissioned. When a brand new executor comes up, the shuffle information are reused:

22/06/07 20:42:50 INFO BasicExecutorFeatureStep: Including decommission script to lifecycle
22/06/07 20:42:50 DEBUG ExecutorPodsAllocator: Requested executor with id 19 from Kubernetes.
22/06/07 20:42:50 DEBUG ExecutorPodsWatchSnapshotSource: Acquired executor pod replace for pod named amazon-reviews-word-count-bfd0a5813fd1b80f-exec-19, motion ADDED
22/06/07 20:42:50 DEBUG BlockManagerMasterEndpoint: Acquired shuffle index block replace for 0 52, updating.
22/06/07 20:42:50 INFO ShuffleStatus: Recuperate 52 BlockManagerId(fallback, distant, 7337, None)

The important thing benefit of this course of is that it allows migrates blocks and shuffle information, thereby decreasing recomputation, which provides to the general resiliency of the system and reduces runtime. This course of will be triggered by a Spot interruption sign (Sigterm) and node draining. Node draining might occur attributable to high-priority job scheduling or independently.

If you use Amazon EMR on EKS with managed node teams/Karpenter, the Spot interruption dealing with is automated, whereby Amazon EKS gracefully drains and rebalances the Spot nodes to attenuate utility disruption when a Spot node is at elevated danger of interruption. Should you’re utilizing managed node teams/Karpenter, the decommission will get triggered when the nodes are getting drained and since it’s proactive, it offers you extra time (not less than 2 minutes) to maneuver the information. Within the case of self-managed node teams, we advocate putting in the AWS Node Termination Handler to deal with the interruption, and the decommission is triggered when the reactive (2-minute) notification is acquired. We advocate to make use of Karpenter with Spot Cases because it has sooner node scheduling with early pod binding and binpacking to optimize the useful resource utilization.

The next code allows this configuration; extra particulars can be found on GitHub:

"spark.decommission.enabled": "true"
"spark.storage.decommission.rddBlocks.enabled": "true"
"spark.storage.decommission.shuffleBlocks.enabled" : "true"
"spark.storage.decommission.enabled": "true"
"spark.storage.decommission.fallbackStorage.path": "s3://<<bucket>>"

PVC reuse

Apache Spark enabled dynamic PVC in model 3.1, which is helpful with dynamic allocation as a result of we don’t must pre-create the claims or volumes for the executors and delete them after completion. PVC allows true decoupling of knowledge and processing once we’re operating Spark jobs on Kubernetes, as a result of we are able to use it as a neighborhood storage to spill in-process information too. The newest model of Amazon EMR 6.8 has built-in the PVC reuse characteristic of Spark, whereby if an executor is terminated attributable to EC2 Spot interruption or every other purpose (JVM), then the PVC just isn’t deleted however endured and reattached to a different executor. If there are shuffle information in that quantity, then they’re reused.

As with node decommission, this reduces the general runtime as a result of we don’t must recompute the shuffle information. We additionally save the time required to request a brand new quantity for an executor, and shuffle information will be reused with out transferring the information spherical.

The next diagram illustrates this workflow.

Fig 3: PVC Reuse

Let’s take a look at the steps in additional element.

If a number of of the nodes which can be operating executors is interrupted, the underlying pods get terminated and the motive force will get the replace. Be aware that the motive force is the proprietor of the PVC of the executors, and they aren’t terminated. See the next code:

22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Acquired executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, motion DELETED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Acquired executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, motion MODIFIED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Acquired executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, motion DELETED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Acquired executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, motion MODIFIED

The ExecutorPodsAllocator tries to allocate new executor pods to interchange those terminated attributable to interruption. Through the allocation, it figures out how lots of the current PVCs have information and will be reused:

22/06/15 23:25:23 INFO ExecutorPodsAllocator: Discovered 2 reusable PVCs from 10 PVCs

The ExecutorPodsAllocator requests for a pod and when it launches it, the PVC is reused. Within the following instance, the PVC from executor 6 is reused for brand spanking new executor pod 11:

22/06/15 23:25:23 DEBUG ExecutorPodsAllocator: Requested executor with id 11 from Kubernetes.
22/06/15 23:25:24 DEBUG ExecutorPodsWatchSnapshotSource: Acquired executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-11, motion ADDED
22/06/15 23:25:24 INFO KubernetesClientUtils: Spark configuration information loaded from Some(/usr/lib/spark/conf) : log4j.properties,spark-env.sh,hive-site.xml,metrics.properties
22/06/15 23:25:24 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script
22/06/15 23:25:24 DEBUG ExecutorPodsWatchSnapshotSource: Acquired executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-11, motion MODIFIED
22/06/15 23:25:24 INFO ExecutorPodsAllocator: Reuse PersistentVolumeClaim amazon-reviews-word-count-9ee82b8169a75183-exec-6-pvc-0

The shuffle information, if current within the PVC are reused.

The important thing benefit of this system is that it permits us to reuse pre-computed shuffle information of their authentic location, thereby decreasing the time of the general job run.

This works for each static and dynamic PVCs. Amazon EKS presents three totally different storage choices, which will be encrypted too: Amazon Elastic Block Retailer (Amazon EBS), Amazon Elastic File System (Amazon EFS), and Amazon FSx for Lustre. We advocate utilizing dynamic PVCs with Amazon EBS as a result of with static PVCs, you would wish to create a number of PVCs.

The next code allows this configuration; extra particulars can be found on GitHub:

"spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
"spark.kubernetes.driver.reusePersistentVolumeClaim": "true"

For this to work, we have to allow PVC with Amazon EKS and point out the main points within the Spark runtime configuration. For directions, consult with How do I exploit persistent storage in Amazon EKS? The next code accommodates the Spark configuration particulars for utilizing PVC as native storage; different particulars can be found on GitHub:

"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.choices.claimName": "OnDemand"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.choices.storageClass": "spark-sc"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.choices.sizeLimit": "10Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/var/information/spill"

Conclusion

With Amazon EMR on EKS (6.9) and the options mentioned on this put up, you’ll be able to additional scale back the general runtime for Spark jobs when operating with Spot Cases. This additionally improves the general resiliency and suppleness of the job whereas value optimizing the workload on EC2 Spot.

Check out the EMR on EKS workshop for improved efficiency when operating Spark workloads on Kubernetes and price optimize utilizing EC2 Spot Cases.

Concerning the Creator

Kinnar Kumar Sen is a Sr. Options Architect at Amazon Net Providers (AWS) specializing in Versatile Compute. As part of the EC2 Versatile Compute staff, he works with clients to information them to essentially the most elastic and environment friendly compute choices which can be appropriate for his or her workload operating on AWS. Kinnar has greater than 15 years of business expertise working in analysis, consultancy, engineering, and structure.