Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew well being

August 21, 2023

1

Amazon OpenSearch Service is a managed service that makes it simple to deploy, function, and scale OpenSearch clusters in AWS to carry out interactive log analytics, real-time utility monitoring, web site search, and extra. OpenSearch is an open supply, distributed search and analytics suite.

When working with OpenSearch Service, shard technique is vital. Shards distribute your workload throughout the information nodes of your cluster. When creating an index, you inform OpenSearch Service what number of main shards to create and what number of replicas to create of every shard. The first shards are impartial partitions of the total dataset. OpenSearch Service mechanically distributes your knowledge throughout the first shards in an index. Our suggestion is to make use of two replicas to your index. For instance, when you set your index’s shard rely to 3 main shards and two replicas, you’ll have a complete of 9 shards. Correctly configured indexes might help enhance general area efficiency, whereas a misconfigured index will result in storage and efficiency skew.

OpenSearch Service distributes the shards in your indexes to the information nodes in your area, making certain that no main shard and its replicas are positioned on the identical node. The info for the shards are saved within the node’s storage. In case your indexes (and subsequently their shards) are very completely different sizes, the storage used on the information nodes within the area will probably be unequal, or skewed. Storage skew results in uneven reminiscence and CPU utilization, intermittent and uneven latency, and uneven queueing and rejecting of requests. Due to this fact, it’s vital to configure and keep indexes such that shards will be distributed evenly throughout the information nodes of your cluster.

On this submit, we discover methods to deploy Amazon CloudWatch metrics utilizing an AWS CloudFormation template to watch an OpenSearch Service area’s storage and shard skew. This answer makes use of an AWS Lambda perform to extract storage and shard distribution metadata out of your OpenSearch Service area, calculates the extent of skew, after which pushes this data to CloudWatch metrics so to simply monitor, alert, and reply.

Resolution overview

The answer and related assets can be found so that you can deploy into your personal AWS account as a CloudFormation template. The template deploys the next assets:

An AWS Identification and Entry Administration (IAM) position for the Lambda perform referred to as OpensearchSkewMetricsLambdaRole. This permits write entry to CloudWatch metrics and entry to the CloudWatch log group and OpenSearch APIs.
An AWS Lambda perform referred to as Opensearch-SkewMetricsPublisher-py.
An Amazon CloudWatch log group for the Lambda perform referred to as /aws/lambda/Opensearch-skewmetrics-publisher-py.
An Amazon EventBridge rule for the Lambda perform referred to as EventRuleForOSSkew.
The next CloudWatch metrics for the Lambda perform:
- aws_/<region-name>/<MetricIdentifier>/_storagemetric
- aws_/<region-name>/<MetricIdentifier>/_shardmetric

Stipulations

For this walkthrough, it is best to have the next stipulations:

An AWS account.
An OpenSearch Service area.
This submit requires you so as to add a Lambda position to the OpenSearch Service area’s safety configuration entry coverage. In case your area is utilizing fine-grained entry management, then it’s essential to comply with the steps as described within the part Mapping roles to customers to allow entry for the newly deployed Lambda execution position to the area after deploying the CloudFormation template.

Deploy the CloudFormation template

To deploy the CloudFormation template, full the next steps:

Log in to your AWS account.
Choose the Area the place you’re working your OpenSearch Service area.
To launch your CloudFormation stack, select Launch Stack
For Stack identify, enter a reputation for the stack (most size 30 characters).
For MetricIdentifier, enter a novel identifier that may allow you to determine the customized CloudWatch metrics to your area.
For OpensearchDomainURL, enter the area endpoint that you’re monitoring.
Select Subsequent.
Choose I acknowledge that AWS CloudFormation may create IAM assets, then select Create stack.
Await the stack creation to finish.
On the Lambda console, select Capabilities within the navigation pane.
Select the Lambda perform referred to as Opensearch-SkewMetricsPublisher-py-<stackname>.
Within the Code part, select Take a look at.
Preserve the default values for the take a look at occasion and run a fast take a look at.

Make sure that to grant the Lambda execution position permission to the OpenSearch Service area’s resource-based coverage, if you’re utilizing one. If fine-grained entry management is enabled on the area, then comply with the steps in Mapping roles to customers (as talked about within the stipulations) to permit the Lambda perform to learn from the area in read-only entry.

The Lambda perform that sends OpenSearch area metrics to CloudWatch is ready to a default frequency of 1 day. You possibly can change this configuration to watch the area on the required granularity by updating the occasion schedule for the rule deployed by the CloudFormation stack on the EventBridge console. Notice that if the frequency is ready to 1 minute, it will set off the Lambda perform each minute and can improve the Lambda price.

This answer makes use of the cat/allocation API, which supplies the variety of knowledge nodes within the area together with every knowledge node’s variety of shards and storage utilization attributes. For additional particulars on area storage and shard skew, consult with Node shard and storage skew. The Lambda perform processes and types every knowledge node’s storage and shard skew from the common worth. Any knowledge node’s skew above 10% from the common is mostly thought of to be considerably skewed. It will begin to affect CPU, community, and disk bandwidth utilization as a result of the nodes with the best storage utilization are typically the resource-strained nodes, whereas nodes with lower than 10% utilization signify underutilized capability.

Consult with Demystifying Elasticsearch shard allocation for particulars associated to shard dimension and shard rely technique. Generally, we advocate holding shard sizes between 10–30 GB for workloads the place search latency is a key efficiency goal and 30–50 GB for write-heavy workloads. For shard rely, we advocate sustaining index shard counts which can be divisible by the information node rely. For added particulars, consult with Sizing Amazon OpenSearch Service domains and Shard technique.

View skew metrics in CloudWatch

After you run this answer in your account, it is going to create two CloudWatch metrics for monitoring. To entry these CloudWatch metrics, use the next steps:

On the CloudWatch console, below Metrics within the navigation pane, select All metrics.
Select Browse and choose Customized namespaces. You must see two customized metrics ending with _storageworkspace and _shardworkspace, respectively.
Select both of the customized metrics after which choose NodeID.
On the record of node IDs, choose all of the nodes displayed within the record, and the graph will probably be plotted mechanically.

You possibly can hover the mouse over the plotted traces to see the node skew data.

The next screenshots present examples of how the CloudWatch metrics will seem on the console.

The storage skew metrics will probably be much like the next screenshot. Storage skew metrics reveals the area storage skew. When you hover over the graph, it reveals the node record with obtainable nodes within the area. This record is sorted by the storage dimension (largest to smallest). The Lambda perform will periodically submit the most recent storage skew outcomes.

The shard skew metrics will probably be much like the next screenshot. Shard skew metrics present the area shard skew. When you hover over the graph, it reveals the node record with obtainable nodes within the area. This record is sorted by the shard dimension (largest to smallest). The Lambda perform will periodically submit the most recent storage skew outcomes.

Storage skew happens when a number of nodes inside the area has considerably extra storage than different nodes. The CloudWatch metric will present greater deviation of storage utilization for these nodes vs. different nodes. Equally, shard skew happens when a number of nodes has considerably extra shards than others nodes. The CloudWatch metric will present greater deviation for these nodes vs. different nodes within the area. When the area storage or shard skew is detected, you’ll be able to increase a assist case to work with the AWS group for remediation actions. See How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster for data on methods to take remediation actions to configure your area shard technique for optimum efficiency.

Prices

The fee related to utilizing this answer can be minimal, round few cents per 30 days because it generates CloudWatch metrics. The answer additionally runs Lambda code, and on this case the Lambda features make API calls. For pricing particulars, consult with Amazon CloudWatch Pricing and AWS Lambda Pricing.

Clear up

When you determine that you just not wish to hold the Lambda perform and related assets, you’ll be able to navigate to the AWS CloudFormation console, select the stack, and select Delete.

If you wish to add the CloudWatch skew monitor metrics mechanism again in at any level, you’ll be able to create the stack once more from the CloudFormation template.

Conclusion

You should use this answer to get a greater understanding of your OpenSearch Service area’s storage and shard skew to enhance its efficiency and presumably decrease the price of working your area. See Use Elasticsearch’s _rollover API For environment friendly storage distribution for extra particulars associated to shard allocation and environment friendly storage distribution technique.

Concerning the authors

Nikhil Agarwal is Sr. Technical Supervisor with Amazon Internet Providers. He’s enthusiastic about serving to prospects obtain operational excellence of their cloud journey and dealing exercise on technical options. He’s additionally AI/ML enthusiastic and deep dives into buyer’s ML-specific use instances. Exterior of labor, he enjoys touring with household and exploring completely different devices.

Karthik Chemudupati is a Principal Technical Account Supervisor (TAM) with AWS, centered on serving to prospects obtain price optimization and operational excellence. He has greater than 19 years of IT expertise in software program engineering, cloud operations and automations. Karthik joined AWS in 2016 as a TAM and labored with greater than dozen Enterprise Clients throughout US-West. Exterior of labor, he enjoys spending time along with his household.

Gene Alpert is a Senior Analytics Specialist with AWS Enterprise Help. He has been centered on our Amazon OpenSearch Service prospects and ecosystem for the previous three years. Gene joined AWS in 2017. Exterior of labor he enjoys mountain biking, touring, and taking part in Inhabitants:One in VR.