How do I resolve the "failed to obtain in-memory shard lock" exception in Amazon OpenSearch Service?

4 minute read
0

My Amazon OpenSearch Service cluster turned yellow with the "failed to obtain in-memory shard lock" error message for hot and warm node indices.

Short description

If your shard doesn't obtain an in-memory lock within the set thresholds for OpenSearch Service shard allocation, then you receive the following error:

"failed_allocation_attempts" : 5,

     "details" : "failed shard on node []: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[][5]: obtaining shard lock timed out after 5000ms]; ",

.

.

"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[], failed_attempts[5], delayed=false, details[failed shard on node [lga-THKoSXykhSDbghN57A]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[evelog-zdn-2020.04.28][5]: obtaining shard lock timed out after 5000ms]; ], allocation_status[no_attempt]]]"

In OpenSearch Service, your cluster can't exceed the time limit (5000ms) and the max number of retries (5) for shard allocation. To resolve the error message, use the following troubleshooting approaches for indices in hot nodes.

Note: It isn't a best practice to update replica count for OpenSearch Service clusters with heavy workloads.

Resolution

Troubleshoot yellow cluster status

An OpenSearch Service cluster can enter the yellow state because of a node or network failure. If the nodes in your cluster fail due to an internal hardware issue, then the existing nodes are replaced by new nodes. The replacement is automatically detected by OpenSearch Service. However, replica shards in the faulty nodes aren't assigned if previously used resources don't free up. During this time, the leader node makes five attempts to allocate the replica shards. If the five attempts to allocate the replica shards aren't successful, then your cluster enters red or yellow health status.

Note: It's a best practice to run the cluster allocation explain API (on the Elasticsearch website) to diagnose unassigned shards. To identify which indices are causing your cluster to enter yellow status, run the following query:

GET /_cat/indices?v&health=yellow

Then, use the following query to identify the root cause of your cluster's unassigned shards:

GET _cluster/allocation/explain

Note: The cluster reroute API isn't recognized by OpenSearch Service. For more information about supported API operations, see Notable API differences.

Increase the maximum retry setting

To return your OpenSearch Service cluster to the green state, increase the maximum number of retries for each yellow index:

PUT /<yellow-index-name>/_settings
{
     "index.allocation.max_retries": 10
}

When you run this API call, the leader node retries the shards allocation for a specified index on your cluster.

Note: When you increase the maximum retry setting, shards aren't always automatically assigned. You might have to manually assign the shards.

Update the replica count

Important: Don't use this approach if your OpenSearch Service cluster load is high. When you remove all replicas from an index, the index must rely only on primary shards. If a node goes down, then your cluster might enter red cluster status because the primary shards are left unassigned.

To change your replica count, perform the following steps:

1.    Remove any replicas so that the affected index count becomes 0:

\PUT /<yellow-index-name>/_settings
{
     "index": {
          "number_of_replicas": 0
     }
}

2.    Change the replica count back to the desired count:

PUT /<yellow-index-name>/_settings
{
     "index": {
          "number_of_replicas": 1
     }
}

If your indices are located in warm nodes, see the following troubleshooting steps.

Wait for warm index automatically assign to return to green status

A yellow cluster status for indices located in warm nodes will return to green automatically when the domain has sufficient resources. The data in warm indices is backed by Amazon Simple Storage Service (Amazon S3), so there is no risk of data loss when a warm index is in yellow or red status.

In the following scenarios, contact AWS Support to manually re-route the unassigned warm indices:

  • The yellow cluster status doesn't return to green after several hours.
  • You see the error "shard fails to obtain an in-memory lock error for warm indices," and the yellow cluster status doesn't return to green after several hours.

Related information

Why is my Amazon OpenSearch Service cluster in red or yellow status?

Why did my Amazon OpenSearch Service node crash?

How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?

AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago