How do I resolve the "Courier fetch: n of m shards failed" error in OpenSearch Dashboards on Amazon OpenSearch Service?

5 minute read
0

When I try to load a dashboard in OpenSearch Dashboards on my Amazon OpenSearch Service domain, it returns a Courier fetch error. How do I resolve this?

Short description

When you load a dashboard in OpenSearch Dashboards, a search request is sent to the OpenSearch Service domain. The search request is routed to a cluster node that acts as the coordinating node for the request. The "Courier fetch: n of m shards failed" error occurs when the coordinating node fails to complete the fetch phase of the search request. There are two types of issues that commonly cause this error:

  • Persistent issues: Mapping conflicts or unassigned shards. If you have several indices in your index pattern using the same name but different mapping types, you might get a Courier fetch error. If your cluster is in red cluster status, it means that at least one shard is unassigned. Because OpenSearch Service can't fetch documents from unassigned shards, a cluster in red status throws a Courier fetch error. If the value of "n" in the Courier fetch error message is the same each time you receive the error, then it is likely a persistent issue. Check the application error logs for troubleshooting suggestions.
    Note: Persistent issues can't be resolved by retrying or provisioning more cluster resources.
  • Transient issues: Transient issues include rejections of thread pools, search timeouts, and tripped field data circuit breakers. These issues occur when you don't have enough compute resources on the cluster. A transient issue is likely the cause when you receive the error message intermittently with a different value of "n" each time. You can also monitor Amazon CloudWatch metrics such as CPUUtilization, JVMMemoryPressure, and ThreadpoolSearchRejected to determine if a transient issue is causing the Courier fetch error.

Resolution

Enable application error logs for the domain. The logs can help you identify the root cause and solution for both transient and persistent issues. For more information, see Viewing OpenSearch Service error logs.

Persistent issues

The following example shows a log entry for a Courier fetch error caused by a persistent issue:

[2019-07-01T12:54:02,791][DEBUG][o.e.a.s.TransportSearchAction] [ip-xx-xx-xx-xxx] [1909731] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [ip-xx-xx-xx-xx][xx.xx.xx.xx:9300][indices:data/read/search[phase/fetch/id]]
Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. 
Set fielddata=true on [request_departure_date] in order to load fielddata in memory by uninverting the inverted index.
Note that this can however use significant memory. Alternatively use a keyword field instead.

In this example, the issue is caused by the request_departure_date field. The log entry shows that you can resolve this issue by setting fielddata=true in the index settings or by using a keyword field.

Transient issues

Most transient issues can be resolved by either provisioning more compute resources or reducing the resource utilization for your queries.

Provisioning more compute resources

Reducing the resource utilization for your queries

  • Confirm that you're following best practices for shard and cluster architecture. A poorly designed cluster can't use all available resources. Some nodes might get overloaded while other nodes sit idle. OpenSearch Service can't fetch documents from overloaded nodes.
  • You can also reduce the scope of your query. For example, if you query on time frame, reduce the date range or filter the results by configuring the index pattern in Kibana.
  • Avoid running select * queries on large indices. Instead, use filters to query a part of the index and search as few fields as possible. For more information, see Tune for search speed and Query and filter context on the Elasticsearch website.
  • Reindex and reduce the number of shards. The more shards you have in your cluster, the more likely you are to get a Courier fetch error. Because each shard has its own resource allocation and overheads, a large number of shards places excessive strain on your cluster. For more information, see Why is my OpenSearch Service domain stuck in the "Processing" state?

The following example shows a log entry for a Courier fetch error caused by a transient issue:

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@26fdeb6f on QueueResizingEsThreadPoolExecutor
[name = __PATH__ queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 2.9ms, adjustment amount = 50,
org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@1968ac53[Running, pool size = 2, active threads = 2, queued tasks = 1015, completed tasks = 96587627]]

In this example, the issue is caused by search threadpool queue rejections. To resolve this issue, scale up your domain by choosing a larger instance type. For more information, see Thread pools on the Elasticsearch website.


Related information

Best practices for Amazon OpenSearch Service

Troubleshooting Amazon OpenSearch Service

AWS OFFICIAL
AWS OFFICIALUpdated 3 years ago