OpenSearchOperationsReliability

8 OpenSearch Problems That Will Eventually Kill Your Cluster

Most OpenSearch failures don't happen suddenly. They build quietly over weeks. Here are the 8 issues most teams discover too late — and how to catch them before they hurt you.

April 1, 20259 min read

Most OpenSearch cluster failures don't come out of nowhere. There's no sudden catastrophic event — just a slow accumulation of problems that nobody noticed until something broke. The heap creeps up. Shards go unassigned and stay that way. Snapshots stop working and nobody checks. Then one day a node falls over and the cluster can't recover on its own.

Here are the 8 issues that silently degrade clusters and eventually cause outages — and how to spot them before they hurt you.

1. JVM Heap Pressure Above 85%

OpenSearch runs on the JVM, and the JVM manages memory with a garbage collector. When heap usage climbs above roughly 75%, the GC starts working harder to free memory. Above 85%, it can trigger stop-the-world GC pauses — moments where the JVM freezes all threads to collect garbage. To your users, this looks like a slow or unresponsive cluster.

Above 90%, OpenSearch activates circuit breakers that start rejecting requests outright. Above 95%, you risk an OutOfMemoryError that crashes the JVM entirely.

The problem is that heap usage climbs gradually. A cluster running at 70% today might be at 88% in three weeks if you've added indices or changed query patterns. By the time queries start failing, the pressure has been building for weeks.

What to do: Monitor heap % per node. Set an alert at 80%. Above 85%, either add nodes, increase heap allocation (up to 32 GB — do not go beyond this due to compressed OOPs), or reduce fielddata usage.

GET /_nodes/stats/jvm
# Look at: nodes.*.jvm.mem.heap_used_percent

2. Unassigned Shards — Left Alone Too Long

When a node leaves the cluster or a shard fails to allocate, its shards become unassigned. OpenSearch will try to reassign them automatically in many cases — but not all.

If a node was removed intentionally (decommissioned, scaled down, rebooted for too long), its primary shards won't reassign until you either bring the node back or explicitly tell OpenSearch to allocate the shard elsewhere. In the meantime, those shards are unavailable. If a primary shard is unassigned, reads and writes to that shard will fail.

The real danger: teams often see the yellow status, note that the cluster is still responding, and move on. Those unassigned shards stay unassigned for days. If a second node fails during that window, you now have primary shards with no replicas — potential data loss.

What to do: Never ignore unassigned shards for more than a few minutes. Run GET /_cluster/allocation/explain to understand why a shard is unassigned, then fix the root cause. Don't just force-allocate without understanding why it happened.

3. No Recent Snapshot

This one is embarrassingly common. The snapshot repository was set up, a few test snapshots ran, and then something changed — a node IP, a storage credential, an S3 bucket policy — and snapshots started failing silently. Nobody noticed because the cluster was healthy. Then a corruption event or accidental index deletion happened, and there was no usable backup.

Snapshots in OpenSearch are incremental, cheap to run once you have the first one, and they're the only reliable recovery option for data loss events. Running them isn't enough — you need to verify they're completing successfully.

What to do: Configure a daily snapshot policy via ISM or a cron job. Check GET /_snapshot/_all/_all?ignore_unavailable=true regularly and alert if no SUCCESS state snapshot exists within the last 24 hours.

4. Cluster RED With No Alerting

A cluster goes RED when one or more primary shards are unassigned. This means those shards are not serving reads or writes. Data may be unavailable. Indexing to affected indices will fail.

The thing about RED status: it can happen within minutes of a node failure, and if you don't have an alert, you might not know for hours — until a user reports that search results stopped updating, or an engineering team notices that writes are returning errors.

Many teams rely on passive monitoring: they'll notice something is wrong when the application breaks. That's too late. You want to know the cluster is RED before any user does.

What to do: Set up an active health check on GET /_cluster/health?wait_for_status=yellow&timeout=5s and alert immediately on RED. At minimum, poll this endpoint every 60 seconds.

5. Anonymous Access Enabled

OpenSearch ships with the security plugin enabled by default in modern versions, but this wasn't always the case — and plenty of clusters were set up during or before the transition. Some teams deliberately disable security for internal clusters, reasoning that the cluster is only accessible from within the VPC.

The problem: "only accessible from within the VPC" is not as strong a guarantee as it sounds. VPCs get misconfigured. Security groups get loosened. Applications running inside the same network get compromised. Once an attacker has access to your OpenSearch endpoint without authentication, they can read all your data, delete indices, and potentially pivot further into your infrastructure via the REST API.

Anonymous access is also a compliance issue. Any regulation that requires audit logging (GDPR, SOC 2, HIPAA) requires that you know who accessed what. Anonymous access makes that impossible.

What to do: Check GET /_plugins/_security/api/securityconfig. Ensure anonymous_auth_enabled is false. Enable TLS on both HTTP and transport layers.

6. ISM Policy Failures — Silent and Cumulative

Index State Management (ISM) is how most OpenSearch operators automate index lifecycle — rolling over time-series indices, moving cold data to cheaper storage, deleting old indices. When ISM works, it's invisible. When it breaks, indices accumulate indefinitely and storage fills up.

ISM failures are rarely dramatic. An index gets stuck in a state. The policy stops executing. No error is surfaced in the cluster health status — the cluster stays green while your disk usage climbs by 20 GB per day. Weeks later you hit the disk watermark, the cluster flips indices to read-only, and all indexing stops.

Common causes: rollover alias not configured on the index, a condition that can never be met (e.g. min_doc_count: 1000000 on an index that only ever gets 500 docs), or a state machine that requires a transition that was never defined.

What to do: Check GET /_plugins/_ism/explain/* regularly. Look for indices where info.cause is non-empty — that's an ISM error. Fix the root cause (usually the alias or the condition), then retry the policy: POST /_plugins/_ism/retry/<index>.

7. Single-Node Cluster — No Redundancy

A cluster running on a single node has no fault tolerance. If that node goes down for any reason — host restart, hardware failure, OOM, or a bad deployment — your cluster is completely unavailable. Every primary shard becomes unassigned. Every index becomes inaccessible.

This is obviously unacceptable for production, but single-node clusters often start as "temporary" setups that outlive their original purpose. A staging environment that becomes load-bearing. A proof-of-concept that got promoted to production. A cost-cutting measure that nobody revisited.

Even two nodes is significantly better than one — you can lose a node without losing the cluster. Three nodes is the minimum for a resilient cluster with a proper quorum for master elections.

What to do: Check number_of_nodes in GET /_cluster/health. For any data that matters, run at least 3 nodes. Set discovery.zen.minimum_master_nodes to (N/2)+1 where N is the number of master-eligible nodes.

8. Too Many Small Shards (Over-Sharding)

A common pattern when getting started with OpenSearch: create indices with 5 primary shards by default, add a daily rollover, and end up with hundreds of indices each with 5 shards — thousands of shards total for a dataset that could fit comfortably in 10.

Each shard is a Lucene instance. Each Lucene instance has overhead: file handles, heap memory for segment metadata, thread pool participation. At scale, thousands of tiny shards consume more heap than the data itself. The rule of thumb: shards should be 10–50 GB each. Below 1 GB per shard, the overhead cost exceeds the value of the shard.

Over-sharding also makes recovery slower. When a node fails, OpenSearch must recover every shard that was on that node. A thousand 100 MB shards take longer to recover than ten 10 GB shards, even though the total data is the same.

What to do: Audit your shards with GET /_cat/shards?v&s=store:desc. Identify indices with many tiny shards and consolidate with ISM rollover policies that trigger on size rather than time, or use the Shrink API on historical indices.

How to Catch All of This Automatically

Manually checking all 8 of these issues across every cluster you manage isn't realistic. Most teams check reactively — after something breaks. By then, the damage is done.

OpenSearch Doctor runs these checks automatically every 6 hours and alerts you the moment a threshold is crossed. Heap above 85%: you get notified. Unassigned shards: alerted within minutes. No recent snapshot: flagged. ISM errors: surfaced. All 8 issues above are covered by the agent's diagnostic checks.

The agent runs on your own server, connects to your cluster locally, and never reads your documents or credentials. It's free for 1 cluster with no credit card required.

Try it free

OpenSearch Doctor detects all of this automatically

A lightweight agent runs on your server, checks 50+ things, and tells you exactly what's wrong and how to fix it. Free for 1 cluster, no credit card.

Get started free →