How to Monitor a Self-Managed OpenSearch Cluster in 2026
You're not on a managed service. There's no built-in dashboard. The _cat APIs are cryptic. Here's a practical guide to what you should actually monitor and the tools available to do it.
If you're running OpenSearch on your own infrastructure — on EC2, bare metal, on-premise, or a VPS — you already know the tooling gap. AWS OpenSearch Service gives you CloudWatch dashboards. Elastic Cloud gives you built-in monitoring. But self-managed OpenSearch gives you the _cat APIs, a JSON response, and a blank stare.
This guide covers what you should actually monitor, what tools are available in 2026, and how to build a monitoring setup that tells you about problems before your users do.
What You Actually Need to Monitor
Before picking a tool, you need to know what matters. Self-managed OpenSearch clusters fail in predictable ways. These are the metrics that give you early warning:
Cluster health — your first indicator
GET /_cluster/health returns a single status: green, yellow, or red. Green means all shards are assigned and healthy. Yellow means all primaries are assigned but some replicas aren't. Red means at least one primary shard is unassigned — data may be unavailable.
This is the first thing you should check and the first thing you should alert on. A cluster that goes red is an emergency. A cluster that stays yellow for more than a few minutes needs investigation.
JVM heap usage — your stability indicator
Check nodes.*.jvm.mem.heap_used_percent from GET /_nodes/stats/jvm. Below 75%: healthy. 75–85%: watch it. Above 85%: act now. Above 90%: circuit breakers will start rejecting requests.
Disk usage — your time bomb
OpenSearch has disk watermarks built in. When disk usage crosses 85%, it stops allocating new shards to that node. At 90%, it starts moving existing shards off the node. At 95%, it puts every index on that node into read-only mode — all writes fail.
Check nodes.*.fs.total.available_in_bytes and total_in_bytes from GET /_nodes/stats/fs. Alert at 80% disk usage — you want time to act before OpenSearch acts for you.
Unassigned shards — your data availability indicator
unassigned_shards from GET /_cluster/health tells you how many shards have no home. Zero is good. Anything above zero warrants investigation. More than 0 primary unassigned shards means data is currently unavailable.
Indexing and search throughput — your performance baseline
From GET /_nodes/stats/indices, track:
indices.indexing.index_total— total documents indexed (use rate of change)indices.search.query_total— total search queriesindices.search.query_time_in_millis / query_total— average search latencythread_pool.write.rejected— indexing requests being dropped (should be 0)thread_pool.search.rejected— search requests being dropped (should be 0)
Thread pool rejections are a critical signal. A non-zero rejection count means OpenSearch is actively discarding work because it's overloaded.
Snapshot recency — your backup status
Check GET /_snapshot/_all/_all and verify that at least one snapshot in each repository has state SUCCESS within the last 24 hours. No recent successful snapshot is a silent risk that only becomes visible when you need it.
Your Options for Monitoring
There are four realistic approaches for self-managed OpenSearch. Here's an honest assessment of each.
Option 1: DIY with Prometheus + Grafana
The most flexible option. The prometheus-exporter plugin (bundled with modern OpenSearch versions) exposes metrics at GET /_prometheus/metrics. Prometheus scrapes it. Grafana visualises it. You build dashboards.
Pros: Full control, unlimited metrics, no external dependency, cost scales with your existing Prometheus infrastructure.
Cons: Significant setup time. You need to build dashboards from scratch or import community ones (which are never quite right). You still need to write alerting rules. You're monitoring metrics — raw numbers — not findings. "Heap is at 87%" tells you something is wrong, but not what to do about it.
Best for: Teams with an existing Prometheus stack who want to add OpenSearch to their existing monitoring setup.
Option 2: OpenSearch Dashboards with Stack Monitoring
OpenSearch Dashboards (the open-source Kibana fork) includes a monitoring section that pulls data from the .monitoring-* indices. It shows node metrics, index stats, and shard information in a pre-built UI.
Pros: Already included if you're running OpenSearch Dashboards. No additional infrastructure.
Cons: Monitoring data must be stored in the same cluster you're monitoring (a risk: if the cluster goes down, you lose monitoring too). The UI is basic. Alerting requires additional configuration. No actionable findings — just raw metrics.
Best for: Teams who already use OpenSearch Dashboards and want basic visibility with minimal setup.
Option 3: Generic APM / Observability Platforms
Tools like Datadog, New Relic, or Dynatrace have OpenSearch integrations. They collect metrics, display dashboards, and can fire alerts.
Pros: Polished dashboards. Integrates with your existing APM stack if you're already paying for one.
Cons: Expensive. Datadog with infrastructure + APM can run $30–100+/month per host. The OpenSearch integration is generic — it collects standard metrics but doesn't provide OpenSearch-specific diagnostics (ISM policy failures, shard allocation reasons, security misconfigurations). You're paying for a general-purpose tool and using 5% of its OpenSearch-specific capabilities.
Best for: Large engineering teams with existing APM contracts who want OpenSearch baked into their existing observability platform and have budget to match.
Option 4: Purpose-Built OpenSearch Diagnostics
Tools built specifically for OpenSearch — like OpenSearch Doctor — take a different approach. Instead of collecting raw metrics and letting you figure out what they mean, they run structured diagnostic checks and surface actionable findings.
The difference: "Heap usage: 87%" vs "Node es-data-01 JVM heap is at 87% — GC pressure is likely causing latency spikes. Consider adding a data node or increasing heap allocation up to 32 GB."
Pros: Actionable findings, not raw metrics. OpenSearch-specific checks (ISM, security config, snapshot health, shard allocation reasons). Faster time-to-value — no dashboard building. Alerting included.
Cons: Less raw data than Prometheus. Not suitable if you need to correlate OpenSearch metrics with application traces in the same tool.
Best for: Teams running OpenSearch as a primary workload who want diagnostic intelligence, not just metric collection.
Building a Minimal Monitoring Setup (DIY)
If you want to roll your own, here's the minimum viable setup using bash and Prometheus.
Step 1: Enable the Prometheus exporter plugin
# Check if it's already installed
GET /_cat/plugins?v&s=component
# If not, install it (requires node restart)
bin/opensearch-plugin install prometheus-exporterStep 2: Scrape metrics in Prometheus
# prometheus.yml
scrape_configs:
- job_name: opensearch
static_configs:
- targets: ['localhost:9200']
metrics_path: /_prometheus/metrics
basic_auth:
username: admin
password: your-passwordStep 3: Minimal alerting rules
# alerts.yml
groups:
- name: opensearch
rules:
- alert: OpenSearchClusterRed
expr: opensearch_cluster_status{color="red"} == 1
for: 1m
labels:
severity: critical
- alert: OpenSearchHeapHigh
expr: opensearch_jvm_mem_heap_used_percent > 85
for: 5m
labels:
severity: warning
- alert: OpenSearchUnassignedShards
expr: opensearch_cluster_shards_unassigned > 0
for: 5m
labels:
severity: warning
- alert: OpenSearchDiskHigh
expr: (1 - opensearch_fs_total_available_in_bytes / opensearch_fs_total_total_in_bytes) * 100 > 80
for: 10m
labels:
severity: warningThe Gap Raw Metrics Don't Fill
Even with Prometheus and Grafana running perfectly, there are things raw metrics can't tell you:
- Why a shard is unassigned (node left vs disk watermark vs max retry exceeded — each has a different fix)
- Whether your ISM policies are silently failing
- Whether anonymous access is enabled on your cluster
- Whether your snapshot repository is actually working
- Whether your index templates have conflicting priorities
- Which specific indices are read-only and why
These require calling specific OpenSearch APIs and interpreting the results — which is exactly what a diagnostic tool does, and what raw metric collection doesn't.
The best monitoring setup for most self-managed OpenSearch teams is a combination: Prometheus for time-series metrics and graphs, and a diagnostic tool for actionable findings and OpenSearch-specific checks. The two are complementary rather than competing.
Try it free
OpenSearch Doctor detects all of this automatically
A lightweight agent runs on your server, checks 50+ things, and tells you exactly what's wrong and how to fix it. Free for 1 cluster, no credit card.
Get started free →