OpenSearchMonitoringDevOps

How to Monitor a Self-Managed OpenSearch Cluster in 2026

You're not on a managed service. There's no built-in dashboard. The _cat APIs are cryptic. Here's a practical guide to what you should actually monitor and the tools available to do it.

April 3, 20258 min read

If you're running OpenSearch on your own infrastructure — on EC2, bare metal, on-premise, or a VPS — you already know the tooling gap. AWS OpenSearch Service gives you CloudWatch dashboards. Elastic Cloud gives you built-in monitoring. But self-managed OpenSearch gives you the _cat APIs, a JSON response, and a blank stare.

This guide covers what you should actually monitor, what tools are available in 2026, and how to build a monitoring setup that tells you about problems before your users do.

What You Actually Need to Monitor

Before picking a tool, you need to know what matters. Self-managed OpenSearch clusters fail in predictable ways. These are the metrics that give you early warning:

Cluster health — your first indicator

GET /_cluster/health returns a single status: green, yellow, or red. Green means all shards are assigned and healthy. Yellow means all primaries are assigned but some replicas aren't. Red means at least one primary shard is unassigned — data may be unavailable.

This is the first thing you should check and the first thing you should alert on. A cluster that goes red is an emergency. A cluster that stays yellow for more than a few minutes needs investigation.

JVM heap usage — your stability indicator

Check nodes.*.jvm.mem.heap_used_percent from GET /_nodes/stats/jvm. Below 75%: healthy. 75–85%: watch it. Above 85%: act now. Above 90%: circuit breakers will start rejecting requests.

Disk usage — your time bomb

OpenSearch has disk watermarks built in. When disk usage crosses 85%, it stops allocating new shards to that node. At 90%, it starts moving existing shards off the node. At 95%, it puts every index on that node into read-only mode — all writes fail.

Check nodes.*.fs.total.available_in_bytes and total_in_bytes from GET /_nodes/stats/fs. Alert at 80% disk usage — you want time to act before OpenSearch acts for you.

Unassigned shards — your data availability indicator

unassigned_shards from GET /_cluster/health tells you how many shards have no home. Zero is good. Anything above zero warrants investigation. More than 0 primary unassigned shards means data is currently unavailable.

Indexing and search throughput — your performance baseline

From GET /_nodes/stats/indices, track:

  • indices.indexing.index_total — total documents indexed (use rate of change)
  • indices.search.query_total — total search queries
  • indices.search.query_time_in_millis / query_total — average search latency
  • thread_pool.write.rejected — indexing requests being dropped (should be 0)
  • thread_pool.search.rejected — search requests being dropped (should be 0)

Thread pool rejections are a critical signal. A non-zero rejection count means OpenSearch is actively discarding work because it's overloaded.

Snapshot recency — your backup status

Check GET /_snapshot/_all/_all and verify that at least one snapshot in each repository has state SUCCESS within the last 24 hours. No recent successful snapshot is a silent risk that only becomes visible when you need it.

Your Options for Monitoring

There are four realistic approaches for self-managed OpenSearch. Here's an honest assessment of each.

Option 1: DIY with Prometheus + Grafana

The most flexible option. The prometheus-exporter plugin (bundled with modern OpenSearch versions) exposes metrics at GET /_prometheus/metrics. Prometheus scrapes it. Grafana visualises it. You build dashboards.

Pros: Full control, unlimited metrics, no external dependency, cost scales with your existing Prometheus infrastructure.

Cons: Significant setup time. You need to build dashboards from scratch or import community ones (which are never quite right). You still need to write alerting rules. You're monitoring metrics — raw numbers — not findings. "Heap is at 87%" tells you something is wrong, but not what to do about it.

Best for: Teams with an existing Prometheus stack who want to add OpenSearch to their existing monitoring setup.

Option 2: OpenSearch Dashboards with Stack Monitoring

OpenSearch Dashboards (the open-source Kibana fork) includes a monitoring section that pulls data from the .monitoring-* indices. It shows node metrics, index stats, and shard information in a pre-built UI.

Pros: Already included if you're running OpenSearch Dashboards. No additional infrastructure.

Cons: Monitoring data must be stored in the same cluster you're monitoring (a risk: if the cluster goes down, you lose monitoring too). The UI is basic. Alerting requires additional configuration. No actionable findings — just raw metrics.

Best for: Teams who already use OpenSearch Dashboards and want basic visibility with minimal setup.

Option 3: Generic APM / Observability Platforms

Tools like Datadog, New Relic, or Dynatrace have OpenSearch integrations. They collect metrics, display dashboards, and can fire alerts.

Pros: Polished dashboards. Integrates with your existing APM stack if you're already paying for one.

Cons: Expensive. Datadog with infrastructure + APM can run $30–100+/month per host. The OpenSearch integration is generic — it collects standard metrics but doesn't provide OpenSearch-specific diagnostics (ISM policy failures, shard allocation reasons, security misconfigurations). You're paying for a general-purpose tool and using 5% of its OpenSearch-specific capabilities.

Best for: Large engineering teams with existing APM contracts who want OpenSearch baked into their existing observability platform and have budget to match.

Option 4: Purpose-Built OpenSearch Diagnostics

Tools built specifically for OpenSearch — like OpenSearch Doctor — take a different approach. Instead of collecting raw metrics and letting you figure out what they mean, they run structured diagnostic checks and surface actionable findings.

The difference: "Heap usage: 87%" vs "Node es-data-01 JVM heap is at 87% — GC pressure is likely causing latency spikes. Consider adding a data node or increasing heap allocation up to 32 GB."

Pros: Actionable findings, not raw metrics. OpenSearch-specific checks (ISM, security config, snapshot health, shard allocation reasons). Faster time-to-value — no dashboard building. Alerting included.

Cons: Less raw data than Prometheus. Not suitable if you need to correlate OpenSearch metrics with application traces in the same tool.

Best for: Teams running OpenSearch as a primary workload who want diagnostic intelligence, not just metric collection.

Building a Minimal Monitoring Setup (DIY)

If you want to roll your own, here's the minimum viable setup using bash and Prometheus.

Step 1: Enable the Prometheus exporter plugin

# Check if it's already installed
GET /_cat/plugins?v&s=component

# If not, install it (requires node restart)
bin/opensearch-plugin install prometheus-exporter

Step 2: Scrape metrics in Prometheus

# prometheus.yml
scrape_configs:
  - job_name: opensearch
    static_configs:
      - targets: ['localhost:9200']
    metrics_path: /_prometheus/metrics
    basic_auth:
      username: admin
      password: your-password

Step 3: Minimal alerting rules

# alerts.yml
groups:
  - name: opensearch
    rules:
      - alert: OpenSearchClusterRed
        expr: opensearch_cluster_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical

      - alert: OpenSearchHeapHigh
        expr: opensearch_jvm_mem_heap_used_percent > 85
        for: 5m
        labels:
          severity: warning

      - alert: OpenSearchUnassignedShards
        expr: opensearch_cluster_shards_unassigned > 0
        for: 5m
        labels:
          severity: warning

      - alert: OpenSearchDiskHigh
        expr: (1 - opensearch_fs_total_available_in_bytes / opensearch_fs_total_total_in_bytes) * 100 > 80
        for: 10m
        labels:
          severity: warning

The Gap Raw Metrics Don't Fill

Even with Prometheus and Grafana running perfectly, there are things raw metrics can't tell you:

  • Why a shard is unassigned (node left vs disk watermark vs max retry exceeded — each has a different fix)
  • Whether your ISM policies are silently failing
  • Whether anonymous access is enabled on your cluster
  • Whether your snapshot repository is actually working
  • Whether your index templates have conflicting priorities
  • Which specific indices are read-only and why

These require calling specific OpenSearch APIs and interpreting the results — which is exactly what a diagnostic tool does, and what raw metric collection doesn't.

The best monitoring setup for most self-managed OpenSearch teams is a combination: Prometheus for time-series metrics and graphs, and a diagnostic tool for actionable findings and OpenSearch-specific checks. The two are complementary rather than competing.

Try it free

OpenSearch Doctor detects all of this automatically

A lightweight agent runs on your server, checks 50+ things, and tells you exactly what's wrong and how to fix it. Free for 1 cluster, no credit card.

Get started free →