You're not on a managed service. There's no built-in dashboard. The _cat APIs are cryptic. Here's a practical guide to what you should actually monitor and the tools available to do it.

If you're running OpenSearch on your own infrastructure — on EC2, bare metal, on-premise, or a VPS — you already know the tooling gap. AWS OpenSearch Service gives you CloudWatch dashboards. Elastic Cloud gives you built-in monitoring. But self-managed OpenSearch gives you the _cat APIs, a JSON response, and a blank stare.

This guide covers what you should actually monitor, what tools are available in 2026, and how to build a monitoring setup that tells you about problems before your users do.

What You Actually Need to Monitor

Before picking a tool, you need to know what matters. Self-managed OpenSearch clusters fail in predictable ways. These are the metrics that give you early warning:

Cluster health — your first indicator

GET /_cluster/health returns a single status: green, yellow, or red. Green means all shards are assigned and healthy. Yellow means all primaries are assigned but some replicas aren't. Red means at least one primary shard is unassigned — data may be unavailable.

This is the first thing you should check and the first thing you should alert on. A cluster that goes red is an emergency. A cluster that stays yellow for more than a few minutes needs investigation.

JVM heap usage — your stability indicator

Check nodes.*.jvm.mem.heap_used_percent from GET /_nodes/stats/jvm. Below 75%: healthy. 75–85%: watch it. Above 85%: act now. Above 90%: circuit breakers will start rejecting requests.

Disk usage — your time bomb

OpenSearch has disk watermarks built in. When disk usage crosses 85%, it stops allocating new shards to that node. At 90%, it starts moving existing shards off the node. At 95%, it puts every index on that node into read-only mode — all writes fail.

Check nodes.*.fs.total.available_in_bytes and total_in_bytes from GET /_nodes/stats/fs. Alert at 80% disk usage — you want time to act before OpenSearch acts for you.

Unassigned shards — your data availability indicator

unassigned_shards from GET /_cluster/health tells you how many shards have no home. Zero is good. Anything above zero warrants investigation. More than 0 primary unassigned shards means data is currently unavailable.

Indexing and search throughput — your performance baseline

From GET /_nodes/stats/indices, track:

indices.indexing.index_total — total documents indexed (use rate of change)
indices.search.query_total — total search queries
indices.search.query_time_in_millis / query_total — average search latency
thread_pool.write.rejected — indexing requests being dropped (should be 0)
thread_pool.search.rejected — search requests being dropped (should be 0)

Thread pool rejections are a critical signal. A non-zero rejection count means OpenSearch is actively discarding work because it's overloaded.

Snapshot recency — your backup status

Check GET /_snapshot/_all/_all and verify that at least one snapshot in each repository has state SUCCESS within the last 24 hours. No recent successful snapshot is a silent risk that only becomes visible when you need it.

Your Options for Monitoring

There are four realistic approaches for self-managed OpenSearch. Here's an honest assessment of each.

Option 1: DIY with Prometheus + Grafana

The most flexible option. The prometheus-exporter plugin (bundled with modern OpenSearch versions) exposes metrics at GET /_prometheus/metrics. Prometheus scrapes it. Grafana visualises it. You build dashboards.

Pros: Full control, unlimited metrics, no external dependency, cost scales with your existing Prometheus infrastructure.

Cons: Significant setup time. You need to build dashboards from scratch or import community ones (which are never quite right). You still need to write alerting rules. You're monitoring metrics — raw numbers — not findings. "Heap is at 87%" tells you something is wrong, but not what to do about it.

Best for: Teams with an existing Prometheus stack who want to add OpenSearch to their existing monitoring setup.

Option 2: OpenSearch Dashboards with Stack Monitoring

OpenSearch Dashboards (the open-source Kibana fork) includes a monitoring section that pulls data from the .monitoring-* indices. It shows node metrics, index stats, and shard information in a pre-built UI.

Pros: Already included if you're running OpenSearch Dashboards. No additional infrastructure.

Cons: Monitoring data must be stored in the same cluster you're monitoring (a risk: if the cluster goes down, you lose monitoring too). The UI is basic. Alerting requires additional configuration. No actionable findings — just raw metrics.

Best for: Teams who already use OpenSearch Dashboards and want basic visibility with minimal setup.

Option 3: Generic APM / Observability Platforms

Tools like Datadog, New Relic, or Dynatrace have OpenSearch integrations. They collect metrics, display dashboards, and can fire alerts.

Pros: Polished dashboards. Integrates with your existing APM stack if you're already paying for one.

Cons: Expensive. Datadog with infrastructure + APM can run $30–100+/month per host. The OpenSearch integration is generic — it collects standard metrics but doesn't provide OpenSearch-specific diagnostics (ISM policy failures, shard allocation reasons, security misconfigurations). You're paying for a general-purpose tool and using 5% of its OpenSearch-specific capabilities.

Best for: Large engineering teams with existing APM contracts who want OpenSearch baked into their existing observability platform and have budget to match.

Option 4: Purpose-Built OpenSearch Diagnostics

Tools built specifically for OpenSearch — like OpenSearch Doctor — take a different approach. Instead of collecting raw metrics and letting you figure out what they mean, they run structured diagnostic checks and surface actionable findings.

The difference: "Heap usage: 87%" vs "Node es-data-01 JVM heap is at 87% — GC pressure is likely causing latency spikes. Consider adding a data node or increasing heap allocation up to 32 GB."

Pros: Actionable findings, not raw metrics. OpenSearch-specific checks (ISM, security config, snapshot health, shard allocation reasons). Faster time-to-value — no dashboard building. Alerting included.

Cons: Less raw data than Prometheus. Not suitable if you need to correlate OpenSearch metrics with application traces in the same tool.

Best for: Teams running OpenSearch as a primary workload who want diagnostic intelligence, not just metric collection.

Building a Minimal Monitoring Setup (DIY)

If you want to roll your own, here's the minimum viable setup using bash and Prometheus.

Step 1: Enable the Prometheus exporter plugin

# Check if it's already installed
GET /_cat/plugins?v&s=component

# If not, install it (requires node restart)
bin/opensearch-plugin install prometheus-exporter

Step 2: Scrape metrics in Prometheus

# prometheus.yml
scrape_configs:
  - job_name: opensearch
    static_configs:
      - targets: ['localhost:9200']
    metrics_path: /_prometheus/metrics
    basic_auth:
      username: admin
      password: your-password

Step 3: Minimal alerting rules

# alerts.yml
groups:
  - name: opensearch
    rules:
      - alert: OpenSearchClusterRed
        expr: opensearch_cluster_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical

      - alert: OpenSearchHeapHigh
        expr: opensearch_jvm_mem_heap_used_percent > 85
        for: 5m
        labels:
          severity: warning

      - alert: OpenSearchUnassignedShards
        expr: opensearch_cluster_shards_unassigned > 0
        for: 5m
        labels:
          severity: warning

      - alert: OpenSearchDiskHigh
        expr: (1 - opensearch_fs_total_available_in_bytes / opensearch_fs_total_total_in_bytes) * 100 > 80
        for: 10m
        labels:
          severity: warning

The Gap Raw Metrics Don't Fill

Even with Prometheus and Grafana running perfectly, there are things raw metrics can't tell you:

Why a shard is unassigned (node left vs disk watermark vs max retry exceeded — each has a different fix)
Whether your ISM policies are silently failing
Whether anonymous access is enabled on your cluster
Whether your snapshot repository is actually working
Whether your index templates have conflicting priorities
Which specific indices are read-only and why

These require calling specific OpenSearch APIs and interpreting the results — which is exactly what a diagnostic tool does, and what raw metric collection doesn't.

The best monitoring setup for most self-managed OpenSearch teams is a combination: Prometheus for time-series metrics and graphs, and a diagnostic tool for actionable findings and OpenSearch-specific checks. The two are complementary rather than competing.

How to Monitor a Self-Managed OpenSearch Cluster in 2026