OpenSearch JVM Heap Usage: When to Worry and What to Do
JVM heap pressure is the single most common cause of OpenSearch instability. This guide explains what heap usage means, which thresholds matter, and what to do when the numbers climb.
JVM heap pressure is quietly responsible for more OpenSearch incidents than any other single factor. It's not the most dramatic failure mode — the cluster doesn't crash immediately. Instead, performance degrades gradually, queries get slower, and then one day a node goes out-of-memory and the cluster starts shedding shards. By then, the heap has been climbing for weeks.
This guide explains what heap usage actually means in OpenSearch, which thresholds matter, why the 32 GB limit exists, and what levers you have when heap climbs too high.
What JVM Heap Is (and Isn't)
OpenSearch runs on the Java Virtual Machine. The JVM manages memory in a region called the heap — this is where all Java objects live: query results being assembled, field data cached for aggregations, segment metadata, filter caches, and so on.
The heap is distinct from OS memory. A node might have 64 GB of RAM, but OpenSearch is only allowed to use a fraction of that for its heap (the rest goes to the OS page cache for Lucene segment files, which OpenSearch relies on heavily). Heap is explicitly sized via the -Xms and -Xmx JVM flags, or via the OPENSEARCH_JAVA_OPTS environment variable.
The critical point: the heap size is fixed at startup. It doesn't grow dynamically. When you're at 90% heap usage, there's no slack — the JVM is working hard to free memory through garbage collection, and if it can't free enough, requests start failing.
The Thresholds That Matter
Heap usage isn't binary. It has a progression of consequences:
Below 75% — Healthy
The garbage collector runs in the background and keeps up easily. Latency is stable. This is where you want to be.
75–85% — Watch It
GC is working harder. You may see occasional pauses of 100–500ms as the collector clears objects. In most clusters, this is manageable but is a signal to investigate what's consuming heap and address it before it gets worse.
85–90% — Act Now
GC pause durations increase significantly. Stop-the-world pauses — where the JVM freezes all application threads to collect garbage — become frequent. To users, this looks like intermittent slow queries. Indexing throughput drops. Thread pools back up.
Above 90% — Circuit Breakers Activate
OpenSearch has circuit breakers that protect the cluster from OOM by refusing certain requests when heap is critically high. The parent circuit breaker trips at 95% by default and rejects any requests that would allocate additional heap. You'll see responses with status 429 and a message like Data too large, data for field exceeds limit.
Above 95% — OOM Risk
The JVM may throw an OutOfMemoryError, which typically crashes the OpenSearch process. The node goes down, its shards become unassigned, and if you had no replicas, you now have a red cluster.
How to Check Heap Usage
# Heap usage per node — quickest overview
GET /_cat/nodes?v&h=name,heap.current,heap.max,heap.percent
# Detailed stats per node
GET /_nodes/stats/jvm
# Look at: nodes.*.jvm.mem.heap_used_percent
# And: nodes.*.jvm.gc.collectors.old.collection_time_in_millisThe old GC collector time is especially important. If it's growing rapidly, you're seeing major GC events — the expensive stop-the-world kind. Track this as a rate over time, not just an absolute value.
The 32 GB Ceiling (and Why It Exists)
If you search for OpenSearch heap guidance, you'll see a consistent recommendation: never set heap above 32 GB. This seems counterintuitive — more heap should mean more breathing room, right?
The reason is compressed ordinary object pointers (compressed OOPs). On 64-bit JVMs, every object reference is normally an 8-byte pointer. With compressed OOPs, the JVM compresses these to 4-byte pointers, which roughly halves the memory overhead of object references throughout the heap. This is a significant optimization — it means a 30 GB heap with compressed OOPs is effectively more efficient than a 34 GB heap without them.
Compressed OOPs are enabled automatically when heap is below approximately 32 GB (the exact threshold depends on the JVM version and OS page size, but 30.5 GB is the safe upper bound). Above that threshold, the JVM switches to uncompressed pointers, and heap efficiency drops dramatically — you often end up with worse performance at 34 GB than you had at 30 GB.
Practical rule: Set heap to the lesser of: half of available RAM, or 30.5 GB. If you need more capacity, add nodes — don't increase heap beyond 32 GB.
# In jvm.options or OPENSEARCH_JAVA_OPTS
# For a node with 64GB RAM:
-Xms30g
-Xmx30g
# Always set Xms = Xmx to avoid heap resizing at runtime
# Never exceed ~30.5gWhat Actually Consumes Heap
When heap is high, the question is: what's in there? The main consumers in a typical OpenSearch cluster:
Field data cache
When you sort or aggregate on a text field (rather than a keyword field), OpenSearch loads the full field values into heap as an uninverted index — called fielddata. This can be enormous. A text field with 50 million documents can consume several gigabytes.
The fix: use keyword fields for sorting and aggregation, not text fields. If you need both search and aggregation, use a multi-field mapping with a keyword subfield.
Shard request cache and query cache
OpenSearch caches the results of frequently run queries. These caches are bounded, but they're held in heap. Check their sizes:
GET /_nodes/stats/indices/query_cache,request_cache
# Look at: nodes.*.indices.query_cache.memory_size_in_bytes
# And: nodes.*.indices.request_cache.memory_size_in_bytesSegment metadata
Every Lucene segment on a node has metadata held in heap: term dictionaries, stored field indexes, doc value metadata. This is roughly proportional to the number of segments — which is why having too many small segments (or too many indices) can cause heap pressure even when the actual data volume is modest.
Aggregation buffers
Deep or high-cardinality aggregations (like a terms aggregation on a field with millions of unique values) can allocate large buffers temporarily during query execution. These don't show up in the cache stats but do spike heap usage.
What to Do When Heap Is High
There's no single fix — it depends on which consumer is responsible. Work through these in order:
1. Check for fielddata abuse
GET /_nodes/stats/indices/fielddata
# nodes.*.indices.fielddata.memory_size_in_bytes
# See which indices are responsible
GET /_cat/fielddata?v&s=size:descIf fielddata is large, identify which fields are driving it and switch them to keyword type, or set fielddata: false on the text field and migrate aggregations to the keyword subfield.
2. Force-merge old indices
Historical indices that are no longer being written to accumulate segments over time. Merging them into fewer, larger segments reduces the per-segment metadata overhead in heap.
# Force-merge read-only historical indices to 1 segment
# Run on quiet periods — this is CPU and I/O intensive
POST /my-old-index/_forcemerge?max_num_segments=13. Reduce shard count
Too many shards means too many Lucene instances, which means more per-segment heap overhead. If you have thousands of shards under 1 GB each, you're paying a disproportionate heap cost. Consolidate small indices or increase the rollover size threshold in your ISM policies.
4. Increase heap (carefully)
If you're well below 30 GB, increasing heap allocation is a valid lever. Increase -Xms and -Xmx together, restart the node (rolling restart so the cluster stays up), and monitor the result.
5. Add data nodes
If you're already near 30 GB heap per node, the only option is horizontal scaling. More nodes = fewer shards per node = less heap pressure per node. This is the correct long-term solution for data volume growth.
Alerting on Heap
Don't wait until heap is critical. Alert early:
# Prometheus alerting rule
- alert: OpenSearchHeapHigh
expr: opensearch_jvm_mem_heap_used_percent > 80
for: 10m
labels:
severity: warning
annotations:
summary: "OpenSearch heap usage above 80% on {{ $labels.node }}"
- alert: OpenSearchHeapCritical
expr: opensearch_jvm_mem_heap_used_percent > 90
for: 5m
labels:
severity: critical
annotations:
summary: "OpenSearch heap critical ({{ $value }}%) on {{ $labels.node }}"
# Also alert on GC duration — a better signal than raw heap %
- alert: OpenSearchGCPressure
expr: rate(opensearch_jvm_gc_collection_time_seconds_total{gc="old"}[5m]) > 0.3
for: 10m
labels:
severity: warning
annotations:
summary: "OpenSearch old GC consuming >30% of time on {{ $labels.node }}"The GC Log: Your Debugging Companion
When heap is high, OpenSearch's GC logs are invaluable. By default, OpenSearch enables GC logging to logs/gc.log. Look for lines containing Pause Full (stop-the-world events) — their frequency and duration tell you how stressed the GC is.
# Tail the GC log on the affected node
tail -f /var/log/opensearch/gc.log | grep "Pause Full"
# A healthy cluster: rare or no Pause Full events
# A stressed cluster: Pause Full every few seconds, each lasting 1-10+ secondsFrequent stop-the-world pauses of more than a second mean the GC is failing to keep up. This will manifest as query timeouts and indexing backpressure before it progresses to OOM.
Try it free
OpenSearch Doctor detects all of this automatically
A lightweight agent runs on your server, checks 50+ things, and tells you exactly what's wrong and how to fix it. Free for 1 cluster, no credit card.
Get started free →