Optimizing Segment Size for Historical Nodes

Engineers see it first as unpredictable query latency and periodic OutOfMemoryError crashes on Historical nodes during segment load — the symptom of segments whose on-disk footprint has drifted outside the operational band a Historical can cache efficiently. Historical nodes are Apache Druid's primary query execution layer, and they memory-map every segment they serve, so segment size directly dictates heap allocation, garbage-collection behaviour, and Broker routing efficiency. When footprints stray outside the 500–750 MB sweet spot, resource contention compounds fast. This page applies the encoding and partitioning theory from its parent, columnar storage formats in Druid, to the concrete problem of keeping published segments loadable — the sizing formulas there govern the row targets you set here.

Failure Modes & Diagnostics

Two size failures dominate, and they present in opposite directions. Oversized segments (>1.5 GB) trigger JVM heap exhaustion during the load phase, causing prolonged stop-the-world GC pauses and OutOfMemoryError: Java heap space during handoff. Undersized segments (<200 MB) fragment the columnar storage layout, inflating metadata in the relational metadata store and degrading the Broker-to-Historical discovery path documented under query routing and segment discovery. Isolate the condition before it cascades into query timeouts or node evictions:

# Inspect segment size distribution per datasource via the Coordinator API
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/<datasource>/segments?full=true" | \
  jq '.[] | {id: .identifier, size_mb: (.size / 1048576 | floor)}'

# Watch Historical JVM heap / GC pressure while segments load
jstat -gcutil <historical_pid> 1000 5

# Count segments currently loaded on a Historical node
curl -s "http://<historical-host>:8083/druid/v2/segments/loaded" | jq 'length'

A sustained FGCT climb in jstat output during a load window confirms heap pressure from oversized segments; a datasource whose jq distribution is dominated by sub-200 MB entries confirms fragmentation. The temporal driver behind both is often the segment granularity settings — a granularity too coarse for event velocity produces giant segments, too fine produces a swarm of tiny ones.

Target Spec & Validated JSON

Deterministic partitioning requires calibrating maxRowsPerSegment against empirical row sizes and compression ratios. The heuristic below gives a starting row target, but it must be validated post-ingestion because dictionary encoding and bitmap compression can shift the real footprint 30–60% from a raw estimate:

$$ \text{targetRows} \approx \frac{\text{targetMB} \times 1{,}048{,}576}{\text{avgRowBytes}} $$

For parallel batch ingestion, the tuningConfig enforces strict boundaries so partitioning cannot run away. This spec is copy-ready against a recent stable Druid release and includes every required top-level key:

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_raw",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY" }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": { "type": "s3", "prefixes": ["s3://analytics-bucket/raw/"] },
      "inputFormat": { "type": "parquet" }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic",
        "maxRowsPerSegment": 5000000,
        "maxTotalRows": 20000000
      },
      "maxRowsInMemory": 1000000,
      "forceGuaranteedRollup": false,
      "logParseExceptions": true,
      "maxParseExceptions": 0
    }
  }
}

After ingestion, audit segment sizes via the Coordinator API. If average compressed size deviates from the 500–750 MB target, adjust maxRowsPerSegment incrementally — never mutate it mid-pipeline without a corresponding automated compaction scheduling pass to re-write the already-published intervals. The dynamic partitioning here optimizes for ingestion throughput; when you need size determinism over speed, switch to a hashed or range spec with targetRowsPerSegment.

Python Automation Script

Pipelines must automate compaction to correct drift without manual intervention. The orchestrator below submits a compaction task, applies exponential backoff for API rate limits and transient Overlord errors, and polls to a terminal state. It uses only the standard library plus requests, and slots directly into a CI/CD gate that blocks a deploy when Historical segment sizes exceed threshold.

import time
import logging
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("druid_compaction")


class DruidCompactionOrchestrator:
    def __init__(self, overlord_url, datasource, interval, auth=None):
        self.base_url = overlord_url.rstrip("/")
        self.datasource = datasource
        self.interval = interval
        self.session = requests.Session()
        self.session.auth = auth
        retry = Retry(total=3, backoff_factor=1.5,
                      status_forcelist=[429, 500, 502, 503, 504])
        self.session.mount("http://", HTTPAdapter(max_retries=retry))
        self.session.mount("https://", HTTPAdapter(max_retries=retry))

    def submit_compaction(self):
        payload = {
            "type": "compact",
            "dataSource": self.datasource,
            "ioConfig": {
                "type": "compact",
                "inputSpec": {"type": "interval", "interval": self.interval},
            },
            "tuningConfig": {
                "type": "index_parallel",
                "partitionsSpec": {
                    "type": "dynamic",
                    "maxRowsPerSegment": 5000000,
                    "maxTotalRows": 20000000,
                },
                "maxRowsInMemory": 1000000,
            },
        }
        resp = self.session.post(
            f"{self.base_url}/druid/indexer/v1/task", json=payload, timeout=30
        )
        resp.raise_for_status()
        task_id = resp.json()["task"]
        logger.info("Compaction task submitted: %s", task_id)
        return task_id

    def poll_until_terminal(self, task_id, base_delay=5, max_delay=60, max_wait=3600):
        start = time.time()
        delay = base_delay
        while time.time() - start < max_wait:
            resp = self.session.get(
                f"{self.base_url}/druid/indexer/v1/task/{task_id}/status", timeout=10
            )
            resp.raise_for_status()
            status = resp.json().get("status", {}).get("status")
            if status in ("SUCCESS", "FAILED", "INTERRUPTED"):
                logger.info("Task %s terminal state: %s", task_id, status)
                return status
            logger.debug("Task %s pending (%s); sleeping %ss", task_id, status, delay)
            time.sleep(delay)
            delay = min(delay * 2, max_delay)  # exponential backoff, capped
        raise TimeoutError(f"Compaction task {task_id} exceeded {max_wait}s")


# Usage
# orch = DruidCompactionOrchestrator("http://overlord:8090", "events_raw",
#                                    "2024-01-01/2024-01-02")
# if orch.poll_until_terminal(orch.submit_compaction()) != "SUCCESS":
#     raise SystemExit("Compaction failed; segment sizes not corrected")

Because a non-SUCCESS terminal state raises loudly, a spec that produces mis-sized segments never silently ships to a Historical node. For templating this compaction spec across datasources and environments, see how it composes with a broader recovery routine below.

Verification Steps

Confirm the fix after compaction completes. First re-check the size distribution — every published segment for the interval should now fall inside the target band:

curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/events_raw/segments?full=true" \
  | jq '[.[] | (.size / 1048576 | floor)] | {min: min, max: max, count: length}'

Expected output shows the spread contained within roughly 500–750 MB:

{
  "min": 512,
  "max": 731,
  "count": 24
}

If an interval's older, oversized segments are still being served, mark them unused so the Coordinator stops loading them and re-evaluates its load queue:

# Retire the stale interval so replacements take over
curl -X POST "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/events_raw/markUnused" \
  -H "Content-Type: application/json" \
  -d '{"interval": "2024-01-01/2024-01-02"}'

# Force the Coordinator to immediately re-evaluate the load queue
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/loadqueue?simple" | jq

Finally, confirm the Historical is healthy under the new set — a stable jstat heap profile with no full-GC storm during the load window means the segments now cache cleanly:

jstat -gcutil <historical_pid> 1000 5
# FGC column should stay flat (no repeated full collections) as segments load

Enforce these as validation gates in the pipeline: reject ingestion when more than 15% of segments exceed 1.2 GB, keep Historical druid.processing.buffer.sizeBytes and druid.server.http.maxQueuedBytes sized to survive concurrent loads, and halt downstream Broker routing updates if a compaction pass produces segments outside the target band. Reference the Druid compaction documentation for maxRowsPerSegment overrides and Coordinator balancing intervals.

Columnar Storage Formats in Druid — parent reference: the encoding, indexing, and compression internals whose sizing formulas this page applies.
Reducing Historical Node Storage Costs — the cost angle on segment sprawl, retention desync, and deep-storage footprint.
How Druid Segments Map to Time Intervals — how segmentGranularity sets the row count that ultimately drives segment size.

Optimizing Segment Size for Historical Nodes

Failure Modes & Diagnostics #

Target Spec & Validated JSON #

Python Automation Script #

Verification Steps #

Related #

Failure Modes & Diagnostics

Target Spec & Validated JSON

Python Automation Script

Verification Steps

Related