Reducing Historical Node Storage Costs in Apache Druid

Engineers open this page when the deep-storage bill and Historical node fleet keep growing faster than ingested data volume: object-store usage climbs, sys.segments returns hundreds of thousands of rows, and Coordinator run-loop times drift upward even though query traffic is flat. The root cause is almost always the same three-way compounding — segment proliferation, misaligned rollup thresholds, and desynchronized retention — each of which is a symptom of segment size drift. This page is a focused runbook under segment size optimization strategies: it gives you the diagnostic one-liners to locate the waste, a validated compact task spec to correct it, an idempotent Python orchestrator to run it safely, and the exact verification commands that confirm storage was reclaimed.

Failure Modes & Diagnostics

Storage bloat manifests through three deterministic failure modes that directly inflate deep-storage I/O and Coordinator memory footprints. Each has a shell one-liner that isolates it against the Coordinator or Overlord REST API.

Segment sprawl and metadata overhead — Ingestion pipelines that write to deep storage without partition alignment generate sub-50 MB segments. Because columnar storage formats mean bytes and row counts are only loosely coupled, thousands of tiny files bloat the Coordinator metadata cache and degrade query planning. List the offenders directly:
```
curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_stream/segments?full=true" \
  | jq '[.[] | select(.size < 52428800)] | length'
```
Or aggregate across datasources through the SQL API to find who owns the sprawl:
```
curl -s -H 'Content-Type: application/json' -XPOST http://broker:8082/druid/v2/sql \
  -d '{"query":"SELECT datasource, COUNT(*) c FROM sys.segments WHERE is_active = 1 GROUP BY 1 HAVING COUNT(*) > 10000 ORDER BY 2 DESC"}' | jq .
```
Compaction lock contention — Overlapping intervals or concurrent compact tasks targeting identical partition boundaries trigger SegmentLock timeouts, leaving intermediate segments stranded in deep storage without ever serving queries. Enumerate running compaction tasks and check for shared intervals:
```
curl -s "http://overlord:8090/druid/indexer/v1/tasks?type=compact&state=running" \
  | jq -r '.[] | "\(.id)\t\(.dataSource)"'
```
The fix is interval-aware task queuing that serializes writes on the same time chunk — the deterministic-ID pattern in the Python script below enforces exactly that. For the row and concurrency knobs that decide how aggressively merges fire, see compaction threshold tuning.
Retention desync — Coordinator drop rules applied after ingestion cause temporary storage spikes until the next duty cycle, and any mismatch between ingestion segmentGranularity and the dropByPeriod boundary leaves orphaned intervals that bypass expiration entirely. Cross-reference the active rules against real segment intervals:
```
curl -s "http://coordinator:8081/druid/coordinator/v1/rules/events_stream" | jq .
```
Aligning those windows is the province of TTL mapping and data expiration; this page assumes the rules exist and focuses on the compaction and recovery half of the cost equation.

Target Spec & Validated JSON

Compaction must be idempotent, bounded, and aligned to the underlying object-storage read block size. Target compressed segments in the 256 MB–512 MB band for cold tiers and 512 MB–1 GB for hot tiers: below that range multipart-upload HTTP overhead dominates, above it Historical heap fragmentation appears during segment load. Because Druid enforces the boundary in rows (the byte-based targetCompactionSizeBytes was removed in Druid 0.21), derive the row target from measured average row size:

$$\text{maxRowsPerSegment} \approx \frac{\text{targetMB} \times 1048576}{\text{avgRowBytes}}$$

The following compact task solves the sprawl case for one day of events_stream, rewriting undersized fragments into DAY-granular segments sized by row count:

{
  "type": "compact",
  "dataSource": "events_stream",
  "ioConfig": {
    "type": "compact",
    "inputSpec": {
      "type": "interval",
      "interval": "2024-01-01T00:00:00.000Z/2024-01-02T00:00:00.000Z"
    }
  },
  "granularitySpec": { "segmentGranularity": "DAY" },
  "tuningConfig": {
    "type": "index_parallel",
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 10000000,
      "maxTotalRows": 20000000
    }
  }
}

Field notes:

maxRowsPerSegment (in partitionsSpec) is the effective output-size control. Keep it at or below ~15 M rows for typical OLAP workloads so index merging does not spike Historical heap; feed the value from the formula above rather than guessing in bytes.
segmentGranularity (in granularitySpec) must match the ingestion granularity governed by the segment granularity settings — a mismatch produces cross-day boundary splits that quietly bypass the compaction window. The full DataSourceCompactionConfig grammar for scheduling this automatically is covered in configuring Druid native compaction rules.

Python Automation Script

Automated compaction needs stateful orchestration to prevent duplicate submission and to survive transient Overlord failures. This script derives a deterministic task ID from the datasource and interval — so a retried submission attaches to the existing task instead of forking a duplicate — and polls with exponential backoff. It uses only the standard library plus requests.

import hashlib
import time
import requests


class DruidCompactionOrchestrator:
    def __init__(self, overlord_url: str, timeout: int = 30):
        self.overlord_url = overlord_url.rstrip("/")
        self.timeout = timeout
        self.session = requests.Session()
        self.session.headers.update({"Content-Type": "application/json"})

    def _task_id(self, datasource: str, interval: str) -> str:
        """Deterministic ID so retries are idempotent, not duplicative."""
        digest = hashlib.md5(f"{datasource}_{interval}".encode()).hexdigest()
        return f"compact_{digest[:12]}"

    def submit(self, spec: dict, max_attempts: int = 5) -> str:
        spec = dict(spec)
        spec["id"] = self._task_id(
            spec["dataSource"], spec["ioConfig"]["inputSpec"]["interval"]
        )
        url = f"{self.overlord_url}/druid/indexer/v1/task"
        delay = 1.0
        for attempt in range(max_attempts):
            try:
                resp = self.session.post(url, json=spec, timeout=self.timeout)
                resp.raise_for_status()
                return spec["id"]
            except requests.RequestException:
                if attempt == max_attempts - 1:
                    raise
                time.sleep(delay)
                delay = min(delay * 2, 30.0)  # exponential backoff, capped

    def wait_for_completion(self, task_id: str, poll_interval: int = 15) -> str:
        """Block until the task reaches a terminal state; return final status."""
        url = f"{self.overlord_url}/druid/indexer/v1/task/{task_id}/status"
        while True:
            resp = self.session.get(url, timeout=self.timeout)
            resp.raise_for_status()
            status = resp.json()["status"]["status"]
            if status in ("SUCCESS", "FAILED"):
                return status
            time.sleep(poll_interval)

Because the ID is a hash of (datasource, interval), a network partition mid-submission is safe: re-execution either re-attaches to the running task or receives a conflict the retry loop absorbs. This is the same serialization that prevents the lock contention in failure mode 2. For how this fits a full scheduled pipeline rather than a one-off run, see automated compaction task scheduling.

Verification Steps

After the compact task reports SUCCESS, confirm the storage win rather than assuming it. First, re-count undersized segments for the datasource — the number should collapse toward zero:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_stream/segments?full=true" \
  | jq '[.[] | select(.size < 52428800)] | length'

Then confirm the rewritten interval now holds a handful of large segments instead of thousands of small ones:

curl -s -H 'Content-Type: application/json' -XPOST http://broker:8082/druid/v2/sql \
  -d '{"query":"SELECT COUNT(*) segs, AVG(size) avg_bytes FROM sys.segments WHERE datasource='"'"'events_stream'"'"' AND is_active = 1"}' | jq .

[
  { "segs": 6, "avg_bytes": 486539264 }
]

If a compaction failed mid-flight, intermediate segments can linger in deep storage while unregistered in the metadata store. Reclaim that space rather than paying for it:

Identify orphaned intervals — query sys.segments for is_published = 0 AND is_realtime = 0, then reconcile against deep-storage prefixes (aws s3 ls s3://druid-deep-storage/events_stream/...).
Reclaim physical bytes — marking a segment unused via POST /druid/coordinator/v1/datasources/events_stream/markUnused only flips metadata state; it does not delete data. Follow it with a kill task submitted to the Overlord for the affected unused interval to physically remove the files from deep storage.
Re-verify — always re-run the sys.segments count above before and after a kill task to prove data was reclaimed and no still-queried interval was touched.

Storage optimization in Druid is a continuous feedback loop between ingestion throughput, compaction cadence, and retention enforcement. Deterministic task IDs, row-target compaction validated at the CI layer, and disciplined orphaned-segment cleanup keep Historical node costs predictable without sacrificing query performance.

Segment size optimization strategies — the parent guide to the sizing math and drift detection this runbook applies. (Up to parent)
Optimizing segment size for Historical nodes — the heap and GC angle on the same size band, from the Historical loading side.
Configuring Druid native compaction rules — the exact compact task and DataSourceCompactionConfig grammar to schedule these rewrites.
TTL mapping and data expiration — the retention half of the cost equation: load/drop rules and kill-task cadence.

Reducing Historical Node Storage Costs in Apache Druid

Failure Modes & Diagnostics #

Target Spec & Validated JSON #

Python Automation Script #

Verification Steps #

Related #

Failure Modes & Diagnostics

Target Spec & Validated JSON

Python Automation Script

Verification Steps

Related