Druid Segment Metadata Storage: Diagnosing Metadata/Deep-Storage Drift

When a query silently misses recent data or the Coordinator thrashes reloading segments that never change, the root cause is almost always drift between Apache Druid's relational metadata store and the immutable objects sitting in deep storage. The metadata layer is the authoritative registry that binds physical segment files to the query plane: it records which segments exist, their version, their load status, and whether they are used. Unlike a warehouse with a centralized, lock-heavy catalog, Druid distributes segment ownership across Coordinators, Historicals, and Brokers, and leans on PostgreSQL or MySQL to hold the single source of truth. When the metadata store and deep storage disagree, engineers see query gaps, Coordinator memory pressure, or orphaned files quietly accruing storage cost. This page shows how to detect that divergence and reconcile it safely. It sits under query routing and segment discovery, where the metadata registry is what the Broker's timeline is ultimately built from.

The druid_segments table is keyed by segment id and holds columns for dataSource, start, end, version, created_date, used, and a serialized payload carrying the full segment descriptor. The Coordinator polls this table every druid.coordinator.period (default PT60S) to compute load and drop decisions, while Historicals report availability over ZooKeeper or HTTP announcements rather than writing back to the table. Synchronization hinges on an atomic publish step during handoff: an ingestion task transactionally inserts segment rows and marks them used, and only after that commit succeeds does the Coordinator schedule Historical loads. This ordering prevents partial visibility, but it opens a narrow window for divergence whenever the metadata store is slow, connection-starved, or partitioned from a task that already wrote its files to deep storage.

Failure Modes & Diagnostics

Four divergence patterns account for nearly every metadata incident. Each has a distinct signature you can confirm from the Coordinator, Overlord, and the metadata store directly.

Stale Coordinator state. A metadata-store connection drop or network partition leaves the Coordinator computing decisions from an outdated segment map. Historicals keep serving, but the Coordinator issues redundant loads or prematurely queues drops. Confirm by comparing the Coordinator's view against the store:
```
# What the Coordinator currently believes about each datasource
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/metadata/datasources" \
  | jq -r '.[]'
```
Version collision during handoff. Overlapping ingestion tasks that publish the same time range produce competing versions; Historicals reject downloads on checksum mismatch. Cross-reference the segment intervals reported by the Coordinator with the versions recorded in the store, and look for two used=true rows over the same [start,end).
Metadata store bloat. Unpruned druid_audit and druid_pendingSegments rows inflate Coordinator query latency and can push duty-cycle balancing into OutOfMemoryError. Run VACUUM (ANALYZE) on PostgreSQL or OPTIMIZE TABLE on MySQL against the metadata database, and schedule audit-log retention with druid.audit.manager.maxPayloadSizeBytes plus a periodic prune.
Orphaned deep-storage objects. A task that wrote its files but failed the metadata commit leaves segments in S3/HDFS that consume storage yet are invisible to the query plane. These are the highest-value target for automated reconciliation, because nothing in the running cluster will ever surface them.

The single most useful diagnostic bypasses Coordinator caching and reads the store directly, exposing segments that were published but flagged unused — superseded by a newer version, or stuck mid-handoff:

SELECT id, "dataSource", used, version, created_date
FROM druid_segments
WHERE used = false
  AND created_date < NOW() - INTERVAL '2 hours'
ORDER BY created_date DESC;

Correlate any suspicious rows against terminal Overlord task state to distinguish a genuine orphan from a task still in flight:

curl -s "http://<overlord-host>:8090/druid/indexer/v1/tasks?state=failed" \
  | jq -r '.[].id'

Target Spec & Validated JSON

Most drift is preventable at ingestion time by making handoff wait for confirmed metadata publication rather than a fixed timeout. Set a bounded awaitSegmentAvailabilityTimeoutMillis so the task blocks until the Coordinator has actually loaded the published segments, and keep partitioning deterministic so re-runs cannot mint colliding versions. The following minimal index_parallel tuningConfig fragment encodes those guarantees:

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_stream",
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE"
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "appendToExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "forceGuaranteedRollup": true,
      "partitionsSpec": {
        "type": "hashed",
        "numShards": 4
      },
      "awaitSegmentAvailabilityTimeoutMillis": 600000
    }
  }
}

forceGuaranteedRollup with a hashed partitionsSpec produces stable, non-overlapping shard versions, which is what prevents the version-collision failure mode above; the choice between hashed and dynamic partitioning has real trade-offs covered under segment size optimization strategies. The awaitSegmentAvailabilityTimeoutMillis of 600000 (ten minutes) makes the task fail loudly if the metadata publish does not propagate to Historicals, instead of reporting success while the segment is still invisible — closing the window that produces orphans. Aligning segmentGranularity here with your retention boundaries also keeps automated compaction scheduling from later fighting the ingestion grain.

Python Automation Script

Manual reconciliation does not scale past a handful of datasources. The reconciler below queries the metadata store directly for long-unused segments, confirms with the Coordinator that each is genuinely unknown to the query plane, and marks true orphans for cleanup. It uses only the standard library plus requests and psycopg2, and wraps every Coordinator call in exponential backoff so a duty-cycle stall does not produce false orphans:

import os
import time
import logging
import psycopg2
import requests

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("druid_metadata_reconciler")

DRUID_COORDINATOR = os.getenv("DRUID_COORDINATOR", "http://localhost:8081")
METADATA_DB_URI = os.getenv("DRUID_METADATA_DB_URI",
                            "postgresql://druid:password@localhost:5432/druid")
ORPHAN_RETENTION_HOURS = int(os.getenv("ORPHAN_RETENTION_HOURS", "4"))
MAX_RETRIES = 5


def get_unpublished_segments(conn, hours: int) -> list[dict]:
    """Segments stuck in used=false past the grace window."""
    query = """
        SELECT id, "dataSource", version, created_date
        FROM druid_segments
        WHERE used = false
          AND created_date < NOW() - (%s || ' hours')::interval
        ORDER BY created_date ASC;
    """
    with conn.cursor() as cur:
        cur.execute(query, (str(hours),))
        cols = [d[0] for d in cur.description]
        return [dict(zip(cols, row)) for row in cur.fetchall()]


def coordinator_knows_segment(segment_id: str) -> bool:
    """True if the Coordinator has a live record of the segment.

    Retries transient failures with exponential backoff so a stalled
    duty cycle is never mistaken for an orphan.
    """
    url = f"{DRUID_COORDINATOR}/druid/coordinator/v1/metadata/segments/{segment_id}"
    delay = 1.0
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            resp = requests.get(url, timeout=5)
            if resp.status_code == 200:
                return True
            if resp.status_code == 404:
                return False
            logger.warning("Coordinator returned %s for %s", resp.status_code, segment_id)
        except requests.RequestException as exc:
            logger.warning("Attempt %d/%d failed for %s: %s",
                           attempt, MAX_RETRIES, segment_id, exc)
        time.sleep(delay)
        delay = min(delay * 2, 30.0)
    # Unresolved after retries: treat as live to stay safe, never delete on doubt.
    return True


def reconcile_orphaned_segments() -> list[dict]:
    """Idempotent scan; returns confirmed orphans without deleting them."""
    orphans: list[dict] = []
    try:
        with psycopg2.connect(METADATA_DB_URI) as conn:
            candidates = get_unpublished_segments(conn, ORPHAN_RETENTION_HOURS)
            if not candidates:
                logger.info("No unpublished segments past the grace window.")
                return orphans

            logger.info("Validating %d candidate(s) against the Coordinator...", len(candidates))
            for seg in candidates:
                if coordinator_knows_segment(seg["id"]):
                    logger.info("Segment %s is pending handoff; skipping.", seg["id"])
                    continue
                logger.warning("Orphan confirmed: %s (datasource=%s, created=%s).",
                               seg["id"], seg["dataSource"], seg["created_date"])
                orphans.append(seg)
                # Production cleanup happens in two ordered steps AFTER this scan:
                #   1. Delete the segment object from deep storage (S3/HDFS SDK).
                #   2. DELETE FROM druid_segments WHERE id = %s  -- only after step 1 confirms.
    except psycopg2.OperationalError as exc:
        logger.error("Metadata store connection failed: %s", exc)
        raise
    return orphans


if __name__ == "__main__":
    reconcile_orphaned_segments()

The safety invariant is that an unresolved Coordinator check returns True, so the reconciler never deletes a segment it could not positively confirm as orphaned. When wiring this into Airflow or a CI job, pool connections through PgBouncer to avoid saturating the metadata store during peak ingestion — the same discipline applied to asynchronous task execution patterns across the orchestration layer.

Verification Steps

After a reconciliation run, confirm that the metadata store and the query plane agree. First, the direct-store query from above should return zero long-unused rows:

psql "$DRUID_METADATA_DB_URI" -tAc \
  "SELECT count(*) FROM druid_segments
   WHERE used = false AND created_date < NOW() - INTERVAL '4 hours';"

Expected output — no lingering orphans past the grace window:

Next, confirm the Coordinator reports the expected active segment count for the datasource and that none are unavailable:

curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/events_stream" \
  | jq '{segments: .segments.count, availability: .tiers}'

Expected shape — a non-zero segment count with every segment loaded on a tier:

{
  "segments": 1440,
  "availability": {
    "_default_tier": { "segmentCount": 1440, "replicationFactor": 2 }
  }
}

Finally, verify from the Broker's perspective that no segments are marked unavailable, which is the ultimate signal that routing sees a complete timeline:

curl -s "http://<broker-host>:8082/druid/broker/v1/loadstatus" | jq '.'

A response of {"inventoryInitialized": true, "totalSegmentCount": 1440, "unavailableSegmentCount": 0} confirms the metadata registry, deep storage, and the query plane are fully aligned.

Query routing and segment discovery — how the Broker turns the metadata registry into a queryable timeline and scatters sub-queries.
Reducing Historical node storage costs — recovering stranded and orphaned segments to reclaim deep storage.
Schema validation for Druid specs — catching malformed ingestion specs before they can publish colliding versions.

Up one level: Apache Druid Segment Architecture & Lifecycle Fundamentals.

For deeper reference on the store's role in retention and cleanup, see the Apache Druid data management documentation.

Druid Segment Metadata Storage: Diagnosing Metadata/Deep-Storage Drift

Failure Modes & Diagnostics #

Target Spec & Validated JSON #

Python Automation Script #

Verification Steps #

Related #

Failure Modes & Diagnostics

Target Spec & Validated JSON

Python Automation Script

Verification Steps

Related