Debugging Apache Druid Supervisor Task Failures

When a Kafka or Kinesis supervisor in Apache Druid flips to SUSPENDED, or its child tasks pile up in FAILED and KILLED states, ingestion stalls silently — the supervisor keeps reporting itself as present in the Overlord, but no new segments reach deep storage and query results quietly go stale. These failures rarely come from transient network blips; they almost always signal structural misalignment between the ingestion spec, the MiddleManager JVM heap, or the Kafka consumer configuration. This page is the failure-diagnosis companion to the async task execution patterns that submit and poll these tasks: it shows how to extract deterministic failure context, patch the offending spec, and drive an idempotent recovery loop that never double-publishes a segment.

Failure Modes & Diagnostics

A supervisor is a long-running Overlord process that spawns and supervises short-lived indexing tasks. Diagnosis means correlating two layers: the supervisor-level status payload from the Overlord, and the task-level logs emitted by the MiddleManager Peon that actually ran the work. Three failure vectors dominate production incidents: schema-validation rejection (ParseException), intermediate-persist heap exhaustion (OutOfMemoryError), and consumer rebalance storms (max.poll.interval.ms breaches).

Start at the supervisor. The recentErrors array in the status payload is the fastest signal of why tasks are dying:

# 1. Supervisor runtime state + the most recent error stack traces
curl -s "http://<overlord-host>:8090/druid/indexer/v1/supervisor/<supervisor-id>/status" \
  | jq '{state: .payload.state, detailedState: .payload.detailedState,
         errors: [.payload.recentErrors[] | {timestamp, streamException: .message}]}'

If detailedState reads UNHEALTHY_TASKS or UNABLE_TO_CONNECT_TO_STREAM, drill into an individual task log. The supervisor lists its active and recently completed tasks; pull the log for the newest one and grep for the three canonical exception classes:

# 2. Resolve the newest task id, then scan its log for the dominant exception classes
TASK_ID=$(curl -s "http://<overlord-host>:8090/druid/indexer/v1/supervisor/<supervisor-id>/status" \
  | jq -r '.payload.activeTasks[0].id // .payload.publishingTasks[0].id')
curl -s "http://<overlord-host>:8090/druid/indexer/v1/task/${TASK_ID}/log" \
  | grep -E 'ParseException|OutOfMemoryError|max\.poll\.interval\.ms|Marking the coordinator dead'

When the log points at heap pressure, confirm it on the worker rather than guessing. jstat against the Peon JVM shows whether the old generation is pinned near 100% and how much time is lost to full GC:

# 3. Confirm heap exhaustion on the Peon: sustained O near 100% + rising FGCT = OOM territory
PEON_PID=$(pgrep -f "${TASK_ID}")
jstat -gcutil "${PEON_PID}" 1000 5   # columns: S0 S1 E O M CCS YGC YGCT FGC FGCT GCT

Map each symptom to its structural root cause and remediation:

ParseException — an upstream producer added or renamed a field and the dimensionsSpec/timestampSpec was never updated. The fix is a spec change, not a restart; treat schema drift as a first-class event, as covered in handling schema evolution in Druid ingestion.
OutOfMemoryError during intermediatePersist — maxRowsInMemory (or maxBytesInMemory) is too large for the MiddleManager -Xmx. Each persist buffers rows on-heap before flushing to disk; lower the threshold or raise the Peon heap in druid.indexer.runner.javaOptsArray.
max.poll.interval.ms breach / rebalance storm — processing a poll batch takes longer than the consumer's poll interval, so Kafka evicts the consumer and triggers a rebalance, which fails the task. Either raise max.poll.interval.ms in consumerProperties or shrink maxRowsInMemory so each persist cycle returns to poll() sooner.

Target Spec & Validated JSON

A hardened Kafka supervisor spec removes the two most common self-inflicted failures at the source: it bounds on-heap row buffering so persists never exhaust the Peon, and it gives the consumer enough headroom to survive a slow batch without a rebalance. The block below is complete — every required top-level key (type, spec, ioConfig, tuningConfig) is present and copy-ready.

{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "events",
      "timestampSpec": { "column": "ts", "format": "millis" },
      "dimensionsSpec": {
        "dimensions": ["user_id", "event_type", "region"]
      },
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": false
      }
    },
    "ioConfig": {
      "type": "kafka",
      "topic": "events",
      "consumerProperties": {
        "bootstrap.servers": "kafka-1:9092,kafka-2:9092",
        "max.poll.interval.ms": "600000",
        "max.poll.records": "5000"
      },
      "taskCount": 2,
      "replicas": 1,
      "taskDuration": "PT1H",
      "useEarliestOffset": false
    },
    "tuningConfig": {
      "type": "kafka",
      "maxRowsInMemory": 75000,
      "maxBytesInMemory": 0,
      "intermediatePersistPeriod": "PT10M",
      "maxRowsPerSegment": 5000000,
      "resetOffsetAutomatically": false
    }
  }
}

The load-bearing fields for failure prevention are maxRowsInMemory (kept at 75k so each persist stays well inside a 1–2 GB Peon heap), max.poll.interval.ms (raised to 10 minutes to absorb slow persist cycles), and resetOffsetAutomatically: false — automatic offset reset silently skips data on an OffsetOutOfRange, so recovery must be an explicit, audited action rather than a side effect. These sizing choices connect directly to broader segment size optimization strategies; maxRowsPerSegment here caps the streaming segment before handoff and compaction consolidates it later.

Python Automation Script

Diagnosis and recovery should be codified, not clicked through a console under incident pressure. The script below uses only the standard library plus requests. It validates the spec locally, submits it with a hand-rolled exponential-backoff retry, and — on a terminal supervisor failure — terminates cleanly and resets offsets to the last committed checkpoint rather than to earliest/latest. This mirrors the submit-and-poll contract described in the parent async task execution patterns and the local validation gate detailed under schema validation for Druid specs.

import json
import time
import requests

OVERLORD = "http://<overlord-host>:8090"
REQUIRED_TOP_LEVEL = {"type", "spec"}
REQUIRED_SPEC_KEYS = {"dataSchema", "ioConfig", "tuningConfig"}


def validate_spec(path: str) -> dict:
    """Fail fast on structural gaps before the Overlord ever sees the spec."""
    with open(path, "r", encoding="utf-8") as fh:
        spec = json.load(fh)
    missing_top = REQUIRED_TOP_LEVEL - spec.keys()
    missing_spec = REQUIRED_SPEC_KEYS - spec.get("spec", {}).keys()
    if missing_top or missing_spec:
        raise ValueError(f"invalid spec: missing {missing_top | missing_spec}")
    return spec


def submit_supervisor(path: str, attempts: int = 4) -> dict:
    """POST the supervisor spec with exponential backoff on transient errors."""
    spec = validate_spec(path)
    for attempt in range(attempts):
        try:
            resp = requests.post(
                f"{OVERLORD}/druid/indexer/v1/supervisor",
                json=spec,
                timeout=15,
            )
            resp.raise_for_status()
            return resp.json()
        except requests.RequestException as exc:
            if attempt == attempts - 1:
                raise
            backoff = 2 ** attempt
            print(f"submit failed ({exc}); retrying in {backoff}s")
            time.sleep(backoff)


def recover_supervisor(supervisor_id: str) -> None:
    """Idempotent recovery: terminate, then reset to committed offsets only."""
    requests.post(
        f"{OVERLORD}/druid/indexer/v1/supervisor/{supervisor_id}/terminate",
        timeout=15,
    )
    # resetOffsets with an empty partition map replays from the last checkpoint,
    # never from earliest/latest — no silent data loss or duplication.
    requests.post(
        f"{OVERLORD}/druid/indexer/v1/supervisor/{supervisor_id}/resetOffsets",
        json={"type": "end", "partitions": {"type": "end", "stream": supervisor_id}},
        timeout=15,
    )


if __name__ == "__main__":
    result = submit_supervisor("kafka_supervisor.json")
    print("supervisor id:", result["id"])

Verification Steps

After resubmission, confirm the supervisor actually reached a healthy steady state rather than just accepting the POST. A RUNNING state alone is insufficient — check that detailedState is RUNNING and that lag is draining, not growing.

# Confirm the supervisor is healthy and consumer lag is bounded
curl -s "http://<overlord-host>:8090/druid/indexer/v1/supervisor/events/status" \
  | jq '{state: .payload.state, detailedState: .payload.detailedState,
         aggregateLag: .payload.aggregateLag}'

A recovered supervisor returns a payload like the following — state and detailedState both RUNNING, with aggregateLag trending toward zero across successive polls:

{
  "state": "RUNNING",
  "detailedState": "RUNNING",
  "aggregateLag": 4210
}

Finally, confirm that fresh segments are being published and handed off to Historicals, which is the true end of the ingestion contract and where query routing and segment discovery begins serving the new data:

# Newly published, used segments for the datasource in the current hour
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/events/segments?full=true" \
  | jq '[.[] | select(.used == true)] | length as $n | "used segments: \($n)"'

Async Task Execution Patterns for Druid Ingestion — the submit-poll-complete control loop these supervisors run inside, including handoff confirmation.
Handling Schema Evolution in Druid Ingestion — prevent the ParseException failures that suspend supervisors after an upstream field change.
Kafka to Druid Real-Time Pipeline Setup — the consumer configuration and deterministic handoff settings referenced in the target spec above.

Up one level: Async Task Execution Patterns.

Debugging Apache Druid Supervisor Task Failures

Failure Modes & Diagnostics #

Target Spec & Validated JSON #

Python Automation Script #

Verification Steps #

Related #

Failure Modes & Diagnostics

Target Spec & Validated JSON

Python Automation Script

Verification Steps

Related