Debugging Druid Supervisor Task Failures: Operational Reference

Supervisor task failures in Apache Druid rarely originate from transient network instability. When a supervisor transitions to TASK_FAILED, KILLED, or FAILED states, it typically signals structural misalignment between ingestion specifications, JVM resource boundaries, or retention synchronization. For OLAP data engineers and platform developers, resolving these states requires deterministic diagnostics, idempotent recovery patterns, and strict pipeline-level gating to prevent spec drift from propagating into production clusters.

Failure Mode Taxonomy & API Diagnostics

Isolate the root cause by correlating the Overlord supervisor payload with underlying middleManager task logs. Production environments consistently exhibit three dominant failure vectors: schema validation rejection, intermediate persist exhaustion, and retention rule conflicts.

Execute the following diagnostic sequence against the Overlord and Coordinator APIs to extract deterministic failure context:

# 1. Retrieve supervisor runtime status and extract the last exception payload
curl -s "http://<overlord-host>:8090/druid/indexer/v1/supervisor/<supervisor-id>/status" \
  -H "Content-Type: application/json" | jq '.payload.lastException'

# 2. Extract the exact failing task ID and fetch raw logs for pattern matching
TASK_ID=$(curl -s "http://<overlord-host>:8090/druid/indexer/v1/supervisor/<supervisor-id>/status" | jq -r '.payload.runningTasks[0].id')
curl -s "http://<overlord-host>:8090/druid/indexer/v1/task/${TASK_ID}/log" | grep -E 'ParseException|OutOfMemoryError|SegmentAlreadyExists|KafkaConsumerTimeout'

# 3. Verify segment metadata alignment against the active retention policy
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/metadata/datasources/<datasource>/segments" | jq '.[] | select(.isAvailable == false) | {id: .id, status: .status}'

When ParseException surfaces, the failure originates from dimensionsSpec or metricsSpec mismatches, often triggered by upstream schema evolution without corresponding spec updates. OutOfMemoryError during the intermediatePersist phase indicates that maxRowsInMemory or intermediatePersistPeriod values exceed the middleManager JVM heap allocation, requiring immediate tuning of ingestion resource parameters. Retention conflicts (SegmentAlreadyExists or premature dropRule execution) demand explicit offset resets and rule synchronization across the metadata store.

Python Orchestration & CI/CD Gating

Pipeline builders must enforce strict spec validation before submission to prevent cascading supervisor crashes. Integrating schema validation into CI/CD workflows ensures that dataSchema contracts remain immutable across deployments. The orchestration layer should implement exponential backoff, idempotent submission, and automatic rollback on validation rejection.

import requests
import json
from tenacity import retry, stop_after_attempt, wait_exponential
from jsonschema import validate, ValidationError

OVERLORD_URL = "http://<overlord-host>:8090"
SUPERVISOR_SPEC_PATH = "kafka_supervisor.json"
SCHEMA_PATH = "druid_supervisor_schema.json"

# Load and validate spec against strict JSON schema before submission
def validate_spec(spec_path: str, schema_path: str) -> dict:
    with open(spec_path, "r") as f:
        spec = json.load(f)
    with open(schema_path, "r") as f:
        schema = json.load(f)
    try:
        validate(instance=spec, schema=schema)
        return spec
    except ValidationError as e:
        raise RuntimeError(f"Spec validation failed at {e.json_path}: {e.message}")

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def submit_supervisor_spec(spec_path: str) -> dict:
    spec = validate_spec(spec_path, SCHEMA_PATH)
    response = requests.post(
        f"{OVERLORD_URL}/druid/indexer/v1/supervisor",
        json=spec,
        headers={"Content-Type": "application/json"},
        timeout=15
    )
    response.raise_for_status()
    return response.json()

def rollback_failed_supervisor(supervisor_id: str) -> bool:
    """Idempotent termination and cleanup for failed supervisors."""
    resp = requests.post(f"{OVERLORD_URL}/druid/indexer/v1/supervisor/{supervisor_id}/terminate")
    return resp.status_code == 200

By decoupling validation from execution, teams can safely implement Async Task Execution Patterns that queue ingestion requests without blocking CI/CD runners. This approach is critical when synchronizing batch backfills with streaming pipelines, as it prevents resource contention during peak ingestion windows.

Deterministic Recovery & Spec Rollback

When a supervisor enters a terminal failure state, manual intervention introduces operational risk. Production recovery should follow a deterministic sequence:

  1. Terminate & Drain: Issue a supervisor termination to halt task spawning and allow active middleManagers to complete pending persist operations gracefully.
  2. Offset Reset: For Kafka-based ingestion, reset the consumer group offset to the last successfully committed timestamp. Avoid hard resets unless data loss is explicitly acceptable.
  3. Spec Patch & Resubmit: Apply a targeted patch to the ingestion spec (e.g., reducing maxRowsInMemory, correcting timestampSpec, or adjusting ioConfig consumer properties) and resubmit via the orchestration layer.
  4. Automated Rollback Trigger: If the resubmitted supervisor fails within the first 10 minutes, trigger an automated rollback to the previous known-good spec version stored in your configuration repository.

Cross-cluster ingestion orchestration requires strict metadata synchronization. When migrating ingestion workloads between environments, ensure that segment metadata, retention rules, and compaction configurations are replicated before supervisor activation. Implementing Automated Ingestion Pipeline Orchestration with declarative state management prevents configuration drift and guarantees reproducible recovery paths.

For teams managing high-throughput OLAP workloads, embedding these diagnostic and recovery patterns directly into platform tooling reduces mean-time-to-recovery (MTTR) and enforces consistent ingestion contracts across development, staging, and production environments.

Back to Automated Ingestion Pipeline Orchestration