Automating Druid Ingestion Specs with Python: A Validation-First Builder

Hand-edited Druid ingestion JSON is the single most common source of pipeline drift: a stale dimensionsSpec, a maxRowsPerSegment that no longer matches the data, or a missing tuningConfig key that only fails after the Overlord has already claimed a worker slot. The fix is to treat the spec as a versioned, executable artifact — assembled in code, validated against a schema before any network call, submitted with a deterministic ID, and confirmed by polling the Coordinator for handoff. This page is the validation-first companion to dynamic ingestion spec generation: where the parent explains how a generator resolves catalog metadata and telemetry into a spec, here we wire that spec through a jsonschema gate, a resilient requests session, and a runnable CI/CD-ready submitter.

Failure Modes & Diagnostics

Automated submitters fail in a handful of recognizable ways. Each is diagnosable from the Overlord and Coordinator REST APIs with curl and jq.

1. Malformed spec accepted locally, rejected by the Overlord. A missing required key (tuningConfig, or spec.dataSchema) slips past a naive builder and the task returns FAILED on submit. Pull the report and read the parser message:

curl -s http://overlord:8090/druid/indexer/v1/task/$TASK_ID/reports \
  | jq '.ingestionStatsAndErrors.payload.errorMsg'

The remedy is structural: validate every assembled spec against a jsonschema contract before the POST, so this class of failure never consumes a worker slot. This is the same guarantee that schema validation for Druid specs enforces at the contract level.

2. Duplicate ingestion from a retried submit. A network partition drops the Overlord's response, the client retries, and the same interval is ingested twice. Detect it by checking for more than one live version on an interval:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_web/segments?full=true" \
  | jq 'group_by(.interval)[] | {interval: .[0].interval, versions: (map(.version) | unique)}'

Deterministic, content-hashed task IDs make the retry idempotent — the Overlord rejects a resubmitted ID rather than double-ingesting.

3. Submitter overwhelms the Overlord during a leader election. Tight retry loops hammer the Overlord while it is unavailable. Confirm free worker capacity before tuning submit rate:

curl -s http://overlord:8090/druid/indexer/v1/workers \
  | jq '[.[] | {host: .worker.host, used: .currCapacityUsed, cap: .worker.capacity}]'

Capped exponential backoff (start small, double, ceiling at ~30 s) protects the Druid cluster while still recovering from transient faults — the same non-blocking discipline covered in async task execution patterns.

4. Task hangs in RUNNING and the poller blocks forever. A stuck sub-task never reaches a terminal state. A bounded poll with a hard max_wait ceiling raises TimeoutError so the pipeline fails loud instead of hanging a CI runner. Inspect the live state directly:

curl -s http://overlord:8090/druid/indexer/v1/task/$TASK_ID/status \
  | jq '.status.status, .status.duration'

Target Spec & Validated JSON

The builder emits a complete index_parallel batch spec with all required top-level keys — type and spec, and within spec the dataSchema, ioConfig, and tuningConfig objects. It is copy-ready against the latest stable Druid release:

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_web",
      "timestampSpec": { "column": "event_ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": ["country", "device", { "name": "page_id", "type": "long" }]
      },
      "metricsSpec": [
        { "type": "count", "name": "rows" },
        { "type": "longSum", "name": "bytes", "fieldName": "resp_bytes" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true,
        "intervals": ["2026-07-04T00:00:00Z/2026-07-05T00:00:00Z"]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": { "type": "s3", "prefixes": ["s3://lake/events/web/2026-07-04/"] },
      "inputFormat": { "type": "json" },
      "appendToExisting": false,
      "dropExisting": true
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsInMemory": 1000000,
      "maxNumConcurrentSubTasks": 4,
      "forceGuaranteedRollup": true,
      "partitionsSpec": { "type": "range", "partitionDimensions": ["country"], "targetRowsPerSegment": 5000000 }
    }
  }
}

Two fields carry most of the operational risk. An explicit granularitySpec.intervals combined with ioConfig.dropExisting: true makes the task atomically replace the window on handoff — the correct primitive for deterministic backfills, and the reason a bounded interval is non-negotiable. The segmentGranularity here maps directly onto the segment granularity settings that govern partition boundaries, while targetRowsPerSegment is the byte-proxy that keeps segments in Druid's healthy band. Because rollup: true collapses rows within each queryGranularity bucket, the row target must reflect the post-rollup count, and the byte estimate behind it must account for the dictionary encoding of the columnar storage formats that back each segment.

Python Automation Script

The submitter below uses only the Python standard library plus requests and jsonschema. It validates the assembled spec against a minimal structural contract, opens a pooled session with transport-level retries, submits with a deterministic ID and capped application-level backoff, then polls the Overlord to a terminal state with a hard timeout. Validation is deliberately the first line of the submit path — a malformed spec is rejected locally, never after it has claimed an Overlord slot.

import time
import hashlib
import requests
from typing import Dict, Any
from jsonschema import validate
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Minimal index_parallel contract: required top-level keys must all be present.
SPEC_SCHEMA = {
    "type": "object",
    "required": ["type", "spec"],
    "properties": {
        "type": {"const": "index_parallel"},
        "spec": {
            "type": "object",
            "required": ["dataSchema", "ioConfig", "tuningConfig"],
            "properties": {
                "dataSchema": {"type": "object", "required": ["dataSource", "granularitySpec"]},
                "ioConfig": {"type": "object", "required": ["type"]},
                "tuningConfig": {"type": "object", "required": ["type"]},
            },
        },
    },
}


def build_session() -> requests.Session:
    """Pooled session with transport-level retry on transient 5xx responses."""
    session = requests.Session()
    session.headers.update({"Content-Type": "application/json"})
    retry = Retry(total=3, backoff_factor=1, status_forcelist=[502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session


def deterministic_task_id(datasource: str, interval: str, spec_version: str) -> str:
    """Content-hashed ID so an identical resubmission is idempotent at the Overlord."""
    key = f"{datasource}:{interval}:{spec_version}".encode()
    return f"idx_{datasource}_{hashlib.sha1(key).hexdigest()[:12]}"


def submit_and_track(
    overlord_url: str,
    spec: Dict[str, Any],
    spec_version: str = "v1",
    poll_interval: int = 5,
    max_wait: int = 3600,
) -> Dict[str, Any]:
    # 1. Hard validation gate — fails locally, before any network call.
    validate(instance=spec, schema=SPEC_SCHEMA)

    datasource = spec["spec"]["dataSchema"]["dataSource"]
    interval = spec["spec"]["dataSchema"]["granularitySpec"]["intervals"][0]
    spec["id"] = deterministic_task_id(datasource, interval, spec_version)

    session = build_session()

    # 2. Submit with capped application-level exponential backoff.
    delay = 2.0
    for attempt in range(5):
        try:
            resp = session.post(
                f"{overlord_url}/druid/indexer/v1/task", json=spec, timeout=30
            )
            resp.raise_for_status()
            task_id = resp.json()["task"]
            break
        except requests.RequestException:
            if attempt == 4:
                raise
            time.sleep(delay)
            delay = min(delay * 2, 30.0)  # ceiling protects the Overlord

    # 3. Poll to a terminal state with a hard wall-clock ceiling.
    start = time.time()
    while time.time() - start < max_wait:
        status = session.get(
            f"{overlord_url}/druid/indexer/v1/task/{task_id}/status", timeout=15
        )
        status.raise_for_status()
        state = status.json().get("status", {}).get("status")
        if state in ("SUCCESS", "FAILED"):
            return status.json()
        time.sleep(poll_interval)

    raise TimeoutError(f"Task {task_id} exceeded max wait of {max_wait}s")

Wire this into a pipeline runner as a CI/CD gate: run validate() on every generated spec in the merge check so a broken descriptor never reaches main, and gate deployment on a dry-run submit against a staging Overlord. The three production guarantees — a pre-flight schema gate, a deterministic ID, and capped backoff — are what turn a fragile curl | jq script into a repeatable job. Fragmented output from fine-grained streaming intervals is reconciled downstream by automated compaction scheduling rather than by re-ingesting.

Verification Steps

Confirm a submitted task actually produced loaded, correctly-sized segments before signalling downstream success.

1. Confirm the task reached SUCCESS:

curl -s http://overlord:8090/druid/indexer/v1/task/$TASK_ID/status | jq '.status.status'

"SUCCESS"

2. Confirm the Coordinator loaded the interval's segments and check their size band:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_web/segments?full=true" \
  | jq '[.[] | {mb: (.size/1048576 | floor), rows: .num_rows}] | sort_by(.mb)'

[
  { "mb": 486, "rows": 4980112 },
  { "mb": 502, "rows": 5013380 }
]

Segments landing in the ~300–700 MB band confirm the sizing was correct. A cloud of sub-50 MB segments means the row target was wrong for the measured rollup ratio.

3. Confirm exactly one live version per interval (no duplicate ingestion):

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_web/segments?full=true" \
  | jq 'group_by(.interval)[] | {interval: .[0].interval, versions: (map(.version) | unique | length)}'

{ "interval": "2026-07-04T00:00:00.000Z/2026-07-04T01:00:00.000Z", "versions": 1 }

A versions count of 1 on every interval proves the dropExisting replace superseded cleanly and no retry double-ingested.

Handling schema evolution in Druid ingestion — keep the validated schema in this builder in step with a source that gains and drops columns.
Debugging Druid supervisor task failures — triage the streaming-side tasks once a submitted spec is in flight.
Setting up a Kafka to Druid real-time pipeline — apply the same submit-and-verify discipline to a supervisor spec instead of a batch task.

Up one level: Dynamic ingestion spec generation covers how the spec this script submits is assembled from catalog metadata and cluster telemetry.

Automating Druid Ingestion Specs with Python: A Validation-First Builder

Failure Modes & Diagnostics #

Target Spec & Validated JSON #

Python Automation Script #

Verification Steps #

Related #

Failure Modes & Diagnostics

Target Spec & Validated JSON

Python Automation Script

Verification Steps

Related