Automated Ingestion Pipeline Orchestration for Apache Druid

Production-grade Apache Druid deployments demand deterministic, idempotent ingestion pipelines that align precisely with the platform's immutable segment lifecycle, so that every byte written to deep storage becomes queryable exactly once, on schedule, and without manual intervention. Manual JSON specification submission introduces operational friction, schema drift vulnerabilities, and unpredictable scheduling latency. Automated orchestration replaces ad-hoc task submission with version-controlled, programmatically generated workflows that interface directly with the Druid Overlord and Coordinator. For OLAP data engineers, analytics platform developers, and DevOps teams, the operational objective is clear: minimize ingestion latency, guarantee segment integrity, and enforce strict schema governance across heterogeneous batch and streaming workloads.

The orchestrator drives the write path and polls the Overlord asynchronously; a task reporting SUCCESS only becomes queryable once the Coordinator loads its segments across the handoff boundary.

Orchestration Flow

The end-to-end orchestration loop validates a generated spec, submits it to the Overlord, polls asynchronously, and gates downstream work on a successful handoff.

The orchestrator treats each ingestion as a state machine, gating downstream work on segment handoff rather than on the task's SUCCESS alone.

Core Concept & Internal Mechanics

An ingestion pipeline in Druid is not a single component but a coordinated hand-off between three control-plane services and the orchestration layer that drives them. The Overlord accepts task submissions on POST /druid/indexer/v1/task, persists them to the task metadata store, and assigns them to worker capacity. For native batch and streaming ingestion, work executes on MiddleManager or Indexer processes, which read source data, build columnar segments in memory, spill to intermediate persists, and finally publish immutable segments to deep storage while recording an entry in the segment metadata table. The Coordinator then reconciles that metadata against load rules and instructs Historical nodes to pull the new segments, at which point the Broker begins routing queries to them. The moment a segment transitions from "published in metadata" to "loaded and announced by a Historical" is the handoff — the single most important synchronization point an orchestrator must observe, because a task reporting SUCCESS does not by itself guarantee the data is queryable.

Understanding this boundary is what separates a robust pipeline from a fragile one. The orchestrator's job is to treat every ingestion as a state machine — WAITING → PENDING → RUNNING → SUCCESS | FAILED on the task side, and published → loaded → available on the segment side — and to reconcile both before signalling completion to downstream consumers. Because Druid segments are immutable and time-partitioned, the pipeline never mutates data in place: a correction or a late-arriving batch produces new segments that either append to or atomically replace an interval. This immutability contract is the same one that governs the Druid segment architecture and lifecycle, and the orchestration layer exists to enforce it programmatically rather than by convention.

Three internal data structures drive orchestrator design. First, the task ID — a caller-supplied or Overlord-generated string that must be deterministic if you want idempotent resubmission. Second, the segment identifier (datasource_interval_version_partitionNum), which encodes the interval and a monotonically increasing version so that a replacement supersedes its predecessor without a race. Third, the ioConfig.appendToExisting / dropExisting flags, which decide whether a new task extends an interval or shadows it. A pipeline that ignores these structures will eventually produce overlapping segments, duplicate rows, or a Coordinator stuck reconciling conflicting versions. The subsystems that own these concerns are covered in depth by dynamic ingestion spec generation for the build phase, and by async task execution patterns for the run phase.

Configuration Reference

Every automated pipeline ultimately emits a JSON ingestion spec. The spec's top-level shape is fixed: a type, a spec object containing dataSchema, ioConfig, and tuningConfig. Below is an annotated native batch spec (index_parallel) that an orchestrator would template. Runtime resolution of dataSource, dimensions, metrics, and partitioning is the subject of dynamic ingestion spec generation; here every tunable field is documented inline so the contract is unambiguous.

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_web",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": ["country", "device", "page_id"]
      },
      "metricsSpec": [
        { "type": "count", "name": "rows" },
        { "type": "longSum", "name": "bytes", "fieldName": "resp_bytes" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true,
        "intervals": ["2026-07-04T00:00:00Z/2026-07-05T00:00:00Z"]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "prefixes": ["s3://lake/events/web/2026-07-04/"]
      },
      "inputFormat": { "type": "json" },
      "appendToExisting": false,
      "dropExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsInMemory": 1000000,
      "maxRowsPerSegment": 5000000,
      "maxNumConcurrentSubTasks": 4,
      "intermediatePersistPeriod": "PT10M",
      "partitionsSpec": {
        "type": "range",
        "partitionDimensions": ["country"],
        "targetRowsPerSegment": 5000000
      },
      "forceGuaranteedRollup": true
    }
  }
}

Field-by-field, the load-bearing knobs are:

granularitySpec.segmentGranularity — the physical partition width. It fixes how many segments an interval produces and, together with intervals, defines the replace/append boundary. It maps directly onto the segment granularity settings that govern partition boundaries.
granularitySpec.queryGranularity and rollup — the pre-aggregation contract. With rollup: true, rows sharing all dimensions within a queryGranularity bucket collapse into one, trading raw fidelity for a smaller footprint on the columnar storage formats that back each segment.
ioConfig.appendToExisting — true adds partitions to an interval; false (with matching intervals) makes the task the authoritative producer for that window.
ioConfig.dropExisting — when true and combined with an explicit intervals, the task atomically replaces every existing segment in that interval on handoff, which is the safe primitive for backfills.
tuningConfig.maxRowsInMemory and intermediatePersistPeriod — control heap pressure and spill cadence on the worker; the orchestrator should tune these from cluster telemetry rather than hardcoding.
tuningConfig.partitionsSpec — range/single_dim for query-locality on a high-cardinality dimension, or hashed for even distribution. The choice is a genuine trade-off worth its own decision page; see the sizing section below.

The API endpoints an orchestrator calls are stable across the latest Druid release: submit with POST /druid/indexer/v1/task, poll status with GET /druid/indexer/v1/task/{taskId}/status, fetch the full report with GET /druid/indexer/v1/task/{taskId}/reports, and cancel with POST /druid/indexer/v1/task/{taskId}/shutdown. Streaming ingestion uses the supervisor family: POST /druid/indexer/v1/supervisor to create or update, and GET /druid/indexer/v1/supervisor/{id}/status to inspect lag and partition offsets. Coordinator-side, GET /druid/coordinator/v1/loadstatus and GET /druid/coordinator/v1/datasources/{ds}/segments confirm that handoff actually completed. Full field semantics are maintained in the official Apache Druid ingestion guide.

Operational Sizing & Constraints

Sizing an ingestion pipeline is a calibration problem with a handful of well-defined relationships. The first is the target segment size. Druid performs best when segments land in the 300–700 MB range (roughly 5 million rows for typical event data); undersized segments inflate metadata and query-planning overhead, while oversized segments strain Historical JVM heaps. Given a target size in bytes and an average row width, the row target is:

$$ \text{targetRows} \approx \frac{\text{targetMB} \times 1048576}{\text{avgRowBytes}} $$

For 500 MB segments over rows averaging ( \approx 105 ) bytes on disk after compression, that yields ( \approx 5 \times 10^{6} ) rows per segment — the targetRowsPerSegment value in the spec above. The second relationship governs how many segments an interval produces, which drives Coordinator and Historical load:

$$ \text{segmentsPerInterval} \approx \left\lceil \frac{\text{rowsPerInterval}}{\text{targetRows}} \right\rceil \times \text{numPartitionDimensions} $$

The third is worker capacity. A MiddleManager runs at most druid.worker.capacity task slots; with maxNumConcurrentSubTasks set to ( k ) per parallel task, the effective concurrency ceiling is ( k \times \text{numRunningSupervisorTasks} ), and exceeding total slot capacity queues tasks in PENDING. The orchestrator must therefore treat slots as a finite pool: submit rate should back off as GET /druid/indexer/v1/workers reports declining available capacity. Finally, streaming pipelines size taskCount and taskDuration against Kafka partition count — one reading task per partition per replica — so that consumer lag stays bounded:

$$ \text{minTaskCount} \approx \left\lceil \frac{\text{kafkaPartitions}}{\text{targetPartitionsPerTask}} \right\rceil \times \text{replicas} $$

These constraints interact: shrinking segmentGranularity to reduce per-segment rows multiplies segment count and Coordinator work, which is precisely why automated compaction scheduling exists to reconcile fragmented streaming output back toward the target size after the fact.

Pipeline Orchestration Patterns

The orchestration layer is deliberately thin: build a spec, validate it, submit it, poll to terminal state with exponential backoff, then confirm handoff. The reference implementation below uses only the Python standard library plus requests. It is the canonical pattern that both async task execution patterns and the Python-first walkthrough in automating Druid ingestion specs with Python build upon.

import time
import hashlib
import requests

OVERLORD = "http://overlord:8090"
TERMINAL = {"SUCCESS", "FAILED"}

def deterministic_task_id(datasource: str, interval: str, spec_version: str) -> str:
    """Content-hashed ID so a resubmission of the same work is idempotent."""
    key = f"{datasource}:{interval}:{spec_version}".encode()
    return f"idx_{datasource}_{hashlib.sha1(key).hexdigest()[:12]}"

def submit_task(spec: dict) -> str:
    r = requests.post(f"{OVERLORD}/druid/indexer/v1/task", json=spec, timeout=30)
    r.raise_for_status()
    return r.json()["task"]

def poll_until_terminal(task_id: str, max_wait: int = 3600) -> str:
    """Poll task status with capped exponential backoff."""
    deadline = time.monotonic() + max_wait
    delay = 2.0
    while time.monotonic() < deadline:
        r = requests.get(
            f"{OVERLORD}/druid/indexer/v1/task/{task_id}/status", timeout=15
        )
        r.raise_for_status()
        state = r.json()["status"]["status"]
        if state in TERMINAL:
            return state
        time.sleep(delay)
        delay = min(delay * 2, 30.0)  # cap backoff at 30s
    raise TimeoutError(f"{task_id} did not reach a terminal state in {max_wait}s")

def confirm_handoff(datasource: str, interval: str, tries: int = 20) -> bool:
    """A SUCCESS task is not queryable until the Coordinator loads its segments."""
    delay = 2.0
    for _ in range(tries):
        r = requests.get(
            f"{OVERLORD.replace('overlord:8090', 'coordinator:8081')}"
            f"/druid/coordinator/v1/datasources/{datasource}/segments",
            timeout=15,
        )
        r.raise_for_status()
        if any(interval.split("/")[0] in s for s in r.json()):
            return True
        time.sleep(delay)
        delay = min(delay * 2, 30.0)
    return False

def run_ingestion(spec: dict, datasource: str, interval: str) -> None:
    task_id = submit_task(spec)
    state = poll_until_terminal(task_id)
    if state != "SUCCESS":
        raise RuntimeError(f"ingestion {task_id} failed: {state}")
    if not confirm_handoff(datasource, interval):
        raise RuntimeError(f"{task_id} succeeded but segments never loaded")

Three patterns make this production-ready. Idempotent submission: deriving the task ID from a content hash means a retried pipeline run submits the same ID, and the Overlord rejects the duplicate rather than double-ingesting. Capped exponential backoff: polling starts at two seconds and doubles to a thirty-second ceiling, keeping the Overlord's status endpoint from being hammered while still reacting quickly to fast tasks. Handoff confirmation: the pipeline never signals success on the task state alone — it queries the Coordinator until the segment for the interval is actually loaded. For streaming sources, the same skeleton wraps supervisor management instead of one-shot tasks, and the reconciliation of batch backfills against live streams is handled by batch vs streaming ingestion sync, which prevents overlapping segments across the two paths. Validation is deliberately factored out into a pre-flight gate rather than inlined here, so that a malformed spec is rejected before it ever consumes an Overlord slot — the contract enforced by schema validation for Druid specs.

Failure Modes & Diagnostics

Ingestion pipelines fail in a small number of recognizable ways. Each below pairs the symptom an engineer observes with its root cause and remediation.

Symptom: task sits in PENDING indefinitely and never reaches RUNNING. Root cause: no free worker slots — every MiddleManager is at druid.worker.capacity. Remediation: inspect GET /druid/indexer/v1/workers for available capacity, raise worker capacity or add MiddleManagers, and have the orchestrator throttle its submit rate when free slots approach zero.
Symptom: task returns FAILED immediately after submission with a parse error in the report. Root cause: a malformed spec — missing tuningConfig, an invalid timestampSpec.format, or a granularitySpec.intervals value the metadata store rejects. Remediation: move validation upstream so the spec is rejected pre-flight; the diagnostics live in schema validation for Druid specs. Pull the full failure report with GET /druid/indexer/v1/task/{taskId}/reports.
Symptom: task reports SUCCESS but queries return no rows for the interval. Root cause: handoff has not completed — segments are published in metadata but the Coordinator has not yet instructed a Historical to load them (or a load rule excludes the interval). Remediation: poll GET /druid/coordinator/v1/loadstatus; confirm a load rule covers the interval; never signal downstream success on task state alone.
Symptom: duplicate rows appear after a pipeline retry. Root cause: non-deterministic task IDs caused the same work to ingest twice with appendToExisting: true. Remediation: use content-hashed task IDs, and prefer dropExisting: true with an explicit intervals for backfills so the replacement is atomic.
Symptom: streaming ingestion lag grows without bound and segments never publish. Root cause: taskCount is below the Kafka partition count, or taskDuration is so long that segments never reach maxRowsPerSegment to trigger publish. Remediation: raise taskCount toward partition count, shorten taskDuration, and inspect supervisor status; deeper triage lives in debugging Druid supervisor task failures.
Symptom: worker JVM OOMs mid-task, killing the ingestion. Root cause: maxRowsInMemory too high relative to heap, or a high-cardinality dimension exploding the in-memory dictionary. Remediation: lower maxRowsInMemory, shorten intermediatePersistPeriod to spill sooner, and reconsider whether the offending column should be a dimension at all.
Symptom: thousands of tiny segments accumulate and query planning slows. Root cause: streaming ingestion at fine segmentGranularity publishes many undersized segments. Remediation: apply automated compaction scheduling to merge them back toward the target size, tuned via compaction threshold tuning.

A fast triage one-liner for any stuck task is to tail its live report and grep the unparseable-row counter:

curl -s http://overlord:8090/druid/indexer/v1/task/$TASK_ID/reports \
  | jq '.ingestionStatsAndErrors.payload.rowStats, .ingestionStatsAndErrors.payload.errorMsg'

Security & Access Control Boundaries

An orchestrator holds the keys to every datasource it can write, so its access boundary must be scoped deliberately. Druid's authorization model gates the ingestion APIs behind DATASOURCE write permissions and the task APIs behind STATE/CONFIG permissions; a pipeline service account should hold WRITE on exactly the datasources it owns and nothing else. When Druid is deployed with the basic-security extension or an LDAP/OIDC integration, the orchestrator authenticates as a dedicated principal, and its credentials are injected from a secrets manager at runtime — never embedded in a spec or committed to the repository that stores the templates.

Multi-tenant clusters raise two additional concerns. First, datasource isolation: each tenant's pipeline should be unable to submit tasks that write to, or drop, another tenant's datasource, which is enforced by per-datasource ACLs rather than by convention in the orchestrator. These map onto the same segment access security boundaries that govern who may query or retire a segment once it is loaded. Second, input-source trust: an inputSource pointing at S3, HDFS, or a Kafka topic carries the credentials Druid uses to read it, so those credentials must be tenant-scoped and least-privilege — a compromised spec should not be able to exfiltrate another tenant's bucket. The orchestration layer should validate that a submitted spec's inputSource prefixes fall within the tenant's allowed namespace before forwarding it to the Overlord, treating the pre-flight validator as a policy-enforcement point and not merely a syntax check. Finally, the POST /druid/indexer/v1/task/{taskId}/shutdown endpoint is a privileged operation: automated rollback must authenticate as an operator principal, and every cancellation should be logged with the triggering condition for auditability.

Monitoring & Alerting Hooks

A pipeline you cannot observe is a pipeline you cannot operate. Druid emits metrics through its emitter subsystem, and the Prometheus emitter exposes the ones an orchestrator's dashboards depend on. Track ingestion throughput with druid_ingest_events_processed and its sibling druid_ingest_events_unparseable — a rising unparseable rate is the earliest signal of schema drift, long before a task fails outright. Watch druid_ingest_events_thrownAway for rows falling outside the ingestion intervals, which usually means a timestamp or timezone misconfiguration. For streaming pipelines, druid_ingest_kafka_lag (or druid_ingest_kinesis_lag) is the primary SLA metric: alert when lag exceeds the number of events the Druid cluster ingests in one taskDuration, because past that point the pipeline is falling permanently behind.

On the task side, druid_task_run_time and the task success/failure counters feed a simple burn-rate alert: page when the failure ratio over a five-minute window crosses a threshold (for example, more than 5% of submitted tasks failing), which is also the natural trigger for automated rollback. Coordinator-side, druid_coordinator_segment_unavailable_count and druid_coordinator_segment_load_queue_size reveal handoff stalls — segments published but not yet loaded — and should alert when either stays non-zero for longer than one Coordinator run period. A useful Grafana layout puts three rows on one panel set: an ingestion row (processed vs. unparseable vs. thrown-away, stacked), a lag row (per-supervisor Kafka lag with the SLA threshold drawn as a static line), and a handoff row (unavailable-segment count with the load-queue size overlaid). Export these thresholds into the orchestrator itself so that the same numbers driving the dashboards also drive the pipeline's own back-pressure and retry decisions, closing the loop between observability and control.

Dynamic ingestion spec generation — build validated specs at runtime from catalog metadata and cluster telemetry instead of static JSON files.
Schema validation for Druid specs — the pre-flight contract that rejects malformed specs before they consume an Overlord slot.
Async task execution patterns — non-blocking submission, status polling, and backoff for hundreds of concurrent ingestion jobs.
Batch vs streaming ingestion sync — coordinate historical backfills with live Kafka or Kinesis streams without overlapping segments.
Segment compaction, retention & storage optimization — reconcile fragmented ingestion output back toward the target segment size and enforce data expiration.

Up one level: Apache Druid segment architecture & lifecycle fundamentals grounds the immutable-segment contract that every pipeline on this page is built to honor.

Automated Ingestion Pipeline Orchestration for Apache Druid

Orchestration Flow #

Core Concept & Internal Mechanics #

Configuration Reference #

Operational Sizing & Constraints #

Pipeline Orchestration Patterns #

Failure Modes & Diagnostics #

Security & Access Control Boundaries #

Monitoring & Alerting Hooks #

Related #

Explore this section