Async Task Execution Patterns for Apache Druid Ingestion

Production OLAP platforms cannot afford ingestion pipelines that block a control thread for the minutes or hours an index_parallel or Kafka supervisor task may run. Apache Druid's task model is asynchronous by design: the Overlord accepts a spec, returns a task ID immediately, and executes the work on a MiddleManager or Indexer Peon while the caller polls for a terminal state. Turning that primitive into a reliable, restart-safe control loop is the subject of this page. Within the broader discipline of automated ingestion pipeline orchestration, async execution is the layer that guarantees every submitted spec becomes a durable, queryable segment exactly once — even across network partitions, Overlord leader elections, and orchestrator restarts.

Mechanics & Internals

Druid separates task submission from task execution. A POST to /druid/indexer/v1/task with a serialized ingestion spec is a lightweight, near-instant operation: the Overlord persists the task into its metadata store (the druid_tasks table), assigns it to an available worker slot, and returns a JSON body containing a single task field — the assigned task ID. The heavy work — reading source rows, building the columnar index, and publishing segments to deep storage — happens asynchronously on a Peon process. The caller learns about progress only by polling.

Every task walks a finite state machine. The Overlord reports a coarse RUNNING, SUCCESS, or FAILED status through /druid/indexer/v1/task/{taskId}/status, but internally a task also passes through WAITING (queued for a worker slot), PENDING (assigned, not yet started), and — on SUCCESS — a handoff phase where the Coordinator loads the freshly published segments onto Historicals before they become queryable. An orchestrator that treats SUCCESS as "data is queryable" without confirming handoff will race the Coordinator; the segment exists in deep storage and the metadata store but is not yet served by any Historical. This handoff boundary is where async execution intersects with query routing and segment discovery, which governs when a Broker begins including a segment in query fan-out.

The orchestrator never blocks on ingestion: it submits a task, receives an ID, and polls the Overlord until the task reaches a terminal state.

The API surface that matters

Five endpoints cover the entire async lifecycle. Memorize their contracts:

POST /druid/indexer/v1/task — submit a spec; returns {"task": "<taskId>"}.
GET /druid/indexer/v1/task/{taskId}/status — coarse state; returns {"status": {"status": "RUNNING|SUCCESS|FAILED", "statusCode": ..., "duration": ...}}.
GET /druid/indexer/v1/task/{taskId}/reports — post-mortem row counts, unparseable events, and the ingestionState; the authoritative source for why a task failed.
POST /druid/indexer/v1/task/{taskId}/shutdown — cooperatively cancel a runaway or superseded task.
GET /druid/indexer/v1/tasks?state=running — the reconciliation endpoint; the ground truth an orchestrator diffs its local state against after a restart.

Idempotency is a spec property, not a client retry flag

Because a network timeout can hide whether the Overlord received a submission, retries are inevitable — and duplicate submissions of the same interval produce overlapping segments that inflate storage and fragment queries. Druid does not deduplicate tasks for you at the HTTP layer. The durable defense is a deterministic task ID: derive id in the spec from a hash of (dataSource, interval, specVersion) so that a resubmission collides with the in-flight task rather than spawning a second one. The Overlord rejects a POST whose id matches a live task, giving you idempotency for free. This is the same determinism discipline enforced upstream during dynamic ingestion spec generation, where the spec — and therefore the ID — is computed from the data contract rather than hand-authored.

Validated Configuration Spec

An async-friendly index_parallel spec pins three things the orchestrator relies on: a deterministic id, explicit interval boundaries (so retries are idempotent and never scan unbounded time), and a maxNumConcurrentSubTasks sized to real worker capacity. Every top-level key required by Druid — type, spec, and nested ioConfig/tuningConfig — is present and documented inline.

{
  "type": "index_parallel",
  "id": "events__2026-07-04T00_00_00Z__v3__a1b2c3",
  "spec": {
    "dataSchema": {
      "dataSource": "events",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": ["country", "device", "campaign_id"]
      },
      "metricsSpec": [
        { "type": "count", "name": "count" },
        { "type": "longSum", "name": "revenue", "fieldName": "revenue" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true,
        "intervals": ["2026-07-04T00:00:00Z/2026-07-04T01:00:00Z"]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "prefixes": ["s3://analytics-lake/events/2026/07/04/00/"]
      },
      "inputFormat": { "type": "json" },
      "appendToExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxNumConcurrentSubTasks": 4,
      "maxRowsPerSegment": 5000000,
      "maxRowsInMemory": 1000000,
      "partitionsSpec": {
        "type": "hashed",
        "numShards": null,
        "partitionDimensions": ["country"]
      },
      "forceGuaranteedRollup": true
    }
  }
}

Field notes an orchestrator author must not skip:

id — the deterministic hash described above. Omit it and Druid auto-generates a random ID, destroying idempotency.
granularitySpec.intervals — bounding the interval is mandatory when forceGuaranteedRollup is true and is the single most effective guard against a retry re-ingesting the wrong window. The chosen segmentGranularity should agree with the segment granularity settings already governing this datasource, or you will produce mismatched partition boundaries that later force extra compaction.
maxNumConcurrentSubTasks — the parallelism ceiling; oversizing it starves other tenants of MiddleManager slots and pushes the Overlord scheduler into backlog. See the sizing formula below.
appendToExisting: false with forceGuaranteedRollup: true — replaces the interval atomically, which is what makes a re-run safe rather than additive.

For Kafka streaming, the analogous submission is a supervisor spec POSTed to /druid/indexer/v1/supervisor; the same idempotency and interval-alignment concerns manifest at the boundary between continuous and discrete loads, covered in batch versus streaming ingestion synchronization.

Sizing Heuristics & Formulas

Async orchestration introduces two numbers you must compute rather than guess: how many subtasks to run concurrently, and how long to wait before a poll gives up.

Concurrent subtask capacity is bounded by the total worker slots the MiddleManager tier exposes, discounted by a headroom factor so streaming supervisors and ad-hoc tasks are not starved:

$$ \text{maxNumConcurrentSubTasks} \approx \left\lfloor \frac{\text{middleManagers} \times \text{workerCapacity}}{\text{concurrentPipelines}} \right\rfloor \times h $$

where workerCapacity is druid.worker.capacity per MiddleManager, concurrentPipelines is how many independent ingestion jobs may run at once, and h \approx 0.7 reserves 30% headroom. For an 8-MiddleManager tier at capacity 4, running up to 4 pipelines: $\lfloor 8 \times 4 / 4 \rfloor \times 0.7 \approx 5$ subtasks per job. Note that each index_parallel task consumes one slot for its supervisor plus one per subtask, so the effective demand is maxNumConcurrentSubTasks + 1.

The poll timeout should track the expected task duration, itself a function of input volume and per-subtask throughput:

$$ \text{expectedSeconds} \approx \frac{\text{inputRows}}{\text{maxNumConcurrentSubTasks} \times \text{rowsPerSecPerSubTask}} $$

Set the orchestrator's hard timeout to roughly $2 \times \text{expectedSeconds}$ to absorb GC pauses and handoff latency, and cap the exponential backoff interval well below it — a poll interval that grows to five minutes on a task that finishes in ninety seconds wastes throughput, while polling every 200 ms hammers the Overlord. A backoff that starts at 1 s, doubles, and caps at 30 s covers both short batch tasks and multi-hour reindexes.

Python Orchestration Snippet

The following controller uses only the standard library plus requests. It submits a spec with a deterministic ID, polls with capped exponential backoff and jitter, treats a resubmission of an in-flight task as success (idempotency), and returns the final report. It is safe to call repeatedly for the same interval.

import hashlib
import json
import random
import time
from typing import Any

import requests

OVERLORD = "http://overlord.druid.internal:8090"
TERMINAL = {"SUCCESS", "FAILED"}


def deterministic_task_id(spec: dict[str, Any], spec_version: str) -> str:
    ds = spec["spec"]["dataSchema"]["dataSource"]
    interval = spec["spec"]["dataSchema"]["granularitySpec"]["intervals"][0]
    digest = hashlib.sha1(f"{ds}|{interval}|{spec_version}".encode()).hexdigest()[:6]
    safe_interval = interval.replace(":", "_").replace("/", "__")
    return f"{ds}__{safe_interval}__{spec_version}__{digest}"


def submit(spec: dict[str, Any], session: requests.Session) -> str:
    resp = session.post(
        f"{OVERLORD}/druid/indexer/v1/task",
        json=spec,
        timeout=30,
    )
    # 400 with an "already exists" body means a prior submit won the race:
    # treat as idempotent success and reuse the known id.
    if resp.status_code == 400 and "already exists" in resp.text:
        return spec["id"]
    resp.raise_for_status()
    return resp.json()["task"]


def poll(task_id: str, session: requests.Session, deadline_s: float) -> str:
    interval, cap = 1.0, 30.0
    start = time.monotonic()
    while True:
        if time.monotonic() - start > deadline_s:
            raise TimeoutError(f"{task_id} did not reach terminal state in time")
        resp = session.get(
            f"{OVERLORD}/druid/indexer/v1/task/{task_id}/status",
            timeout=15,
        )
        resp.raise_for_status()
        state = resp.json()["status"]["status"]
        if state in TERMINAL:
            return state
        # capped exponential backoff with full jitter
        time.sleep(interval + random.uniform(0, interval))
        interval = min(interval * 2, cap)


def report(task_id: str, session: requests.Session) -> dict[str, Any]:
    resp = session.get(
        f"{OVERLORD}/druid/indexer/v1/task/{task_id}/reports",
        timeout=15,
    )
    resp.raise_for_status()
    return resp.json()


def run_ingestion(spec: dict[str, Any], spec_version: str, deadline_s: float = 3600) -> dict[str, Any]:
    spec = dict(spec)
    spec["id"] = deterministic_task_id(spec, spec_version)
    with requests.Session() as session:
        task_id = submit(spec, session)
        final = poll(task_id, session, deadline_s)
        rpt = report(task_id, session)
        if final == "FAILED":
            raise RuntimeError(
                f"{task_id} FAILED: "
                f"{json.dumps(rpt.get('ingestionStatsAndErrors', {}), default=str)}"
            )
        return rpt


if __name__ == "__main__":
    with open("events_spec.json") as fh:
        base_spec = json.load(fh)
    result = run_ingestion(base_spec, spec_version="v3")
    stats = result["ingestionStatsAndErrors"]["payload"]["rowStats"]
    print(json.dumps(stats, indent=2))

Two production refinements worth layering on: correlate every log line with the task_id so a distributed trace can follow one job across Overlord, Peon, and Coordinator; and, for asyncio-based orchestrators, replace the blocking poll with an await asyncio.sleep(...) loop over aiohttp so a single event loop can shepherd hundreds of concurrent tasks without a thread per task.

Failure Modes & Diagnostics

Async execution surfaces partial failures that synchronous code never sees. Diagnose them from the shell before touching the orchestrator.

1. Task stuck in WAITING — no free worker slot. Symptom: /status returns RUNNING for the wrapper but subtasks never start. Confirm slot exhaustion:

curl -s "$OVERLORD/druid/indexer/v1/workers" \
  | jq '[.[] | {host: .worker.host, used: .currAvailableCapacity, cap: .worker.capacity}]'

If currAvailableCapacity is 0 across the tier, either lower maxNumConcurrentSubTasks or add MiddleManager capacity.

2. Orphaned tasks after an orchestrator restart. Symptom: local state lost, but tasks still run in Druid. Reconcile against the ground truth before submitting anything new:

curl -s "$OVERLORD/druid/indexer/v1/tasks?state=running" \
  | jq -r '.[] | select(.dataSource=="events") | .id'

Adopt any returned IDs into the local state machine instead of resubmitting — this is the reconciliation step that prevents duplicate segments.

3. SUCCESS but the data is not queryable. Symptom: task terminal, Broker returns no rows for the interval. The Coordinator has not finished handoff. Check pending loads:

curl -s "$COORDINATOR/druid/coordinator/v1/loadqueue?simple" | jq '.'

4. Task FAILED — is it transient or structural? Always pull the report; do not guess from the coarse status:

curl -s "$OVERLORD/druid/indexer/v1/task/$TASK_ID/reports" \
  | jq '.ingestionStatsAndErrors.payload | {state: .ingestionState, unparseable: .unparseableEvents, error: .errorMsg}'

An errorMsg mentioning S3 throttling or a socket timeout is transient — resubmit with backoff. A schema-parse or dimension-cast error is structural — resubmitting will fail identically. Deep supervisor and Peon diagnostics, including JVM heap forensics with jstat and Kafka rebalance storms, are covered in debugging Druid supervisor task failures. Structural spec rejections are best prevented upstream through schema validation for Druid specs rather than caught at the Peon.

Automation Checklist

Wire these gates into the orchestration pipeline so async correctness is enforced rather than hoped for:

Automated Ingestion Pipeline Orchestration — the parent guide covering the full generate → validate → submit → monitor control plane.
Dynamic Ingestion Spec Generation — how the deterministic specs these async tasks submit are computed at runtime.
Batch vs Streaming Ingestion Sync — aligning discrete async batch tasks with continuous Kafka supervisors at segment boundaries.
Debugging Druid Supervisor Task Failures — deep diagnostics for the structural failures this page classifies but does not fully dissect.
Query Routing and Segment Discovery — what happens after handoff, when a Broker begins serving the segments an async task published.

For the authoritative endpoint contracts and payload schemas, consult the official Druid Task API documentation.

Async Task Execution Patterns for Apache Druid Ingestion

Mechanics & Internals #

The API surface that matters #

Idempotency is a spec property, not a client retry flag #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #

Explore this section