Setting Up a Kafka to Druid Real-Time Ingestion Pipeline

The symptom that brings engineers here is a real-time datasource that ingests but never stabilizes: tasks loop between PENDING and RUNNING, consumer lag climbs during compaction windows, and the segment count for the current hour balloons into the thousands before handoff. Real-time ingestion through the Kafka Indexing Service only behaves deterministically when the supervisor spec, the consumer offset contract, and the handoff timeouts are tuned together — a mismatch in any one of them surfaces as stalled handoff or rebalance storms rather than a clean error. This page is the streaming-edge companion to synchronizing batch and streaming ingestion: it owns the offsets, consumer properties, and watermark tuning that keep the live edge healthy so a batch reconciliation job can safely follow behind it.

Failure Modes & Diagnostics

Real-time pipelines degrade predictably, and each failure has a one-line diagnostic against the Overlord or Coordinator REST API. Isolate the state before touching the spec.

Stalled handoff (PENDING → RUNNING loop). A task publishes a segment but the Coordinator never acknowledges the load within handoffConditionTimeout, so the Overlord keeps the task alive and the loop repeats. The usual root cause is metadata-store lock contention from a concurrent compaction task, or an unreachable deep-storage backend. Confirm which segments are stuck unpublished:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_realtime/segments?full" \
  | jq -r '.[] | select(.interval | startswith("2026-07-04T09")) | "\(.version)\t\(.size)"' \
  | sort | uniq -c

Many distinct version strings for one hour with tiny size values means the stream is publishing but not handing off; check the load queue with curl -s "http://coordinator:8081/druid/coordinator/v1/loadqueue?simple" | jq '.' — a persistently non-empty queue for the datasource confirms handoff is the bottleneck, not ingestion.

Offset drift and rebalance storms. When per-record processing time exceeds the broker's max.poll.interval.ms, Kafka evicts the consumer, triggers a group rebalance, and the reassigned task re-reads from the last committed offset — so lag never drains. Read the supervisor's own lag report rather than guessing:

curl -s "http://overlord:8090/druid/indexer/v1/supervisor/events_realtime_supervisor/status" \
  | jq '{state: .payload.state, lag: .payload.aggregateLag, offsets: .payload.latestOffsets}'

A monotonically rising aggregateLag across polls means the read side cannot keep up: lower maxRowsInMemory so persist cycles finish faster, or reduce maxPollRecords so a single poll cannot overrun the interval.

Schema mutation rejection. Druid enforces dimensionsSpec and metricsSpec typing at supervisor initialization, and unhandled parse exceptions accumulate until maxParseExceptions is breached and the task terminates. Pull the recent parse errors directly:

curl -s "http://overlord:8090/druid/indexer/v1/supervisor/events_realtime_supervisor/status" \
  | jq -r '.payload.recentErrors[]? | .errorMsg' | sort | uniq -c | sort -rn

A changed upstream field type is not a runtime fix — it requires the suspend/patch/resume cycle handled by the schema evolution workflow, not a bumped exception budget.

Target Spec & Validated JSON

The supervisor specification dictates partition assignment, segment boundaries, and handoff thresholds. The block below is a minimal but complete Kafka Indexing Service spec — all four required top-level keys (type, dataSchema, ioConfig, tuningConfig) — tuned to prevent the three failures above. Its segmentGranularity is the single field that must stay identical to any batch backfill over the same datasource, because both paths inherit the same segment granularity settings.

{
  "type": "kafka",
  "id": "events_realtime_supervisor",
  "spec": {
    "dataSchema": {
      "dataSource": "events_realtime",
      "timestampSpec": { "column": "event_ts", "format": "auto" },
      "dimensionsSpec": {
        "dimensions": [
          { "name": "user_id", "type": "string" },
          { "name": "region", "type": "string" },
          { "name": "device_type", "type": "string" }
        ]
      },
      "metricsSpec": [
        { "type": "count", "name": "event_count" },
        { "type": "longSum", "name": "latency_ms", "fieldName": "latency" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "type": "kafka",
      "topic": "events.prod.v2",
      "consumerProperties": {
        "bootstrap.servers": "kafka-broker-01:9092,kafka-broker-02:9092",
        "group.id": "druid-events-realtime",
        "enable.auto.commit": "false",
        "max.poll.records": "5000"
      },
      "useEarliestOffset": true,
      "inputFormat": { "type": "json", "flattenSpec": { "useFieldDiscovery": true } },
      "taskCount": 2,
      "replicas": 2,
      "taskDuration": "PT1H",
      "lateMessageRejectionPeriod": "PT6H",
      "completionTimeout": "PT15M"
    },
    "tuningConfig": {
      "type": "kafka",
      "maxRowsPerSegment": 5000000,
      "maxRowsInMemory": 250000,
      "intermediatePersistPeriod": "PT5M",
      "handoffConditionTimeout": "PT10M",
      "logParseExceptions": true,
      "maxParseExceptions": 10,
      "maxSavedParseExceptions": 100
    }
  }
}

The load-bearing fields:

ioConfig.consumerProperties.enable.auto.commit is false — Druid owns the offset lifecycle and commits to its own metadata store on publish; auto-commit would let Kafka advance offsets independently of a successful handoff and silently drop data on a task restart.
ioConfig.useEarliestOffset seeds a brand-new group.id from the topic's oldest retained offset; on an already-running supervisor it is ignored in favour of the committed offset, so it is safe to leave true at bootstrap but never a substitute for a reset.
ioConfig.lateMessageRejectionPeriod (PT6H) is the watermark: events older than now − PT6H never enter a real-time segment, and that six-hour trailing region is exactly what a batch reconciliation job owns.
ioConfig.completionTimeout should be at least 1.5× the broker's max.poll.interval.ms so a long compaction window does not trip a consumer-group rebalance mid-publish.
tuningConfig.maxRowsPerSegment (5 M) aligns with Historical disk I/O; pushing well past it forces heavier downstream compaction, while far below it produces the tiny-segment sprawl that the segment size optimization strategies exist to correct.
tuningConfig.handoffConditionTimeout (PT10M) is how long the Overlord waits for the Coordinator to acknowledge a published segment; values below PT5M cause premature task termination during metadata-store latency spikes and directly produce the PENDING → RUNNING loop above.

Retention is deliberately absent here: load/drop rules are Coordinator-side policy, not a supervisor concern.

Python Automation Script

Recovery must be idempotent, because a half-applied supervisor change is worse than none. The controller below performs a health check, a graceful suspend/patch/resume, and a bounded reset for stuck tasks. It uses only the standard library plus requests with session-level retries and capped exponential backoff — the same submit/poll contract the async task execution patterns establish for the parent pillar.

import logging
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

OVERLORD = "https://druid-overlord.internal:8090"
SUPERVISOR_ID = "events_realtime_supervisor"
TERMINAL_HEALTHY = "RUNNING"


class DruidSupervisorOrchestrator:
    def __init__(self) -> None:
        self.session = requests.Session()
        retry = Retry(
            total=4,
            backoff_factor=2,               # 2s, 4s, 8s, 16s between transport retries
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["GET", "POST"],
        )
        self.session.mount("https://", HTTPAdapter(max_retries=retry))
        self.session.headers.update({"Content-Type": "application/json"})

    def _post(self, path: str, **kw) -> requests.Response:
        r = self.session.post(f"{OVERLORD}{path}", timeout=30, **kw)
        r.raise_for_status()
        return r

    def state(self) -> str:
        r = self.session.get(
            f"{OVERLORD}/druid/indexer/v1/supervisor/{SUPERVISOR_ID}/status", timeout=15
        )
        r.raise_for_status()
        return r.json().get("payload", {}).get("state", "UNKNOWN")

    def suspend_patch_resume(self, new_spec: dict) -> None:
        """Idempotent spec change: active tasks finish their segment, then resume."""
        self._post(f"/druid/indexer/v1/supervisor/{SUPERVISOR_ID}/suspend")
        logging.info("suspended; waiting for in-flight segments to publish")
        try:
            # POST to the base endpoint replaces the spec in place (same id).
            self._post("/druid/indexer/v1/supervisor", json=new_spec)
            logging.info("spec replaced")
        finally:
            self._post(f"/druid/indexer/v1/supervisor/{SUPERVISOR_ID}/resume")
            logging.info("resumed; live edge handed back to the stream")

    def recover_if_stuck(self) -> bool:
        """Reset only when the supervisor is not in its healthy running state."""
        current = self.state()
        if current == TERMINAL_HEALTHY:
            return False
        logging.warning("supervisor in %s; issuing reset", current)
        # reset re-reads committed offsets; it does NOT rewind to earliest.
        self._post(f"/druid/indexer/v1/supervisor/{SUPERVISOR_ID}/reset")
        return True

Two properties make this safe. The resume runs in a finally block, so a failed patch never leaves the stream permanently suspended. And recover_if_stuck calls reset, which re-reads the committed offsets from the metadata store rather than rewinding to earliest — the difference between draining a rebalance storm and re-consuming the entire topic.

Verification Steps

After applying the spec or a recovery, confirm the pipeline reached steady state — do not trust the task submission alone. Poll the supervisor until its state is RUNNING with draining lag:

curl -s "http://overlord:8090/druid/indexer/v1/supervisor/events_realtime_supervisor/status" \
  | jq '{state: .payload.state, lag: .payload.aggregateLag, tasks: (.payload.activeTasks | length)}'

Expected, for a healthy two-task supervisor once lag has drained:

{
  "state": "RUNNING",
  "lag": 0,
  "tasks": 2
}

Then confirm segments are actually handing off to Historicals rather than piling up as real-time-only. A clean interval shows one authoritative version and a segment count near ceil(rowsPerHour / maxRowsPerSegment):

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_realtime/segments?full" \
  | jq -r '[.[] | select(.interval | startswith("2026-07-04T08"))] | length'

A count in the hundreds or thousands for a closed hour means fine-grained real-time output that never coalesced — route it to automated compaction scheduling rather than widening segmentGranularity. Finally, verify the load queue has drained so the newly published segments are queryable:

curl -s "http://coordinator:8081/druid/coordinator/v1/loadstatus?full" \
  | jq '.events_realtime // "fully loaded"'

"fully loaded"

For authoritative field definitions, consult the official Apache Druid Kafka ingestion documentation and the Kafka consumer configuration guide.

Debugging Druid supervisor task failures — the failure taxonomy and recovery sequence when a supervisor reaches SUSPENDED, FAILED, or KILLED.
Automating Druid ingestion specs with Python — generate this supervisor spec from a template instead of hand-editing JSON.
Handling schema evolution in Druid ingestion — the suspend/patch/resume cycle for a changed dimensionsSpec or metricsSpec.

Up one level: synchronizing batch and streaming ingestion is the parent guide that reconciles this live streaming edge with nightly index_parallel backfills.

Setting Up a Kafka to Druid Real-Time Ingestion Pipeline

Failure Modes & Diagnostics #

Target Spec & Validated JSON #

Python Automation Script #

Verification Steps #

Related #

Failure Modes & Diagnostics

Target Spec & Validated JSON

Python Automation Script

Verification Steps

Related