Synchronizing Batch and Streaming Ingestion in Apache Druid

High-velocity OLAP platforms rarely ingest through a single path: a Kafka supervisor keeps the current interval fresh while nightly index_parallel backfills correct late data, re-key dimensions, or apply an improved rollup. Both paths publish into the same metadata store, so when their interval boundaries or versions collide the result is overlapping segments, duplicated rows, and query-time fragmentation. This page is the reconciliation contract that sits inside automated ingestion pipeline orchestration: how to align a streaming watermark with a batch interval, when to use atomic replacement instead of append, and how to drive the whole handshake from a deterministic Python controller so the two ingestion models never fight over the same time partition.

Two Paths, One Metadata Layer

Streaming (Kafka supervisor) and batch (index_parallel) ingestion produce segments through different execution models, but both publish into the same metadata store so Historicals serve a unified view.

Mechanics & Internals

A Druid segment is identified by datasource_interval_version_partitionNum. The interval and version fields are what make batch and streaming coexistence tractable — and what make it dangerous when mismanaged. A Kafka supervisor continuously reads partitions, builds real-time segments in the peon's memory, and publishes them to deep storage when the segmentGranularity window closes or maxRowsPerSegment is hit. Each publish records a new segment with a fresh version timestamp. A batch index_parallel task over the same interval also mints segments with its own version; because Druid resolves reads by taking the highest version that fully covers an interval, the newer of the two wins for any interval it completely spans. The failure case is a partial cover: a batch task that writes segments for only part of an hour the stream already owns leaves both versions live, and the Broker fans out to both — double-counting rows.

The clean primitive that avoids this is dropExisting. When a batch task carries "dropExisting": true together with an explicit granularitySpec.intervals, its handoff atomically issues tombstone markers for every pre-existing segment inside that interval, so the batch version fully and exclusively replaces whatever the stream published there. This is the mechanism that lets a backfill safely overwrite a window the supervisor previously wrote. The complementary primitive is appendToExisting, which adds partitions alongside existing segments in the same interval and version lineage — correct only when the batch data is genuinely additive and can never duplicate a streamed row.

The second internal concept is the watermark. The supervisor's lateMessageRejectionPeriod defines the trailing edge below which incoming events are dropped as too-late; anything older than now - lateMessageRejectionPeriod will never enter a real-time segment. That boundary is exactly where a batch reconciliation job takes over: it ingests the delta interval below the watermark that the stream refuses. The orchestrator's job is to make the batch interval's upper bound line up with, but never cross above, the live watermark — because a batch task that reaches into an interval the supervisor is still actively writing races the stream's own publishes. Concurrent append-and-replace on one interval is only safe under Druid's concurrent-append-and-replace locking (taskLockType set to APPEND / REPLACE); without it, the orchestrator must serialize the two paths by briefly suspending the supervisor.

Druid exposes every state transition the controller needs over REST. Supervisor lifecycle runs through POST /druid/indexer/v1/supervisor (create/update), POST /druid/indexer/v1/supervisor/{id}/suspend, POST /druid/indexer/v1/supervisor/{id}/resume, and GET /druid/indexer/v1/supervisor/{id}/status (which reports per-partition lag and the current offsets). Batch tasks use the same POST /druid/indexer/v1/task and GET /druid/indexer/v1/task/{id}/status endpoints described for async task execution patterns, and the Coordinator's GET /druid/coordinator/v1/datasources/{ds}/segments confirms which version is actually loaded. The interval boundaries themselves are governed by the segment granularity settings shared by both paths — the single most important thing to keep identical across streaming and batch specs.

Validated Configuration Spec

Two specs must agree for the paths to reconcile cleanly: the Kafka supervisor that owns the live edge and the index_parallel backfill that owns the reconciled tail. The load-bearing rule is that dataSchema — dataSource, dimensionsSpec, metricsSpec, and especially granularitySpec — is byte-for-byte identical between them, so a replaced interval keeps the same rollup and partition semantics as the interval beside it. Templating both from one source of truth is the job of dynamic ingestion spec generation.

The Kafka supervisor spec, with every reconciliation-relevant field annotated:

{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "events_web",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": { "dimensions": ["country", "device", "page_id"] },
      "metricsSpec": [
        { "type": "count", "name": "rows" },
        { "type": "longSum", "name": "bytes", "fieldName": "resp_bytes" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "type": "kafka",
      "topic": "events_web",
      "consumerProperties": { "bootstrap.servers": "kafka:9092" },
      "taskCount": 3,
      "replicas": 1,
      "taskDuration": "PT1H",
      "useEarliestOffset": false,
      "lateMessageRejectionPeriod": "PT6H",
      "completionTimeout": "PT30M"
    },
    "tuningConfig": {
      "type": "kafka",
      "maxRowsInMemory": 1000000,
      "maxRowsPerSegment": 5000000,
      "intermediatePersistPeriod": "PT10M"
    }
  }
}

ioConfig.lateMessageRejectionPeriod — PT6H means the stream refuses events older than six hours; that six-hour trailing edge is the exact region a batch reconciliation job is responsible for.
ioConfig.taskDuration — how long a reading task runs before its segments publish and hand off. Shorter durations reduce the window in which a segment is real-time-only, tightening how promptly a batch job can safely follow behind it.
ioConfig.useEarliestOffset — false resumes from the committed offset; a reconciliation flow must never flip this to true on a live datasource or it will re-consume the whole topic.
tuningConfig.segmentGranularity (via granularitySpec) — must match the batch spec below so replaced intervals align on the same hour boundaries.

The index_parallel reconciliation backfill for a closed interval, using atomic replacement:

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_web",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": { "dimensions": ["country", "device", "page_id"] },
      "metricsSpec": [
        { "type": "count", "name": "rows" },
        { "type": "longSum", "name": "bytes", "fieldName": "resp_bytes" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true,
        "intervals": ["2026-07-03T00:00:00Z/2026-07-03T18:00:00Z"]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "prefixes": ["s3://lake/events/web/2026-07-03/"]
      },
      "inputFormat": { "type": "json" },
      "appendToExisting": false,
      "dropExisting": true
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsInMemory": 1000000,
      "maxRowsPerSegment": 5000000,
      "maxNumConcurrentSubTasks": 4,
      "forceGuaranteedRollup": true,
      "partitionsSpec": {
        "type": "hashed",
        "numShards": null,
        "targetRowsPerSegment": 5000000
      }
    }
  }
}

granularitySpec.intervals — the explicit closed window the batch job authoritatively owns. Its upper bound (18:00Z) sits below the live watermark so it never overlaps an interval the supervisor is still writing.
ioConfig.dropExisting: true — combined with the explicit intervals, every pre-existing segment (including everything the stream published in that window) is tombstoned on handoff, so the batch version becomes the sole, complete cover. This is what makes the replacement atomic and duplicate-free.
ioConfig.appendToExisting: false — mutually exclusive with the drop semantics above; set true only for genuinely additive backfills where no streamed row can recur.
tuningConfig.forceGuaranteedRollup: true with a hashed partitionsSpec — guarantees perfect rollup over the reconciled interval, matching the aggregation the stream applied to its neighbours.

Sizing Heuristics & Formulas

Three numeric relationships govern a healthy batch/streaming split. The first fixes the reconciliation window: the batch job owns everything from the topic's oldest retained offset up to the live watermark, so the delta interval it must cover is bounded by lateMessageRejectionPeriod:

$$ \text{reconcileInterval} \approx \text{lateMessageRejectionPeriod} + \text{safetyMargin} $$

where a one-segmentGranularity safetyMargin keeps the batch upper bound strictly below any interval the supervisor might still touch. The second sizes streaming task parallelism against Kafka so lag never grows into the reconciliation window; one reading task per partition per replica is the ceiling:

$$ \text{minTaskCount} \approx \left\lceil \frac{\text{kafkaPartitions}}{\text{targetPartitionsPerTask}} \right\rceil \times \text{replicas} $$

The third keeps both paths landing on the same target segment size — the property the segment size optimization strategies depend on — so a replaced interval is indistinguishable in shape from its streamed neighbours:

$$ \text{targetRows} \approx \frac{\text{targetMB} \times 1048576}{\text{avgRowBytes}} $$

For 500 MB segments over rows averaging ( \approx 105 ) bytes on disk after compression, that yields ( \approx 5 \times 10^{6} ) rows per segment — the targetRowsPerSegment value shared by both specs above. Where the two paths cannot match size — streaming publishes at fine granularity and produces many undersized segments before the interval closes — the gap is closed after the fact by automated compaction scheduling, tuned via compaction threshold tuning, rather than by widening the streaming segmentGranularity. Finally, the safe cadence for reconciliation is bounded below by handoff latency: a batch job for interval ( I ) must not start until every streaming segment in ( I ) has completed handoff, so

$$ \text{reconcileDelay} \gtrsim \text{taskDuration} + \text{handoffLatency} $$

which for an hourly taskDuration and typical Coordinator load periods lands the safe reconciliation lag at roughly 90 minutes behind real time.

Python Orchestration Snippet

The reconciliation controller below serializes the two paths for datasources not using concurrent-append-and-replace locking: it suspends the supervisor, submits the dropExisting backfill, polls to terminal state with capped exponential backoff, confirms the replaced version actually loaded, then resumes the stream. It uses only the standard library plus requests, the same contract the parent pillar's orchestrator establishes.

import time
import hashlib
import requests

OVERLORD = "http://overlord:8090"
COORDINATOR = "http://coordinator:8081"
TERMINAL = {"SUCCESS", "FAILED"}

def deterministic_task_id(datasource: str, interval: str) -> str:
    """Content-hashed ID so a retried reconciliation is idempotent."""
    key = f"reconcile:{datasource}:{interval}".encode()
    return f"idx_recon_{datasource}_{hashlib.sha1(key).hexdigest()[:12]}"

def supervisor(action: str, sup_id: str) -> None:
    r = requests.post(
        f"{OVERLORD}/druid/indexer/v1/supervisor/{sup_id}/{action}", timeout=30
    )
    r.raise_for_status()

def submit_backfill(spec: dict) -> str:
    r = requests.post(f"{OVERLORD}/druid/indexer/v1/task", json=spec, timeout=30)
    r.raise_for_status()
    return r.json()["task"]

def poll_until_terminal(task_id: str, max_wait: int = 3600) -> str:
    deadline = time.monotonic() + max_wait
    delay = 2.0
    while time.monotonic() < deadline:
        r = requests.get(
            f"{OVERLORD}/druid/indexer/v1/task/{task_id}/status", timeout=15
        )
        r.raise_for_status()
        state = r.json()["status"]["status"]
        if state in TERMINAL:
            return state
        time.sleep(delay)
        delay = min(delay * 2, 30.0)  # cap backoff at 30s
    raise TimeoutError(f"{task_id} did not terminate in {max_wait}s")

def confirm_replacement(datasource: str, interval_start: str, tries: int = 20) -> bool:
    """The batch SUCCESS is meaningless until the replaced version is loaded."""
    delay = 2.0
    for _ in range(tries):
        r = requests.get(
            f"{COORDINATOR}/druid/coordinator/v1/datasources/{datasource}/segments",
            timeout=15,
        )
        r.raise_for_status()
        if any(interval_start in seg for seg in r.json()):
            return True
        time.sleep(delay)
        delay = min(delay * 2, 30.0)
    return False

def reconcile(spec: dict, sup_id: str, datasource: str, interval: str) -> None:
    """Serialize stream + batch on one interval: suspend, replace, verify, resume."""
    supervisor("suspend", sup_id)
    try:
        task_id = submit_backfill(spec)  # spec carries dropExisting + explicit intervals
        if poll_until_terminal(task_id) != "SUCCESS":
            raise RuntimeError(f"reconciliation {task_id} failed")
        if not confirm_replacement(datasource, interval.split("/")[0]):
            raise RuntimeError(f"{task_id} succeeded but replacement never loaded")
    finally:
        supervisor("resume", sup_id)  # always hand the live edge back to the stream

Three properties make this safe. Idempotent submission — the content-hashed task ID means a retried reconciliation resubmits the same ID, which the Overlord rejects rather than double-replacing. Guaranteed resume — the finally block returns the live edge to the supervisor even if the batch job throws, so a failed reconciliation never leaves the stream permanently suspended. Replacement confirmation — the controller never signals success on the batch task state alone; it polls the Coordinator until the replaced interval is actually loaded, the same handoff discipline the parent pillar enforces. For datasources that do use concurrent-append-and-replace locking, drop the suspend/resume wrapper and submit the backfill with taskLockType: REPLACE so it interleaves with the live stream. The end-to-end wiring of the streaming side — offsets, consumer properties, watermark tuning — is covered in Kafka to Druid real-time pipeline setup.

Failure Modes & Diagnostics

When batch and streaming reconciliation goes wrong, the symptoms are specific and the diagnostics are curl-and-jq one-liners against the Coordinator and Overlord.

Duplicated rows after a backfill. The batch task ran with appendToExisting: true (or without an explicit intervals), so its segments coexist with the streamed version instead of replacing it. List every live segment for the suspect interval and check for two versions covering the same window:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_web/segments?full" \
  | jq -r '.[] | select(.interval | startswith("2026-07-03")) | "\(.interval)\t\(.version)\t\(.size)"' \
  | sort

Two distinct version strings over the same interval confirm the overlap; re-run the backfill with dropExisting: true.

Batch task races the live watermark. The reconciliation intervals upper bound reached into an hour the supervisor was still writing, so the batch and streaming publishes interleaved. Compare the batch interval against the supervisor's current offsets and lag:

curl -s "http://overlord:8090/druid/indexer/v1/supervisor/events_web/status" \
  | jq '{state: .payload.state, latestOffsets: .payload.latestOffsets, lag: .payload.aggregateLag}'

If aggregateLag implies the stream is still behind the batch upper bound, lower the reconciliation upper bound below now - lateMessageRejectionPeriod.

Replacement task SUCCESS but queries unchanged. Handoff of the tombstones/replacement has not completed — the new version is published in metadata but not yet loaded. Watch the load queue and unavailable count:

curl -s "http://coordinator:8081/druid/coordinator/v1/loadstatus?full" \
  | jq '.events_web // "loaded"'
curl -s "http://coordinator:8081/druid/coordinator/v1/loadqueue?simple" | jq '.'

A non-empty load queue for the datasource means handoff is in flight; wait one Coordinator period before treating the reconciliation as complete.

Supervisor stuck suspended after a failed run. An earlier reconciliation threw before the resume — verify and recover:

curl -s "http://overlord:8090/druid/indexer/v1/supervisor/events_web/status" \
  | jq -r '.payload.state'
curl -s -X POST "http://overlord:8090/druid/indexer/v1/supervisor/events_web/resume"

Thousands of tiny streaming segments before the interval closes. Fine segmentGranularity plus short intermediatePersistPeriod publishes many undersized segments; count them per interval before deciding to compact:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_web/segments?full" \
  | jq -r '.[] | .interval' | sort | uniq -c | sort -rn | head

Any interval with far more segments than ceil(rowsPerInterval / targetRows) is a compaction candidate — route it to automated compaction scheduling rather than reshaping the supervisor.

Automation Checklist

Streaming and batch specs share byte-identical dataSchema (dimensionsSpec, metricsSpec, granularitySpec) — template both from one source so rollup and partitioning never diverge.
Reconciliation intervals upper bound is strictly below now - lateMessageRejectionPeriod plus a one-segmentGranularity safety margin.
Every reconciliation backfill sets dropExisting: true with an explicit intervals (never bare appendToExisting over a streamed window).
Batch task IDs are content-hashed so a retried run is idempotent and the Overlord rejects duplicates.
The controller confirms the replaced version is loaded via the Coordinator before signalling success — never on batch task state alone.
Supervisor resume runs in a finally block (or concurrent-append-and-replace locking is enabled) so a failed backfill never leaves the stream suspended.
Post-reconciliation audit: one version per interval, segment sizes within the target band, load queue drained.
Fine-grained streaming output is routed to compaction, not fixed by widening segmentGranularity.
Alerts fire on supervisor aggregateLag exceeding one taskDuration and on segment_unavailable_count staying non-zero past one Coordinator period.

Async task execution patterns — the non-blocking submit/poll/backoff loop the reconciliation controller builds on.
Dynamic ingestion spec generation — template the streaming and batch specs from one source so their dataSchema never drifts.
Kafka to Druid real-time pipeline setup — offsets, consumer properties, and watermark tuning for the streaming edge.
Automated compaction scheduling — reconcile fine-grained streaming output back toward the target segment size after handoff.
Query routing and segment discovery — how the Broker resolves versions so a dropExisting replacement serves cleanly.

Up one level: automated ingestion pipeline orchestration is the parent that defines the deterministic, idempotent submission contract every path on this page honors.

Synchronizing Batch and Streaming Ingestion in Apache Druid

Two Paths, One Metadata Layer #

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #

Explore this section