Dynamic Ingestion Spec Generation for Apache Druid

Dynamic ingestion spec generation replaces hand-maintained, per-datasource JSON files with a runtime function that computes the ingestion descriptor from live metadata, cluster topology, and volume telemetry. When a new interval arrives, the generator resolves the schema, partitioning strategy, and tuning parameters at execution time rather than reading them from a static template that has silently drifted out of sync with the source. Operationally this matters because a stale dimensionsSpec or a hardcoded targetRowsPerSegment is the single most common cause of oversized segments, unparseable-row spikes, and backfills that overlap live data. Within automated ingestion pipeline orchestration, spec generation is the build phase that feeds every downstream step — it is the code path that turns a data contract into an executable Druid task.

Mechanics & Internals

A generated spec must resolve three independent axes before it is well-formed: the data-source topology (which columns are dimensions, which are metrics, what the timestamp column and format are), the segment granularity and partitioning (how the interval is sliced into physical segments), and the tuning envelope (memory, concurrency, and persist cadence on the worker). The generator sources the first axis from a metadata catalog — AWS Glue, the Apache Hive Metastore, or an internal schema registry — and reads the second and third from the Druid cluster's own telemetry rather than from constants baked into a file.

The output is always the same fixed JSON shape that the Overlord accepts on POST /druid/indexer/v1/task: a top-level type, a spec object containing dataSchema, ioConfig, and tuningConfig. What changes run to run is the content of those objects. Resolving dimensionsSpec and metricsSpec from the catalog at submission time means the pipeline absorbs schema evolution automatically — a new column in the source becomes a new dimension without a human editing a template — which is the same problem the sibling page on schema validation for Druid specs guards from the opposite direction by rejecting incompatible changes before they reach the Overlord.

Three internal decisions drive the generator's design. First, granularity resolution: the granularitySpec.segmentGranularity fixes how many segments an interval produces, so the generator derives it from the query-access pattern and expected daily volume rather than defaulting to DAY everywhere. It maps directly onto the segment granularity settings that govern partition boundaries. Second, partition-strategy selection: the generator chooses a range/single_dim partitionsSpec when a high-cardinality dimension carries query locality, and hashed when the goal is even distribution — a decision made from column cardinality, not guesswork. Third, rollup and encoding awareness: because rollup: true collapses rows within a queryGranularity bucket, the row-count telemetry the generator uses to size segments must be measured post-rollup, and the byte estimate must account for the dictionary encoding of the columnar storage formats that back each segment.

Catalog resolution is where most generators earn or lose their reliability. A metadata catalog reports source types in its own vocabulary — Glue emits Hive-style types like string, bigint, double, timestamp; a Parquet footer carries logical types; an Avro schema carries unions and nullability. The generator must translate each into a Druid dimension or metric decision, and the translation is not one-to-one. A bigint measure column becomes a longSum metric, not a long dimension; a timestamp column becomes the timestampSpec, not a dimension at all; a high-cardinality string like user_id should never become a dimension because its dictionary would dominate segment size, so the generator routes it to a thetaSketch or HLLSketch aggregator instead. Encoding these rules as an explicit type-mapping table — keyed on catalog type plus a per-column role annotation — is what keeps the emitted dimensionsSpec and metricsSpec correct as the source schema evolves. When the catalog reports a genuinely new column with no mapping rule, the safe default is to reject the run and route the column to the validation gate for a human decision rather than silently guessing a role.

The generator never mutates published data. Druid segments are immutable and versioned (datasource_interval_version_partitionNum), so a correction produces a new version that atomically supersedes its predecessor on handoff. The build phase therefore only ever emits a spec whose intervals, appendToExisting, and dropExisting flags express intent precisely — extend an interval, or replace it — and leaves the run-phase reconciliation to async task execution patterns.

Validated Configuration Spec

Below is a complete index_parallel batch spec exactly as the generator would emit it after resolving the catalog and telemetry, with every load-bearing field documented inline. It includes all required top-level keys (type, spec, and within it dataSchema, ioConfig, tuningConfig) so it is copy-ready against the latest stable Druid release.

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_web",
      "timestampSpec": { "column": "event_ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": [
          "country",
          "device",
          { "name": "page_id", "type": "long" }
        ]
      },
      "metricsSpec": [
        { "type": "count", "name": "rows" },
        { "type": "longSum", "name": "bytes", "fieldName": "resp_bytes" },
        { "type": "thetaSketch", "name": "users", "fieldName": "user_id" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true,
        "intervals": ["2026-07-04T00:00:00Z/2026-07-05T00:00:00Z"]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "prefixes": ["s3://lake/events/web/2026-07-04/"]
      },
      "inputFormat": { "type": "json" },
      "appendToExisting": false,
      "dropExisting": true
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsInMemory": 1000000,
      "maxNumConcurrentSubTasks": 4,
      "intermediatePersistPeriod": "PT10M",
      "forceGuaranteedRollup": true,
      "partitionsSpec": {
        "type": "range",
        "partitionDimensions": ["country"],
        "targetRowsPerSegment": 5000000
      }
    }
  }
}

The fields the generator computes rather than hardcodes are:

dataSchema.dimensionsSpec.dimensions — resolved from the catalog. String dimensions can be listed by name; typed dimensions (long, float, double) are emitted as objects so Druid does not misinfer them as strings.
dataSchema.metricsSpec — the aggregators. count and longSum come from the metric contract; a thetaSketch is emitted when the catalog marks a column for approximate distinct counting so raw high-cardinality IDs never become dimensions.
granularitySpec.segmentGranularity — derived from expected volume so the interval yields segments near the target size. intervals bounds the task to exactly the window being (re)built, which is what makes dropExisting safe.
granularitySpec.queryGranularity and rollup — the pre-aggregation contract; with rollup: true, rows sharing all dimensions within a queryGranularity bucket collapse into one.
ioConfig.dropExisting: true — combined with an explicit intervals, this makes the task atomically replace every existing segment in the window on handoff, the correct primitive for deterministic backfills.
tuningConfig.partitionsSpec — range on country here for query-locality; the generator swaps to { "type": "hashed", "numShards": null } when no dimension offers useful locality. forceGuaranteedRollup requires a range or hashed (non-dynamic) partitionsSpec and a bounded intervals, both of which the generator guarantees.
tuningConfig.maxRowsInMemory and intermediatePersistPeriod — sized from worker heap telemetry, not fixed constants, so the spill cadence tracks available memory. Full field semantics are maintained in the official Apache Druid ingestion spec reference.

Sizing Heuristics & Formulas

The generator's job is to land segments in Druid's sweet spot of roughly 300–700 MB (about 5 million rows for typical event data). Undersized segments inflate metadata and query-planning overhead; oversized segments strain Historical JVM heaps. Given a target size in megabytes and the average post-compression row width measured from segment telemetry, the row target is:

$$ \text{targetRows} \approx \frac{\text{targetMB} \times 1048576}{\text{avgRowBytes}} $$

For 500 MB segments over rows averaging ( \approx 105 ) bytes on disk, that yields ( \approx 5 \times 10^{6} ) rows — the targetRowsPerSegment written into the spec above. Because rollup collapses rows, the generator must feed this formula the post-rollup row estimate, which it derives from the observed rollup ratio ( \rho ) (output rows over input rows) of prior runs:

$$ \text{rowsPerInterval} \approx \rho \times \text{rawEventsPerInterval} $$

The number of segments an interval produces then drives Coordinator and Historical load, and it multiplies by the partition-dimension count when a multi-dimension range spec is used:

$$ \text{segmentsPerInterval} \approx \left\lceil \frac{\text{rowsPerInterval}}{\text{targetRows}} \right\rceil $$

Partition-strategy selection is itself a numeric decision. The generator picks range/single_dim when the candidate dimension's cardinality is high enough to distribute rows evenly yet low enough that most queries filter on it; a rough guard is to prefer hashed distribution once the target segment count exceeds the distinct values available to range-partition on:

$$ \text{useRange} \iff \text{distinct}(\text{dim}) \gtrsim \text{segmentsPerInterval} \times \text{targetRows} / \text{avgRowsPerKey} $$

Finally, submission concurrency is bounded by worker capacity. A MiddleManager runs at most druid.worker.capacity slots; with maxNumConcurrentSubTasks set to ( k ), the effective ceiling is ( k \times \text{numRunningTasks} ), and exceeding total slots queues tasks in PENDING. The generator therefore reads current free capacity and caps ( k ) so a single generated task never starves concurrent pipelines. When streaming fragments the output below target — a natural consequence of fine segmentGranularity — the reconciliation back toward these sizes is handled by automated compaction scheduling rather than by re-ingesting.

Python Orchestration Snippet

The reference generator below uses only the Python standard library plus requests. It queries cluster telemetry to size the segment, resolves partitioning, assembles the spec, and submits it to the Overlord with capped exponential backoff. The deeper, validation-first walkthrough — including jsonschema gating and CI/CD integration — lives in the child guide, automating Druid ingestion specs with Python.

import time
import math
import hashlib
import requests

COORDINATOR = "http://coordinator:8081"
OVERLORD = "http://overlord:8090"


def avg_row_bytes(datasource: str, default: int = 105) -> int:
    """Estimate average on-disk row size from current segment telemetry."""
    r = requests.get(
        f"{COORDINATOR}/druid/coordinator/v1/datasources/{datasource}/segments",
        params={"full": "true"},
        timeout=15,
    )
    r.raise_for_status()
    segs = r.json()
    total_bytes = sum(s.get("size", 0) for s in segs)
    total_rows = sum(s.get("num_rows", 0) for s in segs)
    return math.ceil(total_bytes / total_rows) if total_rows else default


def target_rows(datasource: str, target_mb: int = 500) -> int:
    """targetRows = targetMB * 1048576 / avgRowBytes, clamped to a sane band."""
    rows = (target_mb * 1048576) // avg_row_bytes(datasource)
    return max(1_000_000, min(rows, 10_000_000))


def build_spec(datasource: str, interval: str, dimensions: list, metrics: list,
               s3_prefix: str, part_dim: str | None) -> dict:
    """Assemble a validated index_parallel spec from resolved inputs."""
    if part_dim:
        partitions = {"type": "range", "partitionDimensions": [part_dim],
                      "targetRowsPerSegment": target_rows(datasource)}
    else:
        partitions = {"type": "hashed", "targetRowsPerSegment": target_rows(datasource)}
    return {
        "type": "index_parallel",
        "spec": {
            "dataSchema": {
                "dataSource": datasource,
                "timestampSpec": {"column": "event_ts", "format": "iso"},
                "dimensionsSpec": {"dimensions": dimensions},
                "metricsSpec": metrics,
                "granularitySpec": {
                    "type": "uniform", "segmentGranularity": "HOUR",
                    "queryGranularity": "MINUTE", "rollup": True,
                    "intervals": [interval],
                },
            },
            "ioConfig": {
                "type": "index_parallel",
                "inputSource": {"type": "s3", "prefixes": [s3_prefix]},
                "inputFormat": {"type": "json"},
                "appendToExisting": False, "dropExisting": True,
            },
            "tuningConfig": {
                "type": "index_parallel", "maxRowsInMemory": 1_000_000,
                "maxNumConcurrentSubTasks": 4, "forceGuaranteedRollup": True,
                "partitionsSpec": partitions,
            },
        },
    }


def deterministic_task_id(datasource: str, interval: str, spec_version: str) -> str:
    """Content-hashed ID so a resubmission of identical work is idempotent."""
    key = f"{datasource}:{interval}:{spec_version}".encode()
    return f"idx_{datasource}_{hashlib.sha1(key).hexdigest()[:12]}"


def submit_with_backoff(spec: dict, retries: int = 5) -> str:
    """POST the spec to the Overlord, retrying transient failures with backoff."""
    delay = 2.0
    for attempt in range(retries):
        try:
            r = requests.post(f"{OVERLORD}/druid/indexer/v1/task",
                              json=spec, timeout=30)
            r.raise_for_status()
            return r.json()["task"]
        except requests.RequestException:
            if attempt == retries - 1:
                raise
            time.sleep(delay)
            delay = min(delay * 2, 30.0)  # cap backoff at 30s
    raise RuntimeError("unreachable")

Three patterns make this production-safe. Telemetry-driven sizing: target_rows reads current segment bytes-per-row from the Coordinator so the spec tracks real data, not a stale constant. Deterministic IDs: a retried run submits the same task ID, and the Overlord rejects the duplicate rather than double-ingesting. Capped backoff: submission retries start at two seconds and double to a thirty-second ceiling, protecting the Overlord during leader elections or transient network partitions. Validation is factored out deliberately — a malformed spec must be rejected pre-flight, not after it consumes an Overlord slot.

Failure Modes & Diagnostics

Spec-generation failures surface in a few recognizable ways. Diagnose each from the Coordinator and Overlord REST APIs.

1. Generated spec rejected with a parse error. The task returns FAILED immediately. Pull the full report and read the error message:

curl -s http://overlord:8090/druid/indexer/v1/task/$TASK_ID/reports \
  | jq '.ingestionStatsAndErrors.payload.errorMsg'

A common cause is a catalog column emitted as a bare name when Druid infers the wrong type — pin the type in dimensionsSpec ({"name": "page_id", "type": "long"}).

2. Segments land far off target size. Inspect the produced segment distribution and compare against the target:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_web/segments?full=true" \
  | jq '[.[] | {rows: .num_rows, mb: (.size/1048576 | floor)}] | sort_by(.mb)'

If most segments are tiny, the rollup ratio used to size targetRowsPerSegment was wrong — re-measure ( \rho ) from post-rollup telemetry. If a few are huge, the partitionDimensions choice is skewed; switch to hashed.

3. forceGuaranteedRollup task refuses to start. The Overlord rejects the spec because guaranteed rollup requires a non-dynamic partitionsSpec and a bounded intervals. Confirm both are present:

echo "$GENERATED_SPEC" | jq '.spec.tuningConfig.partitionsSpec.type,
  .spec.dataSchema.granularitySpec.intervals'

4. Backfill overlaps live data. Duplicate rows appear for a reprocessed interval. The generator emitted appendToExisting: true instead of dropExisting: true with an explicit intervals. Verify the flags and the version that actually loaded:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_web/segments?full=true" \
  | jq 'group_by(.interval)[] | {interval: .[0].interval, versions: (map(.version) | unique)}'

More than one live version for the same interval means a replace did not supersede cleanly — re-run with dropExisting: true.

5. Tasks pile up in PENDING. The generator submits faster than worker slots free up. Check capacity before throttling the submit rate:

curl -s http://overlord:8090/druid/indexer/v1/workers \
  | jq '[.[] | {host: .worker.host, used: .currCapacityUsed, cap: .worker.capacity}]'

Streaming-side triage of stuck supervisors extends this — see debugging Druid supervisor task failures.

Automation Checklist

Wire these gates into the generator so every emitted spec is validated before submission and audited after handoff.

Automating Druid ingestion specs with Python — the validation-first builder walkthrough with jsonschema gating and CI/CD integration.
Schema validation for Druid specs — the pre-flight contract that rejects malformed or incompatible specs before they reach the Overlord.
Async task execution patterns — non-blocking submission, status polling, and backoff once a generated spec is in flight.
Batch vs streaming ingestion sync — harmonize generated batch backfills with live Kafka or Kinesis streams without overlapping segments.
Segment size optimization strategies — keep the sizes this generator targets stable across the segment lifecycle.

Up one level: Automated ingestion pipeline orchestration frames how spec generation, validation, and async execution fit together into one deterministic pipeline.

Dynamic Ingestion Spec Generation for Apache Druid

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #

Explore this section