Schema Validation for Apache Druid Ingestion Specs

An ingestion spec that reaches the Druid Overlord malformed is an expensive failure: the task consumes a Peon slot, runs partway, and either dies with an opaque parse exception or — worse — silently drops rows and publishes a segment whose metricsSpec no longer matches its neighbours. Schema validation is the pre-flight gate that catches these defects while they are still cheap JSON on a CI runner, before any cluster capacity is spent. This page is the validation contract inside automated ingestion pipeline orchestration: the two layers every generated spec must clear — a structural JSON Schema contract and a set of Druid-specific semantic rules — and how to drive that gate from a deterministic Python orchestrator that refuses to call POST /druid/indexer/v1/task until the payload is proven correct.

Mechanics & Internals

A Druid ingestion spec is a deeply nested JSON document with four load-bearing top-level regions: the outer type (for example index_parallel or kafka), and inside spec the dataSchema, ioConfig, and tuningConfig. Validation has to reason about all four together, because the fields that break ingestion are almost never syntactically invalid JSON — they are structurally valid values that violate a Druid runtime constraint the JSON parser has no way to know about. That splits the problem into two layers that must run in order.

Structural contract enforcement is what a formal JSON Schema does well: it asserts that required keys exist, that segmentGranularity is one of Druid's accepted enum values, that maxRowsPerSegment is an integer rather than a string, and that metricsSpec is an array of objects each carrying a type and name. This layer is pure shape checking against a versioned schema aligned with the Apache Druid ingestion reference, and it catches the large class of defects that come from hand-edited or badly templated JSON — a missing timestampSpec, a typo'd granularitySpec, a null where an object belongs.

Semantic validation enforces the constraints JSON Schema structurally cannot express, because they are relationships between fields rather than properties of a single field. Three matter most in practice:

Timestamp/inputFormat alignment. The timestampSpec.format (auto, iso, posix, millis, or a Joda pattern) must be parseable from the column the inputFormat actually emits. A spec that declares "format": "iso" over a column carrying epoch millis is structurally perfect and operationally broken: every row fails to parse and is dropped unless surfaced through the task's maxParseExceptions budget.
Aggregator column existence and type compatibility. Druid performs no implicit casting during rollup. Every metricsSpec aggregator with a fieldName must reference a column the input genuinely produces, and the aggregator type must be compatible with that column's type — a longSum or doubleSum over a STRING column, or a thetaSketch over a field that was never emitted, fails at the Peon after the task has already started.
Rollup consistency. When granularitySpec.rollup is true, the set of dimensionsSpec.dimensions and the set of aggregated columns must be disjoint and complete: a column that is neither a declared dimension nor an aggregator input silently vanishes from the rolled-up segment, changing query results in a way no error ever reports.

The reason this gate belongs after templating and before submission is that the spec a pipeline actually sends is rarely the one an engineer wrote — it is the interpolated output of dynamic ingestion spec generation, with environment overrides, computed partition boundaries, and metadata-catalog column lists injected at runtime. Validation must run on the fully resolved payload, because that is the only artifact whose correctness the Overlord will judge. Druid exposes exactly one relevant hook here: POST /druid/indexer/v1/task returns 400 with a diagnostic body for a spec it can reject cheaply, but it accepts many semantically wrong specs and fails them later at the Peon — which is precisely why the orchestrator cannot rely on the Overlord as its validator and must own the contract itself. The interval and rollup semantics the aggregators depend on are the same segment granularity settings that govern partition boundaries, so a validator that understands granularity is validating the segment shape, not just the JSON.

Validated Configuration Spec

Validation needs two artifacts: the JSON Schema contract the structural layer checks against, and a complete, correct ingestion spec that passes both layers and serves as the golden reference. Below is a trimmed but faithful JSON Schema for the dataSchema core, annotated with what each constraint buys:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["type", "spec"],
  "properties": {
    "type": { "enum": ["index_parallel", "kafka", "kinesis", "index_hadoop"] },
    "spec": {
      "type": "object",
      "required": ["dataSchema", "ioConfig", "tuningConfig"],
      "properties": {
        "dataSchema": {
          "type": "object",
          "required": ["dataSource", "timestampSpec", "dimensionsSpec", "granularitySpec"],
          "properties": {
            "dataSource": { "type": "string", "minLength": 1 },
            "timestampSpec": {
              "type": "object",
              "required": ["column", "format"],
              "properties": {
                "column": { "type": "string" },
                "format": { "enum": ["auto", "iso", "posix", "millis", "micro", "nano"] }
              }
            },
            "metricsSpec": {
              "type": "array",
              "items": {
                "type": "object",
                "required": ["type", "name"],
                "properties": {
                  "type": {
                    "enum": ["count", "longSum", "doubleSum", "floatSum",
                             "longMin", "longMax", "thetaSketch", "HLLSketchBuild"]
                  },
                  "name": { "type": "string" },
                  "fieldName": { "type": "string" }
                }
              }
            },
            "granularitySpec": {
              "type": "object",
              "required": ["segmentGranularity"],
              "properties": {
                "segmentGranularity": {
                  "enum": ["MINUTE", "HOUR", "DAY", "WEEK", "MONTH", "YEAR"]
                },
                "rollup": { "type": "boolean" }
              }
            }
          }
        }
      }
    }
  }
}

type enum — rejects a spec whose ingestion type is misspelled or unsupported before the Overlord has to; the four values cover native batch, streaming, and Hadoop paths.
timestampSpec.required and format enum — guarantees a timestamp column is declared and its format is one Druid can parse; this is the structural half of the timestamp/inputFormat alignment the semantic layer completes.
metricsSpec.items.required — every aggregator must carry type and name; fieldName is optional here because count takes none, which is exactly the case the semantic layer must special-case rather than the schema.
granularitySpec.segmentGranularity enum — pins partition-boundary granularity to accepted values so a templated "segmentGranularity": "hourly" (wrong casing) is caught structurally.

A complete index_parallel spec that passes both layers — the golden reference a validator compares interpolated output against — includes every required top-level key:

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_web",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": ["country", "device", "page_id"]
      },
      "metricsSpec": [
        { "type": "count", "name": "rows" },
        { "type": "longSum", "name": "bytes", "fieldName": "resp_bytes" },
        { "type": "thetaSketch", "name": "uniq_users", "fieldName": "user_id" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": { "type": "s3", "prefixes": ["s3://lake/events/web/2026-07-03/"] },
      "inputFormat": { "type": "json" },
      "appendToExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsInMemory": 1000000,
      "maxRowsPerSegment": 5000000,
      "maxNumConcurrentSubTasks": 4,
      "maxParseExceptions": 0,
      "partitionsSpec": { "type": "hashed", "targetRowsPerSegment": 5000000 }
    }
  }
}

metricsSpec[*].fieldName — bytes and uniq_users reference resp_bytes and user_id; the semantic layer must confirm both columns exist in the source and that longSum/thetaSketch are type-compatible with them, while count carries no fieldName and is skipped.
tuningConfig.maxParseExceptions: 0 — makes the timestamp/inputFormat contract strict: any unparseable row fails the task rather than being silently tolerated, turning a validation miss into a loud failure instead of a quiet data loss.
tuningConfig.partitionsSpec.targetRowsPerSegment — must be consistent with the sizing target below so validated specs produce segments the segment size optimization strategies expect.

Sizing Heuristics & Formulas

Two of the semantic checks are numeric, so a validator that only asserts presence misses them. The first bounds the parse-exception tolerance. maxParseExceptions should never be an open-ended number — it is a fraction of expected volume, and a validator can reject any spec whose tolerance exceeds a policy ceiling:

$$ \text{maxParseExceptions} \approx \lceil \text{expectedRows} \times \text{tolerableErrorRate} \rceil $$

For a batch of ( \approx 5 \times 10^{7} ) rows at a tolerated error rate of ( 10^{-4} ), that caps acceptable exceptions near 5000; a spec asking for maxParseExceptions: 100000 over the same batch is masking a real timestamp or type defect and should fail validation, not ingestion.

The second ties the aggregator/partition contract to segment shape. The targetRowsPerSegment a validated spec declares must land segments in the target size band, which is the same relationship every segment size optimization strategy rests on:

$$ \text{targetRows} \approx \frac{\text{targetMB} \times 1048576}{\text{avgRowBytes}} $$

For 500 MB segments over rows averaging ( \approx 105 ) bytes on disk after columnar compression, that yields ( \approx 5 \times 10^{6} ) rows — the targetRowsPerSegment value in the golden spec above. A validator can therefore flag any spec whose declared target deviates from the computed band by more than a tolerance factor, catching a mis-templated partition size before it produces thousands of undersized segments. Finally, rollup only pays off when the aggregation actually collapses rows; the expected reduction is

$$ \text{rollupRatio} \approx \frac{\text{rawRows}}{\text{distinctDimensionTuples}} $$

so a spec that declares rollup: true while the dimension cardinality approaches the raw row count is structurally valid but semantically pointless — a warning-level validation finding that steers the author toward dropping rollup or coarsening queryGranularity rather than paying its ingestion cost for no gain.

Python Orchestration Snippet

The validator below runs both layers and only then submits, polls to a terminal state with capped exponential backoff, and returns the outcome — the same submit/poll discipline the parent orchestration framework and async task execution patterns establish. It uses only the standard library plus requests and the jsonschema reference validator; the semantic layer is hand-written because its rules are relational.

import json
import time
import hashlib
import requests
from jsonschema import Draft202012Validator

OVERLORD = "http://overlord:8090"
TERMINAL = {"SUCCESS", "FAILED"}

# Aggregator type -> the source column types Druid will accept for it.
NUMERIC = {"LONG", "FLOAT", "DOUBLE"}
AGG_INPUT_TYPES = {
    "longSum": NUMERIC, "doubleSum": NUMERIC, "floatSum": NUMERIC,
    "longMin": NUMERIC, "longMax": NUMERIC,
    "thetaSketch": {"STRING", "LONG"}, "HLLSketchBuild": {"STRING", "LONG"},
}


class SpecInvalid(Exception):
    """Raised on any structural or semantic validation failure."""


def validate_structure(spec: dict, contract: dict) -> None:
    """Layer 1: strict JSON Schema contract. Collects every violation, not just the first."""
    errors = sorted(Draft202012Validator(contract).iter_errors(spec), key=lambda e: e.path)
    if errors:
        joined = "; ".join(f"{list(e.path)}: {e.message}" for e in errors)
        raise SpecInvalid(f"structural: {joined}")


def validate_semantics(spec: dict, source_columns: dict) -> None:
    """Layer 2: Druid runtime constraints. source_columns maps column name -> type."""
    ds = spec["spec"]["dataSchema"]

    # Timestamp column must exist in the source.
    ts_col = ds["timestampSpec"]["column"]
    if ts_col not in source_columns:
        raise SpecInvalid(f"semantic: timestamp column '{ts_col}' absent from source")

    dimensions = {
        d["name"] if isinstance(d, dict) else d
        for d in ds.get("dimensionsSpec", {}).get("dimensions", [])
    }

    for metric in ds.get("metricsSpec", []):
        field = metric.get("fieldName")
        if metric["type"] == "count":
            continue  # count takes no fieldName
        if field is None:
            raise SpecInvalid(f"semantic: aggregator '{metric['name']}' needs a fieldName")
        if field not in source_columns:
            raise SpecInvalid(
                f"semantic: metric '{metric['name']}' references missing column '{field}'"
            )
        allowed = AGG_INPUT_TYPES.get(metric["type"], set())
        col_type = source_columns[field].upper()
        if allowed and col_type not in allowed:
            raise SpecInvalid(
                f"semantic: {metric['type']} '{metric['name']}' cannot aggregate "
                f"{col_type} column '{field}'"
            )
        if field in dimensions:
            raise SpecInvalid(
                f"semantic: '{field}' is both a dimension and an aggregator input"
            )


def deterministic_task_id(spec: dict) -> str:
    """Content-hashed ID so a retried submit is idempotent and the Overlord dedupes it."""
    payload = json.dumps(spec, sort_keys=True).encode()
    ds = spec["spec"]["dataSchema"]["dataSource"]
    return f"idx_{ds}_{hashlib.sha1(payload).hexdigest()[:12]}"


def submit(spec: dict) -> str:
    r = requests.post(f"{OVERLORD}/druid/indexer/v1/task", json=spec, timeout=30)
    r.raise_for_status()
    return r.json()["task"]


def poll_until_terminal(task_id: str, max_wait: int = 3600) -> str:
    deadline = time.monotonic() + max_wait
    delay = 2.0
    while time.monotonic() < deadline:
        r = requests.get(f"{OVERLORD}/druid/indexer/v1/task/{task_id}/status", timeout=15)
        r.raise_for_status()
        state = r.json()["status"]["status"]
        if state in TERMINAL:
            return state
        time.sleep(delay)
        delay = min(delay * 2, 30.0)  # cap backoff at 30s
    raise TimeoutError(f"{task_id} did not terminate in {max_wait}s")


def validate_and_submit(spec: dict, contract: dict, source_columns: dict) -> str:
    """Gate first, then submit. Nothing reaches the Overlord until both layers pass."""
    validate_structure(spec, contract)
    validate_semantics(spec, source_columns)
    task_id = submit(spec)  # Overlord assigns its own; content hash guards resubmits
    return poll_until_terminal(task_id)

Three properties make this a real gate rather than a formality. Fail-closed ordering — validate_and_submit calls both validators before submit, so a structural or semantic defect raises SpecInvalid and the Overlord is never contacted. Exhaustive structural errors — iter_errors collects every JSON Schema violation in one pass, so a templating bug that broke five fields surfaces all five instead of one-per-run. Idempotent submission — the content-hashed deterministic_task_id means a retried submit of the identical spec is deduplicated by the Overlord rather than double-ingesting. The source_columns map is the contract with the upstream catalog; where columns change over time, that map is exactly what handling schema evolution in Druid ingestion keeps current so the semantic layer validates against reality.

Failure Modes & Diagnostics

When a spec slips past a weak validator, the failure shows up at the Overlord or Peon, and the diagnostics are curl-and-jq one-liners.

Task rejected at submission with 400. The Overlord's own structural check caught something the local schema missed — an unknown field or a malformed ioConfig. Read the rejection body directly:

curl -s -o /dev/null -w "%{http_code}\n" -X POST \
  "http://overlord:8090/druid/indexer/v1/task" \
  -H "Content-Type: application/json" --data @spec.json
curl -s -X POST "http://overlord:8090/druid/indexer/v1/task" \
  -H "Content-Type: application/json" --data @spec.json | jq '.error'

A non-null .error names the offending field; fold that assertion into the JSON Schema contract so the next run catches it locally.

Task accepted but rows dropped as unparseable. The timestamp/inputFormat alignment was wrong and maxParseExceptions absorbed the damage. Pull the task's row stats and unparseable-event sample:

curl -s "http://overlord:8090/druid/indexer/v1/task/idx_events_web_ab12cd34ef56/reports" \
  | jq '.ingestionStatsAndErrors.payload.rowStats, .ingestionStatsAndErrors.payload.unparseableEvents'

A high processedWithError or unparseable count against a low processed confirms a format mismatch; correct timestampSpec.format and set maxParseExceptions: 0 to make the next occurrence fail loudly.

Aggregator produced null or wrong metrics. A metricsSpec aggregator referenced a column that does not exist or was type-incompatible, and rollup silently zeroed it. Diff the segment's actual columns against the spec:

curl -s -X POST "http://broker:8082/druid/v2/sql" -H "Content-Type: application/json" \
  --data '{"query":"SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '\''events_web'\''"}' \
  | jq -r '.[] | "\(.COLUMN_NAME)\t\(.DATA_TYPE)"'

A metric column missing from the result, or carrying an unexpected type, is the fingerprint of a semantic-validation gap — add the column-existence and type-compatibility check to validate_semantics.

Rollup declared but segment row count barely shrank. Cardinality defeated the aggregation. Compare raw input rows to published segment rows for the interval:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_web/segments?full" \
  | jq -r '[.[] | select(.interval | startswith("2026-07-03")) | .size] | add'

A total near the raw byte volume means rollupRatio is close to 1 — surface it as a validation warning steering the author to coarsen queryGranularity or drop rollup rather than pay its cost.

Automation Checklist

Dynamic ingestion spec generation — produces the interpolated payload this gate validates; the two are always paired render-then-validate.
Automating Druid ingestion specs with Python — the builder pattern that emits specs designed to pass both validation layers.
Handling schema evolution in Druid ingestion — keeps the source-column contract current so the semantic layer validates against reality.
Async task execution patterns — the non-blocking submit/poll/backoff loop the validator hands a proven spec to.
Batch vs streaming ingestion sync — why a validated, byte-identical dataSchema across both paths is what keeps reconciled intervals duplicate-free.

Up one level: automated ingestion pipeline orchestration is the parent that defines the deterministic, idempotent submission contract this validation gate enforces.

Schema Validation for Apache Druid Ingestion Specs

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #

Explore this section