Handling Schema Evolution in Druid Ingestion

Apache Druid enforces an immutable segment architecture, meaning schema evolution cannot be resolved via traditional DDL operations like ALTER TABLE. Structural changes must be applied exclusively at ingestion time through coordinated task definitions, segment metadata alignment, and query planner expectations. For OLAP data engineers, analytics platform developers, and DevOps teams, managing this evolution requires deterministic pipeline orchestration, explicit type mapping, and rigorous backward compatibility controls.

Diagnostic Signatures & Failure Modes

Production environments typically encounter four deterministic failure signatures when upstream schemas drift. Recognizing these patterns early prevents segment corruption and query degradation:

  • Type Coercion Violations (column_type_mismatch): When source data transitions from STRING to LONG (or vice versa), the Druid parser throws java.lang.ClassCastException during segment creation or compaction. Resolution requires explicit columnType overrides within the dimensionsSpec or metricsSpec to enforce safe casting or strict rejection.
  • Implicit Column Suppression (silent_drop_on_missing_dimension): New upstream fields absent from the ingestion spec are silently discarded unless useSchemaDiscovery is explicitly enabled. This creates silent data loss and downstream query failures when analysts reference missing columns. Implementing strict schema contracts prevents unannounced drift.
  • Rollup Aggregation Inconsistency (rollup_aggregation_mismatch): Modifying a metric from count to doubleSum without reprocessing historical segments fractures aggregation continuity. Queries spanning partition boundaries yield skewed results. Versioned datasource naming or explicit segment reindexing is required to maintain mathematical consistency.
  • Temporal Parsing Failures (timestamp_format_shift): Altering timestampSpec.format without synchronizing the parser regex triggers ParseException during task initialization. Affected tasks transition to FAILED with INVALID_TIMESTAMP error codes. Centralized timestamp normalization upstream eliminates this class of ingestion failure.

Dynamic Spec Generation & Python Orchestration

To mitigate schema drift, production pipelines must generate ingestion specifications programmatically. The following pattern leverages the Druid Overlord REST API to construct evolution-aware task payloads, applying explicit type mapping and safe fallbacks:

import requests
import json
import logging
from typing import Dict, Optional
from requests.exceptions import RequestException

DRUID_OVERLORD = "http://druid-overlord:8081"
HEADERS = {"Content-Type": "application/json"}
logger = logging.getLogger(__name__)

def build_evolution_aware_spec(
    datasource: str,
    input_source: Dict,
    schema_map: Dict[str, str],
    timestamp_column: str = "event_time",
    is_append: bool = True
) -> Dict:
    """Generates a Druid ingestion spec with explicit type mapping and safe defaults."""
    dimensions = [
        {"name": k, "type": v} 
        for k, v in schema_map.items() 
        if v in ("string", "long", "float", "double") and v != "metric"
    ]
    metrics = [
        {"type": "doubleSum", "name": k, "fieldName": k} 
        for k, v in schema_map.items() 
        if v == "metric"
    ]

    return {
        "type": "index_parallel",
        "spec": {
            "dataSchema": {
                "dataSource": datasource,
                "timestampSpec": {"column": timestamp_column, "format": "iso"},
                "dimensionsSpec": {"dimensions": dimensions},
                "metricsSpec": metrics,
                "granularitySpec": {
                    "type": "uniform",
                    "segmentGranularity": "DAY",
                    "queryGranularity": "HOUR",
                    "rollup": True
                }
            },
            "ioConfig": {
                "type": "index_parallel",
                "inputSource": input_source,
                "inputFormat": {"type": "json", "flattenSpec": {"useFieldDiscovery": False}},
                "appendToExisting": is_append
            },
            "tuningConfig": {
                "type": "index_parallel",
                "partitionsSpec": {"type": "dynamic"},
                "forceGuaranteedRollup": True
            }
        }
    }

def submit_ingestion_task(spec: Dict, timeout: int = 30) -> Optional[str]:
    """Submits the spec to the Overlord and returns the task ID."""
    try:
        response = requests.post(
            f"{DRUID_OVERLORD}/druid/indexer/v1/task",
            headers=HEADERS,
            json=spec,
            timeout=timeout
        )
        response.raise_for_status()
        task_id = response.json().get("task")
        logger.info("Successfully submitted task: %s", task_id)
        return task_id
    except RequestException as e:
        logger.error("Ingestion submission failed: %s", str(e))
        return None

Validation, Sync, and Recovery Patterns

Before task submission, ingestion payloads must undergo structural verification to prevent malformed JSON or unsupported type combinations from reaching the Overlord. Integrating Schema Validation for Druid Specs into CI/CD pipelines ensures that dynamic mappings align with Druid’s ingestion contract. This pre-flight validation layer catches columnType mismatches and missing timestampSpec definitions before they trigger runtime task failures.

For environments managing concurrent workloads, synchronizing batch and streaming ingestion requires careful partition alignment. When streaming pipelines introduce schema drift, batch backfills must apply identical dimensionsSpec overrides to maintain segment parity. Implementing Automated Ingestion Pipeline Orchestration enables stateful task tracking, allowing operators to pause, retry, or route failed tasks based on async execution patterns.

When schema evolution introduces irreversible aggregation skew, automated rollback mechanisms become essential. Production systems should version datasource names (e.g., events_v2) rather than mutating existing segments. If a task fails due to timestamp_format_shift or rollup_aggregation_mismatch, the orchestrator should trigger a compensating workflow that reverts to the last known stable spec. Cross-cluster ingestion orchestration further isolates experimental schema changes, allowing validation in staging clusters before promotion to production.

Adhering to official Apache Druid Ingestion Documentation guarantees alignment with the latest parser behaviors and Overlord API contracts. Additionally, leveraging standardized JSON Schema validation for ingestion payloads reduces runtime parsing errors and enforces strict type boundaries at the pipeline edge.

Schema evolution in Druid demands explicit coordination between ingestion definitions, segment metadata, and query execution. By enforcing programmatic spec generation, pre-flight validation, and deterministic rollback paths, engineering teams can maintain query consistency and segment integrity across continuous upstream changes.

Back to Automated Ingestion Pipeline Orchestration