How Druid Segments Map to Time Intervals: Operational Reference

Druid’s query routing and storage efficiency depend on deterministic, UTC-aligned time partitioning. Every segment is strictly bound to an immutable interval defined by segmentGranularity and the __time column. This mapping is mathematically enforced during ingestion and validated by the Coordinator, forming the operational backbone of Apache Druid Segment Architecture & Lifecycle Fundamentals. Misalignment at the ingestion boundary cascades into query routing failures, retention policy violations, and inefficient compaction cycles.

Deterministic Interval Mapping

Segments are named using the datasource_version_start_end convention, where start and end are ISO-8601 timestamps truncated to the configured segmentGranularity. Druid normalizes all incoming timestamps to UTC prior to chunk assignment. A DAY granularity yields [YYYY-MM-DDT00:00:00.000Z/YYYY-MM-DDT00:00:00.000Z), while HOUR creates 24 discrete daily chunks. The boundary enforcement is absolute: the indexing task deterministically buckets each row into the segment whose interval contains its __time value, and rows falling outside the task's declared intervals are dropped rather than split.

The Coordinator validates interval integrity during handoff. If a segment’s __time values fall outside the declared interval, the segment is marked unused and quarantined. Pipeline engineers must guarantee that source timestamps are pre-normalized to UTC and that timestampSpec parsing matches the exact format of the incoming payload.

Production Ingestion Blueprint

The following index_parallel specification enforces strict interval mapping and caps segment size to maintain predictable query routing:

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": { "type": "s3", "prefixes": ["s3://analytics-events/raw/"] },
      "inputFormat": { "type": "json", "flattenSpec": { "useFieldDiscovery": false } },
      "appendToExisting": false
    },
    "dataSchema": {
      "dataSource": "user_events",
      "timestampSpec": { "column": "event_ts", "format": "iso", "missingValue": null },
      "dimensionsSpec": { "dimensions": ["user_id", "event_type"] },
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "HOUR",
        "rollup": true,
        "intervals": ["2024-01-01T00:00:00Z/2024-01-08T00:00:00Z"]
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000
    }
  }
}

Key operational notes:

  • intervals in granularitySpec explicitly scopes the task to prevent accidental cross-day ingestion; rows outside this window are dropped.
  • Segments are never merged across segmentGranularity boundaries, so each daily chunk remains independently addressable for retention and compaction.
  • maxRowsPerSegment caps the per-segment row count, keeping segment files within the target size range for Historical nodes.

Diagnostic Orchestration & Python Tooling

Timezone drift and DST fallbacks are the primary causes of orphaned or overlapping segments. Druid does not auto-correct timezone offsets; it treats raw ISO strings or epoch milliseconds as UTC unless explicitly transformed. Pipeline orchestration must validate interval alignment before task submission.

The following Python diagnostic routine queries the Coordinator metadata API, parses segment intervals, and flags boundary anomalies:

import requests
from datetime import datetime, timezone
from dateutil.parser import isoparse

COORDINATOR_URL = "http://coordinator:8081/druid/coordinator/v1/metadata/segments"

def audit_segment_intervals(datasource: str) -> list[dict]:
    params = {"datasources": datasource}
    resp = requests.get(COORDINATOR_URL, params=params, timeout=15)
    resp.raise_for_status()
    
    anomalies = []
    for seg in resp.json():
        start, end = seg["interval"].split("/")
        start_dt = isoparse(start)
        end_dt = isoparse(end)
        
        # Verify exclusive end alignment to segmentGranularity (DAY)
        if end_dt.hour != 0 or end_dt.minute != 0 or end_dt.second != 0:
            anomalies.append({
                "segment_id": seg["id"],
                "interval": seg["interval"],
                "issue": "non-aligned_end_boundary",
                "size_bytes": seg.get("size", 0)
            })
            
    return anomalies

# Usage in CI/CD or Airflow DAG
flags = audit_segment_intervals("user_events")
if flags:
    raise RuntimeError(f"Interval misalignment detected: {flags}")

For robust timestamp normalization in Python ingestion pipelines, always leverage the built-in datetime module with explicit UTC anchoring, as documented in the Python datetime library reference. Avoid implicit local timezone assumptions in pandas or pyarrow transformations prior to Druid handoff.

Failure Modes & Recovery Patterns

1. Timezone Drift & DST Gaps

Source systems emitting local timestamps without explicit UTC conversion cause segments to split across midnight boundaries or duplicate during DST fallback. When __time values cross into adjacent intervals, Druid creates micro-segments that degrade broker routing tables and increase memory pressure.

Recovery Pattern:

  1. Halt ingestion for the affected datasource.
  2. Run a targeted compaction task with dropExisting: true to merge fragmented intervals.
  3. Re-index with a pre-normalized timestampSpec and explicit intervals scoping.

Compaction spec for interval consolidation:

{
  "type": "compact",
  "dataSource": "user_events",
  "ioConfig": {
    "type": "compact",
    "inputSpec": { "type": "interval", "interval": "2024-01-01T00:00:00Z/2024-01-02T00:00:00Z" },
    "dropExisting": true
  },
  "tuningConfig": {
    "type": "index_parallel",
    "maxRowsPerSegment": 5000000
  }
}

2. Overlapping Segments & Query Routing Degradation

When multiple ingestion tasks target the same interval concurrently, Druid may produce overlapping segments. The Broker’s segment discovery layer will route queries to all matching segments, causing duplicate aggregation or OutOfMemoryError during merge phases.

Diagnostic Command:

curl -s "http://coordinator:8081/druid/coordinator/v1/metadata/segments?datasources=user_events" \
  -H "Content-Type: application/json" | jq '[.[] | select(.dataSource == "user_events")] | group_by(.interval) | map(select(length > 1))'

Recovery Pattern:

  • Use the Overlord API to kill redundant tasks: POST /druid/indexer/v1/task/{taskId}/shutdown
  • Trigger a forced compaction with forceDropExisting: true to collapse overlaps.
  • Implement idempotent task submission in your orchestrator by checking /druid/indexer/v1/pendingTasks before dispatch.

For comprehensive interval validation and ingestion tuning, consult the official Apache Druid Ingestion Documentation. Understanding how segmentGranularity dictates chunk boundaries is critical when designing retention policies, as detailed in Understanding Druid Segment Granularity.

Back to Apache Druid Segment Architecture & Lifecycle Fundamentals