How Druid Segments Map to Time Intervals

Engineers hit this problem when a batch job reports SUCCESS but the datasource holds fewer rows than the source, or when a supposedly single-day load produces overlapping segments straddling midnight. The root cause is almost always the timestamp-to-interval mapping: every Druid segment is bound to an immutable, UTC-aligned interval derived from the segmentGranularity setting and each row's __time value, and any row whose timestamp falls outside the task's declared window is dropped rather than routed to a neighbouring segment. This page works through the exact bucketing arithmetic and the diagnostics, spec, and automation that keep it deterministic; it sits under Understanding Druid Segment Granularity, which frames the broader partitioning decision.

The mapping rule is simple but unforgiving. A segment id follows the datasource_interval_version_partitionNumber convention, where the interval is a pair of ISO-8601 timestamps truncated to segmentGranularity. Druid treats every incoming timestamp as UTC unless it was transformed upstream, then buckets each row into the single interval that contains its __time: DAY produces [YYYY-MM-DDT00:00:00.000Z, YYYY-MM-DDT00:00:00.000Z) chunks with an exclusive end, HOUR produces 24 chunks per calendar day. The Coordinator validates that a segment's __time values stay inside its declared interval during handoff and quarantines any segment that violates the boundary. The precise encoding of the rows inside each chunk is covered separately in columnar storage formats in Druid.

Failure Modes & Diagnostics

Interval-mapping bugs surface as dropped rows, overlapping segments, or off-by-one-day chunks — never as an explicit "mapping" error. Each of the shell one-liners below isolates the actual cause against the Coordinator and Overlord REST APIs.

Non-aligned interval boundaries (timezone drift / DST). Symptom: DAY-granularity segments whose interval ends at something other than midnight UTC, or duplicate chunks around a DST fallback. Source systems that emit local timestamps without explicit UTC conversion push __time across midnight, so Druid mints micro-segments that bloat the Broker's routing table. List the intervals actually carrying segments and eyeball the boundaries:

# Distinct intervals for the datasource — every DAY-granularity end must be T00:00:00.000Z
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/user_events/intervals" | jq

Root cause: raw ISO strings or epoch millis carrying an implicit local offset. Remediation: normalize timestamps to UTC upstream, then re-index the affected interval with a pre-normalized timestampSpec.

Overlapping segments over the same interval. Symptom: duplicate aggregation or OutOfMemoryError during a Broker merge, caused by concurrent tasks targeting one interval. Group segments by interval and flag any interval holding more than one distinct time chunk:

# Intervals with more than one segment id (potential overlap)
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/user_events/segments?full=true" \
  | jq 'group_by(.interval) | map(select(length > 1) | {interval: .[0].interval, count: length})'

Root cause: two ingestion tasks writing the same window without idempotent dispatch. Remediation: kill redundant tasks via the Overlord, then run a forced compaction pass over the interval to collapse the overlap.

Silently dropped rows from interval misalignment. Symptom: a SUCCESS task with fewer rows than the source on a backfill. Rows whose __time falls outside granularitySpec.intervals are discarded, not clamped. Compare the declared window against where segments actually landed:

# Confirm the covered intervals match the intended backfill range
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/user_events/intervals" \
  | jq 'keys'

Root cause: an intervals window narrower than the data, or a timezone assumption shifting events across a UTC edge. Remediation: widen granularitySpec.intervals to cover the full range and re-run. How the resulting segment set is then discovered and dispatched is detailed in query routing and segment discovery.

Target Spec & Validated JSON

The index_parallel spec below enforces strict interval mapping: UTC-anchored timestampSpec parsing, an explicit intervals window that scopes the task, and a row cap that keeps each chunk inside the target size band. It is copy-ready against a recent stable Druid release.

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "user_events",
      "timestampSpec": { "column": "event_ts", "format": "iso", "missingValue": null },
      "dimensionsSpec": { "dimensions": ["user_id", "event_type"] },
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "HOUR",
        "rollup": true,
        "intervals": ["2024-01-01T00:00:00Z/2024-01-08T00:00:00Z"]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": { "type": "s3", "prefixes": ["s3://analytics-events/raw/"] },
      "inputFormat": { "type": "json" },
      "appendToExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000
    }
  }
}

The three fields that govern interval mapping:

timestampSpec.format — iso parses ISO-8601 strings as UTC; a mismatched format silently misreads offsets and shifts every row across boundaries. Match this to the exact source payload.
granularitySpec.segmentGranularity — DAY locks each chunk to [T00:00:00.000Z, next-day T00:00:00.000Z). Chunks never merge across this boundary, so each day stays independently addressable for retention and compaction.
granularitySpec.intervals — the explicit window the task will build. Rows outside it are dropped, so on backfills this must span the full source range.

Python Automation Script

Fold interval validation into the pipeline so misalignment is caught at ingestion time rather than in production. The routine below queries the Coordinator metadata API with exponential backoff, parses each segment interval, and flags any DAY-granularity end boundary that is not midnight UTC — the tell-tale of timezone drift. It uses only the standard library plus requests.

import time
import requests
from datetime import datetime, timezone

COORDINATOR_URL = "http://coordinator:8081"


def _get_with_backoff(session, url, params, base_delay=2, max_delay=30, retries=5):
    """GET with capped exponential backoff on transient failures."""
    delay = base_delay
    for attempt in range(retries):
        try:
            resp = session.get(url, params=params, timeout=15)
            resp.raise_for_status()
            return resp.json()
        except requests.RequestException:
            if attempt == retries - 1:
                raise
            time.sleep(delay)
            delay = min(delay * 2, max_delay)


def audit_segment_intervals(datasource: str) -> list[dict]:
    """Return DAY-granularity segments whose interval end is not midnight UTC."""
    session = requests.Session()
    segments = _get_with_backoff(
        session,
        f"{COORDINATOR_URL}/druid/coordinator/v1/datasources/{datasource}/segments",
        params={"full": "true"},
    )

    anomalies = []
    for seg in segments:
        interval = seg.get("interval", "")
        if "/" not in interval:
            continue
        _, end_str = interval.split("/", 1)
        end_dt = datetime.fromisoformat(end_str.replace("Z", "+00:00")).astimezone(timezone.utc)
        if (end_dt.hour, end_dt.minute, end_dt.second) != (0, 0, 0):
            anomalies.append({
                "segment_id": seg.get("identifier", seg.get("id")),
                "interval": interval,
                "issue": "non_aligned_end_boundary",
                "size_bytes": seg.get("size", 0),
            })
    return anomalies


# Usage inside a CI/CD gate or Airflow DAG
if __name__ == "__main__":
    flags = audit_segment_intervals("user_events")
    if flags:
        raise SystemExit(f"Interval misalignment detected: {flags}")

Because the audit exits non-zero when a boundary is off, it drops straight into an ingestion gate that blocks a bad backfill before it reaches query nodes.

Verification Steps

After re-normalizing timestamps and re-running the task, confirm the mapping is clean. First check that every interval end is midnight UTC:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/user_events/intervals" \
  | jq 'keys[]'

Expected output — one entry per contiguous UTC day, all ending at T00:00:00.000Z:

"2024-01-01T00:00:00.000Z/2024-01-02T00:00:00.000Z"
"2024-01-02T00:00:00.000Z/2024-01-03T00:00:00.000Z"
"2024-01-03T00:00:00.000Z/2024-01-04T00:00:00.000Z"

Then confirm no interval holds more than one distinct segment id (no residual overlap):

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/user_events/segments?full=true" \
  | jq 'group_by(.interval) | map(select(length > 1)) | length'

Expected output is 0. A clean run plus the Python audit exiting 0 confirms every __time value now maps to exactly one correctly bounded interval. For deeper validation of ingestion specs, consult the official Apache Druid ingestion documentation.

Understanding Druid Segment Granularity — the parent guide to segmentGranularity, sizing formulas, and how the time-chunk boundary is chosen.
Query Routing and Segment Discovery — how the mapped segment set is pruned and dispatched to Historical nodes at query time.
Columnar Storage Formats in Druid — how the rows inside each mapped interval are encoded, indexed, and compressed.

How Druid Segments Map to Time Intervals

Failure Modes & Diagnostics #

Target Spec & Validated JSON #

Python Automation Script #

Verification Steps #

Related #

Failure Modes & Diagnostics

Target Spec & Validated JSON

Python Automation Script

Verification Steps

Related