How Druid Segments Map to Time Intervals: Operational Reference
Druid’s query routing and storage efficiency depend on deterministic, UTC-aligned time partitioning. Every segment is strictly bound to an immutable interval defined by segmentGranularity and the __time column. This mapping is mathematically enforced during ingestion and validated by the Coordinator, forming the operational backbone of Apache Druid Segment Architecture & Lifecycle Fundamentals. Misalignment at the ingestion boundary cascades into query routing failures, retention policy violations, and inefficient compaction cycles.
Deterministic Interval Mapping
Segments are named using the datasource_version_start_end convention, where start and end are ISO-8601 timestamps truncated to the configured segmentGranularity. Druid normalizes all incoming timestamps to UTC prior to chunk assignment. A DAY granularity yields [YYYY-MM-DDT00:00:00.000Z/YYYY-MM-DDT00:00:00.000Z), while HOUR creates 24 discrete daily chunks. The boundary enforcement is absolute: the indexing task deterministically buckets each row into the segment whose interval contains its __time value, and rows falling outside the task's declared intervals are dropped rather than split.
The Coordinator validates interval integrity during handoff. If a segment’s __time values fall outside the declared interval, the segment is marked unused and quarantined. Pipeline engineers must guarantee that source timestamps are pre-normalized to UTC and that timestampSpec parsing matches the exact format of the incoming payload.
Production Ingestion Blueprint
The following index_parallel specification enforces strict interval mapping and caps segment size to maintain predictable query routing:
{
"type": "index_parallel",
"spec": {
"ioConfig": {
"type": "index_parallel",
"inputSource": { "type": "s3", "prefixes": ["s3://analytics-events/raw/"] },
"inputFormat": { "type": "json", "flattenSpec": { "useFieldDiscovery": false } },
"appendToExisting": false
},
"dataSchema": {
"dataSource": "user_events",
"timestampSpec": { "column": "event_ts", "format": "iso", "missingValue": null },
"dimensionsSpec": { "dimensions": ["user_id", "event_type"] },
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "HOUR",
"rollup": true,
"intervals": ["2024-01-01T00:00:00Z/2024-01-08T00:00:00Z"]
}
},
"tuningConfig": {
"type": "index_parallel",
"maxRowsPerSegment": 5000000
}
}
}
Key operational notes:
intervalsingranularitySpecexplicitly scopes the task to prevent accidental cross-day ingestion; rows outside this window are dropped.- Segments are never merged across
segmentGranularityboundaries, so each daily chunk remains independently addressable for retention and compaction. maxRowsPerSegmentcaps the per-segment row count, keeping segment files within the target size range for Historical nodes.
Diagnostic Orchestration & Python Tooling
Timezone drift and DST fallbacks are the primary causes of orphaned or overlapping segments. Druid does not auto-correct timezone offsets; it treats raw ISO strings or epoch milliseconds as UTC unless explicitly transformed. Pipeline orchestration must validate interval alignment before task submission.
The following Python diagnostic routine queries the Coordinator metadata API, parses segment intervals, and flags boundary anomalies:
import requests
from datetime import datetime, timezone
from dateutil.parser import isoparse
COORDINATOR_URL = "http://coordinator:8081/druid/coordinator/v1/metadata/segments"
def audit_segment_intervals(datasource: str) -> list[dict]:
params = {"datasources": datasource}
resp = requests.get(COORDINATOR_URL, params=params, timeout=15)
resp.raise_for_status()
anomalies = []
for seg in resp.json():
start, end = seg["interval"].split("/")
start_dt = isoparse(start)
end_dt = isoparse(end)
# Verify exclusive end alignment to segmentGranularity (DAY)
if end_dt.hour != 0 or end_dt.minute != 0 or end_dt.second != 0:
anomalies.append({
"segment_id": seg["id"],
"interval": seg["interval"],
"issue": "non-aligned_end_boundary",
"size_bytes": seg.get("size", 0)
})
return anomalies
# Usage in CI/CD or Airflow DAG
flags = audit_segment_intervals("user_events")
if flags:
raise RuntimeError(f"Interval misalignment detected: {flags}")
For robust timestamp normalization in Python ingestion pipelines, always leverage the built-in datetime module with explicit UTC anchoring, as documented in the Python datetime library reference. Avoid implicit local timezone assumptions in pandas or pyarrow transformations prior to Druid handoff.
Failure Modes & Recovery Patterns
1. Timezone Drift & DST Gaps
Source systems emitting local timestamps without explicit UTC conversion cause segments to split across midnight boundaries or duplicate during DST fallback. When __time values cross into adjacent intervals, Druid creates micro-segments that degrade broker routing tables and increase memory pressure.
Recovery Pattern:
- Halt ingestion for the affected datasource.
- Run a targeted compaction task with
dropExisting: trueto merge fragmented intervals. - Re-index with a pre-normalized
timestampSpecand explicitintervalsscoping.
Compaction spec for interval consolidation:
{
"type": "compact",
"dataSource": "user_events",
"ioConfig": {
"type": "compact",
"inputSpec": { "type": "interval", "interval": "2024-01-01T00:00:00Z/2024-01-02T00:00:00Z" },
"dropExisting": true
},
"tuningConfig": {
"type": "index_parallel",
"maxRowsPerSegment": 5000000
}
}
2. Overlapping Segments & Query Routing Degradation
When multiple ingestion tasks target the same interval concurrently, Druid may produce overlapping segments. The Broker’s segment discovery layer will route queries to all matching segments, causing duplicate aggregation or OutOfMemoryError during merge phases.
Diagnostic Command:
curl -s "http://coordinator:8081/druid/coordinator/v1/metadata/segments?datasources=user_events" \
-H "Content-Type: application/json" | jq '[.[] | select(.dataSource == "user_events")] | group_by(.interval) | map(select(length > 1))'
Recovery Pattern:
- Use the Overlord API to kill redundant tasks:
POST /druid/indexer/v1/task/{taskId}/shutdown - Trigger a forced compaction with
forceDropExisting: trueto collapse overlaps. - Implement idempotent task submission in your orchestrator by checking
/druid/indexer/v1/pendingTasksbefore dispatch.
For comprehensive interval validation and ingestion tuning, consult the official Apache Druid Ingestion Documentation. Understanding how segmentGranularity dictates chunk boundaries is critical when designing retention policies, as detailed in Understanding Druid Segment Granularity.