Apache Druid Segment Architecture & Lifecycle Fundamentals: Production Orchestration Guide

Apache Druid's sub-second analytical performance rests on a deterministic, immutable segment model, and for OLAP data engineers, analytics platform developers, and DevOps teams the segment is the single unit of storage, replication, query parallelism, and cost that every production decision ultimately routes through. This guide dissects the operational mechanics of Druid segments — from their columnar internals and metadata state machine to automated lifecycle transitions, orchestration patterns, and the monitoring hooks that keep query latency predictable at scale.

Architecture at a Glance

The diagram below traces a segment's path through the Druid cluster: from an ingestion task writing to deep storage and publishing to the metadata store, through Coordinator-driven assignment and balancing onto Historical nodes, to query-time scatter/gather at the Broker.

Data path: ingestion writes to deep storage and metadata; the Coordinator assigns segments to Historicals; the Broker fans queries out and gathers results.

Every arrow in that path is a place where synchronization can drift: a task can succeed but its segment never load, a Coordinator can rebalance faster than Historicals can pull bytes, or a Broker can hold a stale timeline that points at a segment already dropped. The sections that follow treat each stage as an operational surface with its own tunables, sizing math, failure signatures, and metrics.

Core Concept & Internal Mechanics

A Druid segment is a self-contained, time-bound, immutable data unit identified by a four-part coordinate: datasource, time interval, version (an ISO-8601 timestamp assigned at publish), and partition number. Because segments are immutable, Druid never mutates data in place — a re-ingest or compaction of the same interval produces a new version, and the Coordinator atomically switches queries to the higher version once its replicas are loaded. This versioning is what makes ingestion idempotent and rollbacks trivial: the old version remains in deep storage, marked unused, until a kill task purges it.

Internally, each segment stores dimensions and metrics in a columnar layout where every column is encoded and compressed independently. String dimensions carry a sorted dictionary that maps each distinct value to an integer id, a column of those ids, and a bitmap index — one Roaring Bitmap per distinct value — that records exactly which rows contain it. Filters resolve to fast bitmap set operations (AND/OR/NOT) instead of row scans, which is the mechanical reason a WHERE country = 'US' AND device = 'mobile' predicate stays sub-second across billions of rows. The dictionary, encoding, and compression choices are governed by the indexSpec, and their trade-offs are unpacked in depth in columnar storage formats in Druid.

Three structural properties drive nearly every operational decision downstream:

Time partitioning first, then secondary partitioning. A segment always covers exactly one segmentGranularity bucket (an HOUR, a DAY, and so on). Within that bucket, rows may be further split by a partitionsSpec (dynamic, hashed, or range). The primary time partition is what lets the Broker prune whole intervals before touching a single column; how granularity choice trades metadata overhead against query parallelism is covered in understanding Druid segment granularity.
Rollup at ingestion. When rollup is enabled, rows sharing identical dimension values within a queryGranularity bucket are pre-aggregated, trading raw-row fidelity for dramatic size reduction. The rollup ratio directly determines how many source events land in each segment.
Metadata/data separation. The segment bytes live in deep storage (S3, HDFS, GCS, Azure); the segment descriptor — its coordinate, size, dimensions, and used/unused flag — lives in the metadata store (typically PostgreSQL or MySQL). The Coordinator and Broker reason entirely over metadata; Historicals are the only processes that touch bytes. The internals of that descriptor table are detailed in the segment metadata storage deep dive.

A WHERE country = 'US' filter resolves to the US bitmap alone — a set operation, never a row scan.

The segment lifecycle state machine

A segment is never simply "there." It occupies a metadata state that the Coordinator continuously reconciles against configured load and retention rules:

Published — written to deep storage and its descriptor committed to the metadata store by the indexing task at handoff. At this instant the data is durable but not yet queryable.
Available / Used — the Coordinator has assigned the segment to one or more Historicals (per the replication factor of its tier), those Historicals have pulled the bytes and announced the segment, and the Broker has added it to the query timeline.
Unused — a drop/retention rule or an explicit markUnused call has retired the segment from serving. The bytes remain in deep storage; the descriptor stays in metadata with used = false, so the action is fully reversible via markUsed.
Removed — a kill task has permanently deleted both the deep-storage bytes and the metadata descriptor for unused segments in a target interval. This is the only irreversible transition.

Every transition but the final kill is reversible — the reason this state machine doubles as a cost-control substrate.

The transitions between Used, Unused, and Removed are the domain of the segment compaction, retention & storage optimization workflows, which treat this state machine as the substrate for cost control.

Configuration Reference

Segment shape is decided almost entirely at ingestion time, inside the granularitySpec, partitionsSpec, and tuningConfig blocks of the ingestion spec. The annotated batch spec below shows every field that materially affects the segments it produces. Tasks are submitted to the Overlord at POST /druid/indexer/v1/task and their status polled at GET /druid/indexer/v1/task/{taskId}/status.

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": ["country", "device", "campaign_id"]
      },
      "metricsSpec": [
        { "type": "count", "name": "events" },
        { "type": "longSum", "name": "clicks", "fieldName": "clicks" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": { "type": "s3", "prefixes": ["s3://bucket/events/"] },
      "inputFormat": { "type": "json" }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000,
      "maxNumConcurrentSubTasks": 4,
      "partitionsSpec": {
        "type": "hashed",
        "targetRowsPerSegment": 5000000,
        "partitionDimensions": ["country"]
      },
      "indexSpec": {
        "bitmap": { "type": "roaring" },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "forceGuaranteedRollup": true
    }
  }
}

Field-by-field, the ones that govern segment shape:

segmentGranularity — the time span each segment covers. The primary lever for segment count: finer granularity multiplies segment and metadata volume; coarser reduces Coordinator load but coarsens retention and pruning resolution.
queryGranularity — the timestamp truncation applied to rows before rollup. MINUTE collapses all events within a minute; NONE preserves millisecond fidelity and disables time-based rollup.
rollup / forceGuaranteedRollup — enable pre-aggregation. forceGuaranteedRollup runs a two-phase shuffle so rollup is perfect across the whole interval (no duplicate dimension-tuples split across partitions), at the cost of an extra pass. It requires a hashed or range partitionsSpec.
partitionsSpec.type — dynamic (fast, size-driven, no secondary clustering), hashed (even distribution + guaranteed rollup, keyed on partitionDimensions), or range (co-locates a range of a dimension's values in the same segment for pruning). The hashed-vs-range decision is a distinct high-intent trade-off worth its own analysis.
targetRowsPerSegment / maxRowsPerSegment — the soft target and hard ceiling on rows per segment; the primary knob for hitting the size band discussed below.
indexSpec.bitmap — roaring (default, best for high cardinality and fast AND/OR) vs concise.
indexSpec.dimensionCompression / metricCompression — lz4 (default, fast) or zstd (denser, more CPU). Codec choice interacts with storage tier and scan CPU budget.

For streaming ingestion the equivalent knobs live under a supervisor spec's tuningConfig (maxRowsPerSegment, intermediatePersistPeriod) and the handoff is continuous rather than task-terminal; the reconciliation of batch and streaming segment production is covered under automated ingestion pipeline orchestration.

Operational Sizing & Constraints

Segment size is a first-order performance property, not a cosmetic one. The Historical query engine loads and scans columnar data in memory-mapped chunks; oversized segments materialize excessive column vectors onto the JVM heap and trigger long garbage-collection pauses, while undersized segments fragment the timeline and inflate per-segment scheduling and metadata overhead. The widely used target band is roughly 300–700 MB, or about 5 million rows per segment:

$$\text{targetRows} = \text{targetMB} \times 1048576 \div \text{avgRowBytes}$$

So for a 600 MB target where measured post-rollup rows average 120 bytes:

$$\text{targetRows} \approx 600 \times 1048576 \div 120 \approx 5.24 \times 10^{6}$$

Always derive avgRowBytes from a real sample rather than a guess: ingest a representative slice, read back the actual segment size and numRows from the metadata, and divide. The full treatment of this calibration for Historical memory pressure lives in optimizing segment size for Historical nodes.

Two more constraints bound the design space:

Historical capacity headroom. Each Historical advertises druid.server.maxSize and must hold its assigned segments in druid.segmentCache.locations disk plus enough page-cache RAM to memory-map the hot working set. Plan cluster storage as $\text{totalDeepStorage} \times \text{replicationFactor}$ across the serving tiers, and keep utilization below roughly 85% so the Coordinator retains room to rebalance.
Segment count vs Coordinator cost. Every used segment is a row the Coordinator loads, a timeline entry the Broker holds, and a unit the balancing algorithm considers each run. Millions of tiny segments degrade the control plane long before they exhaust disk — which is precisely why automated compaction scheduling exists: to merge size-drifted segments back into the target band.

Pipeline Orchestration Patterns

Production ingestion cannot be fire-and-forget. A robust orchestrator submits the spec, polls to a terminal state with bounded exponential backoff, and — crucially — waits for the segment to become queryable, not merely for the task to report SUCCESS. Handoff (bytes persisted + descriptor committed) and availability (Historicals loaded + Broker announced) are distinct events, and gating downstream work on the wrong one is the most common cause of "the job passed but the data isn't there" incidents.

The stdlib + requests orchestrator below submits a batch task, polls status with exponential backoff, then confirms availability against the Coordinator before returning:

import time
import requests

OVERLORD = "http://druid-overlord:8090"
COORDINATOR = "http://druid-coordinator:8081"


def submit_task(spec: dict) -> str:
    r = requests.post(f"{OVERLORD}/druid/indexer/v1/task", json=spec, timeout=30)
    r.raise_for_status()
    return r.json()["task"]


def poll_until_terminal(task_id: str, max_wait: float = 3600.0) -> str:
    delay, waited = 2.0, 0.0
    while waited < max_wait:
        r = requests.get(
            f"{OVERLORD}/druid/indexer/v1/task/{task_id}/status", timeout=30
        )
        r.raise_for_status()
        status = r.json()["status"]["status"]  # RUNNING | SUCCESS | FAILED
        if status in ("SUCCESS", "FAILED"):
            return status
        time.sleep(delay)
        waited += delay
        delay = min(delay * 2, 60.0)  # exponential backoff, capped
    raise TimeoutError(f"{task_id} did not finish within {max_wait}s")


def wait_for_availability(datasource: str, interval: str, max_wait: float = 600.0) -> bool:
    """Confirm at least one segment for the interval is loaded and used."""
    delay, waited = 2.0, 0.0
    while waited < max_wait:
        r = requests.get(
            f"{COORDINATOR}/druid/coordinator/v1/datasources/{datasource}"
            f"/intervals/{interval.replace('/', '_')}/serverview",
            timeout=30,
        )
        if r.ok and r.json():
            return True
        time.sleep(delay)
        waited += delay
        delay = min(delay * 2, 30.0)
    return False


def run(spec: dict, datasource: str, interval: str) -> None:
    task_id = submit_task(spec)
    if poll_until_terminal(task_id) != "SUCCESS":
        raise RuntimeError(f"ingestion task {task_id} FAILED")
    if not wait_for_availability(datasource, interval):
        raise RuntimeError(f"{task_id} succeeded but segments never became queryable")
    print(f"{task_id}: ingested and queryable")

The same backoff-and-verify shape generalizes to every lifecycle action: submitting a compaction task, issuing markUnused, or launching a kill. For the spec-generation side of this pipeline — templating these JSON bodies programmatically instead of hand-editing them — see dynamic ingestion spec generation, and for the async supervision of long-running streaming tasks see async task execution patterns.

Failure Modes & Diagnostics

Each failure below is stated as symptom → root cause → remediation, in rough order of how often it bites production clusters.

Task reports SUCCESS but queries return no rows. → Handoff completed but the Coordinator has not yet assigned/loaded the segment, or the target tier is at capacity. → Poll the Coordinator serverview (as wait_for_availability does) before signalling downstream success; check GET /druid/coordinator/v1/loadstatus and confirm the destination tier has headroom under druid.server.maxSize.
Historical node OOM / long GC pauses under scan load. → Segments far above the target band forcing large column vectors onto the heap. → Run compaction to bring segments into the 300–700 MB band; verify with GET /druid/coordinator/v1/datasources/{ds}/segments that size per segment is within range; raise processing buffer/heap only as a stopgap.
Broker query timeouts or partial results after ingestion. → Stale Broker timeline pointing at segments that moved, dropped, or never announced. → Inspect the Broker's view; confirm discovery propagation and that no Historicals are mid-restart. The propagation and tier-affinity mechanics live in query routing and segment discovery.
Coordinator run time climbing, rebalancing never settling ("rebalancing storm"). → Too many tiny segments from over-fine segmentGranularity or dynamic partitioning without compaction. → Coarsen granularity going forward and enable auto-compaction to consolidate historical intervals; watch segment count per datasource.
Deep storage growing without bound. → Old segment versions and unused segments accumulate because no kill task ever purges them. → Confirm druid.coordinator.kill.on=true and schedule kill tasks for intervals past retention; the reversible markUnused step must be followed by an actual purge to reclaim bytes.
Duplicate or double-counted rows after a re-ingest. → Rollup was not guaranteed, so identical dimension-tuples landed in separate partitions, or an older overlapping version is still used. → Use forceGuaranteedRollup with a hashed/range partitionsSpec; verify only the newest version is used for the interval via the metadata segments endpoint.

Security & Access Control Boundaries

The segment lifecycle crosses three trust boundaries, and each must be locked independently — securing one while leaving another open exposes the data.

Query / metadata plane. Druid's authorization extensions (e.g. the basic-security or LDAP extensions) enforce resource:action checks at the datasource level, so a principal authorized to READ datasource events cannot read finance. These checks gate the Broker, Router, and Overlord APIs.
Deep storage plane. The raw segment files in object storage are Druid-opaque: anyone with the underlying S3/HDFS/GCS credentials can read the columnar bytes directly, bypassing every datasource ACL. Deep-storage buckets must therefore carry their own least-privilege IAM policy, scoped so only the indexing and Historical service identities can read/write the segment prefix, with encryption-at-rest enabled.
Metadata plane. The metadata store holds every segment descriptor and, depending on extensions, credentials. Its database ACLs must restrict connections to Druid service accounts, because write access there lets an attacker flip used flags or forge segment versions.

In multi-tenant deployments these boundaries also interact with retention: a tenant's right-to-be-forgotten request is only satisfied once the corresponding segments are marked unused and killed from deep storage. The full hardening model across all three planes is detailed in security boundaries for segment access, and the retention-rule syntax that drives per-tenant expiration in configuring segment retention policies.

Monitoring & Alerting Hooks

Druid emits metrics that, when scraped into Prometheus (via the prometheus-emitter extension or a StatsD bridge), give direct visibility into every stage of the lifecycle. The table maps the highest-value signals to their meaning and a starting alert threshold.

Metric	What it tells you	Suggested alert
`druid_coordinator_segment_unavailable_count`	Segments that should be loaded but aren't	`> 0` for 5 min
`druid_coordinator_segment_under_replicated_count`	Segments below their tier's replication factor	`> 0` for 10 min
`druid_segment_count` (per datasource)	Segment-count growth / drift toward tiny segments	Sudden slope change
`druid_coordinator_time` (duty run duration)	Control-plane pressure from segment volume	`> 90s` sustained
`druid_historical_segment_used_percent`	Historical cache fill vs `maxSize`	`> 85%`
`druid_query_time` (p95/p99 by datasource)	Query latency SLA	p95 above SLA
`druid_ingest_handoff_time`	Time from persist to queryable	Rising trend
JVM `jvm_gc_pause` on Historicals	GC pressure from oversized segments	Frequent long pauses

Practical panel hints for a Grafana lifecycle dashboard: one row for availability (segment_unavailable_count + under_replicated_count as stat panels that go red on nonzero), one row for shape (segment count and average segment size per datasource as time series — the earliest warning of drift), one row for capacity (segment_used_percent per Historical, heatmap), and one row for latency (query_time p95/p99 overlaid on ingest_handoff_time). Alert on segment_unavailable_count > 0 as a page-worthy signal — it is the clearest indicator that ingestion succeeded but the data never reached queries. Wire these alerts to the same orchestrator that runs ingestion so a failed availability check and a firing alert tell one coherent story.

Columnar storage formats in Druid — how dictionary encoding, bitmap indexes, and LZ4/ZSTD codecs shape segment size and scan speed.
Understanding Druid segment granularity — choosing segmentGranularity to balance metadata overhead against query parallelism.
Query routing and segment discovery — how the Broker builds its timeline and routes scatter/gather queries to the right Historicals.
Security boundaries for segment access — hardening the query, deep-storage, and metadata planes for multi-tenant datasources.
Segment compaction, retention & storage optimization — the sibling guide to consolidating, retaining, and purging segments across their lifecycle.
Automated ingestion pipeline orchestration — the sibling guide to submitting, validating, and supervising the tasks that produce segments.

Up one level: this is one of the three foundations of the site — return to the segment management home to explore the ingestion-orchestration and compaction guides alongside it.

Apache Druid Segment Architecture & Lifecycle Fundamentals: Production Orchestration Guide

Architecture at a Glance #

Core Concept & Internal Mechanics #

The segment lifecycle state machine #

Configuration Reference #

Operational Sizing & Constraints #

Pipeline Orchestration Patterns #

Failure Modes & Diagnostics #

Security & Access Control Boundaries #

Monitoring & Alerting Hooks #

Related #

Explore this section