Apache Druid Segment Lifecycle: Compaction, Retention & Storage Optimization

Apache Druid's query performance, storage footprint, and cluster stability are fundamentally governed by how segments are created, consolidated, retained, and purged — making segment lifecycle management a core architectural discipline for OLAP data engineers, analytics platform developers, and DevOps teams running high-throughput ingestion at production scale.

Druid segments are immutable, columnar, time-partitioned units persisted in deep storage and loaded into memory-mapped page cache on Historical nodes. Mastering their lifecycle — the transition from freshly published fragments, through compaction into query-optimal partitions, to rule-driven retention and permanent removal — is what separates a Druid cluster that delivers predictable sub-second latency at bounded cost from one that drifts into heap exhaustion, metadata bloat, and runaway deep-storage bills. This guide treats the lifecycle as a single control surface and drills into each stage: the immutability contract, compaction internals, retention governance, storage sizing math, orchestration patterns, failure diagnostics, access control, and monitoring.

Segment Lifecycle at a Glance

Every segment moves through a small set of metadata states — from publication and loading, through optional compaction, to retirement and permanent removal. The Coordinator and Overlord drive these transitions; the used flag in the metadata store is the single source of truth for whether a segment should be loaded and served.

This page is the top of the compaction, retention, and storage area of the site. For the upstream concepts that define what a segment is before it enters this lifecycle — time chunking, columnar encoding, and query routing — start with the segment architecture and lifecycle fundamentals reference.

Core Concept & Internal Mechanics

Every Druid segment represents a discrete, time-bound slice of data, bounded by segmentGranularity, queryGranularity, and explicit partitioning dimensions. The way those boundaries are chosen is covered in depth under segment granularity fundamentals; everything downstream of ingestion — compaction, retention, and sizing — operates on the segments those settings produce.

Once a segment is published to the metadata store and deep storage, it becomes strictly immutable. Schema adjustments, rollup rule updates, or data corrections cannot modify an existing segment in place. Instead, the system must generate replacement segments through compaction or full re-indexing. This immutability model guarantees deterministic query execution and eliminates write-amplification during peak analytical workloads, but it demands rigorous lifecycle controls to prevent metadata bloat, Historical JVM heap exhaustion, and deep-storage sprawl.

The identifier and the used flag

A segment is uniquely identified by the tuple of datasource, time interval, version (an ISO-8601 timestamp assigned at publish time), and partition number — for example wikipedia_2024-01-01T00:00:00.000Z_2024-01-02T00:00:00.000Z_2024-01-05T09:12:03.456Z_3. The version string is what makes atomic overwrite possible: when compaction or re-indexing produces a newer version covering the same interval, the Coordinator loads the new version, then marks the older version's segments unused. Queries always resolve to the highest version available for an interval, so readers never observe a partial swap.

The metadata store tracks each segment row with a boolean used column. A segment stays used while it should be loaded and served; it flips to unused when a drop rule or a manual markUnused call retires it. Unused segments still occupy deep storage and a metadata row until a kill task permanently removes both. The division of labor across processes is strict:

The Overlord orchestrates ingestion, compaction, and kill tasks, assigning them to middle-manager worker slots.
The Coordinator evaluates load/drop rules, decides which used segments load onto which Historicals, and balances the Druid cluster.
Historical nodes memory-map and serve assigned segments from local druid.segmentCache.locations.
The Broker performs scatter/gather across Historicals; the mechanics of how it discovers which node holds which segment are detailed in query routing and segment discovery.

In production environments, segment boundaries should be treated as versioned infrastructure artifacts, tracked alongside ingestion specifications and pipeline commit histories. A segment's internal layout — dictionary-encoded dimensions, compressed metric columns, and bitmap indexes — is what compaction rewrites and what sizing targets are measured against; that on-disk format is described under columnar storage formats in Druid.

Why compaction exists

Streaming ingestion (Kafka/Kinesis) and small batch tasks routinely emit many undersized segments per time chunk — one per task per partition per intermediate handoff. A single day may land as hundreds of 5–30 MB fragments. Each fragment carries fixed metadata overhead, adds a row to the Broker's query plan, and forces the scan engine to open more column files. Compaction is Druid's mechanism for merging those fragments into a small number of query-optimal partitions, re-applying rollup, and reclaiming the overhead — without ever mutating a live segment.

Configuration Reference

The lifecycle is driven by three families of configuration: the auto-compaction spec (submitted to the Coordinator), Coordinator retention rules (per datasource), and the kill/cleanup settings. The subsections below annotate each. For a complete, copy-ready native compaction spec with every field documented, see configuring Druid native compaction rules.

Auto-compaction spec

Auto-compaction is configured per datasource and posted to POST /druid/coordinator/v1/config/compaction. The Coordinator's compaction duty then generates and submits the underlying tasks automatically.

{
  "dataSource": "wikipedia",
  "taskPriority": 25,
  "inputSegmentSizeBytes": 100000000000,
  "skipOffsetFromLatest": "P1D",
  "tuningConfig": {
    "type": "index_parallel",
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 5000000,
      "maxTotalRows": 20000000
    },
    "maxNumConcurrentSubTasks": 4,
    "maxRowsInMemory": 1000000
  },
  "granularitySpec": {
    "segmentGranularity": "DAY",
    "queryGranularity": "HOUR",
    "rollup": true
  },
  "ioConfig": {
    "dropExisting": false
  }
}

Field-by-field:

taskPriority — priority of generated compaction tasks relative to ingestion. Keep it below streaming supervisor priority so real-time handoff always wins a worker slot.
inputSegmentSizeBytes — hard ceiling on the total bytes of input segments a single compaction task will pull. Guards against a single task trying to rewrite an unbounded interval; 100000000000 (~100 GB) is a common guardrail.
skipOffsetFromLatest — an ISO-8601 period at the head of the timeline that compaction leaves alone (P1D skips the most recent day). This prevents compaction from fighting active streaming ingestion over the newest, still-mutating intervals.
partitionsSpec.maxRowsPerSegment — the per-segment row ceiling for output. This is the primary lever for landing segments in the target size band; the byte-based targetCompactionSizeBytes was removed in Druid 0.21, so size is controlled indirectly through row count.
maxNumConcurrentSubTasks — parallelism within one compaction task; tune against available worker slots and Historical heap.
granularitySpec.segmentGranularity — compaction may re-partition to a coarser granularity (e.g. HOUR → DAY) to consolidate sparse intervals.
ioConfig.dropExisting — when true, tombstones intervals in the compaction window that no longer have input data, forcing them out of the timeline. Use with care alongside retention rules.

The trade-off surface here — count versus parallelism, memory versus throughput — is deep enough to warrant its own treatment in compaction threshold tuning, and the scheduling of when these tasks fire is covered under automated compaction task scheduling.

Retention rules

Retention is enforced at the datasource level through an ordered list of Coordinator load and drop rules, posted to POST /druid/coordinator/v1/rules/{dataSource}. Rules are evaluated top-to-bottom; the first matching rule wins for a given interval.

[
  { "type": "loadByPeriod", "period": "P7D", "includeFuture": true,
    "tieredReplicants": { "hot": 2 } },
  { "type": "loadByPeriod", "period": "P90D",
    "tieredReplicants": { "cold": 1 } },
  { "type": "dropForever" }
]

This ladder keeps the most recent 7 days on the hot tier with 2 replicas, the next 83 days on a single-replica cold tier, and drops everything older. dropForever as the terminal rule is what actually flips aging segments to unused. Timezone offsets, late-arriving data, and compaction-induced interval shifts all interact with these windows — the full retention rule grammar and cleanup workflow is detailed under TTL mapping and data expiration, and the access-control angle on who may change these rules is covered in configuring segment retention policies.

Kill (permanent removal)

Drop rules only unload segments from Historicals; the data and metadata rows persist in deep storage until a kill task removes them. Enable the Coordinator's automatic kill duty in the runtime properties:

druid.coordinator.kill.on=true
druid.coordinator.kill.period=PT1H
druid.coordinator.kill.durationToRetain=P30D
druid.coordinator.kill.maxSegments=1000

kill.on — master switch for automatic kill of unused segments.
kill.period — how often the Coordinator issues kill tasks.
durationToRetain — a safety window: unused segments younger than this are not killed, giving operators a recovery window to markUsed after an accidental drop.
maxSegments — caps how many segments a single kill task removes, bounding metadata-store and deep-storage load.

Operational Sizing & Constraints

Storage optimization is a continuous calibration between physical segment size, column cardinality, and query access patterns. The columnar format compresses high-cardinality dimensions and numeric metrics well, but ratios vary with data distribution and encoding. Oversized segments materialize excessive column vectors into JVM heap and trigger long GC pauses; undersized segments multiply metadata overhead and fragment scans across too many threads.

The target band

The widely used target is 300–700 MB uncompressed, roughly 500 MB–1 GB compressed, or approximately 5 million rows per segment as a starting point. To convert a byte target into a row ceiling, measure the average compressed bytes per row for the datasource and solve:

$$ \text{maxRowsPerSegment} \approx \frac{\text{targetBytes}}{\text{avgCompressedBytesPerRow}} $$

For a datasource whose segments compress to (\approx 180) bytes/row and a 700 MB target:

$$ \text{maxRowsPerSegment} \approx \frac{700 \times 1{,}048{,}576}{180} \approx 4{,}078{,}000 $$

Rounding to maxRowsPerSegment: 4000000 lands output segments near the top of the band without overshooting. The detailed calibration workflow — measuring real bytes/row, accounting for rollup ratio, and iterating — lives under segment size optimization strategies and the Historical-node-specific view under optimizing segment size for Historical nodes.

Historical capacity and replication

Total loaded bytes a Druid cluster must hold is a function of the retention window, ingest rate, rollup, and replication factor:

$$ \text{loadedBytes} \approx \text{retentionDays} \times \text{dailyCompressedBytes} \times \text{replicationFactor} $$

Historical page-cache efficiency depends on keeping the working set within available free RAM. A common heuristic is to provision each Historical so that druid.segmentCache.locations capacity is 2–3× its assignable segment bytes, leaving headroom for the OS page cache to hold hot segments. Under-provisioning forces cold reads from disk on every scan; over-provisioning wastes SSD. Tiering older intervals to a cold tier with replicationFactor: 1 (as in the retention ladder above) is the primary lever for bounding cost — expanded under reducing Historical node storage costs.

Rollup leverage

When rollup: true, the effective row count after ingestion is the number of distinct dimension combinations per queryGranularity bucket, not raw event count. The rollup ratio

$$ \text{rollupRatio} = \frac{\text{rawRows}}{\text{rolledUpRows}} $$

directly multiplies storage efficiency: a rollup ratio of (\approx 8\times) means a segment holds 8× the raw events for the same byte budget. Coarsening queryGranularity (e.g. MINUTE → HOUR) raises the ratio at the cost of temporal resolution — a lifecycle decision, since it can only be changed for existing data by re-compacting with a new granularitySpec.

Pipeline Orchestration Patterns

Modern ingestion pipelines treat segment management as a programmable infrastructure layer. DevOps teams embed lifecycle controls into CI/CD workflows using Python orchestrators — Apache Airflow, Prefect, or Dagster — to drive compaction config, retention updates, and cluster health checks. These same patterns are the connective tissue between this area and the broader automated ingestion pipeline orchestration work; idempotency and backoff are the shared discipline.

The core pattern: verify state before acting, submit idempotently, and poll with exponential backoff. The snippet below (stdlib + requests only) reads the current compaction config, checks whether a datasource already has a config, and posts one only if it is missing or drifted.

import time
import requests

COORD = "http://coordinator:8081"


def get_compaction_config(datasource, session):
    r = session.get(f"{COORD}/druid/coordinator/v1/config/compaction")
    r.raise_for_status()
    configs = {c["dataSource"]: c for c in r.json().get("compactionConfigs", [])}
    return configs.get(datasource)


def upsert_compaction_config(desired, session):
    current = get_compaction_config(desired["dataSource"], session)
    if current and current.get("tuningConfig") == desired.get("tuningConfig"):
        return "unchanged"
    # Idempotent: posting the same spec twice is a no-op on the Coordinator.
    r = session.post(
        f"{COORD}/druid/coordinator/v1/config/compaction",
        json=desired, timeout=30,
    )
    r.raise_for_status()
    return "applied"


def poll_compaction_status(datasource, session, attempts=8):
    delay = 1.0
    for _ in range(attempts):
        r = session.get(
            f"{COORD}/druid/coordinator/v1/compaction/status",
            params={"dataSource": datasource}, timeout=30,
        )
        r.raise_for_status()
        rows = r.json().get("latestStatus", [])
        if rows and rows[0].get("bytesAwaitingCompaction", 0) == 0:
            return "converged"
        time.sleep(delay)
        delay = min(delay * 2, 60)  # exponential backoff, capped at 60s
    return "pending"


if __name__ == "__main__":
    spec = {
        "dataSource": "wikipedia",
        "skipOffsetFromLatest": "P1D",
        "tuningConfig": {
            "type": "index_parallel",
            "partitionsSpec": {"type": "dynamic", "maxRowsPerSegment": 4000000},
            "maxNumConcurrentSubTasks": 4,
        },
    }
    with requests.Session() as s:
        print(upsert_compaction_config(spec, s))
        print(poll_compaction_status("wikipedia", s))

The same shape applies to retention updates (GET/POST /druid/coordinator/v1/rules/{ds}) and kill submission (POST /druid/indexer/v1/task with a kill payload). Wrap every mutating call so it validates deep-storage availability, monitors task completion, and alerts on Historical memory saturation. By treating lifecycle operations as first-class pipeline stages, teams achieve predictable storage decay and consistent query performance rather than reactive firefighting.

Failure Modes & Diagnostics

Each scenario below follows symptom → root-cause → remediation.

Segment count climbs but bytesAwaitingCompaction never reaches zero. Symptom: GET /druid/coordinator/v1/compaction/status reports a large, non-decreasing backlog. Root cause: compaction tasks are starved of worker slots by higher-priority ingestion, or maxNumConcurrentSubTasks is too low for the fragment volume. Remediation: raise taskPriority toward (but below) streaming priority, add dedicated middle-manager capacity or a compaction-only tier, and increase maxNumConcurrentSubTasks.
Historical GC pauses spike after a compaction run. Symptom: long JVM pause times and query timeouts following consolidation. Root cause: over-compaction produced monolithic segments that exceed the target band, forcing large column vectors into heap. Remediation: lower maxRowsPerSegment back into the sizing band; verify actual output size via the metadata query below and re-compact if segments overshot.
Query planning latency grows on a busy datasource. Symptom: Broker spends increasing time in query planning; many tiny segments per interval. Root cause: under-compaction — streaming handoff created fragments faster than compaction merged them, or skipOffsetFromLatest is too large and never lets recent intervals compact. Remediation: shrink skipOffsetFromLatest, confirm the compaction duty is enabled, and inspect fragment counts per interval.
Deep-storage bill grows even though queries only touch recent data. Symptom: object-storage usage rises steadily; old intervals are unqueried. Root cause: drop rules unloaded segments from Historicals but no kill task ran, so unused rows and their deep-storage objects persist. Remediation: enable druid.coordinator.kill.on=true with a sane durationToRetain, or submit explicit kill tasks; confirm removal via the metadata store.
A recent interval suddenly returns empty results. Symptom: queries over an interval that had data return nothing. Root cause: a misconfigured dropByPeriod/dropBeforeByPeriod window (often a timezone or period-math error) marked still-needed segments unused, or a compaction with dropExisting: true tombstoned an interval with no input. Remediation: markUsed the affected segments if within durationToRetain, then correct the rule ladder; audit dropExisting usage.
Compaction task fails with an out-of-memory or "not enough available slots" error. Symptom: Overlord shows failed compaction tasks. Root cause: maxRowsInMemory too high for the worker heap, or inputSegmentSizeBytes allowed a single task to pull an oversized interval. Remediation: lower maxRowsInMemory, tighten inputSegmentSizeBytes, and ensure worker -Xmx matches the tuning config.

Diagnostic one-liners against the Coordinator (segment count and size per datasource):

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/wikipedia?full" \
  | jq '{segments: .segments.count, bytes: .segments.size}'

# Fragment count per interval — spot under-compacted time chunks
curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/wikipedia/intervals?full" \
  | jq 'to_entries | map({interval: .key, count: .value.count}) | sort_by(-.count) | .[0:5]'

Security & Access Control Boundaries

Lifecycle operations are among the most destructive actions in a Druid cluster — a kill task permanently deletes data, and a retention-rule change can silently unload months of history. Access to them must be gated. With the druid-basic-security extension (or an LDAP/OIDC authenticator), Druid enforces authorization through a resource/action model:

Datasource permissions. READ on a datasource governs querying; WRITE governs ingestion, compaction config, retention rules, and kill submission for that datasource. Grant WRITE narrowly — pipeline service accounts should hold WRITE only on the datasources they own.
CONFIG and STATE resources. CONFIG action on the COORDINATOR/OVERLORD resources gates cluster-wide settings, including the global kill duty. Reserve these for platform administrators, never pipeline runners.
Roles and multi-tenancy. Map roles to datasource prefixes so tenant A cannot compact, drop, or kill tenant B's data. In shared clusters, per-tenant service accounts scoped to a datasource namespace are the enforcement boundary; the broader model is covered under security boundaries for segment access.
Transport security. Enable TLS on Coordinator/Overlord REST endpoints (druid.enableTLSPort=true) so retention and compaction API calls carrying credentials are not sent in the clear. Store service-account credentials in a secrets manager, not in Airflow/Dagster DAG source.
Audit. Druid records config changes in the metadata audit table. Ship these to your SIEM so every retention-rule and compaction-config mutation is attributable — essential for compliance-driven retention windows.

Monitoring & Alerting Hooks

Emit Druid metrics via the prometheus-emitter (or statsd-emitter) extension and alert on lifecycle health, not just node uptime. Key Druid metric names and what they signal:

segment/count — segments loaded per datasource/tier. A steady climb without a matching bytes increase indicates under-compaction. Prometheus: druid_segment_count.
segment/size — total loaded bytes per datasource; the input to capacity planning. Prometheus: druid_segment_size.
segment/unavailable/count — segments the Coordinator wants loaded but no Historical is serving. Alert if > 0 for more than a few minutes: it means query gaps.
segment/underReplicated/count — segments below their target replication. Alert on sustained non-zero to catch tier capacity shortfalls before a node loss causes an outage.
compact/segmentAnalyzer/fetchAndProcessMillis and interval/compacted/count — compaction throughput and coverage; a stalled interval/compacted/count flags a broken compaction duty.
segment/waitCompact/bytes (a.k.a. bytesAwaitingCompaction) — the compaction backlog. This is the single best leading indicator: alert when it exceeds a datasource-specific threshold (e.g. > 50 GB) or trends upward for over an hour.
task/run/time{taskType="compact"} and task/failed/count — compaction task duration and failures; page on repeated failures.

Suggested alert thresholds: segment/unavailable/count > 0 (warning at 5 min, critical at 15 min); segment/waitCompact/bytes above the per-datasource budget (warning); kill task age exceeding kill.period (the cleanup duty has stalled). A Grafana panel set for this area should include: a stacked time series of segment/size by tier, a single-stat for total segment/count, a segment/waitCompact/bytes trend line with the alert threshold marked, and a table of top datasources by fragment count. Pair Druid's own jvm/gc/pause metric on Historicals with segment size so an oversized-segment regression from over-compaction is visible immediately.

Automated Compaction Task Scheduling — externalize compaction timing to Airflow/Dagster/CronJobs so consolidation runs in off-peak windows.
Compaction Threshold Tuning — calibrate the row, size, and concurrency thresholds that decide when and how aggressively compaction fires.
Segment Size Optimization Strategies — measure real bytes/row and land segments in the 500 MB–1 GB band.
TTL Mapping and Data Expiration — retention rule grammar, kill-task chaining, and compliant storage decay.
Segment Architecture & Lifecycle Fundamentals — the upstream reference on segment internals, granularity, and query routing that this area builds on.

Up one level: Home.

Apache Druid Segment Lifecycle: Compaction, Retention & Storage Optimization

Segment Lifecycle at a Glance #

Core Concept & Internal Mechanics #

The identifier and the used flag #

Why compaction exists #

Configuration Reference #

Auto-compaction spec #

Retention rules #

Kill (permanent removal) #

Operational Sizing & Constraints #

The target band #

Historical capacity and replication #

Rollup leverage #

Pipeline Orchestration Patterns #

Failure Modes & Diagnostics #

Security & Access Control Boundaries #

Monitoring & Alerting Hooks #

Related #

Explore this section