TTL Mapping and Data Expiration in Apache Druid

Apache Druid has no row-level DELETE; a time-to-live policy is instead expressed as a set of declarative retention rules the Coordinator evaluates against each datasource timeline, and physical reclamation happens as a separate kill task that erases dropped segments from deep storage and the metadata store. Getting TTL right is what keeps Historical nodes from paging in years of cold data, keeps the metadata store's segment table from bloating, and keeps the cloud object-storage bill proportional to the data you actually query. This page sits under Segment Compaction, Retention & Storage Optimization and focuses specifically on the expiration layer — how a retention rule ladder decides what stays queryable, how dropped-but-not-deleted segments become reclaimable, and how to drive the whole lifecycle deterministically from an external scheduler.

Mechanics & Internals

Druid expiration is a two-stage process, and conflating the stages is the single most common source of "I dropped the data but my S3 bill didn't move" tickets.

Stage 1 — retention rules unload segments (Coordinator). Every coordination cycle the Coordinator runs a duty that walks each datasource's used-segment timeline and applies that datasource's ordered rule list to every interval. Rules are evaluated top to bottom, and the first matching rule wins for a given interval — exactly like a firewall ACL. A load rule marks the segment to be present on Historical tiers (with a replication count per tier); a drop rule marks it to be unloaded from Historicals. Dropping does not delete anything: the segment row in the metadata store still has used = 1, and the segment file still sits in deep storage. It is simply no longer served, so it no longer consumes Historical memory-mapped pages or JVM heap for its column metadata.

The rule types that express a TTL are period- and interval-scoped:

loadByPeriod — keep segments whose interval falls within a trailing ISO-8601 period (e.g. P90D) loaded. includeFuture (default true) also matches segments dated ahead of now, which matters for streaming datasources whose newest time chunks are slightly in the future.
dropByPeriod — unload segments older than the trailing period. The mirror image of loadByPeriod.
dropBeforeByPeriod — unload everything before now − period, regardless of what later rules say; useful as an explicit floor.
loadByInterval / dropByInterval — pin an absolute ISO-8601 interval rather than a rolling window, for legal holds or one-off backfills.
loadForever / dropForever — the terminal catch-all. Every rule ladder should end in one so no interval is left unmatched (an unmatched interval falls back to Druid's default cluster rules, which is rarely what you intend).

A TTL is therefore a ladder: a loadByPeriod at the top that defines the hot/queryable window, optional interval loads for holds, and a dropForever at the bottom that sweeps everything else off the Historicals. Because the Coordinator applies rules asynchronously on its own cycle, a freshly POSTed rule set is eventually — not immediately — consistent; an orchestrator must poll for propagation before it assumes a segment is unloaded.

Stage 2 — the kill task reclaims deep storage (Overlord). To actually free bytes you submit a kill task to the Overlord. A kill task targets a datasource and an interval, finds segments in that interval that are marked unused (used = 0), deletes their files from deep storage, and removes their rows from the metadata store. The critical wrinkle: a drop rule unloads a segment but does not mark it unused. Marking-unused is what makes a segment eligible for kill, and it happens in one of two ways:

Automatically, when druid.coordinator.kill.on = true — the Coordinator's KillUnusedSegments duty periodically marks segments outside the retention rules as unused and, on a configurable cadence, issues kill tasks itself.
Explicitly, via POST /druid/coordinator/v1/datasources/{ds}/markUnused (or the older DELETE /druid/coordinator/v1/datasources/{ds}/intervals/{interval}), after which you submit the kill task yourself.

The control surface an orchestrator programs against is a small set of REST endpoints:

POST /druid/coordinator/v1/rules/{dataSource} — replace the full ordered rule list for a datasource (the body is a JSON array of rule objects).
GET /druid/coordinator/v1/rules/{dataSource} — read back the current rules.
GET /druid/coordinator/v1/rules/{dataSource}/history?count=N — audit the last N rule changes (the Coordinator keeps a rule-change audit log).
POST /druid/coordinator/v1/datasources/{dataSource}/markUnused — mark segments in an interval unused so they become kill-eligible.
POST /druid/indexer/v1/task — submit the kill task itself to the Overlord.
GET /druid/indexer/v1/task/{taskId}/status — poll kill-task completion.

Two internal details govern most behaviour. First, retention rules are matched against the segment's data interval, not its ingestion time — late-arriving data written into an old time chunk is subject to that chunk's rule, so it can be dropped almost immediately after ingestion if it lands behind the retention window. Second, kill is irreversible and metadata-destroying: once the row is gone from the metadata store, the segment cannot be re-loaded without re-ingesting from source. That asymmetry is why the safe pattern is drop early, kill late, leaving a reconsideration window between unload and physical delete.

The segments a TTL policy governs are the same immutable, columnar objects produced by ingestion and rewritten by automated compaction scheduling — the time-chunk boundaries that determine which segments a period rule can match are set by the segment granularity settings established at ingestion, so a DAY-granularity datasource can only express TTL at day resolution.

Validated Configuration Spec

A retention policy is a JSON array POSTed to POST /druid/coordinator/v1/rules/{dataSource}. The array replaces the datasource's rules wholesale — there is no partial update — so an orchestrator must always send the complete, ordered ladder. The block below is a copy-ready 90-day hot window with a 7-day audit hold and a terminal sweep; every field is documented inline. It matches current stable Druid rule grammar.

[
  {
    "type": "loadByPeriod",
    "period": "P90D",
    "includeFuture": true,
    "tieredReplicants": {
      "_default_tier": 2
    }
  },
  {
    "type": "loadByInterval",
    "interval": "2026-01-01T00:00:00.000Z/2026-01-08T00:00:00.000Z",
    "tieredReplicants": {
      "_default_tier": 1
    }
  },
  {
    "type": "dropForever"
  }
]

type — the rule kind; evaluated top-to-bottom, first match wins per interval.
period (loadByPeriod) — trailing ISO-8601 window kept queryable on Historicals. P90D = last 90 days relative to the Coordinator's clock, recomputed every cycle.
includeFuture — when true, also matches intervals dated after now; keep true for streaming datasources so in-flight future chunks are not accidentally dropped.
tieredReplicants — map of tier name → replica count for loaded segments. Use it to keep the hot window at 2 replicas but demote the audit hold to 1 replica, halving its Historical footprint.
loadByInterval.interval — absolute ISO-8601 interval for a legal/audit hold that must survive independent of the rolling window; placed above dropForever so it wins.
dropForever — terminal catch-all; unloads every interval not matched above. Every ladder must end here (or in loadForever) so nothing falls through to cluster defaults.

For a datasource where you want Druid to handle physical reclamation automatically rather than driving kill tasks yourself, enable the Coordinator kill duty in coordinator/runtime.properties:

# Enable automatic marking-unused + kill of segments outside retention rules
druid.coordinator.kill.on=true
# How often the kill duty runs (ISO-8601). Keep >> coordinator.period.
druid.coordinator.kill.period=PT1H
# Never kill data younger than this, regardless of rules — a safety floor.
druid.coordinator.kill.durationToRetain=P7D
# Max kill tasks the duty may have in flight at once.
druid.coordinator.kill.maxSegments=100

druid.coordinator.kill.on — master switch; when false (a common conservative default) drop rules unload data but nothing is ever deleted from deep storage until you kill manually.
druid.coordinator.kill.durationToRetain — a hard floor that overrides the rules: even if a rule drops an interval, the kill duty will not delete anything newer than now − durationToRetain. This is your safety margin against a bad rule push.
druid.coordinator.kill.maxSegments — bounds Overlord load from a large backlog of unused segments.

If instead you submit kill tasks explicitly, the task payload sent to POST /druid/indexer/v1/task is minimal:

{
  "type": "kill",
  "dataSource": "clickstream",
  "interval": "2025-01-01T00:00:00.000Z/2025-04-01T00:00:00.000Z"
}

Only segments already marked unused within that interval are deleted; loaded/used segments in the same interval are untouched, so an over-broad interval is safe with respect to live data (though wasteful of task time).

Sizing Heuristics & Formulas

TTL sizing is where retention windows meet Historical capacity. The loaded footprint of a datasource under a period rule is the ingest rate times the window times replication:

$$\text{loadedBytes} \approx \text{ingestBytesPerDay} \times \text{retentionDays} \times \text{replicas}$$

For a datasource ingesting (120) GB/day of compressed segments under a P90D window at 2 replicas:

$$\text{loadedBytes} \approx 120 \times 90 \times 2 = 21{,}600\ \text{GB} \approx 21\ \text{TB}$$

That figure — not raw ingest — is what must fit inside aggregate Historical memory-map plus disk across the tier. Shrinking the window or demoting older intervals to 1 replica via tieredReplicants is the primary lever; the byte accounting for the disk tier itself is covered under reducing Historical node storage costs.

The retention period must be an integer multiple of the segment granularity or the boundary cuts through a time chunk and the whole chunk stays loaded until fully past the window:

$$\text{retentionPeriod} = k \times \text{segmentGranularity},\quad k \in \mathbb{Z}^{+}$$

A P90D window on DAY granularity is clean ((k = 90)); a P90D window on MONTH granularity is not — it will retain the current and prior two full months, effectively 60–89 days depending on the calendar. Size the window in whole granularity units to make the loaded set predictable.

For the deletion side, the reconsideration gap between drop and kill sets how much reclaimable-but-not-yet-reclaimed data you carry. If kill runs every killPeriod and refuses to delete data younger than durationToRetain, the standing volume of unused-but-present segments is roughly:

$$\text{reclaimBacklogBytes} \approx \text{ingestBytesPerDay} \times \left(\frac{\text{durationToRetain}}{1\ \text{day}} - \frac{\text{retentionDays}}{1}\right)$$

That difference is the safety buffer you pay for in deep-storage cost. Set durationToRetain just wide enough to survive a bad rule push and a human noticing (commonly (2\times) the on-call response window), not arbitrarily large.

Python Orchestration Snippet

A production TTL job is idempotent: it reads the current rules, applies the desired ladder only if it differs, waits for the Coordinator to converge, then marks-unused and kills the expired tail — each Overlord/Coordinator call wrapped in exponential backoff because both APIs are asynchronous and occasionally 503 during coordination cycles. This uses only the standard library plus requests.

import time
import logging
import requests

logger = logging.getLogger("druid.ttl")

COORDINATOR = "http://druid-coordinator:8081"
OVERLORD = "http://druid-overlord:8090"


def _request(method: str, url: str, *, retries: int = 5, **kwargs) -> requests.Response:
    """HTTP with exponential backoff on 5xx / connection errors."""
    delay = 1.0
    for attempt in range(1, retries + 1):
        try:
            resp = requests.request(method, url, timeout=30, **kwargs)
            if resp.status_code < 500:
                resp.raise_for_status()
                return resp
            logger.warning("%s %s -> %s (attempt %d)", method, url, resp.status_code, attempt)
        except requests.RequestException as exc:
            logger.warning("%s %s failed: %s (attempt %d)", method, url, exc, attempt)
        if attempt == retries:
            raise RuntimeError(f"{method} {url} failed after {retries} attempts")
        time.sleep(delay)
        delay = min(delay * 2, 30.0)


def apply_rules_if_changed(datasource: str, desired: list) -> bool:
    """Replace retention rules only when they differ from what's live (idempotent)."""
    current = _request("GET", f"{COORDINATOR}/druid/coordinator/v1/rules/{datasource}").json()
    if current == desired:
        logger.info("Rules for %s already match desired ladder; no-op", datasource)
        return False
    _request(
        "POST",
        f"{COORDINATOR}/druid/coordinator/v1/rules/{datasource}",
        json=desired,
        headers={"Content-Type": "application/json"},
    )
    logger.info("Applied %d retention rules to %s", len(desired), datasource)
    return True


def mark_unused(datasource: str, interval: str) -> None:
    """Make segments in an interval eligible for kill."""
    _request(
        "POST",
        f"{COORDINATOR}/druid/coordinator/v1/datasources/{datasource}/markUnused",
        json={"interval": interval},
        headers={"Content-Type": "application/json"},
    )
    logger.info("Marked %s %s unused", datasource, interval)


def kill_interval(datasource: str, interval: str, poll_seconds: int = 10) -> str:
    """Submit a kill task and block until it succeeds; return the task id."""
    resp = _request(
        "POST",
        f"{OVERLORD}/druid/indexer/v1/task",
        json={"type": "kill", "dataSource": datasource, "interval": interval},
        headers={"Content-Type": "application/json"},
    )
    task_id = resp.json()["task"]
    logger.info("Kill task %s submitted for %s %s", task_id, datasource, interval)
    while True:
        status = _request(
            "GET", f"{OVERLORD}/druid/indexer/v1/task/{task_id}/status"
        ).json()["status"]["statusCode"]
        if status == "SUCCESS":
            logger.info("Kill task %s complete", task_id)
            return task_id
        if status == "FAILED":
            raise RuntimeError(f"Kill task {task_id} FAILED")
        time.sleep(poll_seconds)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    ladder = [
        {"type": "loadByPeriod", "period": "P90D", "includeFuture": True,
         "tieredReplicants": {"_default_tier": 2}},
        {"type": "dropForever"},
    ]
    apply_rules_if_changed("clickstream", ladder)
    expired = "2025-01-01T00:00:00.000Z/2025-04-01T00:00:00.000Z"
    mark_unused("clickstream", expired)
    kill_interval("clickstream", expired)

The ordering is deliberate: rules are applied first so the Coordinator stops serving the tail, then the tail is marked unused and killed. Reversing it — killing before the drop rule is live — races the Coordinator and can delete a segment it is simultaneously trying to load.

Failure Modes & Diagnostics

1. Dropped data still costs money. Symptom: a dropForever rule is live, the interval no longer appears in queries, but deep-storage bytes and the metadata segments row count are unchanged. Root cause: drop only unloads; nothing was marked unused or killed. Diagnose the count of used-but-unloaded vs unused segments:

# How many segments does the metadata store still consider used for this datasource?
curl -s "http://druid-coordinator:8081/druid/coordinator/v1/datasources/clickstream/segments" \
  | jq 'length'

# Is the kill duty even enabled on the Coordinator?
curl -s "http://druid-coordinator:8081/status/properties" \
  | jq '{killOn: ."druid.coordinator.kill.on", retain: ."druid.coordinator.kill.durationToRetain"}'

Remediation: either enable druid.coordinator.kill.on, or explicitly markUnused + submit a kill task for the expired interval.

2. Rules didn't take effect. Symptom: you POSTed a ladder but old data is still loaded minutes later. Root cause: eventual consistency (the Coordinator applies on its own cycle) or a malformed rule that was silently rejected. Read the rules back and check the audit history:

# Confirm the live ladder is what you sent (POST replaces the whole array)
curl -s "http://druid-coordinator:8081/druid/coordinator/v1/rules/clickstream" | jq '.'

# Who changed the rules last, and to what?
curl -s "http://druid-coordinator:8081/druid/coordinator/v1/rules/clickstream/history?count=5" \
  | jq '.[] | {auditTime, comment, payload}'

3. First-match-wins masks a drop. Symptom: a dropByPeriod never fires because a broad loadForever sits above it. Root cause: rule ordering — the first matching rule wins, so a catch-all load shadows every drop below it. Diagnose by reading the ladder top-to-bottom and confirming no load* rule matches the interval you expect to drop; move dropForever to the bottom and any absolute holds above it.

4. Kill task fails or hangs. Symptom: kill task sits in RUNNING or flips to FAILED. Root cause: deep-storage permission error (the Overlord/MiddleManager IAM role lacks s3:DeleteObject), or a lock conflict with a compaction/ingestion task on the same interval. Inspect the task log:

# Tail the kill task log for the deep-storage delete error
curl -s "http://druid-overlord:8090/druid/indexer/v1/task/kill_clickstream_.../log" | tail -50

# Any active locks on the interval you're killing?
curl -s "http://druid-overlord:8090/druid/indexer/v1/lockedIntervals" \
  -H 'Content-Type: application/json' \
  -d '{"clickstream": 0}' | jq '.'

5. Coordinator GC pressure from oversized loaded set. Symptom: retention window widened, then Coordinator/Historical old-gen GC climbs and rule evaluation lags. Root cause: the loaded byte estimate (see the sizing formula) exceeded tier capacity. Diagnose GC on the Coordinator and check per-tier load:

jstat -gcutil "$(pgrep -f 'io.druid.cli.Main server coordinator')" 5s 4

# Segments loaded per tier vs capacity
curl -s "http://druid-coordinator:8081/druid/coordinator/v1/tiers" | jq '.'

Remediation: shrink the loadByPeriod window, demote older intervals to fewer tieredReplicants, or add Historical capacity.

6. Late data dropped before it's queried. Symptom: backfilled events for an old date never appear. Root cause: rules match on data interval, so late data landing behind the dropForever boundary is unloaded almost immediately. Remediation: widen the retention window to cover expected backfill latency, or gate the kill duty with a larger durationToRetain.

Automation Checklist

Segment Compaction, Retention & Storage Optimization — the parent guide covering how retention, compaction, and sizing policies together govern the segment lifecycle.
Automated Compaction Task Scheduling — coordinate expiration with the compaction duty so intervals bound for deletion aren't recompacted first.
Configuring Druid Native Compaction Rules — the compact-task grammar and locking semantics that share time-chunk locks with kill tasks.
Segment Size Optimization Strategies — keep segment sizes in the query-optimal band so TTL boundaries align with efficient block sizes.
Reducing Historical Node Storage Costs — the tier-and-replica accounting that a retention window directly drives.
Understanding Druid Segment Granularity — the time-chunk boundaries that set the resolution at which any TTL period rule can match.

For authoritative rule and kill-task semantics, see the official Apache Druid data management reference.

TTL Mapping and Data Expiration in Apache Druid

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #