Automated Compaction Task Scheduling in Apache Druid

Automated compaction scheduling turns Druid's segment lifecycle from a reactive maintenance chore into a deterministic, self-healing background process: the Coordinator continuously inspects each datasource's timeline, finds intervals whose segments drift away from the target size, and submits compact tasks to reconsolidate them without human intervention. For OLAP data engineers and Python pipeline builders the goal is to align that consolidation with ingestion cadence, query windows, and storage budgets so segments stay in the query-optimal band automatically. This page sits under Segment Compaction, Retention & Storage Optimization and focuses specifically on the timing and orchestration layer — how the compaction duty decides what to run and when, and how to steer that decision from an external scheduler.

Mechanics & Internals

Druid ships two ways to run compaction, and automated scheduling almost always means the first one.

Native auto-compaction (Coordinator duty). The Coordinator runs a periodic duty — CompactSegments — on every coordination cycle (druid.coordinator.period, default PT30S, though compaction is gated by druid.coordinator.dutyGroups timing). On each pass it reads the compaction config for every datasource, walks that datasource's used-segment timeline newest interval first, and decides whether each time chunk needs compaction by comparing the segments it finds against the target state you declared (target row/size, segmentGranularity, indexSpec, partitioning). Intervals that already match are skipped; intervals that don't are queued as compact tasks and dispatched to the Overlord, bounded by the available task slots. Because the search is newest-first and honours skipOffsetFromLatest, recent, still-mutating data is left alone while the tail of the timeline is quietly kept in shape.

Manual compact tasks. You can also POST a one-off compact task straight to the Overlord at POST /druid/indexer/v1/task. That bypasses the duty entirely and is what you reach for in backfills or targeted reprocessing. The task grammar and its locking semantics are covered in depth under Configuring Druid Native Compaction Rules; this page assumes you drive the duty and only fall back to manual submission for exceptions.

The control surface for the duty is a small set of Coordinator REST endpoints. A Python scheduler treats these as the API it programs against:

POST /druid/coordinator/v1/config/compaction — upsert the auto-compaction config for one datasource (the payload is a DataSourceCompactionConfig).
GET /druid/coordinator/v1/config/compaction — read back the full cluster-wide compaction config.
DELETE /druid/coordinator/v1/config/compaction/{dataSource} — remove a datasource from auto-compaction.
POST /druid/coordinator/v1/config/compaction/taskslots — set the global cap and ratio of Coordinator task slots compaction may consume.
GET /druid/coordinator/v1/compaction/status — per-datasource compaction status (scheduling state and, where computed, waiting/compacted/skipped byte counts).
GET /druid/coordinator/v1/compaction/progress?dataSource={ds} — bytes still awaiting compaction for one datasource.

Two internal details drive most scheduling behaviour. First, the newest-first iterator means a datasource that has fallen far behind will compact its recent intervals before its old ones — so a compaction backlog manifests as stale old intervals, not stale new ones. Second, the compact task holds a time-chunk lock over the interval it rewrites; if an ingestion task is writing the same interval, one of them waits. That interaction is exactly why skipOffsetFromLatest exists, and why it must be sized against real ingestion lag rather than guessed.

The segments the duty produces obey the same immutability and columnar-encoding contract as any freshly ingested segment — compaction is a read-old-write-new operation, never an in-place edit. The upstream rules that define what a segment is (time chunking, dictionary encoding, partition boundaries) are covered under segment architecture and lifecycle fundamentals and the segment granularity settings that govern those boundaries.

Validated Configuration Spec

The unit of automated scheduling is a DataSourceCompactionConfig POSTed to POST /druid/coordinator/v1/config/compaction. The Coordinator merges it into the global compaction config and the duty picks it up on the next cycle. Every field below is documented inline; the block is copy-ready against a current stable Druid (matches the partitionsSpec-based sizing model — the byte-based targetCompactionSizeBytes was removed in Druid 0.21).

{
  "dataSource": "analytics_events",
  "taskPriority": 25,
  "skipOffsetFromLatest": "P1D",
  "tuningConfig": {
    "partitionsSpec": {
      "type": "range",
      "partitionDimensions": ["tenant_id"],
      "targetRowsPerSegment": 5000000
    },
    "maxNumConcurrentSubTasks": 4,
    "indexSpec": {
      "dimensionCompression": "lz4",
      "metricCompression": "lz4",
      "longEncoding": "longs"
    }
  },
  "granularitySpec": {
    "segmentGranularity": "DAY",
    "queryGranularity": "HOUR",
    "rollup": true
  },
  "ioConfig": {
    "dropExisting": false
  },
  "taskContext": {
    "forceTimeChunkLock": true
  }
}

Field-by-field:

dataSource — the datasource this config governs. One config object per datasource; there is no wildcard.
taskPriority — Overlord priority assigned to the compact tasks the duty spawns. Keep it below your ingestion task priority (ingestion typically 50+) so compaction never preempts live loads.
skipOffsetFromLatest — an ISO-8601 duration measured back from the latest segment. Intervals inside this window are never auto-compacted, protecting still-mutating recent data from lock contention with ingestion. This is the single most consequential scheduling knob; size it above your worst-case ingestion watermark lag.
tuningConfig.partitionsSpec — declares the target segment shape. type: range (or hashed) with targetRowsPerSegment is the modern way to control output size; partitionDimensions set secondary partitioning that improves pruning for those filters. Use dynamic with maxRowsPerSegment if you don't need clustered partitioning.
tuningConfig.maxNumConcurrentSubTasks — parallelism within a single compact task's index_parallel engine. Multiply this by concurrent compact tasks to size total MiddleManager/Indexer worker demand.
tuningConfig.indexSpec — carries forward compression/encoding for the rewritten segments; changing dimensionCompression here (e.g. lz4 → zstd) lets compaction double as a re-encoding pass. How those codecs affect footprint is detailed under columnar storage formats.
granularitySpec.segmentGranularity — the target time-chunk size. Setting this coarser than ingestion (e.g. ingest HOUR, compact to DAY) is the classic pattern for collapsing many small hourly segments into fewer daily ones.
granularitySpec.queryGranularity / rollup — re-assert rollup during compaction; leave rollup: true only if the source data was rolled up, otherwise you silently change query semantics.
ioConfig.dropExisting — when true, compaction drops (marks unused) any existing segments in the compacted interval that the new output doesn't replace. Leave false unless you specifically want tombstone-based interval replacement.
taskContext.forceTimeChunkLock — forces an exclusive time-chunk lock rather than segment locks, preventing concurrent writers from interleaving on the same interval.

Task-slot allocation is a separate cluster-wide config and is easy to forget:

{
  "compactionTaskSlotRatio": 0.1,
  "maxCompactionTaskSlots": 10
}

POST this to POST /druid/coordinator/v1/config/compaction/taskslots?ratio=0.1&max=10. The effective slot budget is (\min(\lfloor \text{ratio} \times \text{totalTaskSlots}\rfloor,\ \text{max})). With the default ratio of 0.1 a small cluster can end up with zero compaction slots — the most common reason a correctly-configured datasource never compacts.

Sizing Heuristics & Formulas

Three numbers decide whether automated scheduling keeps up: the row target, the skip offset, and the task-slot budget.

Row target from a byte target. Because output size is controlled by rows, translate your desired compressed segment size into a row count using the measured average bytes-per-row:

$$\text{targetRowsPerSegment} \approx \frac{\text{targetBytes}}{\text{avgRowBytes}}$$

For a 700 MB target on data that measures 140 bytes/row compressed:

$$\text{targetRowsPerSegment} \approx \frac{700 \times 1048576}{140} \approx 5.24 \times 10^{6}$$

Round to 5000000. Measure avgRowBytes from real segments (see the diagnostics below) rather than guessing — it is the term that most often makes segments land outside the band. The full method for landing in the 500 MB–1 GB range lives in Segment Size Optimization Strategies.

Skip offset from ingestion lag. The skip window must cover the interval that ingestion might still touch, plus headroom:

$$\text{skipOffsetFromLatest} \geq \text{segmentGranularity} + \text{maxIngestionLag} + \text{safetyMargin}$$

For DAY segments fed by a stream whose worst-case handoff lag is ~2 h, P1D plus a couple of hours of margin is the floor; P1D is the common safe default and PT0S-style aggressive offsets are what produce TaskLock conflicts.

Throughput check — can the duty keep up? Compaction keeps pace only if it can rewrite bytes at least as fast as ingestion produces compactible bytes:

$$T_{\text{drain}} \approx \frac{\text{bytesAwaitingCompaction}}{\text{slots} \times \text{throughputPerSlot}}$$

If bytesAwaitingCompaction (segment/waitCompact/bytes) trends upward across cycles, either maxCompactionTaskSlots is too low or skipOffsetFromLatest is so large the backlog can never be reached. The interplay of these thresholds with parallelism is developed further in Compaction Threshold Tuning.

Python Orchestration Snippet

Even with the native duty doing the work, teams program the config declaratively — version the DataSourceCompactionConfig in Git, apply it from CI/CD, and poll status so the pipeline can assert convergence. This orchestrator uses only the standard library plus requests, with a reusable session, exponential backoff on idempotent reads, and a bounded convergence poll. It fits directly inside an Airflow/Dagster task or the async submission patterns described under asynchronous task execution.

import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

COORDINATOR = "https://druid-coordinator.internal:8081"


def _session(retries: int = 4) -> requests.Session:
    s = requests.Session()
    retry = Retry(
        total=retries,
        backoff_factor=1.5,  # 0s, 1.5s, 3s, 6s ...
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET", "POST", "DELETE"],
    )
    adapter = HTTPAdapter(max_retries=retry)
    s.mount("http://", adapter)
    s.mount("https://", adapter)
    return s


def apply_compaction_config(config: dict, session: requests.Session) -> None:
    """Upsert the auto-compaction config for one datasource."""
    resp = session.post(
        f"{COORDINATOR}/druid/coordinator/v1/config/compaction",
        json=config,
        headers={"Content-Type": "application/json"},
        timeout=(5, 30),
    )
    resp.raise_for_status()


def set_task_slots(ratio: float, maximum: int, session: requests.Session) -> None:
    """Ensure the duty actually has slots to run in (default ratio can be 0 slots)."""
    resp = session.post(
        f"{COORDINATOR}/druid/coordinator/v1/config/compaction/taskslots",
        params={"ratio": ratio, "max": maximum},
        timeout=(5, 30),
    )
    resp.raise_for_status()


def await_convergence(
    datasource: str,
    session: requests.Session,
    target_bytes: int = 0,
    deadline_s: int = 3600,
    poll_s: int = 60,
) -> bool:
    """Poll until bytes-awaiting-compaction drops to target, or the deadline passes."""
    started = time.monotonic()
    while time.monotonic() - started < deadline_s:
        resp = session.get(
            f"{COORDINATOR}/druid/coordinator/v1/compaction/progress",
            params={"dataSource": datasource},
            timeout=(5, 30),
        )
        resp.raise_for_status()
        waiting = int(resp.json().get("remainingSegmentSize", 0))
        if waiting <= target_bytes:
            return True
        time.sleep(poll_s)
    return False


if __name__ == "__main__":
    sess = _session()
    set_task_slots(ratio=0.15, maximum=10, session=sess)
    apply_compaction_config(
        {
            "dataSource": "analytics_events",
            "taskPriority": 25,
            "skipOffsetFromLatest": "P1D",
            "tuningConfig": {
                "partitionsSpec": {
                    "type": "range",
                    "partitionDimensions": ["tenant_id"],
                    "targetRowsPerSegment": 5000000,
                },
                "maxNumConcurrentSubTasks": 4,
            },
            "granularitySpec": {"segmentGranularity": "DAY", "rollup": True},
        },
        session=sess,
    )
    converged = await_convergence("analytics_events", sess, deadline_s=7200)
    print("compaction backlog cleared" if converged else "still draining at deadline")

Because apply_compaction_config is an idempotent upsert keyed on dataSource, re-running the pipeline is safe — Druid stores exactly one config per datasource. Keep the config JSON under schema validation before it is applied; the same schema validation approach used for ingestion specs catches malformed partitionsSpec blocks before they reach the Coordinator.

Failure Modes & Diagnostics

Auto-compaction fails quietly — the API returns 200, the duty runs, and yet segments never converge. Diagnose from the shell against the live Coordinator/Overlord.

Is the datasource even scheduled, and how big is its backlog?

curl -s http://druid-coordinator:8081/druid/coordinator/v1/compaction/status \
  | jq '.latestStatus[] | select(.dataSource=="analytics_events")
        | {dataSource, scheduleStatus, bytesAwaitingCompaction, bytesCompacted, bytesSkipped}'

A large, non-decreasing bytesAwaitingCompaction across successive polls means the duty is not keeping up — check slots next. scheduleStatus of NOT_ENOUGH_TASK_SLOTS is a direct answer.

Does the duty have any slots? With the default ratio: 0.1 a small cluster rounds down to zero:

curl -s http://druid-coordinator:8081/druid/coordinator/v1/config/compaction \
  | jq '{ratio: .compactionTaskSlotRatio, max: .maxCompactionTaskSlots}'

If the effective budget is 0, raise it via the taskslots endpoint and re-check on the next cycle.

Are compact tasks failing on the Overlord?

curl -s "http://druid-overlord:8090/druid/indexer/v1/tasks?type=compact&state=complete" \
  | jq '[.[] | select(.statusCode=="FAILED")] | .[0:5] | .[] | {id, dataSource, statusCode}'

Pull a failed task's log to see the real cause (usually a lock conflict or OOM):

curl -s "http://druid-overlord:8090/druid/indexer/v1/task/<TASK_ID>/log" | tail -n 40

TaskLock contention with ingestion — the log shows Cannot acquire lock on the interval. Root cause: skipOffsetFromLatest is smaller than real ingestion lag, so compaction and ingestion fight over the same recent time chunk. Remedy: raise the offset to cover the lag (see the sizing formula above).

Measure real bytes-per-row to fix output sizes that keep landing wrong:

curl -s "http://druid-coordinator:8081/druid/coordinator/v1/datasources/analytics_events/segments?full" \
  | jq '[.[] | {rows: .num_rows, size: .size}]
        | (map(.size)|add) as $b | (map(.rows)|add) as $r
        | {avg_row_bytes: ($b/$r), total_mb: ($b/1048576)}'

Feed avg_row_bytes back into the targetRowsPerSegment formula. If output segments oscillate and the duty re-compacts the same interval every cycle (an infinite compaction loop), the row target is set below what the data's row size can satisfy — raise it until output clears the ~256 MB floor.

Coordinator heap pressure from an unbounded backlog:

jstat -gcutil $(pgrep -f coordinator) 5s 4

Sustained old-gen occupancy above ~85% with rising GC time means the compaction planning set is too large; cap maxCompactionTaskSlots per cluster and stagger datasources rather than scheduling every one at once. Repeated compact failures and a stalled interval/compacted/count metric should page an operator.

Watch, too, for compaction that runs against segments about to be dropped: if a datasource's retention window is shorter than its compaction cadence, the duty burns compute rewriting data that a kill task then purges. Align the two by coordinating with TTL mapping and data expiration so skipOffsetFromLatest and the drop rules don't overlap wastefully.

Automation Checklist

Wire these into the pipeline that manages a datasource's compaction config so scheduling stays correct as workloads shift:

Pre-apply validation — the DataSourceCompactionConfig passes JSON schema validation (valid partitionsSpec type, ISO-8601 skipOffsetFromLatest, taskPriority below ingestion priority) before any POST.
Slot budget asserted — the effective compaction slot count (min(⌊ratio × totalSlots⌋, max)) is greater than zero after applying config; CI fails if it rounds to 0.
Skip offset covers lag — skipOffsetFromLatest ≥ segmentGranularity + observed max ingestion watermark lag + margin, re-derived from monitoring, not hard-coded.
Row target from measured bytes — targetRowsPerSegment recomputed from live avg_row_bytes, targeting 500 MB–1 GB compressed output.
Post-apply convergence — pipeline polls /compaction/progress and asserts bytesAwaitingCompaction is trending down within a deadline; alerts if it plateaus.
Failure watch — alerting on failed compact tasks, a stalled interval/compacted/count, and Coordinator old-gen GC above threshold.
Retention coordination — datasource compaction cadence checked against its drop/kill rules so segments bound for deletion are not recompacted.
Off-peak biasing — where query SLAs are tight, maxCompactionTaskSlots scheduled higher in off-peak windows and lower during peak query hours.
Config in version control — the compaction config lives in Git and is applied idempotently from CI/CD, never edited ad hoc in the console.

Configuring Druid Native Compaction Rules — the compact task and DataSourceCompactionConfig grammar, locking semantics, and diagnostic signatures in full.
Compaction Threshold Tuning — calibrate the row, size, and concurrency thresholds that decide how aggressively the duty fires.
Segment Size Optimization Strategies — measure real bytes/row and land compacted segments in the 500 MB–1 GB band.
TTL Mapping and Data Expiration — retention rule grammar and kill-task chaining that compaction cadence must respect.
Asynchronous Task Execution Patterns — submit-and-poll orchestration primitives for driving these APIs from a scheduler.

Up one level: Segment Compaction, Retention & Storage Optimization.

Automated Compaction Task Scheduling in Apache Druid

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #

Explore this section