Compaction Threshold Tuning in Apache Druid

Effective segment lifecycle management in Apache Druid depends on precisely calibrated compaction thresholds that balance query latency, storage footprint, and cluster resource utilization. As ingestion velocity climbs, static thresholds drift out of alignment: too aggressive and the Coordinator floods worker slots with redundant compact tasks; too conservative and undersized fragments accumulate until Historical heap and Broker planning time degrade. This page sits under Segment Compaction, Retention & Storage Optimization and drills into the specific thresholds — the row ceiling, the input-size guardrail, and the concurrency budget — that decide how the automated compaction scheduling duty behaves once it fires, plus the Python patterns to keep those thresholds derived from live telemetry rather than guessed.

Mechanics & Internals

Druid's Coordinator runs a periodic compaction duty that compares each datasource's on-disk segment layout against its DataSourceCompactionConfig. When a time interval's segments fall outside the configured targets, the duty submits a compact task to the Overlord, which reads the existing segments and writes replacement segments that honour the new thresholds. The thresholds themselves live in two places: the top-level compaction config that governs whether and how much to compact, and the nested tuningConfig that governs how each task partitions and merges its input.

The decision boundary is byte-driven. The Coordinator maintains a running bytesAwaitingCompaction figure per datasource, exposed through the GET /druid/coordinator/v1/compaction/progress and /compaction/status endpoints. An interval becomes a candidate when its total used-segment bytes, or its segment count for a given time chunk, indicates the layout no longer matches the target partitionsSpec. Because Druid segments are immutable, "tuning a threshold down" never edits an existing segment in place — it schedules a rewrite that produces fresh segments and marks the originals unused.

The thresholds that actually move behaviour are:

maxRowsPerSegment (dynamic partitioning) — the hard row ceiling for an output segment. Aligning it with query scan patterns keeps vectorized execution efficient and limits how many column vectors a single scan materializes into heap. Since the byte-based targetCompactionSizeBytes was removed in Druid 0.21, row count is the primary lever for landing output in the target size band.
targetRowsPerSegment (hashed or range partitionsSpec) — the desired, not maximum, row count per output segment. Druid treats it as the center of the distribution and allows some spread; set it so compressed output lands in the 500 MB–1 GB range.
inputSegmentSizeBytes — a guardrail on the total bytes a single compaction task will pull in as input. It caps per-task memory pressure during the merge phase and prevents one task from trying to rewrite a pathologically large interval in a single pass.
maxNumConcurrentSubTasks — parallelism within one index_parallel compaction task. It fans work across worker slots; set it against available middle-manager slots and JVM heap, not against total cluster cores.
skipOffsetFromLatest — the recency window the duty leaves untouched so it does not fight active ingestion for locks. Not a size threshold, but it gates which intervals are even eligible.

These interact with upstream ingestion choices. The segment granularity settings fix the time-chunk boundaries every threshold operates inside, and columnar storage formats determine the compressed bytes-per-row that maps a row threshold to a size outcome. Tuning thresholds without measuring real bytes-per-row is guesswork — a high-cardinality string dimension and a low-cardinality enum with the same row count produce wildly different segment sizes.

Validated Configuration Spec

Thresholds are set on the datasource compaction config that the Coordinator polls, applied idempotently via POST /druid/coordinator/v1/config/compaction. The block below is copy-ready; every field is documented inline in the prose that follows.

{
  "dataSource": "clickstream_events",
  "taskPriority": 25,
  "inputSegmentSizeBytes": 1073741824,
  "skipOffsetFromLatest": "P1D",
  "granularitySpec": {
    "segmentGranularity": "DAY",
    "queryGranularity": "HOUR",
    "rollup": true
  },
  "tuningConfig": {
    "type": "index_parallel",
    "partitionsSpec": {
      "type": "hashed",
      "targetRowsPerSegment": 5000000
    },
    "maxNumConcurrentSubTasks": 4,
    "maxRowsInMemory": 1000000,
    "maxRowsPerSegment": 5000000
  },
  "taskContext": {
    "priority": 25
  }
}

Field by field:

taskPriority — kept below the ingestion task priority so compaction never starves live ingestion for worker slots. A value around 25 sits under the default streaming priority of 75.
inputSegmentSizeBytes — 1073741824 (1 GiB) caps how much input a single task pulls. Lower it when tasks OOM; raise it only when merge memory headroom is proven.
skipOffsetFromLatest — P1D leaves the most recent day untouched so the duty and streaming ingestion do not contend for interval locks. It must cover the maximum ingestion watermark lag, not just one granularity bucket.
granularitySpec.segmentGranularity — compaction can re-chunk time here; keeping it at DAY preserves alignment with the retention rules governed by TTL mapping and data expiration.
partitionsSpec.type / targetRowsPerSegment — hashed gives even, deterministic distribution without a secondary partition key; 5000000 targets the ~700 MB compressed band for this datasource's measured row size. Switch to range when queries filter heavily on one dimension.
maxNumConcurrentSubTasks — 4 fans the parallel index across four worker slots. Match it to free middle-manager capacity so a compaction burst does not evict streaming supervisors.
maxRowsInMemory — the per-task in-heap row buffer before a spill to disk; 1000000 is a safe default that keeps merge-phase heap bounded.
maxRowsPerSegment — the absolute ceiling that backstops targetRowsPerSegment so no single output segment blows past the size band.

The full grammar of the compact task and DataSourceCompactionConfig, including locking semantics, is covered under configuring Druid native compaction rules.

Sizing Heuristics & Formulas

The core relationship converts a target segment size into a row threshold. Given a target compressed size in MB and the datasource's measured average compressed bytes per row:

$$\text{targetRowsPerSegment} \approx \frac{\text{targetMB} \times 1048576}{\text{avgRowBytes}}$$

For a datasource landing on ~700 MB with a measured 140 bytes/row:

$$\text{targetRowsPerSegment} \approx \frac{700 \times 1048576}{140} \approx 5.24 \times 10^{6}$$

which rounds to the 5000000 used in the spec above. The measured avgRowBytes must come from live segment metadata, because rollup ratio and dictionary-encoding overhead on high-cardinality dimensions dominate it; the same query at two datasources can differ by an order of magnitude. Deriving it empirically is the subject of segment size optimization strategies.

To budget concurrency, size the per-task input against the merge-phase heap available to a worker. If each worker slot has heapPerSlotMB usable for merge buffers and the merge cost scales with the input pulled per sub-task:

$$\text{maxNumConcurrentSubTasks} \approx \min\left(\text{freeWorkerSlots},\ \left\lfloor \frac{\text{inputSegmentSizeBytes}}{\text{bytesPerSubTaskBudget}} \right\rfloor\right)$$

The drain time for a backlog — useful for deciding whether current thresholds keep pace with ingestion — is the outstanding bytes divided by aggregate throughput:

$$T_{\text{drain}} \approx \frac{\text{bytesAwaitingCompaction}}{\text{slots} \times \text{throughputPerSlot}}$$

If $T_{\text{drain}}$ exceeds the interval at which new data arrives, thresholds are too conservative or the slot budget is too small, and the backlog grows unboundedly.

Python Orchestration Snippet

Treat compaction thresholds as runtime variables computed from segment telemetry, not static YAML. The orchestrator below reads current segment sizes from the Coordinator, derives a row threshold from measured bytes-per-row, submits a compact task to the Overlord, then polls to completion with exponential backoff. It uses only the standard library plus requests.

import math
import time
import requests

DRUID_COORDINATOR = "http://coordinator:8081"
DRUID_OVERLORD = "http://overlord:8090"
SESSION = requests.Session()


def derive_thresholds(datasource: str, target_mb: int = 700) -> dict:
    """Compute a row threshold from live segment telemetry."""
    resp = SESSION.get(
        f"{DRUID_COORDINATOR}/druid/coordinator/v1/datasources/{datasource}/segments",
        params={"full": "true"},
        timeout=15,
    )
    resp.raise_for_status()
    segments = resp.json()

    total_bytes = sum(s.get("size", 0) for s in segments)
    total_rows = sum(s.get("num_rows", 0) for s in segments)
    avg_row_bytes = total_bytes / max(total_rows, 1)

    target_rows = math.floor((target_mb * 1_048_576) / max(avg_row_bytes, 1))
    return {
        "targetRowsPerSegment": max(target_rows, 1_000_000),
        "maxRowsPerSegment": max(target_rows, 1_000_000),
        "inputSegmentSizeBytes": 1_073_741_824,  # 1 GiB per-task input cap
        "maxNumConcurrentSubTasks": 4,
    }


def submit_compaction(datasource: str, interval: str, tuning: dict) -> str:
    """Submit a dynamically tuned compact task; return the task id."""
    payload = {
        "type": "compact",
        "dataSource": datasource,
        "ioConfig": {
            "type": "compact",
            "inputSpec": {"type": "interval", "interval": interval},
        },
        "tuningConfig": {
            "type": "index_parallel",
            "partitionsSpec": {
                "type": "hashed",
                "targetRowsPerSegment": tuning["targetRowsPerSegment"],
            },
            "maxRowsPerSegment": tuning["maxRowsPerSegment"],
            "maxNumConcurrentSubTasks": tuning["maxNumConcurrentSubTasks"],
        },
    }
    resp = SESSION.post(
        f"{DRUID_OVERLORD}/druid/indexer/v1/task",
        json=payload,
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["task"]


def poll_task(task_id: str, deadline_s: int = 3600) -> str:
    """Poll task status with exponential backoff until terminal or deadline."""
    delay, waited = 2.0, 0.0
    while waited < deadline_s:
        resp = SESSION.get(
            f"{DRUID_OVERLORD}/druid/indexer/v1/task/{task_id}/status",
            timeout=15,
        )
        resp.raise_for_status()
        state = resp.json()["status"]["status"]
        if state in ("SUCCESS", "FAILED"):
            return state
        time.sleep(delay)
        waited += delay
        delay = min(delay * 2, 60.0)  # cap backoff at 60s
    raise TimeoutError(f"{task_id} did not finish within {deadline_s}s")


if __name__ == "__main__":
    ds, window = "clickstream_events", "2026-06-01/2026-06-02"
    thresholds = derive_thresholds(ds)
    tid = submit_compaction(ds, window, thresholds)
    print(tid, poll_task(tid))

The submit-and-poll primitive here is the same pattern used throughout the asynchronous task execution patterns reference; wire the retry escalation from the next section around poll_task so a FAILED state degrades thresholds before resubmitting.

Failure Modes & Diagnostics

Compaction tasks are memory-intensive and fail predictably near heap limits or when thresholds fight the data's shape. Diagnose against the Coordinator and Overlord REST APIs before changing config.

Backlog not draining — thresholds too conservative or too few slots. Check outstanding bytes:

curl -s "http://coordinator:8081/druid/coordinator/v1/compaction/progress?dataSource=clickstream_events" \
  | jq '{bytesAwaitingCompaction}'

A figure that holds steady or climbs across cycles means $T_{\text{drain}}$ exceeds the ingestion interval — raise maxNumConcurrentSubTasks or the Druid cluster slot budget.

Measure real bytes-per-row before touching targetRowsPerSegment, so the threshold is derived rather than guessed:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/clickstream_events/segments?full" \
  | jq '[.[] | {size, num_rows}]
        | (map(.size)|add) as $b | (map(.num_rows)|add) as $r
        | {avg_row_bytes: ($b/$r), total_mb: ($b/1048576)}'

Task OOM / CONTAINER_EXITED — the merge phase exceeded worker heap. Confirm which task and inspect GC pressure on the worker JVM:

curl -s "http://overlord:8090/druid/indexer/v1/tasks?state=complete" \
  | jq '.[] | select(.statusCode=="FAILED") | {id, errorMsg}' | head
jstat -gcutil $(pgrep -f middleManager) 5s 4

Sustained old-gen occupancy above ~85% with rising GC time confirms heap starvation — halve maxNumConcurrentSubTasks, then lower inputSegmentSizeBytes to shrink per-task input.

Infinite re-compaction — the same interval is rewritten every cycle. This happens when targetRowsPerSegment is set below what the data's row size can satisfy, so output never clears the ~256 MB floor and the duty keeps re-flagging it. Raise the row target until output stabilizes above the floor.

A robust orchestrator encodes these as an ordered fallback rather than relying on default retries:

Halve maxNumConcurrentSubTasks and retry with the identical input interval.
Reduce inputSegmentSizeBytes to isolate smaller input batches and lower per-task merge memory.
Escalate to a manual review queue after three consecutive failures with CONTAINER_EXITED or OUT_OF_MEMORY.

Automation Checklist

Wire these gates into the pipeline that manages a datasource's compaction thresholds so calibration stays correct as workloads shift:

Automated Compaction Task Scheduling — how the Coordinator duty decides when to fire the tasks these thresholds shape, with slot-budget and skip-offset math.
Configuring Druid Native Compaction Rules — the full compact task and DataSourceCompactionConfig grammar and locking semantics.
Segment Size Optimization Strategies — measure real bytes-per-row and land compacted output in the 500 MB–1 GB band.
TTL Mapping and Data Expiration — retention and kill-task cadence that threshold tuning must avoid rewriting against.
Asynchronous Task Execution Patterns — the submit-and-poll orchestration primitives used to drive these APIs from a scheduler.

Up one level: Segment Compaction, Retention & Storage Optimization.

Compaction Threshold Tuning in Apache Druid

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #