Segment Size Optimization Strategies in Apache Druid

Apache Druid's query execution model is architected around segment-level parallelism: the Broker fans a query out across Historical processes, and each process scans its assigned segments as independent units of work. When segment sizes drift outside the target range, query latency rises, Historical heap pressure escalates, and background maintenance consumes disproportionate cluster resources. Segment sizing is therefore not a static ingestion parameter but a dynamic property that must be governed continuously — a discipline that sits at the center of segment compaction, retention and storage optimization and depends on the same measured inputs (row byte width, rollup ratio, partition boundaries) that drive every other lifecycle decision.

Mechanics & Internals

A Druid segment is an immutable, columnar, time-partitioned file. Each segment holds a contiguous slice of one time chunk for one datasource, and its internal layout — dictionary-encoded dimensions, compressed metric columns, and a bitmap index per dimension value — is what makes scan cost roughly proportional to the number of rows touched. The way columnar storage formats encode those columns means the size on disk and the row count of a segment are only loosely coupled: two segments with identical row counts can differ several-fold in bytes depending on cardinality and compression codec.

Segment size is bounded at write time by two levers that operate at different stages:

maxRowsPerSegment — a hard row ceiling applied during ingestion and during compaction. When a partition reaches this count, Druid closes it and opens a new one. It is a count, not a byte target, which is why byte-level tuning always routes back through row-size measurement.
partitionsSpec (dynamic, hashed, or range) — governs how rows are distributed across partitions within a time chunk. dynamic splits purely by maxRowsPerSegment; hashed and range also key on dimension values so that pruning-relevant data lands together. The segment granularity settings set the outer time boundary of each chunk, and the partitionsSpec then subdivides that chunk.

On the read path, the Broker consults the metadata store and the query routing and segment discovery layer to build a per-Historical scatter plan. Each Historical memory-maps its segment files into the OS page cache and materializes column vectors on demand. Two failure geometries bracket the healthy range:

Oversized segments force the vectorized engine to allocate large column buffers into JVM heap, lengthening young-gen collections and, past a threshold, spilling group-by state to disk. A single fat segment also becomes an indivisible unit of work that no amount of Historical parallelism can subdivide, so it caps the query's tail latency.
Undersized segments multiply metadata rows in the Coordinator's catalog and the sys.segments table, inflate per-segment scan fixed costs (open, seek, decompress headers), and raise Coordinator load-balancing and scheduling overhead. A datasource with hundreds of thousands of tiny segments will show Coordinator run-loop times climbing even while data volume is modest.

The target that reconciles both is roughly 300–700 MB per segment (frequently expressed as ~500 MB–1 GB in mixed-compression terms, or about 5 million rows as a rule-of-thumb starting point). The goal is to make each segment large enough to amortize fixed scan costs but small enough that a single Historical thread can scan it without heap distress. Because that band is defined in bytes but enforced in rows, every concrete configuration begins by measuring avgRowBytes.

Validated Configuration Spec

Two specs govern segment size: the ingestion tuningConfig that sets the initial ceiling, and the compaction spec that corrects drift after the fact. Both must agree on the row target so that compaction does not fight ingestion.

The ingestion tuningConfig for a native batch or streaming supervisor sets the upstream ceiling. With a dynamic partitioning strategy, maxRowsPerSegment alone determines segment boundaries:

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_stream",
      "timestampSpec": { "column": "__time", "format": "millis" },
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": { "type": "druid", "dataSource": "events_stream", "interval": "2026-01-01/2026-02-01" }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000,
      "maxRowsInMemory": 1000000,
      "partitionsSpec": {
        "type": "dynamic",
        "maxRowsPerSegment": 5000000
      }
    }
  }
}

Every top-level key is required: type selects the task, spec wraps the three sub-configs, and ioConfig/tuningConfig must both be present. maxRowsPerSegment is the segment row ceiling; maxRowsInMemory bounds the in-heap buffer before an intermediate persist and is unrelated to final segment size. Keeping rollup: true with a coarser queryGranularity than segmentGranularity is the single largest lever on row count, and therefore on how many rows equal a 500 MB segment.

For workloads that benefit from data-locality pruning, switch to range partitioning so that high-cardinality filter dimensions cluster within partitions:

{
  "tuningConfig": {
    "type": "index_parallel",
    "forceGuaranteedRollup": true,
    "partitionsSpec": {
      "type": "range",
      "partitionDimensions": ["tenant_id", "event_type"],
      "targetRowsPerSegment": 5000000
    }
  }
}

Here targetRowsPerSegment (a soft target that Druid balances around, not a hard ceiling) replaces maxRowsPerSegment, and forceGuaranteedRollup is mandatory for range and hashed specs because perfect rollup requires a two-pass shuffle.

The correction mechanism is auto-compaction, configured once per datasource on the Coordinator via POST /druid/coordinator/v1/config/compaction. This is a persistent DataSourceCompactionConfig, not a one-off task — the Coordinator's compaction duty continuously scans for segments that violate the target and submits compact tasks itself:

{
  "dataSource": "events_stream",
  "skipOffsetFromLatest": "P1D",
  "tuningConfig": {
    "maxRowsPerSegment": 5000000,
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 5000000
    },
    "maxNumConcurrentSubTasks": 4
  },
  "taskPriority": 25,
  "taskContext": { "priority": 25 }
}

skipOffsetFromLatest protects the most recent interval from compaction while it is still receiving late data — compacting a hot interval wastes work and can race the ingestion supervisor. taskPriority is set below ingestion priority so compaction never preempts fresh data. maxNumConcurrentSubTasks bounds the parallelism of a single compaction task, which combined with the Coordinator's maxCompactionTaskSlots caps total I/O the correction pass can consume. This configured target must match the ingestion ceiling; when the two disagree, segments oscillate — ingestion writes at one size, compaction rewrites at another, and the datasource churns indefinitely. Threshold selection for when compaction fires (as opposed to the output size) belongs to compaction threshold tuning, and the scheduling cadence and slot budgeting belong to automated compaction task scheduling.

Sizing Heuristics & Formulas

The whole strategy reduces to one conversion: a byte target must become a row ceiling, because Druid enforces rows. Given a target segment size in bytes and the measured average compressed bytes per row for a datasource:

$$\text{targetRowsPerSegment} \approx \frac{\text{targetBytes}}{\text{avgRowBytes}}$$

For a 700 MB target on a datasource whose segments average 140 bytes per row after rollup and compression:

$$\text{targetRowsPerSegment} \approx \frac{700 \times 1048576}{140} \approx 5.24 \times 10^{6}$$

avgRowBytes is not a guess — it is derived from live segments. Divide total segment bytes by total rows across a representative interval, which the sys.segments table or the Coordinator segments endpoint both expose. The number moves with cardinality and codec, so recompute it after any schema change, any shift in queryGranularity, or a codec swap. A dictionary-heavy high-cardinality dimension inflates bytes per row; a coarser queryGranularity collapses rows and shrinks it.

Rollup compresses row count by a ratio that directly scales the target. If ingestion rolls up raw events at ratio $\rho$ (raw rows in per rolled-up row out), then the raw event volume a single segment represents is:

$$\text{rawEventsPerSegment} \approx \rho \times \text{targetRowsPerSegment}$$

This matters when reasoning backward from throughput: a supervisor ingesting a known raw event rate lands segments in the target band only if segmentGranularity and $\rho$ together produce a per-chunk row count near the ceiling. If an HOUR chunk at the current rollup produces far fewer than targetRowsPerSegment, either coarsen segmentGranularity (to DAY) or accept that compaction must merge multiple chunks' partitions — the latter is what auto-compaction's interval merging does.

Finally, the count of segments a Historical must hold for an interval, which drives its page-cache and metadata footprint:

$$\text{segmentsPerChunk} \approx \left\lceil \frac{\text{rowsPerChunk}}{\text{targetRowsPerSegment}} \right\rceil$$

Minimizing this count without exceeding the byte ceiling is the optimization. The related trade-off of local SSD cache sizing versus object-storage tiering — how many of those segments stay resident versus cold — is worked in reducing Historical node storage costs.

Python Orchestration Snippet

Size drift is detected by querying the Coordinator for a datasource's segment inventory, computing the byte and row distribution, and submitting a compaction task only when segments fall outside the band. The snippet below uses only the standard library plus requests, computes avg_row_bytes from live segments, derives the row target from the formula above, and submits with exponential backoff on transient failures.

import time
import requests

COORDINATOR = "http://coordinator:8081"
OVERLORD = "http://overlord:8090"
TARGET_BYTES = 700 * 1048576          # 700 MB target
UNDERSIZE_BYTES = 100 * 1048576       # flag segments under 100 MB
OVERSIZE_BYTES = 1200 * 1048576       # flag segments over 1.2 GB


def get(session, url, **kw):
    """GET with exponential backoff on 5xx / connection errors."""
    delay = 1.0
    for attempt in range(5):
        try:
            r = session.get(url, timeout=30, **kw)
            if r.status_code < 500:
                r.raise_for_status()
                return r.json()
        except requests.RequestException:
            if attempt == 4:
                raise
        time.sleep(delay)
        delay *= 2
    raise RuntimeError(f"GET failed after retries: {url}")


def segment_stats(session, datasource):
    """Return (avg_row_bytes, offending_intervals) for a datasource."""
    segs = get(
        session,
        f"{COORDINATOR}/druid/coordinator/v1/datasources/{datasource}/segments",
        params={"full": "true"},
    )
    total_bytes = total_rows = 0
    offending = set()
    for s in segs:
        size = s.get("size", 0)
        rows = s.get("num_rows") or s.get("numRows") or 0
        total_bytes += size
        total_rows += rows
        if size < UNDERSIZE_BYTES or size > OVERSIZE_BYTES:
            offending.add(s["interval"])
    avg_row_bytes = (total_bytes / total_rows) if total_rows else 0
    return avg_row_bytes, sorted(offending)


def submit_compaction(session, datasource, avg_row_bytes):
    target_rows = int(TARGET_BYTES / avg_row_bytes) if avg_row_bytes else 5_000_000
    task = {
        "type": "compact",
        "dataSource": datasource,
        "ioConfig": {"type": "compact",
                     "inputSpec": {"type": "all"}},
        "tuningConfig": {
            "type": "index_parallel",
            "maxRowsPerSegment": target_rows,
            "partitionsSpec": {"type": "dynamic", "maxRowsPerSegment": target_rows},
            "maxNumConcurrentSubTasks": 4,
        },
        "context": {"priority": 25},
    }
    delay = 1.0
    for attempt in range(5):
        r = session.post(f"{OVERLORD}/druid/indexer/v1/task", json=task, timeout=30)
        if r.status_code < 500:
            r.raise_for_status()
            return r.json()["task"]
        time.sleep(delay)
        delay *= 2
    raise RuntimeError("compaction submit failed after retries")


def poll_task(session, task_id, deadline_s=3600):
    """Poll Overlord until the task leaves RUNNING, or the deadline passes."""
    start = time.time()
    delay = 5.0
    while time.time() - start < deadline_s:
        status = get(session, f"{OVERLORD}/druid/indexer/v1/task/{task_id}/status")
        state = status["status"]["status"]
        if state != "RUNNING":
            return state           # SUCCESS or FAILED
        time.sleep(delay)
        delay = min(delay * 1.5, 60)
    raise TimeoutError(f"{task_id} did not complete within {deadline_s}s")


if __name__ == "__main__":
    ds = "events_stream"
    with requests.Session() as sess:
        avg, bad = segment_stats(sess, ds)
        print(f"avg_row_bytes={avg:.1f} offending_intervals={len(bad)}")
        if bad:
            tid = submit_compaction(sess, ds, avg)
            print(f"submitted {tid} -> {poll_task(sess, tid)}")

Two backoff patterns appear on purpose: a bounded geometric retry for idempotent GETs and the submit, and a capped growing interval for the long poll so it does not hammer the Overlord while a multi-minute task runs. Because the row target is recomputed from live avg_row_bytes on every run, the pipeline self-corrects as cardinality drifts — it never trusts a hard-coded ceiling. Coordinating this against expiration rules so a datasource bound for deletion is not recompacted is handled through TTL mapping and data expiration.

Failure Modes & Diagnostics

Diagnose size problems directly against the Coordinator and Historical REST APIs and the JVM, without waiting for a dashboard.

Find undersized and oversized segments. Pull the full segment list and bucket by size with jq:

curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/events_stream/segments?full=true" \
  | jq '[.[] | {interval, mb: (.size/1048576)}]
        | group_by(.mb < 100)
        | map({undersized_lt_100mb: (.[0].mb < 100), count: length})'

Compute the datasource-wide average row width — the input to the sizing formula — straight from the metadata via the SQL API:

curl -s -XPOST http://broker:8082/druid/v2/sql \
  -H 'Content-Type: application/json' \
  -d '{"query":"SELECT datasource, SUM(\"size\")/SUM(num_rows) AS avg_row_bytes, COUNT(*) AS segs FROM sys.segments WHERE is_active = 1 GROUP BY datasource ORDER BY segs DESC"}' \
  | jq '.'

A datasource showing tens of thousands of segments with a tiny avg_row_bytes is fragmented; one with a handful of very large segments and heap alarms is oversized.

Confirm the symptom is heap, not I/O, on the Historical hosting the fat segments. Rising old-gen occupancy and long GC times under scan load point at oversized column vectors:

jstat -gcutil $(pgrep -f 'io.druid.cli.Main server historical') 5s 6

Full-GC count climbing during queries over one interval is the oversized-segment signature. Cross-check with the Historical's own segment load list:

curl -s http://historical:8083/druid/historical/v1/loadstatus | jq '.'

Watch a correction converge. After submitting compaction, the Coordinator exposes remaining work; a plateau means the task is starved of slots or stuck:

curl -s "http://coordinator:8081/druid/coordinator/v1/compaction/progress?dataSource=events_stream" | jq '.'

The dominant failure modes, each with its signature and fix:

Fragmentation from dynamic partitioning at low volume — thousands of sub-50 MB segments because per-chunk row count never approaches maxRowsPerSegment. Root cause: segmentGranularity too fine for the ingest rate. Remediation: coarsen granularity or let interval-merging compaction consolidate chunks.
Oversized segments from an inflated row ceiling — heap pressure and tail latency. Root cause: maxRowsPerSegment set without accounting for a high avgRowBytes. Remediation: recompute the row ceiling from measured bytes and recompact.
Ingestion/compaction oscillation — segment count and size never settle. Root cause: ingestion tuningConfig and the compaction DataSourceCompactionConfig specify different targets. Remediation: pin both to the same derived row target.
Compaction starvation — drift detected but never corrected. Root cause: maxCompactionTaskSlots too low or skipOffsetFromLatest covering the intervals that need work. Remediation: raise slot budget in off-peak windows and verify the skip offset only shields genuinely hot intervals.

Automation Checklist

Segment compaction, retention & storage optimization — the parent reference covering the full segment lifecycle this sizing work sits inside.
Automated compaction task scheduling — how the Coordinator's compaction duty schedules and paces the tasks that enforce your size target.
Compaction threshold tuning — calibrating when compaction fires so size correction runs without starving query threads.
TTL mapping and data expiration — aligning size optimization with retention so segments bound for deletion are never recompacted.
Reducing Historical node storage costs — translating segment size targets into local SSD cache and object-storage tiering decisions.
Understanding Druid segment granularity — how the time-chunk boundary upstream of partitionsSpec sets per-segment row counts.

Segment Size Optimization Strategies in Apache Druid

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #

Explore this section