Query Routing and Segment Discovery in Apache Druid

Apache Druid answers a time-bounded query by resolving it against a metadata-driven timeline and scattering sub-queries to the exact Historical nodes that hold the matching segments. For OLAP data engineers and platform teams, the Broker's segment-discovery subsystem is the hinge between ingestion and query latency: if the routing table is stale, partial, or misbalanced, queries either miss data or fan out to the wrong tier. This page details how discovery propagates, how to configure it, how to size the routing table, and how to automate its validation. It sits under Apache Druid Segment Architecture & Lifecycle Fundamentals, where routing resilience is one facet of the broader segment lifecycle.

Query Routing at a Glance

The Broker resolves a query's time range against its in-memory segment timeline, scatters sub-queries to the Historicals holding the matching segments, and merges the partial results.

The Broker matches a query's interval against its in-memory timeline, scatters sub-queries to the tier holding each segment (hot preferred), then merges the partials. The Coordinator populates that timeline out of band.

Mechanics & Internals

Segment discovery originates in the Coordinator's reconciliation loop, which continuously aligns the immutable segment registry with the underlying relational metadata store. When an ingestion task completes, the Overlord transactionally commits segment manifests to the metadata database and marks them used; only then does the Coordinator schedule Historical loads. The full state machine behind that commit — the druid_segments table, versioning, and handoff atomicity — is covered in the Druid segment metadata storage deep dive.

Brokers do not read the metadata store directly for routing. Instead, each Broker maintains a TimelineServerView — an in-memory structure that maps each datasource's time intervals to the set of active Historical (and real-time) servers currently announcing a matching segment. Historicals announce segment availability through one of two discovery transports:

ZooKeeper announcements (legacy default): Historicals write ephemeral znodes under druid.zk.paths.announcementsPath; Brokers set watches and rebuild timeline entries on each watch event.
HTTP-based segment discovery (druid.serverview.type=http): Brokers poll each server's segment list over HTTP, removing ZooKeeper from the query-discovery path entirely. This is the recommended transport for large clusters because it avoids ZooKeeper watch storms.

The timeline itself is versioned. For any interval, the Broker keeps a VersionedIntervalTimeline that resolves overlapping segments by version string, so a query always targets the highest version available for that interval and never observes a partial compaction swap. When the Broker builds a query plan it (1) prunes segments whose interval falls outside the query's time filter, (2) applies tier affinity to prefer the lowest-latency Historical pool holding a replica, and (3) partitions the surviving segment set into per-server sub-queries for scatter/gather. The precision of step 1 depends directly on how finely the data is partitioned in time, which the segment granularity settings govern.

Discovery is deterministic but not instantaneous. The reconciliation cadence is set by druid.coordinator.period (default PT60S), and there is a bounded window between metadata commit, Coordinator assignment, Historical load, availability announcement, and Broker timeline refresh. Automation that dispatches queries immediately after ingestion must account for this window rather than assume synchronous visibility.

Validated Configuration Spec

Routing behavior is controlled on three node roles: the Broker (timeline hydration and dispatch), the Historical (tier membership and announcement), and the Coordinator (assignment and balancing). The blocks below are copy-ready runtime.properties fragments for Druid's latest stable release, with every field documented inline.

Broker runtime.properties:

# --- Segment discovery transport ---
# http avoids ZooKeeper watch storms; poll each data server for its segment list.
druid.serverview.type=http
druid.broker.http.numConnections=20          # connection pool per downstream server
druid.broker.http.maxQueuedBytes=64MiB       # backpressure ceiling before rejecting

# --- Timeline hydration on startup ---
# Block query serving until the timeline is populated, so queries never hit a
# partially-hydrated routing table after a restart.
druid.broker.segment.awaitInitializationOnStart=true

# --- Tier-aware routing ---
# When a segment is replicated across tiers, prefer this order. Names must match
# the druid.server.tier values configured on the Historicals below.
druid.broker.select.tier=custom
druid.broker.select.tier.custom.priorities=["hot","cold"]

# --- Result caching (Broker-side) ---
druid.broker.cache.useResultLevelCache=true
druid.broker.cache.populateResultLevelCache=true

Historical runtime.properties (repeat per tier, changing the tier name):

# Tier membership: the Broker routes to these names via the priority list above.
druid.server.tier=hot
druid.server.priority=10                      # higher wins when the same segment
                                              # is available in multiple tiers
# Advertised capacity used by the Coordinator to balance segment placement.
druid.server.maxSize=300g
druid.segmentCache.locations=[{"path":"/mnt/druid/segment-cache","maxSize":"300g"}]

Coordinator dynamic config (submitted as JSON to the API, not a properties file). Apply it live via POST /druid/coordinator/v1/config:

{
  "millisToWaitBeforeDeleting": 900000,
  "maxSegmentsToMove": 100,
  "replicantLifetime": 15,
  "replicationThrottleLimit": 500,
  "balancerComputeThreads": 2,
  "smartSegmentLoading": true
}

millisToWaitBeforeDeleting — grace period before dropping a segment whose rule no longer applies; prevents flapping during rule changes.
maxSegmentsToMove — per-cycle rebalance ceiling; higher values converge faster but add load-queue churn that Brokers must re-observe.
replicationThrottleLimit — caps replicas created per cycle so a tier rebuild does not saturate the load queue.
smartSegmentLoading — lets the Coordinator auto-tune load-queue sizes; leave enabled unless you are pinning explicit limits.

Sizing Heuristics & Formulas

The Broker's routing table cost scales with the number of live segments it must track, not with the volume of data behind them. Every used segment contributes one timeline entry plus one discovery watch (ZooKeeper transport) or one poll payload row (HTTP transport). The total segment count for a datasource is:

$$ \text{segmentCount} \approx \frac{\text{retentionDays} \times 86400}{\text{segmentGranularitySeconds}} \times \text{partitionsPerInterval} \times \text{replicationFactor} $$

An hourly segment granularity over 90 days of retention with 4 partitions per hour and 2 replicas yields roughly (90 \times 24 \times 4 \times 2 \approx 17{,}280) tracked entries for one datasource — manageable, but the same retention at MINUTE granularity explodes to (90 \times 1440 \times 4 \times 2 \approx 1{,}036{,}800) entries, inflating timeline memory and, on the ZooKeeper transport, triggering watch storms.

To keep the routing table lean, target a segment count per datasource that keeps timeline rebuild cheap while preserving pruning precision. A practical target rows-per-segment starting point is:

$$ \text{targetRows} \approx \frac{\text{targetSegmentMB} \times 1048576}{\text{avgCompressedBytesPerRow}} $$

For a 700 MB target and segments compressing to (\approx 180) bytes/row, (\text{targetRows} \approx \frac{700 \times 1048576}{180} \approx 4{,}078{,}000) rows. Feeding that back into the granularity choice keeps per-interval partition counts low, which is exactly what keeps timeline pruning fast. When historical intervals drift below the target band, the fix is not a granularity change but consolidation — see segment size optimization strategies and the broader automated compaction and retention workflows that keep segment counts from ballooning over a datasource's lifetime.

Timeline memory pressure is also sensitive to how wide each segment's column metadata is, because dictionary and index headers are loaded when a segment is queried; the columnar storage formats in Druid determine that per-segment overhead.

Python Orchestration Snippet

Because discovery is asynchronous, a pipeline that submits ingestion and then queries must wait for segments to become queryable on the Broker — not merely committed to the metadata store. The orchestrator below submits a task, polls the Overlord for terminal status, then confirms segment availability through the Coordinator's load-status endpoint with exponential backoff. It uses only the standard library plus requests.

import time
import requests

OVERLORD = "http://overlord:8090"
COORDINATOR = "http://coordinator:8081"


def submit_task(spec: dict) -> str:
    """Submit an ingestion task and return its task id."""
    resp = requests.post(
        f"{OVERLORD}/druid/indexer/v1/task",
        json=spec,
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["task"]


def poll_task(task_id: str, max_attempts: int = 20) -> str:
    """Poll task status with exponential backoff; return the terminal state."""
    delay = 2.0
    for attempt in range(max_attempts):
        resp = requests.get(
            f"{OVERLORD}/druid/indexer/v1/task/{task_id}/status",
            timeout=15,
        )
        resp.raise_for_status()
        status = resp.json()["status"]["status"]
        if status in ("SUCCESS", "FAILED"):
            return status
        time.sleep(delay)
        delay = min(delay * 2, 60.0)  # cap backoff at 60s
    raise TimeoutError(f"task {task_id} did not reach a terminal state")


def await_segment_availability(datasource: str, max_attempts: int = 15) -> None:
    """Block until the Coordinator reports 100% of segments loaded and queryable."""
    delay = 2.0
    for attempt in range(max_attempts):
        resp = requests.get(
            f"{COORDINATOR}/druid/coordinator/v1/loadstatus",
            params={"forceMetadataRefresh": "true", "interval": "P1Y"},
            timeout=15,
        )
        resp.raise_for_status()
        pct = resp.json().get(datasource, 0.0)
        if pct >= 100.0:
            return
        time.sleep(delay)
        delay = min(delay * 2, 60.0)
    raise TimeoutError(f"{datasource} not fully queryable within backoff budget")


def run(spec: dict, datasource: str) -> None:
    task_id = submit_task(spec)
    if poll_task(task_id) != "SUCCESS":
        raise RuntimeError(f"ingestion task {task_id} failed")
    await_segment_availability(datasource)
    print(f"{datasource}: segments loaded and routable")

The two-stage wait is the important pattern: poll_task confirms the metadata commit, and await_segment_availability confirms the Broker will actually route to the new segments. Skipping the second stage is the most common cause of "the query returned no rows right after ingestion succeeded." This pattern generalizes across every ingestion path documented under automated ingestion pipeline orchestration.

Failure Modes & Diagnostics

1. Queries miss freshly-ingested data. Symptom: a task reports SUCCESS but queries return no rows for the new interval. Root cause: the Broker timeline has not yet refreshed, or the segments are committed but not yet loaded onto a Historical. Diagnose:

# Coordinator load status per datasource (100.0 == fully queryable)
curl -s "http://coordinator:8081/druid/coordinator/v1/loadstatus?forceMetadataRefresh=true" | jq .

# What the Broker's timeline actually knows about the datasource
curl -s "http://broker:8082/druid/v2/datasources/my_datasource/candidates?intervals=2026-07-01/2026-07-05" | jq '.[].segments | length'

2. Routing table bloat / ZooKeeper watch storms. Symptom: Broker heap climbs, GC pauses lengthen, ZooKeeper request latency spikes after fine-grained ingestion. Root cause: over-partitioned segments inflate timeline entries and watches. Diagnose:

# Count live segments the Broker is tracking (proxy for timeline size)
curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/my_datasource/segments" | jq 'length'

# Broker heap occupancy over five samples
jstat -gcutil "$(pgrep -f 'io.druid.cli.Main.*broker')" 1000 5

Remediation: switch to druid.serverview.type=http, coarsen segment granularity, and enable compaction so historical intervals consolidate into the target size band.

3. Tier misrouting / cold-tier scans on hot queries. Symptom: low-latency dashboards intermittently hit slow disks. Root cause: replicas exist only on the cold tier for some intervals, or the tier priority list is wrong. Diagnose:

# Per-tier segment counts for the datasource
curl -s "http://coordinator:8081/druid/coordinator/v1/tiers" | jq .
curl -s "http://coordinator:8081/druid/coordinator/v1/datasources/my_datasource?full=true" \
  | jq '.segments[] | {interval, tier: .loadSpec.tier}' | head

Remediation: fix druid.broker.select.tier.custom.priorities, verify each Historical's druid.server.tier, and adjust load rules so hot intervals carry a hot-tier replica.

4. Stale timeline after a Broker restart. Symptom: the first queries after a rolling restart return partial results. Root cause: the Broker began serving before its timeline finished hydrating. Remediation: set druid.broker.segment.awaitInitializationOnStart=true and gate the load balancer's health check on GET /status/health returning true.

Automation Checklist

Validate segmentGranularity and partitionsSpec against the target rows-per-segment formula before submitting ingestion, so the routing table never bloats at the source.
After every ingestion run, poll GET /druid/coordinator/v1/loadstatus until the datasource reads 100.0 before marking the pipeline green.
Alert when tracked segment count for any datasource exceeds its expected segmentCount by more than 20% — a signal to trigger compaction.
Confirm the discovery transport is http on clusters above ~50k live segments; audit ZooKeeper request latency if still on the ZooKeeper transport.
Verify hot intervals carry a hot-tier replica by reconciling load rules against GET /druid/coordinator/v1/datasources/{ds}?full=true on a schedule.
Gate rolling restarts on awaitInitializationOnStart=true plus a health-check probe so Brokers never serve on a half-built timeline.
Version-control runtime.properties and Coordinator dynamic config; diff live config against the repo to catch drift.

For authoritative reference on Broker internals and the discovery transports, consult the Apache Druid Broker design documentation, and for ephemeral-node semantics on the ZooKeeper transport see the Apache ZooKeeper Programmer's Guide.

Druid segment metadata storage deep dive — the metadata state machine and handoff commit that feeds the Broker timeline.
Understanding Druid segment granularity — how time partitioning sets the cardinality of the routing table.
Columnar storage formats in Druid — the per-segment column overhead that timeline hydration pays for.
Security boundaries for segment access — datasource authorization applied on the same Broker query path.
Segment compaction, retention & storage optimization — consolidation workflows that keep segment counts, and therefore routing tables, bounded.
Up: Apache Druid Segment Architecture & Lifecycle Fundamentals.

Query Routing and Segment Discovery in Apache Druid

Query Routing at a Glance #

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #

Explore this section