Columnar Storage Formats in Druid: Encoding, Indexing & Compression Internals

Apache Druid's sub-second analytical throughput rests on a columnar storage layout where every segment is a self-contained block of independently encoded, indexed, and compressed columns. For OLAP data engineers and pipeline builders, the encoding decisions Druid makes at ingestion time are not opaque internals — they are directly configurable through the indexSpec, and they govern segment size, query latency, and Historical-node heap pressure. Treating columnar output as a deterministic, programmable pipeline artifact rather than a fixed side effect is what separates a stable cluster from one that drifts. This page sits under Apache Druid Segment Architecture & Lifecycle Fundamentals, which frames how segments are built, distributed, and retired; here we focus specifically on the on-disk column formats and how to control them.

Mechanics & Internals

A Druid segment is a directory whose principal artifacts are a per-column set of encoded value stores, an inverted index, and a compact metadata descriptor (meta.smoosh / version.bin plus the smoosh value files). Unlike a row store, Druid never materializes a full row on disk; a query touches only the columns it references, and each column carries its own encoding independent of its neighbours. This physical isolation is what enables aggressive predicate pushdown and vectorized scans, but Druid only reaches those speeds when the upstream data contract matches the type each encoder expects.

Four independently encoded column families in one segment: __time and metrics as delta/bit-packed numerics, string dimensions as dictionary + id column + Roaring bitmaps, plus the smoosh metadata block; every data column's blocks are then LZ4- or ZSTD-compressed per tier.

Dictionary encoding for string dimensions. Every string (and multi-value string) dimension is stored as an ordered dictionary of distinct values plus a column of integer IDs that reference it. The dictionary is sorted, which makes equality and range filters cheap and makes GROUP BY operate over dense integer codes rather than raw strings. The cost is cardinality-sensitive: a high-cardinality column (UUIDs, request IDs, raw URLs) produces a large dictionary that inflates segment metadata and raises the per-column memory Historicals must map. Enforcing cardinality budgets before ingestion — and pushing genuinely unique fields into non-indexed columns — is the first lever for keeping segments lean.

Roaring bitmap indexes. For each distinct value of a dictionary-encoded dimension, Druid maintains a bitmap marking the rows that contain it, stored as Roaring bitmaps. Filter evaluation (WHERE, IN, boolean combinations) becomes a sequence of bitmap AND/OR/NOT operations, which is dramatically faster than a linear scan. Bitmaps are compact for low-to-moderate cardinality but offer little benefit on near-unique columns, where the index size approaches the data size. The bitmap block in indexSpec selects the implementation (roaring is the default and recommended; concise remains for legacy compatibility).

Numeric column encoding. Long and double metrics — and numeric dimensions — use a combination of delta encoding, bit-packing, and run-length strategies chosen from the value distribution. Monotonic or slowly-varying sequences (epoch timestamps, incrementing counters) compress extremely well under delta plus bit-packing. The longEncoding field toggles between longs (fixed 8-byte) and auto (table/delta selection); auto typically wins on skewed or bounded ranges at a small encode-time cost.

Compression codecs. After encoding, column value blocks are compressed. lz4 is the default and the right choice for hot and warm tiers because decompression is nearly free during scans. zstd yields markedly better ratios for cold and archival tiers where CPU during the occasional scan is acceptable in exchange for lower storage cost. Codecs are set per column family via dimensionCompression, metricCompression, and interact with longEncoding, all inside the indexSpec. These choices belong to the same decision surface as LZ4 vs ZSTD codec selection by storage tier — the codec you pick should track the tier the segment will live on.

Temporal layout multiplies every one of these effects. The segmentGranularity in granularitySpec determines how many rows land in each segment, and therefore dictionary density, bitmap length, and codec block counts. The trade-offs there are covered in depth under segment granularity settings; the short version is that granularity should be derived from query patterns and expected event velocity, not from ingestion volume alone. Once finalized, columnar layout also shapes how Brokers prune and dispatch scans, as detailed in query routing and segment discovery.

Validated Configuration Spec

The indexSpec lives inside tuningConfig and is the single place where encoding, indexing, and compression are declared. The batch (index_parallel) spec below is copy-ready against a recent stable Druid release; every field that touches columnar output is annotated.

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "events_columnar",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": [
          { "type": "string", "name": "country", "createBitmapIndex": true },
          { "type": "string", "name": "device", "createBitmapIndex": true },
          { "type": "string", "name": "request_id", "createBitmapIndex": false },
          { "type": "long", "name": "status_code" }
        ]
      },
      "metricsSpec": [
        { "type": "count", "name": "events" },
        { "type": "longSum", "name": "bytes_sent", "fieldName": "bytes" },
        { "type": "doubleSum", "name": "latency_ms", "fieldName": "latency" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "HOUR",
        "rollup": true
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": { "type": "s3", "prefixes": ["s3://analytics-bucket/raw/"] },
      "inputFormat": { "type": "json" }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsInMemory": 1000000,
      "partitionsSpec": {
        "type": "hashed",
        "targetRowsPerSegment": 5000000
      },
      "indexSpec": {
        "bitmap": { "type": "roaring" },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": { "type": "frontCoded", "bucketSize": 4 },
        "metricCompression": "lz4",
        "longEncoding": "auto"
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": { "type": "roaring" },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "auto"
      }
    }
  }
}

Field-by-field, the columnar-relevant keys are:

dimensionsSpec.dimensions[].createBitmapIndex — set false on near-unique columns such as request_id so Druid stores the values without paying for a bitmap that would rival the column in size. Filterable, low-cardinality columns keep true.
indexSpec.bitmap.type — roaring for all new datasources. Only fall back to concise when reading legacy segments that require it.
indexSpec.dimensionCompression / metricCompression — lz4 for hot/warm, zstd for cold/archival. Keep these aligned with the destination tier and re-apply them during compaction rather than mutating them mid-pipeline.
indexSpec.stringDictionaryEncoding — frontCoded shrinks dictionaries for columns with shared prefixes (paths, hostnames, hierarchical labels) versus the default utf8; bucketSize of 4 is a safe starting point.
indexSpec.longEncoding — auto lets Druid pick table/delta packing per column; use longs only when you have measured that auto loses on your distribution.
indexSpecForIntermediatePersists — controls encoding of the on-heap spill segments before final merge. Keeping it light (lz4) reduces persist-phase CPU while the heavier final indexSpec governs the published segment.
granularitySpec.rollup — with rollup: true, pre-aggregation collapses rows before encoding, which shrinks dictionaries and bitmaps and is often the single biggest lever on segment size.

Sizing Heuristics & Formulas

Columnar footprint is predictable enough to plan capacity against, provided you calibrate on real post-compression row sizes rather than raw estimates. The target row count for a segment follows directly from your desired on-disk size and the average compressed bytes per row:

$$ \text{targetRowsPerSegment} \approx \frac{\text{targetBytes}}{\text{avgCompressedBytesPerRow}} $$

For a 700 MB target and a measured 180 compressed bytes per row:

$$ \text{targetRowsPerSegment} \approx \frac{700 \times 1{,}048{,}576}{180} \approx 4{,}078{,}000 $$

Dictionary and bitmap overhead do not scale linearly with row count — they scale with cardinality. A useful upper-bound estimate for a single string dimension's on-disk contribution is:

$$ \text{columnBytes} \approx \underbrace{C \times \bar{L}}{\text{dictionary}} + \underbrace{N \times \lceil \log_2 C \rceil / 8}{\text{id column}} + \underbrace{C \times b}_{\text{bitmaps}} $$

where $C$ is distinct values, $\bar{L}$ average value length, $N$ total rows, and $b$ the average compressed bitmap bytes per value. The middle term shows why cardinality drives id-column width, and the first and third terms show why a high $C$ inflates both the dictionary and the index set — the practical reason to set createBitmapIndex: false on near-unique fields.

Rollup effectiveness, when enabled, is worth tracking as an explicit ratio because it directly discounts every term above:

$$ \text{rollupRatio} = \frac{\text{rawRows}}{\text{rolledUpRows}} $$

A rollup ratio near 1.0 means pre-aggregation is buying you nothing and the queryGranularity or dimension set should be revisited. These sizing relationships feed straight into partition tuning; the applied form for Historical nodes is worked through in optimizing segment size for Historical nodes. Aim to keep published segments in the 500–750 MB range: larger segments strain Historical heap and lengthen load times, smaller ones multiply metadata and Broker coordination cost.

Python Orchestration Snippet

Encoding choices only stay correct if a pipeline enforces them. The orchestrator below submits an ingestion task with a validated indexSpec, polls to a terminal state with exponential backoff, and then audits the resulting segment sizes so drift is caught at ingestion time rather than in production. It uses only the standard library plus requests.

import time
import logging
import requests

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("columnar_ingest")


class DruidColumnarPipeline:
    def __init__(self, overlord_url, coordinator_url, auth=None):
        self.overlord = overlord_url.rstrip("/")
        self.coordinator = coordinator_url.rstrip("/")
        self.session = requests.Session()
        self.session.auth = auth

    def submit(self, ingestion_spec):
        resp = self.session.post(
            f"{self.overlord}/druid/indexer/v1/task",
            json=ingestion_spec,
            timeout=30,
        )
        resp.raise_for_status()
        task_id = resp.json()["task"]
        logger.info("Submitted ingestion task %s", task_id)
        return task_id

    def poll_until_terminal(self, task_id, base_delay=5, max_delay=60, max_wait=3600):
        start = time.time()
        delay = base_delay
        while time.time() - start < max_wait:
            resp = self.session.get(
                f"{self.overlord}/druid/indexer/v1/task/{task_id}/status",
                timeout=10,
            )
            resp.raise_for_status()
            status = resp.json().get("status", {}).get("status")
            if status in ("SUCCESS", "FAILED", "INTERRUPTED"):
                logger.info("Task %s terminal state: %s", task_id, status)
                return status
            logger.debug("Task %s pending (%s); sleeping %ss", task_id, status, delay)
            time.sleep(delay)
            delay = min(delay * 2, max_delay)  # exponential backoff, capped
        raise TimeoutError(f"Task {task_id} exceeded {max_wait}s")

    def audit_segment_sizes(self, datasource, low_mb=500, high_mb=750):
        resp = self.session.get(
            f"{self.coordinator}/druid/coordinator/v1/datasources/"
            f"{datasource}/segments?full=true",
            timeout=30,
        )
        resp.raise_for_status()
        sizes_mb = [s["size"] / 1048576 for s in resp.json()]
        if not sizes_mb:
            raise RuntimeError(f"No segments found for {datasource}")
        out_of_band = [round(m, 1) for m in sizes_mb if m < low_mb or m > high_mb]
        avg = sum(sizes_mb) / len(sizes_mb)
        logger.info(
            "%s: %d segments, avg %.1f MB, %d outside %d-%d MB band",
            datasource, len(sizes_mb), avg, len(out_of_band), low_mb, high_mb,
        )
        return {"count": len(sizes_mb), "avg_mb": avg, "out_of_band": out_of_band}


# Usage
# pipe = DruidColumnarPipeline("http://overlord:8090", "http://coordinator:8081")
# task_id = pipe.submit(ingestion_spec)          # ingestion_spec = the JSON above
# if pipe.poll_until_terminal(task_id) == "SUCCESS":
#     report = pipe.audit_segment_sizes("events_columnar")
#     if report["out_of_band"]:
#         raise SystemExit(f"Segment drift detected: {report['out_of_band']}")

Because the audit fails loudly when segments fall outside the target band, this pattern slots directly into a CI/CD gate: a spec that produces mis-sized columns never reaches a Historical node. For the broader templating of these specs across environments, see dynamic ingestion spec generation.

Failure Modes & Diagnostics

Columnar problems surface as heap pressure, slow filters, or bloated metadata long before they read as "encoding" issues. The shell workflows below isolate the actual cause against the Coordinator, Overlord, and Historical REST APIs.

Dictionary bloat from high-cardinality dimensions. Symptom: growing Historical heap and slow GROUP BY. Confirm by inspecting per-segment dimension cardinality through segment metadata:

# Per-column cardinality for a segment interval via the SQL metadata endpoint
curl -s -X POST "http://<broker-host>:8082/druid/v2/sql" \
  -H 'Content-Type: application/json' \
  -d '{"query": "SELECT column_name, cardinality FROM sys.segments_columns
       WHERE datasource = '\''events_columnar'\''
       ORDER BY cardinality DESC LIMIT 20"}' | jq

Root cause is a filterable-by-accident unique column carrying a dictionary and bitmaps. Remediation: set createBitmapIndex: false (or drop the dimension), then compact the affected interval.

Oversized or undersized segments. Symptom: OOM during load, or metadata store bloat with many tiny segments. Diagnose the size distribution and Historical heap directly:

# Segment size distribution for a datasource
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/datasources/events_columnar/segments?full=true" \
  | jq '.[] | {id: .identifier, size_mb: (.size / 1048576 | floor)}'

# Historical JVM GC / heap pressure while segments load
jstat -gcutil <historical_pid> 1000 5

Root cause is a targetRowsPerSegment calibrated against raw rather than compressed row size. Remediation: recompute with the formula above and run a compaction pass rather than editing the value mid-pipeline.

Codec mismatch against tier. Symptom: cold-tier storage cost higher than expected, or hot-tier scans burning CPU on decompression. Inspect the effective indexSpec recorded on published segments and confirm it matches the tier the segment now lives on:

# Full payload for a single segment, including its indexSpec
curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/metadata/datasources/events_columnar/segments" \
  -H 'Content-Type: application/json' -d '["2026-07-01T00:00:00.000Z/2026-07-02T00:00:00.000Z"]' \
  | jq '.[0] | {interval: .interval, indexSpec: .shardSpec, size_mb: (.size/1048576|floor)}'

Remediation: re-run compaction with the tier-appropriate dimensionCompression/metricCompression, coordinated with the retention rules that move data between tiers under automated compaction and retention optimization.

Missing segment metadata after ingestion. Symptom: partial query results despite a SUCCESS task. Confirm registration and expected replication:

curl -s "http://<coordinator-host>:8081/druid/coordinator/v1/metadata/datasources/events_columnar" | jq

Root cause is usually inconsistent dataSource naming or an interval never handed off. Remediation is post-ingestion verification wired into the pipeline, as covered in schema validation for Druid specs.

Automation Checklist

Enforce per-dimension cardinality budgets in pre-ingestion validation and reject specs where a createBitmapIndex: true column exceeds the threshold.
Pin indexSpec.bitmap.type to roaring and assert it in CI so no spec silently ships concise.
Derive dimensionCompression/metricCompression from the destination storage tier (lz4 hot/warm, zstd cold/archival) and re-apply on every compaction.
Compute targetRowsPerSegment from measured compressed bytes per row, not raw size, and re-calibrate after any schema change.
Run the post-ingestion size audit and fail the gate when more than a defined fraction of segments fall outside the 500–750 MB band.
Track rollupRatio per datasource and alert when it collapses toward 1.0, signalling wasted pre-aggregation.
Verify segment metadata registration and replication via the Coordinator API before marking a pipeline run healthy.
Export segment/size and Historical heap metrics to Prometheus and alert on average-size drift and sustained GC pressure.

Apache Druid Segment Architecture & Lifecycle Fundamentals — the parent overview of how segments are built, distributed, and retired.
Optimizing Segment Size for Historical Nodes — applies these sizing formulas to Historical heap and load behaviour.
Understanding Druid Segment Granularity — how temporal partitioning drives dictionary density and bitmap length.
Query Routing and Segment Discovery — how columnar pruning shapes Broker dispatch and scan cost.
Automated Compaction, Retention & Storage Optimization — where codec selection and re-encoding are enforced over the segment lifecycle.

Columnar Storage Formats in Druid: Encoding, Indexing & Compression Internals

Mechanics & Internals #

Validated Configuration Spec #

Sizing Heuristics & Formulas #

Python Orchestration Snippet #

Failure Modes & Diagnostics #

Automation Checklist #

Related #

Explore this section