Apache Iceberg Architecture

Apache Iceberg represents every table as a tree of metadata files sitting in object storage alongside the actual data. Understanding that tree is the key to understanding how Iceberg delivers ACID guarantees, fast queries, and multi-engine access without a central coordinator.

The Metadata Tree

graph TD CAT["Catalog: table name to current metadata.json location"] META["Table Metadata JSON: schema, partition spec, sort order, snapshot list, current snapshot ID"] ML["Manifest List (Avro): one record per manifest, includes partition-level min/max summaries"] MF1["Manifest File (Avro): one record per data file, includes column-level min/max and null counts"] MF2["Manifest File (Avro): one record per data file, includes column-level min/max and null counts"] DF1["data-00001.parquet"] DF2["data-00002.parquet"] DF3["data-00003.parquet"] DEL["delete-00001.parquet (positional or equality deletes)"] CAT --> META META --> ML ML --> MF1 ML --> MF2 MF1 --> DF1 MF1 --> DF2 MF2 --> DF3 MF2 --> DEL

Table Metadata JSON

Every Iceberg table has a current metadata JSON file recording the full schema (with column IDs, not just names), all partition specs, all sort orders, and a list of all snapshots. The catalog holds a pointer to the current version. When you run ALTER TABLE ... ADD COLUMN, Iceberg writes a new metadata JSON with the updated schema and swaps the catalog pointer. No data files are rewritten.

Snapshots

A snapshot represents the table state at a specific committed transaction. Each snapshot has a unique ID, a parent snapshot ID, a timestamp, a summary of what changed, and a pointer to the manifest list. Snapshots are immutable. Once written, they never change. Time travel and consistent reads are direct consequences of this.

Manifest List

An Avro file where each record describes one manifest file and includes a summary of that manifest's partition range. This summary enables manifest-level partition pruning: the query planner reads only the manifest list headers to determine which manifests can be skipped entirely before reading any manifest file.

Manifest Files

Each manifest file is Avro. Every record describes one data file or delete file and includes the file's path, format, partition values, record count, file size, and per-column statistics (min value, max value, null count). These statistics enable file-level data skipping: if a file's max order_date is earlier than the query's filter, the engine skips that file without opening it.

The Read Path: How Query Planning Works

flowchart TD Q["Query: SELECT ... WHERE region='AMER' AND date >= '2026-01-01'"] Q --> C["1. Catalog lookup: get current metadata.json path"] C --> M["2. Read metadata.json: get current snapshot ID"] M --> ML["3. Read manifest list: evaluate partition summaries. Skip manifests outside date/region range"] ML --> MF["4. Read qualifying manifests: evaluate file-level stats. Skip files outside date range"] MF --> BP["5. Apply bloom filters if present"] BP --> READ["6. Read qualifying Parquet files"] READ --> RESULT["Return result rows"]

Steps 1 through 5 are metadata operations — no Parquet data is read. A query that filters on a well-partitioned date column may skip 99% of the data files. This is how Iceberg achieves warehouse-level query performance on raw object storage.

The Write Path: Committing a Snapshot

sequenceDiagram participant W as Writer participant Cat as Catalog participant S3 as Object Storage W->>Cat: Load table (get metadata.json version N) Cat-->>W: metadata.json path, schema, partition spec W->>S3: Write Parquet data files (parallel) W->>S3: Write manifest file (lists new data files + their stats) W->>S3: Write manifest list (new snapshot referencing manifests) W->>S3: Write metadata.json version N+1 W->>Cat: Commit: swap table pointer from N to N+1 Cat-->>W: 200 OK (or 409 Conflict if another writer committed first) Note over W,Cat: On 409: W re-reads current state and retries

Concurrency

Iceberg uses optimistic concurrency control. Writers do not take locks. Two appends to different partitions: no conflict, both succeed. Two writers overwriting the same partition: conflict, one retries. A compaction job and an append job: usually compatible, both succeed.

Schema Evolution via Column IDs

Every column has a permanent numeric field ID. Parquet files store data by field ID, not column name. This means renaming a column, adding a column, or reordering columns requires no data rewrites. Old files still work because Iceberg maps field IDs to current names at read time.

Snapshot Maintenance

Old snapshots reference data files and keep them from being garbage collected. The standard operations are: expire snapshots, remove orphan files, compaction, and rewrite manifests.