Skip to content

Smarter Data Layout — Sorting and Clustering Iceberg Tables

Published: at 09:00 AM

Smarter Data Layout — Sorting and Clustering Iceberg Tables

So far in this series, we’ve focused on optimizing file sizes to reduce metadata and scan overhead. But how data is laid out within those files can be just as important as the size of the files themselves.

In this post, we’ll explore clustering techniques in Apache Iceberg, including sort order and Z-ordering, and how these techniques improve query performance by reducing the amount of data that needs to be read.

Why Clustering Matters

Imagine a query that filters on a customer_id. If your data is randomly distributed, every file needs to be scanned. But if the data is sorted or clustered, the engine can skip over entire files or row groups — reducing I/O and speeding up execution.

Clustering benefits:

Sorting in Iceberg

Iceberg supports sort order evolution, which lets you define how data should be physically sorted as it’s written or rewritten.

You can define sort orders during write or compaction:

import org.apache.iceberg.SortOrder
import static org.apache.iceberg.expressions.Expressions.*;

table.updateSortOrder()
  .sortBy(asc("customer_id"), desc("order_date"))
  .commit();

Use Cases for Sorting

Z-order Clustering

Z-ordering is a multi-dimensional clustering technique that co-locates related values across multiple columns. It’s ideal for exploratory queries that filter on different combinations of columns.

Example:

table.updateSortOrder()
  .sortBy(zorder("customer_id", "product_id", "region"))
  .commit();

Z-ordering works by interleaving bits from multiple columns to keep related rows close together. This increases the chance that queries filtering on any subset of these columns can benefit from data skipping.

Note: Z-ordering is supported by Iceberg through integrations like Dremio’s Iceberg Auto-Clustering and Spark jobs using RewriteDataFiles.

Choosing Between Sort and Z-order

Use CaseBest Technique
Filtering on one key columnSimple Sort
Range queries on timestampsSort on time
Multi-column filteringZ-order
Joins on a key columnSort on join key
Complex OLAP-style filtersZ-order

When to Apply Clustering

Clustering is typically applied:

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .sortBy("region", "event_time")
  .execute();

Make sure the sort order aligns with your most frequent query patterns.

Tradeoffs

While clustering helps query performance, it comes with tradeoffs:

Summary

Smart data layout is essential for fast queries in Apache Iceberg. By leveraging sorting and Z-order clustering:

In the next post, we’ll look at another silent performance killer: metadata bloat, and how to clean it up using snapshot expiration and manifest rewriting.