Skip to content

The Basics of Compaction — Bin Packing Your Data for Efficiency

Published: at 09:00 AM

The Basics of Compaction — Bin Packing Your Data for Efficiency

In the first post of this series, we explored how Apache Iceberg tables degrade when left unoptimized. Now it’s time to look at the most foundational optimization technique: compaction.

Compaction is the process of merging small files into larger ones to reduce file system overhead and improve query performance. In Iceberg, this usually takes the form of bin packing — grouping smaller files together so they align with an optimal size target.

Why Bin Packing Matters

Query engines like Dremio, Trino, and Spark operate more efficiently when reading a smaller number of larger files instead of a large number of tiny files. Every file adds cost:

By merging many small files into fewer large files, compaction directly addresses:

How Standard Compaction Works

A typical Iceberg compaction job involves:

  1. Scanning the table to identify small files below a certain threshold.
  2. Reading and coalescing records from multiple small files within a partition.
  3. Writing out new files targeting an optimal size (commonly 128MB–512MB per file).
  4. Creating a new snapshot that references the new files and drops the older ones.

This process can be orchestrated using:

Example: Spark Action

import org.apache.iceberg.actions.Actions

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .option("target-file-size-bytes", 134217728) // 128 MB
  .execute()

This will identify and bin-pack small files across partitions, replacing them with larger files.

Tips for Running Compaction

When Should You Run It?

That depends on:

In many cases, a daily or hourly schedule works well. Some platforms support event-driven compaction based on file count or size thresholds.

Tradeoffs

While compaction boosts performance, it also:

That’s why timing and scope matter—a theme we’ll return to later in this series.

Up Next

Now that you understand standard compaction, the next challenge is applying it without interrupting streaming workloads. In Part 3, we’ll explore techniques to make compaction faster, safer, and more incremental for real-time pipelines.