Skip to content

Avoiding Metadata Bloat with Snapshot Expiration and Rewriting Manifests

Published: at 09:00 AM

Avoiding Metadata Bloat with Snapshot Expiration and Rewriting Manifests

As your Apache Iceberg tables evolve—through continuous writes, schema changes, and compaction jobs—they generate a growing amount of metadata. While metadata is a powerful feature in Iceberg, enabling time travel and auditability, unchecked metadata growth can lead to:

In this post, we’ll explore how to expire old snapshots and rewrite manifests to keep your Iceberg tables lean, responsive, and cost-efficient.

What Causes Metadata Bloat?

Iceberg tracks table state through a series of snapshots. Each snapshot references a set of manifest lists, which in turn reference manifest files describing individual data files.

Bloat occurs when:

Expiring Snapshots

You can safely remove older snapshots using Iceberg’s built-in expiration functionality. This deletes metadata for snapshots that are no longer needed for time travel, rollback, or audit purposes.

Example in Spark:

import org.apache.iceberg.actions.Actions

Actions.forTable(spark, table)
  .expireSnapshots()
  .expireOlderThan(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(7)) // keep 7 days
  .retainLast(2) // keep last 2 snapshots no matter what
  .execute();

This keeps recent snapshots while cleaning up older ones, freeing up metadata and unreferenced data files (if garbage collection is also enabled).

Guidelines:

Rewriting Manifests

Over time, manifest files can become inefficient:

Example in Spark:

Actions.forTable(spark, table)
  .rewriteManifests()
  .execute();

This reduces metadata file count, organizes manifests by partition and sort order, and can improve query planning times.

When Should You Perform Metadata Cleanup?

Bonus: Use Metadata Tables to Inspect Bloat

Iceberg’s metadata tables help you inspect how much bloat has built up.

Example:

SELECT snapshot_id, added_files_count, total_data_files_count
FROM my_table.snapshots
ORDER BY committed_at DESC;
SELECT COUNT(*) FROM my_table.manifests;

These insights can help you determine when cleanup is needed.

Tradeoffs and Cautions

Summary

Metadata is a powerful part of Iceberg’s architecture, but without routine maintenance, it can weigh down your table performance. By:

In the next post, we’ll explore how to design the ideal cadence for compaction and snapshot expiration so your optimizations are timely and cost-effective.