PracticeTutorialsBlogPricing
Back to blog
General

Apache Iceberg vs Delta Lake in 2026: Which Should You Use?

The table format war is the most debated topic in data engineering right now. Databricks acquired Tabular (the company behind Iceberg) for over $1 billion. AWS launched S3 Tables with built-in Iceberg support. Snowflake made Iceberg Tables GA. Every senior data engineering interview in 2026 will ask you about this — and "I have heard of both" is not an answer.

What Is a Table Format and Why Does It Matter?

A table format sits between your raw files (Parquet, ORC) and your query engine (Spark, Trino, Flink). It tracks which files belong to a table, handles schema evolution, manages transactions, and enables time travel. Without a table format, you have a data swamp. With one, you have a data lakehouse.

Before Iceberg and Delta Lake, engineers had to manually track file locations, had no ACID guarantees, and schema changes were terrifying. Both formats solve these problems — but in meaningfully different ways.

How Delta Lake Works

Delta Lake stores a transaction log in a _delta_log directory alongside your Parquet files. Every write appends a new JSON entry to this log. To read the current table state, the engine replays the log from the last checkpoint. This is simple, reliable, and deeply integrated with Apache Spark — which is why it dominates Databricks environments.

# Write a Delta table
df.write.format("delta").save("/data/sales")

# Read with time travel — go back to version 5
df = spark.read.format("delta").option("versionAsOf", 5).load("/data/sales")

# Upsert with MERGE
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/data/sales")
dt.alias("t").merge(
    updates.alias("u"),
    "t.id = u.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Delta Lake's biggest strength is its Spark integration — features like Z-ORDER clustering, OPTIMIZE, and VACUUM are first-class operations. Its limitation: it was built for Spark. Other engines (Trino, Flink, Athena) have connectors but they are not as seamless.

How Apache Iceberg Works

Iceberg uses a hierarchical metadata structure. There is a table metadata JSON file that points to manifest lists, which point to manifest files, which contain data file references with column-level statistics. This multi-layer design lets queries prune irrelevant files before scanning any data — critical for tables with billions of files.

# Write an Iceberg table via Spark
df.writeTo("prod.silver.sales").using("iceberg").createOrReplace()

# Time travel by snapshot ID
df = spark.read.option("snapshot-id", "8270633134997068581").table("prod.silver.sales")

# Time travel by timestamp
df = spark.read.option("as-of-timestamp", "2026-01-01 00:00:00").table("prod.silver.sales")

# Schema evolution — add a column without rewriting data
spark.sql("ALTER TABLE prod.silver.sales ADD COLUMN discount DOUBLE")

Iceberg's killer feature is hidden partitioning. With Delta Lake and Hive, you write queries using partition columns directly. With Iceberg, you write queries using raw column values and Iceberg automatically routes to the correct partition. This means partition evolution — changing how a table is partitioned — requires no data rewrite and no query changes.

The 5 Differences That Actually Matter

1. Engine support: Iceberg was designed engine-agnostic from day one. It has production-grade support in Spark, Trino, Flink, Dremio, Athena, BigQuery, and Snowflake. Delta Lake delivers its best experience inside Spark and Databricks — connectors exist for other engines but are a tier below.

2. Hidden partitioning: Iceberg hides partition complexity from users. You query on event_date and Iceberg finds the right files. With Delta Lake you write WHERE partition_date = "2026-01-01" — leaking the partition scheme into every query. When you change partitioning in Delta Lake, you may need to update queries. With Iceberg, you do not.

3. Metadata at scale: At billions of files, Iceberg's columnar manifest files dramatically outperform Delta Lake's JSON transaction log, which must be replayed linearly. Delta Lake mitigates this with periodic checkpointing, but Iceberg's architecture scales more naturally.

4. MERGE performance: Delta Lake's MERGE is highly optimised for Spark — it is the go-to for CDC (Change Data Capture) pipelines in Databricks. Iceberg's row-level deletes via delete files are more engine-agnostic but can require compaction to maintain read performance.

5. Ecosystem momentum: By every measure, Iceberg is winning new adoptions in 2026. The $1B+ Databricks acquisition of Tabular, AWS S3 Tables built on Iceberg, Snowflake GA Iceberg Tables, and Google BigQuery managed Iceberg — the cloud providers are standardising on Iceberg.

Which Should You Use?

Use Delta Lake if: you are all-in on Databricks, your pipelines are Spark-only, and you want the tightest integration with Databricks features (Unity Catalog, DLT, Photon). Delta Lake inside Databricks is genuinely excellent.

Use Iceberg if: you are building a new engine-agnostic platform, you use multiple query engines (Spark for ETL, Trino for ad-hoc, Flink for streaming), or you are on AWS/GCP/Snowflake. Iceberg's ecosystem support is now broad enough that it is the safer long-term bet for most organisations.

How This Shows Up in Interviews

Three questions that appear regularly in 2026 data engineering interviews:

"Your team uses Spark, Trino, and Flink against the same data lake. Which table format would you choose and why?" — The answer is Iceberg, because Delta Lake's Trino and Flink support is weaker. Explain the engine-agnostic architecture.

"You need to change how a 10TB table is partitioned without downtime. How do you do it in Iceberg?" — Iceberg partition evolution. No data rewrite, no downtime. ALTER TABLE ... SET PARTITION SPEC.

"Explain the difference between a snapshot and a version in Iceberg/Delta Lake." — In Iceberg, every write creates a new snapshot with a snapshot ID. Time travel uses snapshot IDs or timestamps. In Delta Lake, writes create versions — sequential integers in the transaction log.

The best way to get comfortable with these concepts is to practice writing real PySpark code that reads, writes, and queries table formats — not just memorising definitions. DataCodingHub has hands-on PySpark questions that build exactly this intuition.

Practice PySpark & data lake questions