PracticeTutorialsBlogPricing
Back to blog
General

Medallion Architecture: Bronze, Silver and Gold Explained

The medallion architecture (also called multi-hop architecture) has become the de-facto standard for organising data in a modern data lakehouse. If you are interviewing for a senior data engineering role in 2026, you must be able to design and discuss this pattern fluently.

What Is the Medallion Architecture?

The medallion architecture organises data into three progressively refined layers: • Bronze — Raw data exactly as it arrived from source systems. Immutable, full fidelity, no transformations. Think of it as your audit log. • Silver — Cleaned, validated, deduplicated, and lightly enriched data. Business entities take shape here. • Gold — Aggregated, business-ready data optimised for specific use cases: BI dashboards, ML feature stores, reporting APIs. Each layer adds trust and loses raw detail. Bronze is for data engineers; Gold is for business stakeholders.

Bronze Layer — Raw Ingestion

# Bronze: land raw data exactly as received
# Add metadata columns but never modify source data

from pyspark.sql import functions as F

def ingest_to_bronze(source_df, table_path):
    bronze_df = source_df         .withColumn("_ingested_at", F.current_timestamp())         .withColumn("_source", F.lit("orders_api"))         .withColumn("_batch_id", F.lit(batch_id))

    bronze_df.write         .format("delta")         .mode("append")         .partitionBy("_ingestion_date")         .save(table_path)

Silver Layer — Clean and Validate

# Silver: clean, deduplicate, enforce schema

def bronze_to_silver(bronze_path, silver_path):
    df = spark.read.format("delta").load(bronze_path)

    # 1. Cast types properly
    df = df.withColumn("order_date", F.to_date("order_date", "yyyy-MM-dd"))            .withColumn("revenue", F.col("revenue").cast("double"))

    # 2. Drop nulls on critical keys
    df = df.dropna(subset=["order_id", "customer_id"])

    # 3. Deduplicate (keep latest record per order_id)
    from pyspark.sql.window import Window
    w = Window.partitionBy("order_id").orderBy(F.desc("_ingested_at"))
    df = df.withColumn("rn", F.row_number().over(w)).filter("rn = 1").drop("rn")

    # 4. Write as Delta (upsert to handle re-runs idempotently)
    from delta.tables import DeltaTable
    if DeltaTable.isDeltaTable(spark, silver_path):
        DeltaTable.forPath(spark, silver_path).alias("t")             .merge(df.alias("s"), "t.order_id = s.order_id")             .whenMatchedUpdateAll()             .whenNotMatchedInsertAll()             .execute()
    else:
        df.write.format("delta").save(silver_path)

Gold Layer — Business Aggregations

# Gold: aggregate for specific business use cases

def silver_to_gold_daily_revenue(silver_path, gold_path):
    df = spark.read.format("delta").load(silver_path)

    gold_df = df.groupBy("order_date", "region", "product_category")                 .agg(
                    F.sum("revenue").alias("total_revenue"),
                    F.countDistinct("customer_id").alias("unique_customers"),
                    F.count("order_id").alias("order_count"),
                    F.avg("revenue").alias("avg_order_value")
                )

    gold_df.write         .format("delta")         .mode("overwrite")         .option("overwriteSchema", "true")         .partitionBy("order_date")         .save(gold_path)

Why This Pattern Wins Interviews

When an interviewer asks "How would you design a data platform?", answering with the medallion architecture immediately signals senior-level thinking. Key points to mention: • Bronze enables full replay — if Silver logic has a bug, re-run from Bronze without re-ingesting from source • Each layer has distinct SLAs — Bronze in 5 min, Silver in 30 min, Gold in 1 hour • Data quality checks live at the Bronze→Silver boundary • Gold tables are purpose-built — one Gold table per use case, not a one-size-fits-all table • Delta Lake makes all of this reliable with ACID transactions

Practice data engineering problems on DataCodingHub