Databricks Interview Questions Cheat Sheet 2026
Databricks interviews are different from generic data engineering interviews. You are expected to know the Databricks-specific ecosystem deeply — Delta Lake internals, the Lakehouse paradigm, Unity Catalog governance, the Photon engine, MLflow, and Delta Live Tables. This cheat sheet covers every topic area that appears repeatedly in Databricks engineer and data engineer interviews at tech companies in 2026.
Q1. What is the Lakehouse architecture and how does Databricks implement it?
The Lakehouse combines the low-cost, schema-flexible storage of a data lake with the ACID transactions, schema enforcement, and query performance of a data warehouse — all on a single platform. Databricks implements this through Delta Lake: an open-source storage layer that sits on top of cloud object storage (S3, ADLS, GCS) and adds a transaction log, enabling ACID compliance on Parquet files.
The key insight is that raw data lands in object storage (like S3) in Parquet format. Delta Lake wraps those files with a _delta_log/ directory containing JSON commit files that record every write, update, and delete. This transaction log is what enables time travel, schema enforcement, concurrent reads/writes, and audit history — features that were previously only available in proprietary data warehouses.
# Writing a Delta table
df.write.format("delta").mode("overwrite").save("/mnt/datalake/silver/orders")
# Reading back
df = spark.read.format("delta").load("/mnt/datalake/silver/orders")
# Using SQL syntax
spark.sql("CREATE TABLE orders USING DELTA LOCATION '/mnt/datalake/silver/orders'")Q2. What is the Delta Lake transaction log and how does it work?
The Delta Lake transaction log (_delta_log/) is the single source of truth for a Delta table. Every operation — INSERT, UPDATE, DELETE, MERGE, schema change — is recorded as a new JSON file in this directory. When you read a Delta table, Spark first reads the transaction log to determine which Parquet files are part of the current version of the table, then reads only those files.
Every 10 commits, Delta Lake creates a Parquet checkpoint file that consolidates the JSON logs for faster reads. This means reading metadata for a table with 1000 commits only requires reading the latest checkpoint plus the JSON files since then — not all 1000 JSON files. The transaction log is also what makes concurrent writes safe: writers use optimistic concurrency control, checking for conflicts at commit time.
# Inspect the transaction log
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/mnt/datalake/silver/orders")
# View history
dt.history().select("version", "timestamp", "operation", "operationParameters").show()
# Check table details
dt.detail().select("numFiles", "sizeInBytes", "numOutputRows").show()Q3. Explain Delta Lake time travel — how does it work and when would you use it?
Time travel lets you query any previous version of a Delta table. Since the transaction log records every commit and the old Parquet files are never deleted until VACUUM runs, you can query the exact state of a table at any point in its history. You reference a version by number or by timestamp.
Real use cases: (1) Audit — reproduce exactly what data was visible at the time a report was generated. (2) Recovery — if a bad MERGE or DELETE ran, read the version before the mistake and overwrite the current table. (3) Reproducibility — ML training pipelines can reference a specific table version so models are reproducible even as the underlying data changes. (4) Debugging — compare two versions to understand what changed between pipeline runs.
# Query by version number
df_v5 = spark.read.format("delta").option("versionAsOf", 5).load("/mnt/datalake/silver/orders")
# Query by timestamp
df_yesterday = spark.read.format("delta") .option("timestampAsOf", "2026-04-16T00:00:00") .load("/mnt/datalake/silver/orders")
# Restore a table to a previous version
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/mnt/datalake/silver/orders")
dt.restoreToVersion(5)
# SQL syntax
spark.sql("SELECT * FROM orders VERSION AS OF 5")
spark.sql("SELECT * FROM orders TIMESTAMP AS OF '2026-04-16'")Q4. What is MERGE INTO and how does it enable CDC (Change Data Capture)?
MERGE INTO is the most powerful Delta Lake operation. It lets you upsert data — insert new rows, update existing rows, and optionally delete rows — in a single atomic operation. This is the standard pattern for applying CDC (Change Data Capture) updates from a source system to a Delta Lake target.
A CDC pipeline typically reads change events (inserts, updates, deletes) from Kafka or a database CDC log. MERGE INTO applies these changes to the Delta table. The operation is ACID — if the job fails halfway, the table is not left in a partial state. Under the hood, Delta rewrites only the affected files, not the entire table.
from delta.tables import DeltaTable
target = DeltaTable.forPath(spark, "/mnt/datalake/silver/customers")
# Apply CDC updates: new rows insert, changed rows update, deleted rows delete
target.alias("target").merge(
source=cdc_df.alias("source"),
condition="target.customer_id = source.customer_id"
).whenMatchedUpdate(
condition="source.op_type = 'U'",
set={
"name": "source.name",
"email": "source.email",
"updated_at": "source.updated_at"
}
).whenMatchedDelete(
condition="source.op_type = 'D'"
).whenNotMatchedInsert(
condition="source.op_type = 'I'",
values={
"customer_id": "source.customer_id",
"name": "source.name",
"email": "source.email",
"updated_at": "source.updated_at"
}
).execute()Q5. What is OPTIMIZE and Z-ordering? When should you use them?
OPTIMIZE is a Delta Lake command that compacts small files into larger, more efficient Parquet files. When you have a streaming pipeline writing micro-batches every few minutes, you end up with thousands of tiny files — each requiring a separate S3 API call to read. OPTIMIZE merges these into files of ~1GB each, dramatically reducing the number of file reads and improving query speed.
Z-ordering is a data clustering technique that co-locates related data within the same Parquet files. When you Z-ORDER BY a column (e.g., customer_id), Delta reorganises data so that rows with similar customer_id values are stored in the same files. Queries filtering on customer_id can then skip entire files — this is called data skipping. Z-ordering is most effective on high-cardinality columns that appear frequently in WHERE clauses, JOIN conditions, or GROUP BY.
-- Run OPTIMIZE with Z-ordering
OPTIMIZE orders ZORDER BY (customer_id, order_date);
-- Python equivalent
from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/mnt/datalake/silver/orders")
dt.optimize().executeZOrderBy("customer_id", "order_date")
-- VACUUM: remove old files no longer needed (keep 7 days for time travel)
VACUUM orders RETAIN 168 HOURS;
-- Check how much data skipping is occurring
DESCRIBE DETAIL orders;Q6. What is Auto Loader and why is it preferred over plain Spark readStream?
Auto Loader (cloudFiles source) is Databricks's managed incremental ingestion tool for cloud object storage. Unlike a plain Spark readStream with the binaryFile or csv source, Auto Loader uses cloud-native file notification services (S3 Event Notifications, Azure Event Grid, GCS Pub/Sub) to detect new files the instant they arrive — rather than listing the entire directory on every micro-batch. This makes it dramatically more efficient for large directories with millions of files.
Auto Loader also handles schema inference and evolution automatically. It tracks which files have already been processed using a RocksDB-based checkpoint, so even if a job restarts or a cluster fails, it picks up exactly where it left off without reprocessing or missing files. It can ingest JSON, CSV, Parquet, Avro, ORC, and Delta formats.
# Auto Loader ingesting JSON from S3 into a Delta table
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/mnt/checkpoints/orders_schema")
.option("cloudFiles.inferColumnTypes", "true")
.load("s3://raw-bucket/orders/")
)
(df.writeStream
.format("delta")
.option("checkpointLocation", "/mnt/checkpoints/orders")
.option("mergeSchema", "true")
.trigger(availableNow=True) # process all available files then stop
.toTable("bronze.orders")
)Q7. What is Unity Catalog and how does it differ from the legacy Hive metastore?
Unity Catalog is Databricks's unified governance layer for all data and AI assets. It replaces the per-workspace Hive metastore with a centralised, account-level catalog that spans all workspaces. The key improvements: (1) Fine-grained access control at the column and row level — you can grant SELECT on specific columns, or apply row filters based on the user's group. (2) Data lineage tracking — automatically records which queries read and wrote to which tables. (3) Audit logs — every access is logged. (4) Works across Delta tables, external tables, ML models, feature tables, and notebooks.
The three-level namespace in Unity Catalog is: catalog.schema.table (e.g., prod.silver.orders). The legacy Hive metastore only had schema.table. This extra level lets organisations create separate catalogs per environment (dev, staging, prod), per business unit, or per data domain — all governed from a single control plane.
-- Create a catalog and schema in Unity Catalog
CREATE CATALOG IF NOT EXISTS prod;
CREATE SCHEMA IF NOT EXISTS prod.silver;
-- Grant access
GRANT USE CATALOG ON CATALOG prod TO 'data-engineers';
GRANT SELECT ON TABLE prod.silver.orders TO 'analysts';
-- Column masking (hide PII for non-privileged users)
CREATE OR REPLACE FUNCTION mask_email(email STRING)
RETURNS STRING
RETURN IF(is_member('pii-access'), email, regexp_replace(email, '(.).+@', '$1***@'));
ALTER TABLE prod.silver.customers
ALTER COLUMN email SET MASK mask_email;Q8. What is the Photon engine? When does it help most?
Photon is Databricks's native vectorised query engine, written in C++, that replaces the standard Spark JVM-based executor for SQL and DataFrame operations. Instead of processing one row at a time, Photon processes data in columnar batches (vectors), which maps directly to CPU SIMD instructions. The result is 2–8x faster query execution for SQL-heavy workloads — aggregations, joins, sorts, and scans are particularly accelerated.
Photon helps most on: (1) BI and dashboarding queries that run frequently and need sub-second latency. (2) Large aggregations and window functions. (3) Wide tables with many columns (columnar execution skips non-referenced columns entirely). It helps least on: Python UDFs (which still cross the JVM/Python boundary), very small datasets where query compilation overhead dominates, and RDD-based workloads. Photon is enabled by default on Databricks Runtime 9.1 LTS and above — you do not need to change your code.
Q9. What are Delta Live Tables (DLT) and how do they simplify pipeline development?
Delta Live Tables is a declarative framework for building reliable, maintainable ETL pipelines in Databricks. Instead of writing imperative Spark code that manually manages dependencies, retries, and checkpoint state, you declare each table as a Python function decorated with @dlt.table or @dlt.view. Databricks infers the dependency graph between tables, manages execution order, handles retries automatically, and tracks data quality via expectations.
DLT pipelines run in two modes: Development (full recompute, good for iteration) and Production (incremental/streaming, good for cost efficiency). The big win for interviews: DLT automatically handles exactly-once semantics for streaming pipelines, manages schema evolution, and gives you a live pipeline graph UI showing data quality metrics and row counts at each stage.
import dlt
from pyspark.sql import functions as F
# Bronze: raw ingestion
@dlt.table(comment="Raw orders from Auto Loader")
def bronze_orders():
return (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("s3://raw-bucket/orders/")
)
# Silver: clean and validate — with data quality expectations
@dlt.table(comment="Validated orders")
@dlt.expect_or_drop("valid_amount", "order_amount > 0")
@dlt.expect_or_drop("non_null_id", "order_id IS NOT NULL")
def silver_orders():
return (dlt.read_stream("bronze_orders")
.withColumn("order_date", F.to_date("order_timestamp"))
.select("order_id", "customer_id", "order_amount", "order_date")
)
# Gold: aggregation
@dlt.table(comment="Daily revenue per customer")
def gold_daily_revenue():
return (dlt.read("silver_orders")
.groupBy("customer_id", "order_date")
.agg(F.sum("order_amount").alias("daily_revenue"))
)Q10. What is MLflow and how does it integrate with Databricks?
MLflow is an open-source platform for managing the complete ML lifecycle. Databricks manages a hosted MLflow instance as part of every workspace. The four core components: (1) Tracking — logs parameters, metrics, and artefacts (model files, plots) for every training run. (2) Model Registry — versioned store for models with lifecycle stages (Staging, Production, Archived). (3) Projects — reproducible packaging of ML code. (4) Models — standard model format that lets you serve the same model via REST API, batch inference, or streaming.
In a Databricks interview, the expected workflow is: train a model with MLflow autologging (logs all hyperparameters automatically), register the best run to the Model Registry, promote it to Production after validation, then serve it via a Model Serving endpoint or apply it to a Delta table using mlflow.pyfunc.spark_udf() for batch scoring.
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
mlflow.set_experiment("/Users/me/churn-prediction")
with mlflow.start_run(run_name="rf_v3"):
mlflow.autolog() # logs all params and metrics automatically
model = RandomForestClassifier(n_estimators=200, max_depth=8)
model.fit(X_train, y_train)
acc = accuracy_score(y_test, model.predict(X_test))
mlflow.log_metric("test_accuracy", acc)
# Register to Model Registry
mlflow.sklearn.log_model(
model,
artifact_path="rf_model",
registered_model_name="churn-predictor"
)
# Batch inference on a Delta table using the registered model
import mlflow.pyfunc
predict_udf = mlflow.pyfunc.spark_udf(spark, "models:/churn-predictor/Production")
predictions = customers_df.withColumn("churn_prob", predict_udf(*feature_cols))
predictions.write.format("delta").mode("overwrite").saveAsTable("gold.churn_predictions")Q11. Explain Structured Streaming checkpoints and exactly-once semantics in Databricks.
Structured Streaming achieves exactly-once semantics through a combination of offset tracking and idempotent writes. The checkpoint directory stores two things: (1) offsets — the source position (Kafka offset, file path, etc.) of the last successfully processed micro-batch. (2) commit log — confirmation that the last batch was written to the sink. On restart, the engine reads the checkpoint to know which offsets to start from, ensuring no data is reprocessed or lost.
For exactly-once end-to-end (not just at-least-once), the sink must also support idempotent writes. Delta Lake sinks are idempotent — each micro-batch has a unique batch ID, and Delta's transaction log prevents duplicate writes even if a batch is retried. Kafka sinks require enabling idempotent producers. HTTP/JDBC sinks typically provide at-least-once semantics only.
# Structured Streaming with Delta sink — exactly-once
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "orders-topic")
.option("startingOffsets", "latest")
.load()
.select(F.from_json(F.col("value").cast("string"), schema).alias("data"))
.select("data.*")
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/mnt/checkpoints/kafka-orders")
.trigger(processingTime="30 seconds")
.toTable("silver.orders")
)Q12. What is the Change Data Feed (CDF) in Delta Lake?
Change Data Feed (also called Change Data Capture at the storage layer) is a Delta Lake feature that records row-level changes — inserts, updates (before and after values), and deletes — as they happen to a Delta table. When enabled, Delta writes a separate _change_data/ directory alongside the regular Parquet files. Downstream consumers can read only the changes since their last checkpoint, rather than re-scanning the entire table.
This is particularly powerful for propagating changes through a medallion architecture: a Silver table with CDF enabled can feed a Gold aggregation table incrementally. Instead of the Gold job re-reading all Silver data on every run, it reads only the rows that changed since the last run — dramatically reducing compute cost. It is also the foundation for building low-latency CDC pipelines from Delta into downstream systems like Elasticsearch or a feature store.
-- Enable CDF on a table
ALTER TABLE silver.orders SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
-- Or at creation time
CREATE TABLE silver.orders (
order_id STRING,
customer_id STRING,
amount DOUBLE
) TBLPROPERTIES (delta.enableChangeDataFeed = true);
-- Read changes since version 10
changes = (spark.read
.format("delta")
.option("readChangeData", True)
.option("startingVersion", 10)
.table("silver.orders")
)
# _change_type column = "insert" | "update_preimage" | "update_postimage" | "delete"
changes.filter("_change_type IN ('insert', 'update_postimage')").show()Q13. How do you handle data skew in Databricks / PySpark?
Data skew happens when one or a few join or groupBy keys contain a disproportionate number of rows — causing some Spark tasks to process 100x more data than others. The job runs as fast as the slowest task. Common symptoms: most tasks complete in seconds, but 1-2 tasks run for 10+ minutes; Spark UI shows a handful of tasks using far more memory than the rest.
Solutions: (1) Salting — add a random suffix (0–N) to the skewed key before the join, then join on the salted key after replicating the small side with all N suffixes. (2) AQE skew join optimisation — Databricks Runtime 7.0+ with Adaptive Query Execution (spark.sql.adaptive.enabled=true) automatically detects and splits skewed partitions at runtime. (3) Broadcast the smaller side if it fits in memory. (4) Pre-aggregate before joining to reduce the volume of skewed rows.
# Manual salting to handle skew on "customer_id"
import pyspark.sql.functions as F
SALT_BUCKETS = 20
# Add salt to the large side
orders_salted = orders.withColumn(
"salted_key",
F.concat(F.col("customer_id"), F.lit("_"), (F.rand() * SALT_BUCKETS).cast("int").cast("string"))
)
# Replicate the small side with all salt values
customers_exploded = customers.withColumn("salt", F.explode(F.array([F.lit(i) for i in range(SALT_BUCKETS)]))) .withColumn("salted_key", F.concat(F.col("customer_id"), F.lit("_"), F.col("salt").cast("string")))
# Join on salted key
result = orders_salted.join(customers_exploded, on="salted_key", how="inner")
# AQE (preferred — no code changes needed)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")Q14. What is Adaptive Query Execution (AQE) and what problems does it solve?
AQE (enabled by default in Databricks Runtime 7.0+) is a Spark feature that re-optimises a query plan at runtime using actual data statistics collected during execution — rather than relying solely on pre-execution estimates. This solves three classic Spark performance problems:
(1) Skew join: AQE detects partitions that are larger than the median and automatically splits them into smaller sub-partitions, so no single task is overwhelmed. (2) Coalescing small shuffle partitions: After a shuffle, AQE merges small consecutive partitions into larger ones, reducing the number of tasks in subsequent stages. (3) Dynamic switching from sort-merge join to broadcast join: if one side of a join turns out to be small after filtering, AQE switches to a broadcast join at runtime, avoiding an expensive shuffle.
# AQE configuration in Databricks (mostly defaults — shown for interview awareness)
spark.conf.set("spark.sql.adaptive.enabled", "true") # default ON in DBR 7+
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") # merge small shuffle partitions
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") # auto-handle skew
spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true") # reduce network I/O
# Check if AQE was used in a query plan
df.explain("formatted") # look for "AdaptiveSparkPlan" and "AQE" annotationsQ15. How do you optimise a slow Databricks job? Walk through your checklist.
This is the most common senior-level Databricks interview question. A structured answer demonstrates experience. Here is the production checklist:
1. Check the Spark UI: look at the Stage tab for task duration distribution. Highly skewed distributions (most tasks fast, a few very slow) = data skew. Check the SQL tab for expensive exchanges (shuffles) and missing predicate pushdown. 2. Count the files: use DESCRIBE DETAIL on the Delta table. Thousands of tiny files = small file problem → run OPTIMIZE. 3. Check the query plan with df.explain("formatted"): look for FileScan with a large number of files and no PartitionFilter, SortMergeJoin where BroadcastHashJoin would be faster, and unnecessary full table scans. 4. Tune partitioning: spark.sql.shuffle.partitions defaults to 200. For large datasets use 2–4x the number of cores; for small datasets reduce it significantly. AQE handles this automatically if enabled. 5. Cache strategically: if the same large DataFrame is used in multiple actions, cache it. But cache consumes memory — uncache when done. 6. Upgrade to a cluster with Photon enabled for SQL-heavy workloads.
# Diagnostic commands for a slow Databricks job
# 1. Check table file health
spark.sql("DESCRIBE DETAIL prod.silver.orders").select("numFiles", "sizeInBytes").show()
# 2. Check query plan for issues
df.explain("formatted")
# 3. Tune shuffle partitions if not using AQE
spark.conf.set("spark.sql.shuffle.partitions", "400") # tune to dataset size
# 4. Force broadcast for a known-small table
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(dim_df), "id")
# 5. Cache a reused DataFrame
df_silver = spark.table("silver.orders").filter("status = 'complete'").cache()
df_silver.count() # trigger caching
# 6. Run OPTIMIZE after heavy writes
spark.sql("OPTIMIZE prod.silver.orders ZORDER BY (customer_id)")