PracticeTutorialsBlogPricing
Data Engineering Interview Prep

Data Engineering
Interview Questions 2026

15+ must-know data engineering questions covering pipelines, data warehouses, Kafka, Airflow, dbt, and system design — asked at Amazon, Meta, Google and Databricks.

15+ questions
System design included
FAANG-sourced
Free to read
Data Engineering Interview Questions

Covers the full breadth of data engineering — from fundamentals like OLTP vs OLAP to advanced topics like system design and fault-tolerant pipeline architecture.

Q1

What is the difference between a data warehouse, data lake, and data lakehouse?

Data warehouse — structured, schema-on-write, optimised for BI queries, expensive storage (Snowflake, Redshift, BigQuery). Data lake — raw data in any format, schema-on-read, cheap storage (S3, GCS), but poor governance and query performance. Data lakehouse — combines both: cheap lake storage with warehouse-quality ACID transactions, schema enforcement, and fast queries (Databricks Lakehouse, Delta Lake, Apache Iceberg, Apache Hudi). The lakehouse is the current industry standard for modern data platforms.
Q2

What is a data pipeline and what are its main components?

A data pipeline moves data from source systems to destinations with transformations in between. Main components: • Ingestion — extract from sources (APIs, databases, event streams) • Storage — raw/staging layer (S3, GCS, ADLS) • Transformation — clean, enrich, aggregate (Spark, dbt, SQL) • Serving — optimised for consumers (data warehouse, feature store) • Orchestration — schedule and monitor (Airflow, Prefect, Dagster) • Observability — data quality checks, lineage, alerting
Q3

Explain the difference between batch processing and stream processing.

Batch processing — processes data in large chunks at scheduled intervals (hourly, daily). Simple, cost-efficient, high throughput. Tools: Spark, dbt, SQL. Suitable when latency of hours is acceptable. Stream processing — processes data continuously as it arrives, typically with sub-second to second latency. Tools: Kafka Streams, Apache Flink, Spark Structured Streaming. Suitable for real-time dashboards, fraud detection, recommendation engines. Most modern architectures use both: streaming for low-latency use cases and batch for historical reprocessing and aggregations.
Q4

What is Apache Kafka and how is it used in data pipelines?

Kafka is a distributed event streaming platform — a highly scalable, fault-tolerant message queue. Producers publish events to topics; consumers read from topics at their own pace. Kafka retains messages for a configurable period (default 7 days), allowing replays. In pipelines: services publish events (user clicks, transactions) → Kafka buffers them → Spark Structured Streaming or Flink consumes → writes to data lake or warehouse. Kafka decouples producers from consumers and handles traffic spikes without data loss.
Q5

What is the medallion architecture (Bronze/Silver/Gold)?

A layered data organisation pattern popularised by Databricks: • Bronze — raw ingested data, immutable, full fidelity, no transformations. Schema-on-read. • Silver — cleaned, validated, deduplicated, enriched. Business entities take shape. • Gold — aggregated, business-level tables ready for BI and ML consumption. Benefits: lineage is clear, you can replay from Bronze if bugs are found in Silver/Gold, each layer has progressively stricter data quality. Most modern Lakehouse architectures use this pattern.
Q6

What is dbt (data build tool) and how does it fit in the modern data stack?

dbt is a transformation tool that lets data engineers and analysts write modular SQL SELECT statements that dbt compiles and runs against your data warehouse. It handles: dependency management between models, testing (not null, unique, referential integrity), documentation, and version control for SQL. dbt fits in the T of ELT — data is first loaded into the warehouse, then transformed with dbt. It does not move data; it transforms data already in the warehouse. Stack: Fivetran/Airbyte (EL) → Snowflake/BigQuery (storage) → dbt (T) → Looker/Tableau (BI).
Q7

What is idempotency and why is it critical in data pipelines?

An idempotent operation produces the same result whether run once or many times. In data pipelines, failures and retries are inevitable — a non-idempotent pipeline run twice creates duplicate data or corrupted state. How to achieve idempotency: • Use INSERT OVERWRITE / MERGE instead of INSERT • Partition writes by date — overwriting a partition is safe to rerun • Use deduplication keys in upserts • Track pipeline run state with a watermark table • In Kafka consumers, use exactly-once semantics or idempotent consumers Interviewers consider idempotency understanding a must-have for senior DE roles.
Q8

Explain SCD Type 1, Type 2, and Type 3.

Slowly Changing Dimensions handle how dimension data changes over time: • SCD Type 1 — overwrite old value, no history kept. Simple. Used when history is not needed (e.g., fixing a typo in a name). • SCD Type 2 — insert a new row for each change, keep old rows with effective_from/effective_to dates and is_current flag. Full history preserved. Most common in data warehouses. • SCD Type 3 — add a new column (e.g., previous_value). Limited history. Rarely used. SCD Type 2 implementation in PySpark uses MERGE (Delta Lake) or a custom insert + expire logic.
Q9

What is data partitioning and why does it matter for query performance?

Partitioning physically divides data into folders based on a column value (e.g., date, country). Query engines skip irrelevant partitions entirely — called partition pruning — drastically reducing data scanned. Example: 3 years of data partitioned by day = ~1000 partitions. A query for one day scans 1/1000th of the data vs a full table scan. Best practices: partition on commonly filtered columns (date is the most common), avoid high-cardinality columns as partition keys (creates too many small files), combine partitioning with Z-ordering (Delta Lake) for multi-column filtering.
Q10

What is the small files problem in a data lake and how do you fix it?

When many small files accumulate (from frequent micro-batch writes or poor partitioning), each file has overhead — listing, opening, metadata operations become slow, and the driver/namenode gets overwhelmed. Fixes: • Coalesce or repartition before writing to control output file count • Use Delta Lake / Iceberg OPTIMIZE command to compact small files • Use Hudi's inline clustering • Increase write batch sizes • Set spark.sql.files.maxPartitionBytes to tune read-time file merging This is a common cause of slow Spark jobs on production data lakes.
Q11

What is Apache Airflow and how do you structure a DAG?

Airflow is an orchestration platform that schedules and monitors data pipelines as DAGs (Directed Acyclic Graphs). Each node is a Task (operator); edges define execution order and dependencies. from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime with DAG("etl_pipeline", schedule_interval="@daily", start_date=datetime(2026,1,1)) as dag: extract = PythonOperator(task_id="extract", python_callable=extract_fn) transform = PythonOperator(task_id="transform", python_callable=transform_fn) load = PythonOperator(task_id="load", python_callable=load_fn) extract >> transform >> load Best practices: keep tasks atomic and idempotent, use XCom sparingly, set retries and timeouts, use sensors for data-dependent triggers.
Q12

What is data lineage and why does it matter?

Data lineage tracks the full journey of data — where it came from, how it was transformed, and where it went. It answers "why does this number look wrong?" by tracing back through every transformation. Why it matters: • Debugging — trace incorrect values back to the source • Impact analysis — know which downstream tables break if you change a source • Compliance — GDPR right-to-erasure requires knowing everywhere PII was copied • Trust — data consumers trust data more when they can see its origin Tools: OpenLineage (open standard), Marquez, DataHub, Apache Atlas, dbt docs (model-level lineage).
Q13

How do you design a data pipeline for high availability and fault tolerance?

Key principles: 1. Idempotency — safe to rerun on failure 2. Checkpointing — Spark Structured Streaming and Flink checkpoint state so restarts resume from the last good position 3. Dead-letter queues — failed records go to a separate store for investigation, not lost 4. Retries with backoff — transient failures (network, API rate limits) are retried automatically 5. Alerting — SLA breaches and failures trigger PagerDuty/Slack alerts immediately 6. Separate compute from storage — stateless workers restart without data loss 7. Multi-region / multi-AZ for critical pipelines 8. Schema registry — prevents schema changes from silently breaking downstream consumers
Q14

What is the difference between OLTP and OLAP systems?

OLTP (Online Transaction Processing) — optimised for high-volume, low-latency reads and writes of individual records. Row-oriented storage. Examples: PostgreSQL, MySQL, Aurora. Used for applications (e.g., user sign-ups, order processing). OLAP (Online Analytical Processing) — optimised for complex aggregations over large volumes of data. Columnar storage. Examples: Snowflake, BigQuery, Redshift, ClickHouse. Used for analytics and reporting. In data engineering: data flows OLTP → ETL → OLAP. Never run heavy analytical queries on production OLTP databases.
Q15

Describe a data engineering system design for a real-time dashboard.

Example: real-time sales dashboard updated every 30 seconds. Ingestion: Point-of-sale systems → Kafka topics (orders_created, payments_processed) Stream processing: Spark Structured Streaming or Flink consumes Kafka → enriches with product/customer data → aggregates per minute → writes to Delta Lake (Bronze → Silver) Serving layer: Gold table in Snowflake/BigQuery updated every 30s via MERGE → Looker/Grafana queries it Alerting: Flink detects anomalies (sudden revenue drop) → triggers PagerDuty Trade-offs to discuss: exactly-once vs at-least-once delivery, micro-batch vs true streaming, freshness vs cost.

Put your knowledge into practice

Reading is only half the prep. Practice writing real PySpark, Python and SQL code against data engineering problems — with instant feedback.