Data Engineering Interview Cheat Sheet 2026
Data engineering interviews in 2026 have evolved significantly. Companies now test a broader range of skills — not just SQL and Python, but distributed systems, real-time processing, data modelling, and infrastructure. This cheat sheet covers everything you need to study, organised by topic and frequency.
The 5 Pillars of a Data Engineering Interview
1. SQL (always tested, medium-high difficulty). 2. Python/PySpark coding (tested at 90% of companies). 3. System design (tested at senior level, increasing at mid-level). 4. Data modelling (star schema, normalisation, slowly changing dimensions). 5. Behavioural / situational (every company).
SQL — Must-Know Topics
Window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER), CTEs and recursive CTEs, subqueries vs joins — when each is better, GROUP BY with HAVING, CASE WHEN for conditional aggregation, NULL handling (COALESCE, NULLIF, IS NULL), self-joins, and date/time functions (DATE_TRUNC, DATEDIFF, EXTRACT).
-- Classic interview pattern: find users active on 3+ consecutive days
WITH daily_logins AS (
SELECT user_id, DATE(login_time) AS login_date
FROM logins
GROUP BY 1, 2
),
with_gaps AS (
SELECT
user_id, login_date,
login_date - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS grp
FROM daily_logins
)
SELECT user_id, COUNT(*) AS streak_length
FROM with_gaps
GROUP BY user_id, grp
HAVING COUNT(*) >= 3;PySpark — Must-Know Topics
DataFrame API (filter, select, withColumn, join, groupBy, agg), Window functions with the Window class, UDFs and when NOT to use them (vectorised UDFs with pandas_udf are preferred), reading/writing Parquet and Delta format, broadcast joins, explain() and query plan analysis, repartition vs coalesce, caching strategy.
Python — Must-Know Topics
List/dict/set comprehensions, generators (memory-efficient iteration), decorators (especially for retry logic and timing), context managers, error handling patterns, Pandas for data manipulation, datetime handling, JSON parsing, writing modular testable functions, and type hints.
System Design — Common Questions
These questions are increasingly common even at mid-level: "Design a pipeline to process 1 billion events per day", "How would you build a real-time dashboard showing user activity?", "Design a data warehouse for an e-commerce company", "How would you handle late-arriving data in a streaming pipeline?"
For any system design question, cover: data sources and ingestion, storage layer (what format, what partitioning), processing layer (batch vs stream), serving layer (how data is consumed), failure handling and monitoring, and scalability considerations.
Top 20 Questions You Will Likely Face
1. Find the Nth highest salary. 2. Month-over-month growth rate. 3. Deduplicate a DataFrame, keep latest. 4. Calculate 7-day rolling average. 5. Find users who did event A then event B. 6. Identify the top 3 products per category. 7. Calculate retention rate (Day 1, Day 7, Day 30). 8. Handle a skewed join. 9. Write an idempotent ETL pipeline. 10. Explain the difference between repartition and coalesce. 11. Consecutive days streak. 12. Median salary per department. 13. First and last event per session. 14. Calculate cumulative distribution. 15. Handle slowly changing dimensions. 16. Design a star schema for sales data. 17. Explain partitioning in Spark. 18. When to use broadcast join. 19. What causes data skew and how to fix it. 20. Difference between Parquet and ORC.
Last-Minute Study Plan
Week before: Solve 2-3 window function problems daily. Day before: Review SQL NULL handling and PySpark join types. Morning of: Skim your notes on optimisation (broadcast joins, skew handling). In the interview: Clarify the schema first, write the simplest working solution, then optimise. Always mention edge cases.