Practice Certifications Tutorials Blog Pricing

SQL Interview Prep

Top SQL Interview Questions
for Data Engineers 2026

Q: What is a window function and how does PARTITION BY work?

A window function performs a calculation across a set of rows related to the current row (the window), without collapsing rows the way GROUP BY does. PARTITION BY divides the result set into groups; the function resets for each partition. ORDER BY within the window defines row ordering for rank/lag/lead functions. SELECT name, dept, salary, SUM(salary) OVER (PARTITION BY dept) as dept_total, salary * 100.0 / SUM(salary) OVER (PARTITION BY dept) as pct_of_dept FROM employees

Q: What is the difference between DELETE, TRUNCATE, and DROP?

• DELETE — removes rows one by one, can have WHERE clause, is logged, can be rolled back, slower • TRUNCATE — removes all rows at once, not row-by-row, minimal logging, faster, cannot have WHERE clause, resets identity counters • DROP — removes the entire table structure and data from the database permanently In data engineering contexts, TRUNCATE is used to reset staging tables before a load.

Q: How do you calculate a running total in SQL?

Use SUM() as a window function with ORDER BY and a frame specification: SELECT date, amount, SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total FROM transactions ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default frame for running totals. You can also use a 7-day rolling sum: SUM(amount) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

15+ essential SQL questions asked in data engineering interviews — covering window functions, CTEs, joins, aggregations and query optimisation.

15+ questions

Detailed answers

With code examples

Free to read

SQL Interview Questions

Questions are ordered from foundational to advanced. Each answer includes practical examples relevant to data engineering roles.

What is the difference between WHERE and HAVING in SQL?

WHERE filters rows before aggregation — it operates on individual rows. HAVING filters after aggregation — it operates on grouped results. You cannot use aggregate functions in WHERE. SELECT dept, COUNT(*) as cnt FROM employees WHERE salary > 50000 -- filters rows first GROUP BY dept HAVING COUNT(*) > 5 -- filters groups after

Explain ROW_NUMBER(), RANK(), and DENSE_RANK() — what is the difference?

All three assign a number to each row within a window partition. • ROW_NUMBER() — always unique, no gaps: 1, 2, 3, 4 • RANK() — ties get the same rank, gaps follow: 1, 2, 2, 4 • DENSE_RANK() — ties get the same rank, no gaps: 1, 2, 2, 3 Example: find the top earner per department SELECT * FROM ( SELECT *, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) rk FROM employees ) t WHERE rk = 1

What is a CTE (Common Table Expression) and when should you use it?

A CTE is a named temporary result set defined with WITH. It improves readability and lets you reference the same subquery multiple times without repeating code. Recursive CTEs can traverse hierarchical data. WITH high_earners AS ( SELECT dept, AVG(salary) avg_sal FROM employees GROUP BY dept ) SELECT e.name, e.dept, h.avg_sal FROM employees e JOIN high_earners h ON e.dept = h.dept WHERE e.salary > h.avg_sal

What is the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN?

• INNER JOIN — returns only rows that match in both tables • LEFT JOIN — all rows from left table, matched rows from right (NULLs where no match) • RIGHT JOIN — all rows from right table, matched rows from left (NULLs where no match) • FULL OUTER JOIN — all rows from both tables, NULLs where no match on either side In data engineering interviews, interviewers often test whether you know when to use LEFT JOIN to retain all source records during pipeline transformations.

How do you find duplicate rows in a SQL table?

Use GROUP BY with HAVING COUNT > 1: SELECT email, COUNT(*) as cnt FROM users GROUP BY email HAVING COUNT(*) > 1 To find full duplicate rows (all columns identical), use: SELECT *, COUNT(*) OVER (PARTITION BY col1, col2, col3) as cnt FROM table_name Having the window function approach is useful when you need to keep one copy and delete others.

What is the difference between UNION and UNION ALL?

UNION removes duplicate rows (performs a DISTINCT operation internally). UNION ALL keeps all rows including duplicates. UNION ALL is faster because it skips the deduplication step. In ETL pipelines, always use UNION ALL unless you explicitly need deduplication, as UNION adds unnecessary overhead.

Write a SQL query to find the second highest salary.

Multiple approaches: -- Using subquery SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees) -- Using window function (preferred) SELECT salary FROM ( SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rk FROM employees ) t WHERE rk = 2 The window function approach generalises easily to Nth highest.

What is a window function and how does PARTITION BY work?

A window function performs a calculation across a set of rows related to the current row (the window), without collapsing rows the way GROUP BY does. PARTITION BY divides the result set into groups; the function resets for each partition. ORDER BY within the window defines row ordering for rank/lag/lead functions. SELECT name, dept, salary, SUM(salary) OVER (PARTITION BY dept) as dept_total, salary * 100.0 / SUM(salary) OVER (PARTITION BY dept) as pct_of_dept FROM employees

What is the difference between DELETE, TRUNCATE, and DROP?

• DELETE — removes rows one by one, can have WHERE clause, is logged, can be rolled back, slower • TRUNCATE — removes all rows at once, not row-by-row, minimal logging, faster, cannot have WHERE clause, resets identity counters • DROP — removes the entire table structure and data from the database permanently In data engineering contexts, TRUNCATE is used to reset staging tables before a load.

Q10

How do you calculate a running total in SQL?

Use SUM() as a window function with ORDER BY and a frame specification: SELECT date, amount, SUM(amount) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total FROM transactions ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default frame for running totals. You can also use a 7-day rolling sum: SUM(amount) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

Q11

What is a self join and when would you use it?

A self join joins a table to itself. Common use case: finding manager-employee relationships or comparing rows within the same table. -- Find employees who earn more than their manager SELECT e.name, e.salary, m.name as manager, m.salary as mgr_salary FROM employees e JOIN employees m ON e.manager_id = m.id WHERE e.salary > m.salary

Q12

Explain the difference between correlated and non-correlated subqueries.

A non-correlated subquery executes once independently of the outer query. A correlated subquery references columns from the outer query and executes once per row of the outer query — much slower. -- Correlated (runs once per employee row) SELECT name FROM employees e WHERE salary > (SELECT AVG(salary) FROM employees WHERE dept = e.dept) In practice, rewrite correlated subqueries using window functions or CTEs for better performance.

Q13

What is the LAG() and LEAD() window function used for?

LAG() accesses a value from a previous row; LEAD() from a subsequent row. Both are used for period-over-period comparisons. -- Calculate day-over-day revenue change SELECT date, revenue, LAG(revenue, 1) OVER (ORDER BY date) as prev_day_revenue, revenue - LAG(revenue, 1) OVER (ORDER BY date) as daily_change FROM daily_revenue Interviewers often ask you to find consecutive login days or detect gaps in time series data using LAG/LEAD.

Q14

How do you pivot rows to columns in SQL?

Use conditional aggregation with CASE WHEN: SELECT user_id, SUM(CASE WHEN product = "A" THEN revenue ELSE 0 END) as product_A, SUM(CASE WHEN product = "B" THEN revenue ELSE 0 END) as product_B, SUM(CASE WHEN product = "C" THEN revenue ELSE 0 END) as product_C FROM sales GROUP BY user_id In databases that support PIVOT syntax (SQL Server, Oracle, Spark SQL), you can use PIVOT directly.

Q15

What are indexes and how do they affect query performance?

An index is a data structure (typically a B-tree) that speeds up data retrieval at the cost of additional storage and write overhead. B-tree indexes are efficient for range queries and equality lookups. Composite indexes benefit queries that filter on multiple columns — the leading column matters most (leftmost prefix rule). In data warehouses, columnar storage and partition pruning often replace traditional row-level indexes.

Practice SQL in a real execution environment

Run your SQL queries against real datasets — no setup needed. Instant feedback on whether your output matches.

Practice SQL Questions →View Pricing

Top SQL Interview Questionsfor Data Engineers 2026