Practice Certifications Tutorials Blog Pricing

Python Interview Prep

Python Interview Questions
for Data Engineers 2026

14+ Python questions asked in data engineering interviews — covering ETL pipelines, pandas, generators, multiprocessing and production best practices.

14+ questions

ETL focused

With code examples

Free to read

Python Interview Questions

Focused on data engineering use cases — not generic Python trivia. Every answer relates to real pipeline and data processing scenarios.

What is the difference between a list, tuple, and set in Python?

List — ordered, mutable, allows duplicates. Tuple — ordered, immutable, allows duplicates; faster than list for fixed data. Set — unordered, mutable, no duplicates; O(1) lookup. In data engineering, use tuples for fixed config/constants, sets for deduplication, and lists for sequences you need to modify.

Explain generators in Python and when would you use them in data engineering?

A generator is a function that yields values one at a time instead of returning a list all at once. It produces values lazily — only when requested — using minimal memory. def read_large_file(path): with open(path) as f: for line in f: yield line.strip() In data engineering, generators are essential for processing large files or database result sets that don't fit in memory. Python's csv.reader, file iteration, and many database cursors use generator patterns internally.

What is the difference between deepcopy and shallow copy in Python?

Shallow copy (copy.copy or list[:]) creates a new object but does not copy nested objects — they are references to the same underlying data. Deep copy (copy.deepcopy) recursively copies all nested objects. In data pipelines, using shallow copy when you expect independence between the original and copy leads to subtle mutation bugs.

How does Python's GIL affect data engineering workloads?

The Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time. This means multi-threading does not achieve true parallelism for CPU-bound tasks. For data engineering: use multiprocessing for CPU-bound work (data transformation, parsing), use threading for I/O-bound work (API calls, file reads), or use async/await for highly concurrent I/O. PySpark avoids the GIL entirely by running computation on the JVM.

How do you build an ETL pipeline in Python? Walk through the pattern.

Standard ETL pattern: # Extract def extract(source_path): return pd.read_csv(source_path) # Transform def transform(df): df = df.dropna(subset=["id"]) df["created_at"] = pd.to_datetime(df["created_at"]) df["revenue"] = df["price"] * df["quantity"] return df # Load def load(df, conn_str, table): df.to_sql(table, conn_str, if_exists="append", index=False) Production pipelines add: error handling with retries, schema validation (Great Expectations or Pydantic), logging, idempotency (upserts instead of inserts), and orchestration (Airflow, Prefect).

What is the difference between pandas merge(), join(), and concat()?

• merge() — SQL-style join on one or more key columns; most flexible • join() — shorthand for merge on index; less common • concat() — stacks DataFrames vertically (axis=0) or horizontally (axis=1) by position # merge example result = orders.merge(customers, on="customer_id", how="left") # concat example all_months = pd.concat([jan_df, feb_df, mar_df], ignore_index=True) In data engineering interviews, you will be asked to distinguish when to use merge vs concat.

How do you handle large datasets in pandas that don't fit in memory?

Options in order of complexity: 1. chunking — pd.read_csv(file, chunksize=100000) — process in chunks 2. dtypes — specify dtypes upfront to reduce memory (float32 vs float64, category for low-cardinality strings) 3. usecols — read only needed columns 4. Dask — drop-in pandas replacement for larger-than-memory DataFrames 5. PySpark — when data is truly big (GB–TB scale) Interviewers expect you to know the chunking pattern and dtype optimisation at minimum.

Explain list comprehension vs map() vs a for loop — which is fastest?

List comprehension is generally the fastest and most Pythonic for building lists. map() with a built-in function can be slightly faster since it avoids Python function call overhead. A for loop is slowest for list building. # All equivalent: squares = [x**2 for x in range(1000)] # fastest, readable squares = list(map(lambda x: x**2, range(1000))) squares = [] for x in range(1000): squares.append(x**2) # slowest In data engineering, prefer list/dict comprehensions for clarity. For large data, vectorised pandas/numpy operations beat all three.

What are decorators in Python? Give a practical data engineering example.

A decorator wraps a function to add behaviour without modifying its code. import functools, time def retry(max_attempts=3): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): for attempt in range(max_attempts): try: return func(*args, **kwargs) except Exception as e: if attempt == max_attempts - 1: raise time.sleep(2 ** attempt) # exponential backoff return wrapper return decorator @retry(max_attempts=3) def fetch_from_api(url): ... Retry, logging, and timing decorators are common in production ETL code.

Q10

What is the difference between multiprocessing and threading in Python?

threading — multiple threads share the same memory space and GIL. Good for I/O-bound tasks (API calls, DB queries) where threads release the GIL during waits. multiprocessing — spawns separate processes, each with its own Python interpreter and memory. True parallelism for CPU-bound tasks (data transformation, compression). from concurrent.futures import ProcessPoolExecutor with ProcessPoolExecutor(max_workers=4) as executor: results = list(executor.map(transform_chunk, chunks)) In data engineering pipelines, use ProcessPoolExecutor for CPU-bound batch processing of partitioned data.

Q11

How do you validate data quality in a Python pipeline?

Three levels of validation: 1. Schema validation — check column names, dtypes, nullability (Pydantic, Pandera, Great Expectations) 2. Statistical validation — check value ranges, distributions, unique counts 3. Business rule validation — e.g., revenue must be positive, dates must be in range Example with Pandera: import pandera as pa schema = pa.DataFrameSchema({ "id": pa.Column(int, nullable=False), "revenue": pa.Column(float, pa.Check.greater_than(0)), }) schema.validate(df) # raises SchemaError if invalid

Q12

What is context manager and how does it help in data engineering?

A context manager (with statement) guarantees setup and teardown regardless of exceptions. Critical in data engineering for: • File handles — with open(path) as f: — always closes file • Database connections — with engine.connect() as conn: — always closes connection • Temporary resources — cleaning up temp files or locking resources Implement with __enter__ / __exit__ or use @contextlib.contextmanager decorator. Without context managers, resource leaks in long-running pipelines cause connection pool exhaustion.

Q13

How do you connect to and query a database in Python?

Using SQLAlchemy (ORM) or psycopg2 (raw PostgreSQL): # SQLAlchemy — database-agnostic from sqlalchemy import create_engine import pandas as pd engine = create_engine("postgresql://user:pass@host/dbname") df = pd.read_sql("SELECT * FROM orders WHERE date > '2026-01-01'", engine) # Write back df.to_sql("orders_summary", engine, if_exists="replace", index=False) Always use parameterised queries (never f-strings) to prevent SQL injection. Use connection pooling for high-throughput pipelines.

Q14

Explain error handling best practices in production ETL pipelines.

Key patterns: 1. Catch specific exceptions, not bare except: try: load_data(df) except (ConnectionError, TimeoutError) as e: logger.error(f"Load failed: {e}") raise 2. Use retries with exponential backoff for transient failures 3. Dead-letter queues — send failed records to a separate table/queue for investigation 4. Idempotency — design loads so re-running does not create duplicates (use upserts) 5. Alerting — send failures to PagerDuty/Slack via webhook 6. Structured logging — log JSON with run_id, source, record_count for traceability

Practice Python data engineering challenges

Run real Python code in your browser against data engineering problems — no setup, instant results.

Practice Python Questions →View Pricing

Python Interview Questionsfor Data Engineers 2026