PracticeTutorialsBlogPricing
Back to blog
Python

Python & Pandas Interview Guide for Data Engineers

Python interviews for data engineering roles are different from general software engineering Python interviews. The focus is on data processing patterns, pandas efficiency, ETL design, and understanding when Python alone is not enough (and PySpark is needed).

Python Fundamentals That Actually Matter

Interviewers do not ask about metaclasses or abstract base classes. They ask about things you use daily in data pipelines.

# Generators — critical for large file processing
def stream_csv(path, chunk_size=10_000):
    """Process a 50GB CSV without loading it into memory."""
    for chunk in pd.read_csv(path, chunksize=chunk_size):
        yield process_chunk(chunk)

# Context managers — ensure resource cleanup
from contextlib import contextmanager

@contextmanager
def db_connection(conn_str):
    conn = create_engine(conn_str).connect()
    try:
        yield conn
    finally:
        conn.close()  # Always runs, even on exception

Pandas Memory Optimisation

A pandas DataFrame can use 5-10x more memory than necessary if you use default dtypes. Interviewers love asking about this because it is a common real-world problem.

import pandas as pd

# Before: 800MB
df = pd.read_csv("large_file.csv")

# After: ~120MB — specify dtypes explicitly
df = pd.read_csv("large_file.csv", dtype={
    "user_id": "int32",          # int64 default → int32 saves 50%
    "country": "category",        # object default → category saves 90% for low-cardinality
    "revenue": "float32",         # float64 default → float32 saves 50%
    "is_active": "bool",          # object default → bool saves 85%
}, parse_dates=["created_at"], usecols=["user_id", "country", "revenue"])

print(df.memory_usage(deep=True).sum() / 1e6, "MB")

ETL Pattern Interview Question

A very common interview format: "Write a Python function that reads data from a source, transforms it, and loads it to a destination. Make it production-ready." Here is what "production-ready" means to interviewers:

import logging
import functools
import time

logger = logging.getLogger(__name__)

def retry(max_attempts=3, delay=2):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts:
                        logger.error(f"{func.__name__} failed after {max_attempts} attempts: {e}")
                        raise
                    logger.warning(f"Attempt {attempt} failed, retrying in {delay**attempt}s")
                    time.sleep(delay ** attempt)
        return wrapper
    return decorator

@retry(max_attempts=3)
def extract(source_url: str) -> pd.DataFrame:
    logger.info(f"Extracting from {source_url}")
    return pd.read_json(source_url)

def transform(df: pd.DataFrame) -> pd.DataFrame:
    df = df.dropna(subset=["id", "timestamp"])
    df["timestamp"] = pd.to_datetime(df["timestamp"])
    df["revenue"] = (df["price"] * df["quantity"]).round(2)
    return df[df["revenue"] > 0]

@retry(max_attempts=3)
def load(df: pd.DataFrame, engine, table: str) -> None:
    df.to_sql(table, engine, if_exists="append", index=False, method="multi")
    logger.info(f"Loaded {len(df)} rows to {table}")

When Python Is Not Enough

Interviewers often ask: "When would you use PySpark instead of pandas?" Know this answer cold: • Data > available RAM (pandas needs everything in memory; Spark processes in partitions) • Distributed computation across a cluster is needed • Data is already in a data lake (S3, GCS) and you need partition pruning • You need ACID transactions or time travel (Delta Lake) • Streaming data processing (Spark Structured Streaming) The rule of thumb: use pandas for < 10GB on a single machine; use PySpark for larger data or when you need the Spark ecosystem.

Practice Python data engineering problems