In-depth guides for PySpark, Python, and SQL — with the reasoning behind each concept, not just the code. Built for people preparing for data engineering interviews.
15 tutorials · 3 sections · beginner to advanced
Understand what Apache Spark is under the hood and why PySpark has become the industry standard for large-scale data processing. We cover the Spark architecture — drivers, executors, and the DAG scheduler — so you build an accurate mental model before writing a single line of code.
The SparkSession is your single entry point to all Spark functionality. Learn how to create one, configure application names, set memory and core settings, and understand the difference between local mode and cluster mode — including how to control parallelism with spark.master.
RDDs were Spark's original abstraction, but DataFrames are what you'll use 99% of the time in modern data engineering. This tutorial explains the key differences, when each is appropriate, and why the Catalyst Optimizer makes DataFrames dramatically faster for structured data.
Real-world data engineering means moving data in and out of Spark constantly. Learn how to read multiple file formats using spark.read, understand schema inference vs explicit schemas, handle malformed records, and write output with partitioning and compression — including why Parquet should be your default format.
Master the building blocks of DataFrame manipulation: selecting columns with select() and col(), renaming with alias() and withColumnRenamed(), creating new derived columns with withColumn(), and casting data types. You will also learn how to avoid the common pitfall of column ambiguity when working with multiple DataFrames.
Learn every way to filter data in PySpark — boolean expressions with col(), SQL-style string conditions, isin(), isNull() / isNotNull(), and combining conditions with & and | operators. We also cover the subtle gotcha of Python's and/or vs Spark's &/| that trips up many beginners.
New foundations tutorials are added every week
Aggregations are at the heart of almost every data pipeline. Learn groupBy() with count(), sum(), avg(), min(), max(), and collect_list(). We go beyond the basics to cover multiple aggregations in a single pass using agg(), renaming aggregated columns, and filtering groups with the equivalent of SQL's HAVING clause.
Joins are one of the most performance-critical operations in Spark. This tutorial covers all join types (inner, left, right, full outer, semi, anti), how to avoid duplicate column names after a join, broadcast joins for small-large table scenarios, and the shuffle join vs broadcast join tradeoff with real performance numbers.
Window functions let you compute running totals, rankings, and comparisons across rows without collapsing the DataFrame. We cover ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, and cumulative aggregations using Window.partitionBy() and Window.orderBy(). Includes real interview-style examples with step-by-step explanations.
When built-in functions are not enough, UDFs let you apply any Python logic to a DataFrame column. Learn to register scalar UDFs with @udf, understand why UDFs break the Catalyst Optimizer, how to pass type hints for better performance, and when to use Pandas UDFs (vectorized UDFs) instead — which can be 10–100× faster.
Null handling is a constant challenge in real data pipelines. This tutorial covers dropna(), fillna(), replace(), and how to use coalesce() for null-safe column expressions. We also explain how nulls affect aggregations (they are silently ignored), comparisons, and joins — behaviors that cause subtle bugs if you are not aware of them.
Learn the most commonly tested PySpark built-in functions for strings and dates. String functions covered: lower(), upper(), trim(), split(), regexp_replace(), concat_ws(). Date functions: to_date(), to_timestamp(), date_diff(), date_add(), date_format(), and extracting year/month/day components. Packed with practical examples.
New core concepts tutorials are added every week
Partitioning is the single biggest lever for Spark performance. Understand how data is physically distributed across executors, what triggers a shuffle (groupBy, join, repartition), how to use repartition() vs coalesce(), and why the default spark.sql.shuffle.partitions of 200 is often wrong for your workload.
Learn when to use cache() and persist(), the difference between MEMORY_ONLY, MEMORY_AND_DISK, and DISK_ONLY storage levels, and how to inspect cached DataFrames in the Spark UI. We also cover the common mistake of caching too eagerly and how that can hurt performance rather than help it.
The Spark UI and explain() output are your debugging superpower. This tutorial teaches you to read physical and logical query plans, identify expensive stages like sort-merge joins and full table scans, understand the Jobs/Stages/Tasks hierarchy, and use the Spark UI to pinpoint data skew and executor bottlenecks.
New performance & advanced tutorials are added every week
Practice with real PySpark questions — write actual code, get instant pass/fail feedback,
and build the muscle memory that gets you hired.