PracticeCertificationsTutorialsBlogPricing
📖 Learn Data Engineering

Tutorials that go
beyond the syntax

In-depth guides for PySpark, Python, and SQL — with the reasoning behind each concept, not just the code. Built for people preparing for data engineering interviews.

PySpark Tutorials

15 tutorials · 3 sections · beginner to advanced

🌱Foundations·6
⚙️Core Concepts·6
🚀Performance & Advanced·3
🌱 Foundations
Start here — no prior Spark experience needed
#01

What is PySpark & Why Data Engineers Use It

Beginner📖 10 min

Understand what Apache Spark is under the hood and why PySpark has become the industry standard for large-scale data processing. We cover the Spark architecture — drivers, executors, and the DAG scheduler — so you build an accurate mental model before writing a single line of code.

Spark ArchitectureDriver & ExecutorsDAG SchedulerWhen to Use Spark
#02

Setting Up a SparkSession

Beginner📖 8 min

The SparkSession is your single entry point to all Spark functionality. Learn how to create one, configure application names, set memory and core settings, and understand the difference between local mode and cluster mode — including how to control parallelism with spark.master.

SparkSessionspark.masterConfig OptionsLocal vs Cluster
#03

DataFrames vs RDDs — What You Actually Need

Beginner📖 12 min

RDDs were Spark's original abstraction, but DataFrames are what you'll use 99% of the time in modern data engineering. This tutorial explains the key differences, when each is appropriate, and why the Catalyst Optimizer makes DataFrames dramatically faster for structured data.

RDDDataFrameCatalyst OptimizerPerformance
#04

Reading & Writing Data: CSV, JSON, and Parquet

Beginner📖 15 min

Real-world data engineering means moving data in and out of Spark constantly. Learn how to read multiple file formats using spark.read, understand schema inference vs explicit schemas, handle malformed records, and write output with partitioning and compression — including why Parquet should be your default format.

spark.readParquetCSV/JSONSchema InferencePartitioned Writes
#05

Selecting, Renaming & Casting Columns

Beginner📖 12 min

Master the building blocks of DataFrame manipulation: selecting columns with select() and col(), renaming with alias() and withColumnRenamed(), creating new derived columns with withColumn(), and casting data types. You will also learn how to avoid the common pitfall of column ambiguity when working with multiple DataFrames.

select()withColumn()cast()col()alias()
#06

Filtering Rows with where() and filter()

Beginner📖 10 min

Learn every way to filter data in PySpark — boolean expressions with col(), SQL-style string conditions, isin(), isNull() / isNotNull(), and combining conditions with & and | operators. We also cover the subtle gotcha of Python's and/or vs Spark's &/| that trips up many beginners.

filter()where()isin()isNull()Boolean Logic
✨ Updated Weekly

New foundations tutorials are added every week

⚙️ Core Concepts
The patterns you will use every single day on the job
#07

Aggregations & GroupBy

Intermediate📖 14 min

Aggregations are at the heart of almost every data pipeline. Learn groupBy() with count(), sum(), avg(), min(), max(), and collect_list(). We go beyond the basics to cover multiple aggregations in a single pass using agg(), renaming aggregated columns, and filtering groups with the equivalent of SQL's HAVING clause.

groupBy()agg()sum/avg/countpivot()HAVING equivalent
#08

Joining DataFrames

Intermediate📖 18 min

Joins are one of the most performance-critical operations in Spark. This tutorial covers all join types (inner, left, right, full outer, semi, anti), how to avoid duplicate column names after a join, broadcast joins for small-large table scenarios, and the shuffle join vs broadcast join tradeoff with real performance numbers.

inner/left/full joinsemi/anti joinbroadcast()shuffleduplicate columns
#09

Window Functions

Intermediate📖 20 min

Window functions let you compute running totals, rankings, and comparisons across rows without collapsing the DataFrame. We cover ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, and cumulative aggregations using Window.partitionBy() and Window.orderBy(). Includes real interview-style examples with step-by-step explanations.

Window.partitionBy()ROW_NUMBERLAG/LEADRANKRunning Totals
#10

User-Defined Functions (UDFs)

Intermediate📖 16 min

When built-in functions are not enough, UDFs let you apply any Python logic to a DataFrame column. Learn to register scalar UDFs with @udf, understand why UDFs break the Catalyst Optimizer, how to pass type hints for better performance, and when to use Pandas UDFs (vectorized UDFs) instead — which can be 10–100× faster.

@udfreturnTypePandas UDFPerformance CostType Hints
#11

Handling Null Values

Intermediate📖 12 min

Null handling is a constant challenge in real data pipelines. This tutorial covers dropna(), fillna(), replace(), and how to use coalesce() for null-safe column expressions. We also explain how nulls affect aggregations (they are silently ignored), comparisons, and joins — behaviors that cause subtle bugs if you are not aware of them.

dropna()fillna()coalesce()isNull()Null in Aggregations
#12

String & Date Functions

Intermediate📖 14 min

Learn the most commonly tested PySpark built-in functions for strings and dates. String functions covered: lower(), upper(), trim(), split(), regexp_replace(), concat_ws(). Date functions: to_date(), to_timestamp(), date_diff(), date_add(), date_format(), and extracting year/month/day components. Packed with practical examples.

regexp_replace()to_date()date_diff()split()concat_ws()
✨ Updated Weekly

New core concepts tutorials are added every week

Ready to test your knowledge?

Practice with real PySpark questions — write actual code, get instant pass/fail feedback,
and build the muscle memory that gets you hired.

Start Practicing