Practice Certifications Tutorials Blog Pricing

Log In Start Free →

📖 Learn Data Engineering

Tutorials that go
beyond the syntax

In-depth guides for PySpark, Python, and SQL — with the reasoning behind each concept, not just the code. Built for data engineering interviews.

31 tutorials7.8h of content3 learning tracks100% free

🌱

Foundations

Start here — no prior Spark experience needed

1

What is PySpark & Why Data Engineers Use It

Understand what Apache Spark is under the hood and why PySpark has become the industry standard for large-scale data processing. We cover the Spark architecture — drivers, executors, and the DAG scheduler — so you build an accurate mental model before writing a single line of code.

Spark ArchitectureDriver & ExecutorsDAG SchedulerWhen to Use Spark

2

Setting Up a SparkSession

The SparkSession is your single entry point to all Spark functionality. Learn how to create one, configure application names, set memory and core settings, and understand the difference between local mode and cluster mode — including how to control parallelism with spark.master.

SparkSessionspark.masterConfig OptionsLocal vs Cluster

3

DataFrames vs RDDs — What You Actually Need

RDDs were Spark's original abstraction, but DataFrames are what you'll use 99% of the time in modern data engineering. This tutorial explains the key differences, when each is appropriate, and why the Catalyst Optimizer makes DataFrames dramatically faster for structured data.

RDDDataFrameCatalyst OptimizerPerformance

4

Reading & Writing Data: CSV, JSON, and Parquet

Real-world data engineering means moving data in and out of Spark constantly. Learn how to read multiple file formats using spark.read, understand schema inference vs explicit schemas, handle malformed records, and write output with partitioning and compression — including why Parquet should be your default format.

spark.readParquetCSV/JSONSchema Inference

5

Selecting, Renaming & Casting Columns

Master the building blocks of DataFrame manipulation: selecting columns with select() and col(), renaming with alias() and withColumnRenamed(), creating new derived columns with withColumn(), and casting data types. You will also learn how to avoid the common pitfall of column ambiguity when working with multiple DataFrames.

select()withColumn()cast()col()

6

Filtering Rows with where() and filter()

Learn every way to filter data in PySpark — boolean expressions with col(), SQL-style string conditions, isin(), isNull() / isNotNull(), and combining conditions with & and | operators. We also cover the subtle gotcha of Python's and/or vs Spark's &/| that trips up many beginners.

filter()where()isin()isNull()

✨

Updated weekly — new foundations tutorials are added every week

⚙️

Core Concepts

The patterns you will use every single day on the job

7

Aggregations & GroupBy

Intermediate14 min

Aggregations are at the heart of almost every data pipeline. Learn groupBy() with count(), sum(), avg(), min(), max(), and collect_list(). We go beyond the basics to cover multiple aggregations in a single pass using agg(), renaming aggregated columns, and filtering groups with the equivalent of SQL's HAVING clause.

groupBy()agg()sum/avg/countpivot()

8

Joining DataFrames

Intermediate18 min

Joins are one of the most performance-critical operations in Spark. This tutorial covers all join types (inner, left, right, full outer, semi, anti), how to avoid duplicate column names after a join, broadcast joins for small-large table scenarios, and the shuffle join vs broadcast join tradeoff with real performance numbers.

inner/left/full joinsemi/anti joinbroadcast()shuffle

9

Window Functions

Intermediate20 min

Window functions let you compute running totals, rankings, and comparisons across rows without collapsing the DataFrame. We cover ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, and cumulative aggregations using Window.partitionBy() and Window.orderBy(). Includes real interview-style examples with step-by-step explanations.

Window.partitionBy()ROW_NUMBERLAG/LEADRANK

10

User-Defined Functions (UDFs)

Intermediate16 min

When built-in functions are not enough, UDFs let you apply any Python logic to a DataFrame column. Learn to register scalar UDFs with @udf, understand why UDFs break the Catalyst Optimizer, how to pass type hints for better performance, and when to use Pandas UDFs (vectorized UDFs) instead — which can be 10–100× faster.

@udfreturnTypePandas UDFPerformance Cost

11

Handling Null Values

Intermediate12 min

Null handling is a constant challenge in real data pipelines. This tutorial covers dropna(), fillna(), replace(), and how to use coalesce() for null-safe column expressions. We also explain how nulls affect aggregations (they are silently ignored), comparisons, and joins — behaviors that cause subtle bugs if you are not aware of them.

dropna()fillna()coalesce()isNull()

12

String & Date Functions

Intermediate14 min

Learn the most commonly tested PySpark built-in functions for strings and dates. String functions covered: lower(), upper(), trim(), split(), regexp_replace(), concat_ws(). Date functions: to_date(), to_timestamp(), date_diff(), date_add(), date_format(), and extracting year/month/day components. Packed with practical examples.

regexp_replace()to_date()date_diff()split()

✨

Updated weekly — new core concepts tutorials are added every week

🚀

Performance & Advanced

What separates good data engineers from great ones

13

Partitioning & the Shuffle Explained

Partitioning is the single biggest lever for Spark performance. Understand how data is physically distributed across executors, what triggers a shuffle (groupBy, join, repartition), how to use repartition() vs coalesce(), and why the default spark.sql.shuffle.partitions of 200 is often wrong for your workload.

repartition()coalesce()shuffle.partitionsData Skew

14

Caching & Persistence Strategies

Learn when to use cache() and persist(), the difference between MEMORY_ONLY, MEMORY_AND_DISK, and DISK_ONLY storage levels, and how to inspect cached DataFrames in the Spark UI. We also cover the common mistake of caching too eagerly and how that can hurt performance rather than help it.

cache()persist()StorageLevelSpark UI

15

Reading the Spark UI & Query Plans

The Spark UI and explain() output are your debugging superpower. This tutorial teaches you to read physical and logical query plans, identify expensive stages like sort-merge joins and full table scans, understand the Jobs/Stages/Tasks hierarchy, and use the Spark UI to pinpoint data skew and executor bottlenecks.

explain()Physical PlanDAG VisualizationSpark UI

✨

Updated weekly — new performance & advanced tutorials are added every week

Ready to test your knowledge?

Practice with real PySpark questions — write actual code, get instant pass/fail feedback, and build the muscle memory that gets you hired.

Start Practicing ⚡