Understanding Spark Partitioning for Performance
Partitioning is the concept that separates candidates who understand Spark from those who just know the syntax. Every Spark interview at a senior level will include at least one question about partitions — whether it is "why is my job slow?" or "what is the difference between repartition and coalesce?" This guide covers everything.
What is a Partition?
A partition is a logical chunk of data that lives on one executor. Spark distributes the work by processing one partition per task. If you have 200 partitions and 20 executor cores, Spark runs 20 tasks in parallel, each processing one partition. More partitions = more parallelism (up to the number of cores). Fewer partitions = less parallelism, but also less overhead.
How Partitions Are Created
When you read from storage, Spark creates partitions based on file size and format. The default target partition size is 128MB. A 1.28GB Parquet file becomes ~10 partitions. After a shuffle (groupBy, join, distinct), Spark creates spark.sql.shuffle.partitions partitions — default is 200, regardless of data size. This default is almost always wrong.
repartition() vs coalesce() — The Classic Interview Question
This is tested in almost every senior Spark interview. They are both used to change the number of partitions but they work differently and have different use cases.
# repartition(n) — full shuffle, evenly distributes data, can increase OR decrease partitions
df = df.repartition(50) # redistribute into 50 even partitions
df = df.repartition(50, "country") # partition BY country column (for joins/aggregations)
# coalesce(n) — no shuffle, only merges existing partitions, can ONLY decrease partitions
df = df.coalesce(10) # merge down to 10 partitions — fast, no shuffle
# Rule of thumb:
# repartition → when you need even distribution or need to partition by a column
# coalesce → when you are writing fewer output files at the end of a pipelineChoosing the Right Partition Count
The guideline: aim for partition sizes between 100MB and 200MB, and have 2-4x more partitions than cores for good CPU utilisation. In practice: (total data size in MB) / 128 is a good starting point. For shuffle operations, set spark.conf.set("spark.sql.shuffle.partitions", str(cores * 4)) to override the 200 default.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Override default shuffle partitions for your cluster size
spark.conf.set("spark.sql.shuffle.partitions", "100")
# Check current partition count
print(df.rdd.getNumPartitions())
# Check partition sizes (useful for diagnosing skew)
df.groupBy(F.spark_partition_id()).count().show()Data Skew — The Silent Performance Killer
Data skew happens when some partitions are much larger than others — usually because the key you are grouping or joining on has a highly non-uniform distribution. The symptom: 99 tasks complete in 2 minutes, 1 task takes 45 minutes. The entire job stalls waiting for that one slow task.
# Detect skew: check row distribution across partitions after a groupBy
df.groupBy(F.spark_partition_id().alias("partition_id")) \
.count() \
.orderBy(F.desc("count")) \
.show(20)
# Fix skew with salting (for join skew on a heavy key)
import random
# Add a random salt to the skewed key on the large table
df_large = df_large.withColumn(
"salted_key",
F.concat(F.col("skewed_key"), F.lit("_"), (F.rand() * 10).cast("int").cast("string"))
)
# Explode the small table to match all salt values
from pyspark.sql.functions import explode, array, lit
df_small = df_small.withColumn("salt", explode(array([lit(i) for i in range(10)])))
df_small = df_small.withColumn(
"salted_key",
F.concat(F.col("key"), F.lit("_"), F.col("salt").cast("string"))
)
result = df_large.join(df_small, on="salted_key")Partition Pruning — Reading Less Data
When your data is partitioned on disk (e.g., Parquet files partitioned by date), Spark can skip reading entire partitions if your filter matches the partition key. This is called partition pruning and can reduce read times from hours to minutes.
# Writing with partition — stores data in date=2026-04-01/ subdirectories
df.write.partitionBy("date").parquet("s3://my-bucket/events/")
# Reading with filter — Spark only reads the 2026-04-01 partition
df = spark.read.parquet("s3://my-bucket/events/") \
.filter(F.col("date") == "2026-04-01")
# Run df.explain() to confirm PartitionFilters is in the plan