Handling Missing Data in PySpark DataFrames
Handling missing values is an essential part of data cleaning, ETL pipelines, machine learning prep, and large-scale analytics.
PySpark provides powerful DataFrame.na functions to detect, drop, fill, and impute missing values.
Below is a complete guide using a sample Royalty & Sales dataset.
Load Dataset
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('missing-data').getOrCreate()
df = spark.read.csv('/path/to/royalty.csv', header=True, inferSchema=True)
Show Raw Data
df.show()
Output:
+------------+------------------+------+--------+
| Author | Book | Sales| Royalty|
+------------+------------------+------+--------+
| John Smith | Deep Learning | 5000 | 800 |
| Jane Doe | AI Ethics | null | 600 |
| null | Quantum Computing| 3500 | 550 |
| Alex Lee | Big Data | 4500 | null |
| Anna Ray | null | null | null |
+------------+------------------+------+--------+
1. Drop Rows With Any Null Values
df.na.drop().show()
Output:
+-----------+--------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000| 800 |
+-----------+--------------+-----+--------+
✔️ What This Does
Removes any row that contains at least one null value.
2. Keep Rows Only if ≥ 2 Non-Null Values
df.na.drop(thresh=2).show()
Output:
+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics |null | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
+-----------+------------------+-----+--------+
✔️ What This Does
Keeps rows that have at least 2 non-null values.
3. Drop Row Only If Every Column Is Null
df.na.drop(how='all').show()
Output:
+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 |800 |
| Jane Doe |AI Ethics |null |600 |
| null |Quantum Computing |3500 |550 |
| Alex Lee |Big Data |4500 |null |
| Anna Ray |null |null |null |
+-----------+------------------+-----+--------+
✔️ What This Does
Removes rows only if every column has null → none in this dataset.
4. Drop Rows If Any Column Is Null
df.na.drop(how='any').show()
Output is the same as example (only one row remains).
✔️ Equivalent to df.na.drop().
5. Drop Rows Where a Specific Column (Sales) Is Null
df.na.drop(subset=['Sales']).show()
Output:
+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
+-----------+------------------+-----+--------+
✔️ What This Does
Keeps only rows where Sales is not null.
6. Fill All Nulls With 0
df.na.fill(0).show()
Output:
+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics | 0 | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | 0 |
| Anna Ray |null | 0 | 0 |
+-----------+------------------+-----+--------+
✔️ What This Does
Replaces all nulls (numbers, strings) with 0.
7. Fill All Nulls With "unknown"
df.na.fill('unknown').show()
Output:
+-----------+------------------+--------+--------+
| Author | Book | Sales |Royalty |
+-----------+------------------+--------+--------+
| John Smith| Deep Learning | 5000 | 800 |
| Jane Doe | AI Ethics | unknown| 600 |
| unknown | Quantum Computing| 3500 | 550 |
| Alex Lee | Big Data | 4500 |unknown |
| Anna Ray | unknown |unknown |unknown |
+-----------+------------------+--------+--------+
✔️ What This Does
Fills every missing cell with "unknown" regardless of data type.
8. Fill Nulls Only in the Author Column
df.na.fill('unknown', subset=['Author']).show()
Output:
+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics | null | 600 |
| unknown |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 | null |
| Anna Ray |null | null | null |
+-----------+------------------+------+--------+
✔️ What This Does
Only fills missing author names.
9. Fill Missing Sales Values With the Mean
from pyspark.sql.functions import mean
mean_val = df.select(mean(df['Sales'])).collect()
mean_sales = mean_val[0][0]
df.na.fill(mean_sales, subset=['Sales']).show()
Output:
+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics | 4250 | 600 |
| null |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 | null |
| Anna Ray |null | 4250 | null |
+-----------+------------------+------+--------+
✔️ What This Does
Replaces missing Sales values with the calculated average = 4250.
🟦 1-Minute Summary — Handling Missing Data in PySpark
| Code Example | Meaning / Use Case |
|---|---|
df.na.drop() | Drop rows with any null values |
df.na.drop(thresh=2) | Keep rows with ≥ 2 non-null values |
df.na.drop(how='all') | Drop only rows where all columns are null |
df.na.drop(subset=['Sales']) | Drop rows with nulls in specific column(s) |
df.na.fill(0) | Replace all nulls with 0 |
df.na.fill('unknown') | Replace all nulls with "unknown" |
df.na.fill('unknown', subset=['Author']) | Replace nulls only in the Author column |
mean(df['Sales']) | Compute the mean for imputation |
df.na.fill(mean_sales, subset=['Sales']) | Fill Sales nulls using mean-based imputation |
Next, we’ll cover Working with Dates and Timestamps in PySpark DataFrames (Full Guide)