Skip to main content

Handling Missing Data in PySpark DataFrames

Handling missing values is an essential part of data cleaning, ETL pipelines, machine learning prep, and large-scale analytics.
PySpark provides powerful DataFrame.na functions to detect, drop, fill, and impute missing values.

Below is a complete guide using a sample Royalty & Sales dataset.


Load Dataset

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('missing-data').getOrCreate()

df = spark.read.csv('/path/to/royalty.csv', header=True, inferSchema=True)

Show Raw Data

df.show()

Output:

+------------+------------------+------+--------+
| Author | Book | Sales| Royalty|
+------------+------------------+------+--------+
| John Smith | Deep Learning | 5000 | 800 |
| Jane Doe | AI Ethics | null | 600 |
| null | Quantum Computing| 3500 | 550 |
| Alex Lee | Big Data | 4500 | null |
| Anna Ray | null | null | null |
+------------+------------------+------+--------+

1. Drop Rows With Any Null Values

df.na.drop().show()

Output:

+-----------+--------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000| 800 |
+-----------+--------------+-----+--------+

✔️ What This Does

Removes any row that contains at least one null value.


2. Keep Rows Only if ≥ 2 Non-Null Values

df.na.drop(thresh=2).show()

Output:

+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics |null | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
+-----------+------------------+-----+--------+

✔️ What This Does

Keeps rows that have at least 2 non-null values.


3. Drop Row Only If Every Column Is Null

df.na.drop(how='all').show()

Output:

+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 |800 |
| Jane Doe |AI Ethics |null |600 |
| null |Quantum Computing |3500 |550 |
| Alex Lee |Big Data |4500 |null |
| Anna Ray |null |null |null |
+-----------+------------------+-----+--------+

✔️ What This Does

Removes rows only if every column has null → none in this dataset.


4. Drop Rows If Any Column Is Null

df.na.drop(how='any').show()

Output is the same as example (only one row remains).

✔️ Equivalent to df.na.drop().


5. Drop Rows Where a Specific Column (Sales) Is Null

df.na.drop(subset=['Sales']).show()

Output:

+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
+-----------+------------------+-----+--------+

✔️ What This Does

Keeps only rows where Sales is not null.


6. Fill All Nulls With 0

df.na.fill(0).show()

Output:

+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics | 0 | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | 0 |
| Anna Ray |null | 0 | 0 |
+-----------+------------------+-----+--------+

✔️ What This Does

Replaces all nulls (numbers, strings) with 0.


7. Fill All Nulls With "unknown"

df.na.fill('unknown').show()

Output:

+-----------+------------------+--------+--------+
| Author | Book | Sales |Royalty |
+-----------+------------------+--------+--------+
| John Smith| Deep Learning | 5000 | 800 |
| Jane Doe | AI Ethics | unknown| 600 |
| unknown | Quantum Computing| 3500 | 550 |
| Alex Lee | Big Data | 4500 |unknown |
| Anna Ray | unknown |unknown |unknown |
+-----------+------------------+--------+--------+

✔️ What This Does

Fills every missing cell with "unknown" regardless of data type.


8. Fill Nulls Only in the Author Column

df.na.fill('unknown', subset=['Author']).show()

Output:

+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics | null | 600 |
| unknown |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 | null |
| Anna Ray |null | null | null |
+-----------+------------------+------+--------+

✔️ What This Does

Only fills missing author names.


9. Fill Missing Sales Values With the Mean

from pyspark.sql.functions import mean

mean_val = df.select(mean(df['Sales'])).collect()
mean_sales = mean_val[0][0]

df.na.fill(mean_sales, subset=['Sales']).show()

Output:

+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics | 4250 | 600 |
| null |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 | null |
| Anna Ray |null | 4250 | null |
+-----------+------------------+------+--------+

✔️ What This Does

Replaces missing Sales values with the calculated average = 4250.


🟦 1-Minute Summary — Handling Missing Data in PySpark

Code ExampleMeaning / Use Case
df.na.drop()Drop rows with any null values
df.na.drop(thresh=2)Keep rows with ≥ 2 non-null values
df.na.drop(how='all')Drop only rows where all columns are null
df.na.drop(subset=['Sales'])Drop rows with nulls in specific column(s)
df.na.fill(0)Replace all nulls with 0
df.na.fill('unknown')Replace all nulls with "unknown"
df.na.fill('unknown', subset=['Author'])Replace nulls only in the Author column
mean(df['Sales'])Compute the mean for imputation
df.na.fill(mean_sales, subset=['Sales'])Fill Sales nulls using mean-based imputation

Next, we’ll cover Working with Dates and Timestamps in PySpark DataFrames (Full Guide)

Career