Handling Missing Data in PySpark DataFrames

Handling missing values is an essential part of data cleaning, ETL pipelines, machine learning prep, and large-scale analytics.
PySpark provides powerful DataFrame.na functions to detect, drop, fill, and impute missing values.

Below is a complete guide using a sample Royalty & Sales dataset.

Load Dataset

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('missing-data').getOrCreate()

df = spark.read.csv('/path/to/royalty.csv', header=True, inferSchema=True)

Show Raw Data

df.show()

Output:

+------------+------------------+------+--------+
| Author     | Book             | Sales| Royalty|
+------------+------------------+------+--------+
| John Smith | Deep Learning    | 5000 | 800    |
| Jane Doe   | AI Ethics        | null | 600    |
| null       | Quantum Computing| 3500 | 550    |
| Alex Lee   | Big Data         | 4500 | null   |
| Anna Ray   | null             | null | null   |
+------------+------------------+------+--------+

1. Drop Rows With Any Null Values

df.na.drop().show()

Output:

+-----------+--------------+-----+--------+
| Author    | Book         |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000|   800  |
+-----------+--------------+-----+--------+

✔️ What This Does

Removes any row that contains at least one null value.

2. Keep Rows Only if ≥ 2 Non-Null Values

df.na.drop(thresh=2).show()

Output:

+-----------+------------------+-----+--------+
| Author    | Book             |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning     |5000 | 800    |
| Jane Doe  |AI Ethics         |null | 600    |
| null      |Quantum Computing |3500 | 550    |
| Alex Lee  |Big Data          |4500 | null   |
+-----------+------------------+-----+--------+

✔️ What This Does

Keeps rows that have at least 2 non-null values.

3. Drop Row Only If Every Column Is Null

df.na.drop(how='all').show()

Output:

+-----------+------------------+-----+--------+
| Author    | Book             |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning     |5000 |800     |
| Jane Doe  |AI Ethics         |null |600     |
| null      |Quantum Computing |3500 |550     |
| Alex Lee  |Big Data          |4500 |null    |
| Anna Ray  |null              |null |null    |
+-----------+------------------+-----+--------+

✔️ What This Does

Removes rows only if every column has null → none in this dataset.

4. Drop Rows If Any Column Is Null

df.na.drop(how='any').show()

Output is the same as example (only one row remains).

✔️ Equivalent to df.na.drop().

5. Drop Rows Where a Specific Column (Sales) Is Null

df.na.drop(subset=['Sales']).show()

Output:

+-----------+------------------+-----+--------+
| Author    | Book             |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning     |5000 | 800    |
| null      |Quantum Computing |3500 | 550    |
| Alex Lee  |Big Data          |4500 | null   |
+-----------+------------------+-----+--------+

✔️ What This Does

Keeps only rows where Sales is not null.

6. Fill All Nulls With 0

df.na.fill(0).show()

Output:

+-----------+------------------+-----+--------+
| Author    | Book             |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning     |5000 | 800    |
| Jane Doe  |AI Ethics         |  0  | 600    |
| null      |Quantum Computing |3500 | 550    |
| Alex Lee  |Big Data          |4500 | 0      |
| Anna Ray  |null              |  0  | 0      |
+-----------+------------------+-----+--------+

✔️ What This Does

Replaces all nulls (numbers, strings) with 0.

7. Fill All Nulls With `"unknown"`

df.na.fill('unknown').show()

Output:

+-----------+------------------+--------+--------+
| Author    | Book             | Sales  |Royalty |
+-----------+------------------+--------+--------+
| John Smith| Deep Learning    | 5000   | 800    |
| Jane Doe  | AI Ethics        | unknown| 600    |
| unknown   | Quantum Computing| 3500   | 550    |
| Alex Lee  | Big Data         | 4500   |unknown |
| Anna Ray  | unknown          |unknown |unknown |
+-----------+------------------+--------+--------+

✔️ What This Does

Fills every missing cell with "unknown" regardless of data type.

8. Fill Nulls Only in the Author Column

df.na.fill('unknown', subset=['Author']).show()

Output:

+-----------+------------------+------+--------+
| Author    | Book             |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning     | 5000 |  800   |
| Jane Doe  |AI Ethics         | null |  600   |
| unknown   |Quantum Computing | 3500 |  550   |
| Alex Lee  |Big Data          | 4500 | null   |
| Anna Ray  |null              | null | null   |
+-----------+------------------+------+--------+

✔️ What This Does

Only fills missing author names.

9. Fill Missing Sales Values With the Mean

from pyspark.sql.functions import mean

mean_val = df.select(mean(df['Sales'])).collect()
mean_sales = mean_val[0][0]

df.na.fill(mean_sales, subset=['Sales']).show()

Output:

+-----------+------------------+------+--------+
| Author    | Book             |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning     | 5000 | 800    |
| Jane Doe  |AI Ethics         | 4250 | 600    |
| null      |Quantum Computing | 3500 | 550    |
| Alex Lee  |Big Data          | 4500 | null   |
| Anna Ray  |null              | 4250 | null   |
+-----------+------------------+------+--------+

✔️ What This Does

Replaces missing Sales values with the calculated average = 4250.

🟦 1-Minute Summary — Handling Missing Data in PySpark

Code Example	Meaning / Use Case
`df.na.drop()`	Drop rows with any null values
`df.na.drop(thresh=2)`	Keep rows with ≥ 2 non-null values
`df.na.drop(how='all')`	Drop only rows where all columns are null
`df.na.drop(subset=['Sales'])`	Drop rows with nulls in specific column(s)
`df.na.fill(0)`	Replace all nulls with `0`
`df.na.fill('unknown')`	Replace all nulls with `"unknown"`
`df.na.fill('unknown', subset=['Author'])`	Replace nulls only in the `Author` column
`mean(df['Sales'])`	Compute the mean for imputation
`df.na.fill(mean_sales, subset=['Sales'])`	Fill Sales nulls using mean-based imputation

Next, we’ll cover Working with Dates and Timestamps in PySpark DataFrames (Full Guide)

Load Dataset

Show Raw Data

1. Drop Rows With Any Null Values

✔️ What This Does​

2. Keep Rows Only if ≥ 2 Non-Null Values

✔️ What This Does​

3. Drop Row Only If Every Column Is Null

✔️ What This Does​

4. Drop Rows If Any Column Is Null

5. Drop Rows Where a Specific Column (Sales) Is Null

✔️ What This Does​

6. Fill All Nulls With 0

✔️ What This Does​

7. Fill All Nulls With "unknown"

✔️ What This Does​

8. Fill Nulls Only in the Author Column

✔️ What This Does​

9. Fill Missing Sales Values With the Mean

✔️ What This Does​

🟦 1-Minute Summary — Handling Missing Data in PySpark

✔️ What This Does

✔️ What This Does

✔️ What This Does

✔️ What This Does

✔️ What This Does

7. Fill All Nulls With `"unknown"`

✔️ What This Does

✔️ What This Does

✔️ What This Does