Skip to main content

Handling Missing Data

Load Dataset

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('missing-data').getOrCreate()

df = spark.read.csv('/path/to/royalty.csv', header=True, inferSchema=True)

Show Raw Data

df.show()

Result

+----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+----------+------------------+------+--------+
| John Smith | Deep Learning |5000 |800 |
| Jane Doe | AI Ethics |null |600 |
| null | Quantum Computing|3500|550 |
| Alex Lee | Big Data |4500 |null |
| Anna Ray | null |null |null |
+------------+----------------+------+--------+

1. Drop rows with any null values

df.na.drop().show()

Result

+-----------+--------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000| 800 |
+-----------+--------------+-----+--------+

✅ Drops all rows that have even one null.

2. Drop rows with 2 or more non-null values required

df.na.drop(thresh=2).show()

Result

+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics |null | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
+-----------+------------------+-----+--------+

✅ Keeps rows with at least 2 non-null values.

3. Drop rows only when all values are null

df.na.drop(how='all').show()

Result

+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics |null | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
| Anna Ray |null |null | null |
+-----------+------------------+-----+--------+

✅ Only drops rows fully empty (none in this case).

4. Drop rows if any column is null

df.na.drop(how='any').show()

Result

+-----------+--------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000| 800 |
+-----------+--------------+-----+--------+

✅ Same as first — drops all rows with any missing value.

5. Drop rows where specific column (Sales) is null

df.na.drop(subset=['Sales']).show()

Result

+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
+-----------+------------------+-----+--------+

✅ Keeps only rows where Sales is not null.

6. Fill all nulls with 0

df.na.fill(0).show()

Result

+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics | 0 | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | 0 |
| Anna Ray |null | 0 | 0 |
+-----------+------------------+-----+--------+

✅ Replaces all numeric and string nulls with 0.

7. Fill all nulls with 'unknown'

df.na.fill('unknown').show()

Result

+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics |unknown| 600 |
| unknown |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 |unknown |
| Anna Ray |unknown |unknown|unknown|
+-----------+------------------+------+--------+

✅ Fills all nulls (regardless of type) with "unknown".

8. Fill null only in the Author column

df.na.fill('unknown', subset=['Author']).show()

Result

+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics |null | 600 |
| unknown |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 | null |
| Anna Ray |null |null | null |
+-----------+------------------+------+--------+

✅ Only fills nulls in the Author column.

9. Fill Sales nulls with the mean value

from pyspark.sql.functions import mean

mean_val = df.select(mean(df['Sales'])).collect()
mean_sales = mean_val[0][0]

df.na.fill(mean_sales, subset=['Sales']).show()

Result

+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics | 4250 | 600 |
| null |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 | null |
| Anna Ray |null | 4250 | null |
+-----------+------------------+------+--------+

✅ Fills missing Sales with the average of existing values (4250).

One-line Meaning for Each Method/Parameter

Method / ParameterMeaning
df.na.drop()Drops all rows with at least one null value (default behavior).
thresh=2Keeps rows with at least 2 non-null values; drops the rest.
how='all'Drops rows only if all columns are null in that row.
how='any'Drops rows if any single column is null in that row.
subset=['Sales']Limits the operation (drop or fill) to the Sales column only.
df.na.fill(0)Replaces all missing values (numeric or string) with 0.
df.na.fill('unknown')Replaces all missing values with 'unknown', good for string columns.
df.na.fill(..., subset=['Author'])Replaces nulls only in Author column, leaves others unchanged.
mean(df['Sales'])Calculates the average of all non-null values in Sales.
df.na.fill(mean_sales, subset=['Sales'])Fills missing Sales values with the previously calculated mean.

🔑 1-Minute Summary — Handling Missing Data in PySpark (Royalties Example)

CodeWhat it Does
df.na.drop().show()Drops all rows with any null values
df.na.drop(thresh=2).show()Keeps only rows with at least 2 non-null values
df.na.drop(how='all').show()Drops rows only if all values are null
df.na.drop(how='any').show()Drops rows if any column is null
df.na.drop(subset=['Sales']).show()Drops rows where Sales is null only
df.na.fill(0).show()Replaces all nulls with 0
df.na.fill('unknown').show()Replaces all nulls with "unknown" (for both strings and numbers)
df.na.fill('unknown', subset=['Author']).show()Replaces nulls only in Author column with "unknown"
df.select(mean(df['Sales'])).collect()Calculates mean of Sales column
df.na.fill(mean_sales, subset=['Sales']).show()Fills nulls in Sales with the computed average