Handling Missing Data

Load Dataset

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('missing-data').getOrCreate()

df = spark.read.csv('/path/to/royalty.csv', header=True, inferSchema=True)

Show Raw Data

df.show()

Result

+----------+------------------+------+--------+
| Author   | Book             |Sales |Royalty |
+----------+------------------+------+--------+
| John Smith | Deep Learning  |5000  |800     |
| Jane Doe   | AI Ethics      |null  |600     |
| null       | Quantum Computing|3500|550     |
| Alex Lee   | Big Data       |4500  |null    |
| Anna Ray   | null           |null  |null    |
+------------+----------------+------+--------+

1. Drop rows with any null values

df.na.drop().show()

Result

+-----------+--------------+-----+--------+
| Author    | Book         |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000|   800  |
+-----------+--------------+-----+--------+

✅ Drops all rows that have even one null.

2. Drop rows with 2 or more non-null values required

df.na.drop(thresh=2).show()

Result

+-----------+------------------+-----+--------+
| Author    | Book             |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning     |5000 |   800  |
| Jane Doe  |AI Ethics         |null |   600  |
| null      |Quantum Computing |3500 |   550  |
| Alex Lee  |Big Data          |4500 |  null  |
+-----------+------------------+-----+--------+

✅ Keeps rows with at least 2 non-null values.

3. Drop rows only when all values are null

df.na.drop(how='all').show()

Result

+-----------+------------------+-----+--------+
| Author    | Book             |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning     |5000 |   800  |
| Jane Doe  |AI Ethics         |null |   600  |
| null      |Quantum Computing |3500 |   550  |
| Alex Lee  |Big Data          |4500 |  null  |
| Anna Ray  |null              |null |  null  |
+-----------+------------------+-----+--------+

✅ Only drops rows fully empty (none in this case).

4. Drop rows if any column is null

df.na.drop(how='any').show()

Result

+-----------+--------------+-----+--------+
| Author    | Book         |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000|   800  |
+-----------+--------------+-----+--------+

✅ Same as first — drops all rows with any missing value.

5. Drop rows where specific column (Sales) is null

df.na.drop(subset=['Sales']).show()

Result

+-----------+------------------+-----+--------+
| Author    | Book             |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning     |5000 |   800  |
| null      |Quantum Computing |3500 |   550  |
| Alex Lee  |Big Data          |4500 |  null  |
+-----------+------------------+-----+--------+

✅ Keeps only rows where Sales is not null.

6. Fill all nulls with 0

df.na.fill(0).show()

Result

+-----------+------------------+-----+--------+
| Author    | Book             |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning     |5000 |   800  |
| Jane Doe  |AI Ethics         |  0  |   600  |
| null      |Quantum Computing |3500 |   550  |
| Alex Lee  |Big Data          |4500 |    0   |
| Anna Ray  |null              |  0  |    0   |
+-----------+------------------+-----+--------+

✅ Replaces all numeric and string nulls with 0.

7. Fill all nulls with 'unknown'

df.na.fill('unknown').show()

Result

+-----------+------------------+------+--------+
| Author    | Book             |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning     | 5000 |   800  |
| Jane Doe  |AI Ethics         |unknown| 600   |
| unknown   |Quantum Computing | 3500 |  550   |
| Alex Lee  |Big Data          | 4500 |unknown |
| Anna Ray  |unknown           |unknown|unknown|
+-----------+------------------+------+--------+

✅ Fills all nulls (regardless of type) with "unknown".

8. Fill null only in the Author column

df.na.fill('unknown', subset=['Author']).show()

Result

+-----------+------------------+------+--------+
| Author    | Book             |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning     | 5000 |   800  |
| Jane Doe  |AI Ethics         |null  |   600  |
| unknown   |Quantum Computing | 3500 |  550   |
| Alex Lee  |Big Data          | 4500 | null   |
| Anna Ray  |null              |null  | null   |
+-----------+------------------+------+--------+

✅ Only fills nulls in the Author column.

9. Fill Sales nulls with the mean value

from pyspark.sql.functions import mean

mean_val = df.select(mean(df['Sales'])).collect()
mean_sales = mean_val[0][0]

df.na.fill(mean_sales, subset=['Sales']).show()

Result

+-----------+------------------+------+--------+
| Author    | Book             |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning     | 5000 |   800  |
| Jane Doe  |AI Ethics         | 4250 |   600  |
| null      |Quantum Computing | 3500 |  550   |
| Alex Lee  |Big Data          | 4500 | null   |
| Anna Ray  |null              | 4250 | null   |
+-----------+------------------+------+--------+

✅ Fills missing Sales with the average of existing values (4250).

One-line Meaning for Each Method/Parameter

Method / Parameter	Meaning
`df.na.drop()`	Drops all rows with at least one null value (default behavior).
`thresh=2`	Keeps rows with at least 2 non-null values; drops the rest.
`how='all'`	Drops rows only if all columns are null in that row.
`how='any'`	Drops rows if any single column is null in that row.
`subset=['Sales']`	Limits the operation (drop or fill) to the `Sales` column only.
`df.na.fill(0)`	Replaces all missing values (numeric or string) with `0`.
`df.na.fill('unknown')`	Replaces all missing values with `'unknown'`, good for string columns.
`df.na.fill(..., subset=['Author'])`	Replaces nulls only in `Author` column, leaves others unchanged.
`mean(df['Sales'])`	Calculates the average of all non-null values in `Sales`.
`df.na.fill(mean_sales, subset=['Sales'])`	Fills missing `Sales` values with the previously calculated mean.

🔑 1-Minute Summary — Handling Missing Data in PySpark (Royalties Example)

Code	What it Does
`df.na.drop().show()`	Drops all rows with any null values
`df.na.drop(thresh=2).show()`	Keeps only rows with at least 2 non-null values
`df.na.drop(how='all').show()`	Drops rows only if all values are null
`df.na.drop(how='any').show()`	Drops rows if any column is null
`df.na.drop(subset=['Sales']).show()`	Drops rows where `Sales` is null only
`df.na.fill(0).show()`	Replaces all nulls with `0`
`df.na.fill('unknown').show()`	Replaces all nulls with `"unknown"` (for both strings and numbers)
`df.na.fill('unknown', subset=['Author']).show()`	Replaces nulls only in `Author` column with `"unknown"`
`df.select(mean(df['Sales'])).collect()`	Calculates mean of `Sales` column
`df.na.fill(mean_sales, subset=['Sales']).show()`	Fills nulls in `Sales` with the computed average

Load Dataset​

1. Drop rows with any null values​

2. Drop rows with 2 or more non-null values required​

3. Drop rows only when all values are null​

4. Drop rows if any column is null​

5. Drop rows where specific column (Sales) is null​

6. Fill all nulls with 0​

7. Fill all nulls with 'unknown'​

8. Fill null only in the Author column​

9. Fill Sales nulls with the mean value​

One-line Meaning for Each Method/Parameter​

🔑 1-Minute Summary — Handling Missing Data in PySpark (Royalties Example)​

Load Dataset

1. Drop rows with any null values

2. Drop rows with 2 or more non-null values required

3. Drop rows only when all values are null

4. Drop rows if any column is null

5. Drop rows where specific column (Sales) is null

6. Fill all nulls with 0

7. Fill all nulls with 'unknown'

8. Fill null only in the Author column

9. Fill Sales nulls with the mean value

One-line Meaning for Each Method/Parameter

🔑 1-Minute Summary — Handling Missing Data in PySpark (Royalties Example)