Handling Missing Data
Load Dataset
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('missing-data').getOrCreate()
df = spark.read.csv('/path/to/royalty.csv', header=True, inferSchema=True)
Show Raw Data
df.show()
Result
+----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+----------+------------------+------+--------+
| John Smith | Deep Learning |5000 |800 |
| Jane Doe | AI Ethics |null |600 |
| null | Quantum Computing|3500|550 |
| Alex Lee | Big Data |4500 |null |
| Anna Ray | null |null |null |
+------------+----------------+------+--------+
1. Drop rows with any null values
df.na.drop().show()
Result
+-----------+--------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000| 800 |
+-----------+--------------+-----+--------+
✅ Drops all rows that have even one null.
2. Drop rows with 2 or more non-null values required
df.na.drop(thresh=2).show()
Result
+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics |null | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
+-----------+------------------+-----+--------+
✅ Keeps rows with at least 2 non-null values.
3. Drop rows only when all values are null
df.na.drop(how='all').show()
Result
+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics |null | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
| Anna Ray |null |null | null |
+-----------+------------------+-----+--------+
✅ Only drops rows fully empty (none in this case).
4. Drop rows if any column is null
df.na.drop(how='any').show()
Result
+-----------+--------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+--------------+-----+--------+
| John Smith|Deep Learning | 5000| 800 |
+-----------+--------------+-----+--------+
✅ Same as first — drops all rows with any missing value.
5. Drop rows where specific column (Sales) is null
df.na.drop(subset=['Sales']).show()
Result
+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | null |
+-----------+------------------+-----+--------+
✅ Keeps only rows where Sales is not null.
6. Fill all nulls with 0
df.na.fill(0).show()
Result
+-----------+------------------+-----+--------+
| Author | Book |Sales|Royalty |
+-----------+------------------+-----+--------+
| John Smith|Deep Learning |5000 | 800 |
| Jane Doe |AI Ethics | 0 | 600 |
| null |Quantum Computing |3500 | 550 |
| Alex Lee |Big Data |4500 | 0 |
| Anna Ray |null | 0 | 0 |
+-----------+------------------+-----+--------+
✅ Replaces all numeric and string nulls with 0.
7. Fill all nulls with 'unknown'
df.na.fill('unknown').show()
Result
+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics |unknown| 600 |
| unknown |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 |unknown |
| Anna Ray |unknown |unknown|unknown|
+-----------+------------------+------+--------+
✅ Fills all nulls (regardless of type) with "unknown".
8. Fill null only in the Author column
df.na.fill('unknown', subset=['Author']).show()
Result
+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics |null | 600 |
| unknown |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 | null |
| Anna Ray |null |null | null |
+-----------+------------------+------+--------+
✅ Only fills nulls in the Author column.
9. Fill Sales nulls with the mean value
from pyspark.sql.functions import mean
mean_val = df.select(mean(df['Sales'])).collect()
mean_sales = mean_val[0][0]
df.na.fill(mean_sales, subset=['Sales']).show()
Result
+-----------+------------------+------+--------+
| Author | Book |Sales |Royalty |
+-----------+------------------+------+--------+
| John Smith|Deep Learning | 5000 | 800 |
| Jane Doe |AI Ethics | 4250 | 600 |
| null |Quantum Computing | 3500 | 550 |
| Alex Lee |Big Data | 4500 | null |
| Anna Ray |null | 4250 | null |
+-----------+------------------+------+--------+
✅ Fills missing Sales with the average of existing values (4250).
One-line Meaning for Each Method/Parameter
Method / Parameter | Meaning |
---|---|
df.na.drop() | Drops all rows with at least one null value (default behavior). |
thresh=2 | Keeps rows with at least 2 non-null values; drops the rest. |
how='all' | Drops rows only if all columns are null in that row. |
how='any' | Drops rows if any single column is null in that row. |
subset=['Sales'] | Limits the operation (drop or fill) to the Sales column only. |
df.na.fill(0) | Replaces all missing values (numeric or string) with 0 . |
df.na.fill('unknown') | Replaces all missing values with 'unknown' , good for string columns. |
df.na.fill(..., subset=['Author']) | Replaces nulls only in Author column, leaves others unchanged. |
mean(df['Sales']) | Calculates the average of all non-null values in Sales . |
df.na.fill(mean_sales, subset=['Sales']) | Fills missing Sales values with the previously calculated mean. |
🔑 1-Minute Summary — Handling Missing Data in PySpark (Royalties Example)
Code | What it Does |
---|---|
df.na.drop().show() | Drops all rows with any null values |
df.na.drop(thresh=2).show() | Keeps only rows with at least 2 non-null values |
df.na.drop(how='all').show() | Drops rows only if all values are null |
df.na.drop(how='any').show() | Drops rows if any column is null |
df.na.drop(subset=['Sales']).show() | Drops rows where Sales is null only |
df.na.fill(0).show() | Replaces all nulls with 0 |
df.na.fill('unknown').show() | Replaces all nulls with "unknown" (for both strings and numbers) |
df.na.fill('unknown', subset=['Author']).show() | Replaces nulls only in Author column with "unknown" |
df.select(mean(df['Sales'])).collect() | Calculates mean of Sales column |
df.na.fill(mean_sales, subset=['Sales']).show() | Fills nulls in Sales with the computed average |