Joins in PySpark DataFrames
Load Example Datasets
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('joins').getOrCreate()
# Authors dataset
authors = [(1, "John Smith"),
(2, "Jane Doe"),
(3, "Alex Lee"),
(4, "Anna Ray")]
df_authors = spark.createDataFrame(authors, ["AuthorID", "Author"])
# Books dataset
books = [(1, "Deep Learning"),
(2, "AI Ethics"),
(5, "Data Science")]
df_books = spark.createDataFrame(books, ["AuthorID", "Book"])
1. Inner Join
df_authors.join(df_books, on="AuthorID", how="inner").show()
Why we need this? Use when you only want records present in both datasets (e.g., authors who actually wrote a book).
Result +--------+-----------+--------------+ |AuthorID| Author | Book | +--------+-----------+--------------+ | 1 |John Smith |Deep Learning | | 2 |Jane Doe |AI Ethics | +--------+-----------+--------------+
2. Left Join
df_authors.join(df_books, on="AuthorID", how="left").show()
Why we need this? Use when you want all records from left (authors), even if they don’t have a matching book. Result +--------+-----------+--------------+ |AuthorID| Author | Book | +--------+-----------+--------------+ | 1 |John Smith |Deep Learning | | 2 |Jane Doe |AI Ethics | | 3 |Alex Lee |null | | 4 |Anna Ray |null | +--------+-----------+--------------+
3. Right Join
df_authors.join(df_books, on="AuthorID", how="right").show()
Why we need this? Use when you want all records from right (books), even if they don’t have a matching author.
Result +--------+-----------+--------------+ |AuthorID| Author | Book | +--------+-----------+--------------+ | 1 |John Smith |Deep Learning | | 2 |Jane Doe |AI Ethics | | 5 |null |Data Science | +--------+-----------+--------------+
4. Full Outer Join
df_authors.join(df_books, on="AuthorID", how="outer").show()
Why we need this? Use when you need a complete dataset, keeping all records from both sides.
Result +--------+-----------+--------------+ |AuthorID| Author | Book | +--------+-----------+--------------+ | 1 |John Smith |Deep Learning | | 2 |Jane Doe |AI Ethics | | 3 |Alex Lee |null | | 4 |Anna Ray |null | | 5 |null |Data Science | +--------+-----------+--------------+
5. Left Semi Join
df_authors.join(df_books, on="AuthorID", how="left_semi").show()
Why we need this?
Use when you only want rows from left that have a match in right, but don’t care about right’s columns. (Example: Find authors who wrote at least one book.)
Result +--------+-----------+ |AuthorID| Author | +--------+-----------+ | 1 |John Smith | | 2 |Jane Doe | +--------+-----------+
6. Left Anti Join
df_authors.join(df_books, on="AuthorID", how="left_anti").show()
Why we need this? Use when you want rows from left that have NO match in right. (Example: Find authors who don’t have any book published.) Result +--------+-----------+ |AuthorID| Author | +--------+-----------+ | 3 |Alex Lee | | 4 |Anna Ray | +--------+-----------+
7. Cross Join (Cartesian Product)
df_authors.crossJoin(df_books).show()
Why we need this? Use only for special scenarios like recommendations or testing where every row must be paired with every other row. ⚠️ Be careful — it explodes data size quickly.
Result +--------+-----------+--------+--------------+ |AuthorID| Author |AuthorID| Book | +--------+-----------+--------+--------------+ | 1 |John Smith | 1 |Deep Learning | | 1 |John Smith | 2 |AI Ethics | | 1 |John Smith | 5 |Data Science | ... (all combinations)
🔑 1-Minute Summary — Joins in PySpark
Join Type | Why We Need It |
---|---|
inner | Only matching rows (common records) |
left | Keep all from left, match where possible |
right | Keep all from right, match where possible |
outer | Keep everything from both sides |
left_semi | Filter left rows that exist in right |
left_anti | Filter left rows that don’t exist in right |
crossJoin() | All combinations (use cautiously — can explode rows) |