Skip to main content

Joins in PySpark DataFrames

Load Example Datasets

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('joins').getOrCreate()

# Authors dataset
authors = [(1, "John Smith"),
(2, "Jane Doe"),
(3, "Alex Lee"),
(4, "Anna Ray")]

df_authors = spark.createDataFrame(authors, ["AuthorID", "Author"])

# Books dataset
books = [(1, "Deep Learning"),
(2, "AI Ethics"),
(5, "Data Science")]

df_books = spark.createDataFrame(books, ["AuthorID", "Book"])

1. Inner Join

df_authors.join(df_books, on="AuthorID", how="inner").show()

Why we need this? Use when you only want records present in both datasets (e.g., authors who actually wrote a book).

Result +--------+-----------+--------------+ |AuthorID| Author | Book | +--------+-----------+--------------+ | 1 |John Smith |Deep Learning | | 2 |Jane Doe |AI Ethics | +--------+-----------+--------------+

2. Left Join

df_authors.join(df_books, on="AuthorID", how="left").show()

Why we need this? Use when you want all records from left (authors), even if they don’t have a matching book. Result +--------+-----------+--------------+ |AuthorID| Author | Book | +--------+-----------+--------------+ | 1 |John Smith |Deep Learning | | 2 |Jane Doe |AI Ethics | | 3 |Alex Lee |null | | 4 |Anna Ray |null | +--------+-----------+--------------+

3. Right Join

df_authors.join(df_books, on="AuthorID", how="right").show()

Why we need this? Use when you want all records from right (books), even if they don’t have a matching author.

Result +--------+-----------+--------------+ |AuthorID| Author | Book | +--------+-----------+--------------+ | 1 |John Smith |Deep Learning | | 2 |Jane Doe |AI Ethics | | 5 |null |Data Science | +--------+-----------+--------------+

4. Full Outer Join

df_authors.join(df_books, on="AuthorID", how="outer").show()

Why we need this? Use when you need a complete dataset, keeping all records from both sides.

Result +--------+-----------+--------------+ |AuthorID| Author | Book | +--------+-----------+--------------+ | 1 |John Smith |Deep Learning | | 2 |Jane Doe |AI Ethics | | 3 |Alex Lee |null | | 4 |Anna Ray |null | | 5 |null |Data Science | +--------+-----------+--------------+

5. Left Semi Join

df_authors.join(df_books, on="AuthorID", how="left_semi").show()

Why we need this?

Use when you only want rows from left that have a match in right, but don’t care about right’s columns. (Example: Find authors who wrote at least one book.)

Result +--------+-----------+ |AuthorID| Author | +--------+-----------+ | 1 |John Smith | | 2 |Jane Doe | +--------+-----------+

6. Left Anti Join

df_authors.join(df_books, on="AuthorID", how="left_anti").show()

Why we need this? Use when you want rows from left that have NO match in right. (Example: Find authors who don’t have any book published.) Result +--------+-----------+ |AuthorID| Author | +--------+-----------+ | 3 |Alex Lee | | 4 |Anna Ray | +--------+-----------+

7. Cross Join (Cartesian Product)

df_authors.crossJoin(df_books).show()

Why we need this? Use only for special scenarios like recommendations or testing where every row must be paired with every other row. ⚠️ Be careful — it explodes data size quickly.

Result +--------+-----------+--------+--------------+ |AuthorID| Author |AuthorID| Book | +--------+-----------+--------+--------------+ | 1 |John Smith | 1 |Deep Learning | | 1 |John Smith | 2 |AI Ethics | | 1 |John Smith | 5 |Data Science | ... (all combinations)

🔑 1-Minute Summary — Joins in PySpark

Join TypeWhy We Need It
innerOnly matching rows (common records)
leftKeep all from left, match where possible
rightKeep all from right, match where possible
outerKeep everything from both sides
left_semiFilter left rows that exist in right
left_antiFilter left rows that don’t exist in right
crossJoin()All combinations (use cautiously — can explode rows)