Joins in PySpark DataFrames

Load Example Datasets

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('joins').getOrCreate()

# Authors dataset
authors = [(1, "John Smith"),
           (2, "Jane Doe"),
           (3, "Alex Lee"),
           (4, "Anna Ray")]

df_authors = spark.createDataFrame(authors, ["AuthorID", "Author"])

# Books dataset
books = [(1, "Deep Learning"),
         (2, "AI Ethics"),
         (5, "Data Science")]

df_books = spark.createDataFrame(books, ["AuthorID", "Book"])

1. Inner Join

df_authors.join(df_books, on="AuthorID", how="inner").show()

Why we need this? Use when you only want records present in both datasets (e.g., authors who actually wrote a book).

2. Left Join

df_authors.join(df_books, on="AuthorID", how="left").show()

3. Right Join

df_authors.join(df_books, on="AuthorID", how="right").show()

Why we need this? Use when you want all records from right (books), even if they don’t have a matching author.

4. Full Outer Join

df_authors.join(df_books, on="AuthorID", how="outer").show()

Why we need this? Use when you need a complete dataset, keeping all records from both sides.

5. Left Semi Join

df_authors.join(df_books, on="AuthorID", how="left_semi").show()

Why we need this?

Use when you only want rows from left that have a match in right, but don’t care about right’s columns. (Example: Find authors who wrote at least one book.)

Result +--------+-----------+ |AuthorID| Author | +--------+-----------+ | 1 |John Smith | | 2 |Jane Doe | +--------+-----------+

6. Left Anti Join

df_authors.join(df_books, on="AuthorID", how="left_anti").show()

Why we need this? Use when you want rows from left that have NO match in right. (Example: Find authors who don’t have any book published.) Result +--------+-----------+ |AuthorID| Author | +--------+-----------+ | 3 |Alex Lee | | 4 |Anna Ray | +--------+-----------+

7. Cross Join (Cartesian Product)

df_authors.crossJoin(df_books).show()

Why we need this? Use only for special scenarios like recommendations or testing where every row must be paired with every other row. ⚠️ Be careful — it explodes data size quickly.

🔑 1-Minute Summary — Joins in PySpark

Join Type	Why We Need It
`inner`	Only matching rows (common records)
`left`	Keep all from left, match where possible
`right`	Keep all from right, match where possible
`outer`	Keep everything from both sides
`left_semi`	Filter left rows that exist in right
`left_anti`	Filter left rows that don’t exist in right
`crossJoin()`	All combinations (use cautiously — can explode rows)

Load Example Datasets​

1. Inner Join​

2. Left Join​

3. Right Join​

4. Full Outer Join​

5. Left Semi Join​

Why we need this?​

6. Left Anti Join​

7. Cross Join (Cartesian Product)​

🔑 1-Minute Summary — Joins in PySpark​