Skip to main content

Joins in PySpark DataFrames

Joins are one of the most common operations when working with data.
In PySpark, joins let you combine DataFrames based on matching keys — while taking advantage of Spark’s distributed computing engine.

This guide covers all join types in PySpark, including:

  • Inner Join
  • Left Join
  • Right Join
  • Full Outer Join
  • Left Semi Join
  • Left Anti Join
  • Cross Join

Each join comes with a clean explanation, Python code, and output.


Load Example Datasets

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('joins').getOrCreate()

# Authors dataset
authors = [(1, "John Smith"),
(2, "Jane Doe"),
(3, "Alex Lee"),
(4, "Anna Ray")]

df_authors = spark.createDataFrame(authors, ["AuthorID", "Author"])

# Books dataset
books = [(1, "Deep Learning"),
(2, "AI Ethics"),
(5, "Data Science")]

df_books = spark.createDataFrame(books, ["AuthorID", "Book"])

1. Inner Join

df_authors.join(df_books, on="AuthorID", how="inner").show()

📌 When to Use:

Use inner join when you only need rows that exist in both DataFrames. Example: authors who actually wrote a book.

✅ Result

AuthorIDAuthorBook
1John SmithDeep Learning
2Jane DoeAI Ethics

2. Left Join (Left Outer Join)

df_authors.join(df_books, on="AuthorID", how="left").show()

📌 When to Use:

Use when you want all authors, even if they don’t have a book.

✅ Result

AuthorIDAuthorBook
1John SmithDeep Learning
2Jane DoeAI Ethics
3Alex Leenull
4Anna Raynull

3. Right Join (Right Outer Join)

df_authors.join(df_books, on="AuthorID", how="right").show()

📌 When to Use:

Use when you want all books, even if no author matches.

✅ Result

AuthorIDAuthorBook
1John SmithDeep Learning
2Jane DoeAI Ethics
5nullData Science

4. Full Outer Join

df_authors.join(df_books, on="AuthorID", how="outer").show()

📌 When to Use:

Use this join when you want all records from both sides, matched where possible.

✅ Result

AuthorIDAuthorBook
1John SmithDeep Learning
2Jane DoeAI Ethics
3Alex Leenull
4Anna Raynull
5nullData Science

5. Left Semi Join

df_authors.join(df_books, on="AuthorID", how="left_semi").show()

📌 When to Use:

Use left semi join when you only need rows from the left DataFrame that have a match in the right.

Note: You only get left columns — right DataFrame columns are removed.

Example:

Find authors who have written a book.

✅ Result

AuthorIDAuthor
1John Smith
2Jane Doe

6. Left Anti Join

df_authors.join(df_books, on="AuthorID", how="left_anti").show()

📌 When to Use:

Use when you want rows from the left DataFrame that do NOT exist in the right.

Example:

Find authors without any published book.

✅ Result

AuthorIDAuthor
3Alex Lee
4Anna Ray

7. Cross Join (Cartesian Product)

df_authors.crossJoin(df_books).show()

📌 When to Use:

Use only in special cases, such as generating combinations for recommendations or testing.

⚠️ Warning: Cross join produces all combinations, which can explode to millions of rows.

Example Result (Truncated)

AuthorIDAuthorAuthorIDBook
1John Smith1Deep Learning
1John Smith2AI Ethics
1John Smith5Data Science
............

🔑 1-Minute Summary — PySpark Joins

Join TypeWhat It Does
innerOnly matching rows from both sides
leftAll left rows + matching right rows
rightAll right rows + matching left rows
outerKeep every row from both sides
left_semiLeft rows that have matches in right
left_antiLeft rows that don’t have matches in right
crossJoin()All combinations (use carefully)

Next, we’ll explore Window Functions in PySpark DataFrames, enabling running totals, rankings, and time-based calculations.

Career