Skip to main content

Explode, Lateral View, Structs & Arrays — Mastering Complex Columns in PySpark

At NeoMart, customer interactions leave behind massive footprints:

  • A single order contains multiple items
  • Each user has multiple addresses
  • Product metadata comes as nested objects
  • API logs arrive as arrays-of-structs

This is the reality of modern data — deeply nested and annoyingly complex.

To make this data analytics-friendly, PySpark gives us powerful tools:

✔ Arrays
✔ Structs
explode()
posexplode()
lateral view in SQL

This chapter helps you master the art of flattening: turning messy nested data into clean, analysis-ready DataFrames.


1. Understanding Complex Columns in PySpark

PySpark handles many nested types:

✔ ArrayType

Used for lists such as items, tags, categories.

✔ StructType

Used for nested objects like address, metadata, payment info.

✔ Array of Structs

The most common format in logs and APIs.


2. Creating a Sample Nested DataFrame

from pyspark.sql import Row
from pyspark.sql import functions as F

data = [
Row(order_id=1, items=["Laptop", "Mouse"], info={"city": "Berlin", "payment": "Card"}),
Row(order_id=2, items=["Keyboard"], info={"city": "Paris", "payment": "UPI"}),
Row(order_id=3, items=["Monitor", "Stand", "HDMI Cable"], info={"city": "Rome", "payment": "Cash"})
]

df = spark.createDataFrame(data)
df.show(truncate=False)

Input Data

order_iditemsinfo
1["Laptop", "Mouse"]{"city": "Berlin", "payment": "Card"}
2["Keyboard"]{"city": "Paris", "payment": "UPI"}
3["Monitor", "Stand", "HDMI Cable"]{"city": "Rome", "payment": "Cash"}

3. explode() — Flatten Array Columns

explode() turns each element of an array into its own row.

df_ex = df.select("order_id", F.explode("items").alias("item"))
df_ex.show()

Output

order_iditem
1Laptop
1Mouse
2Keyboard
3Monitor
3Stand
3HDMI Cable

Story

NeoMart’s analytics team needs one row per product sold instead of a nested list. explode() delivers exactly that.


4. posexplode() — Preserve Position Index

Useful when order matters.

df_pos = df.select(
"order_id",
F.posexplode("items").alias("position", "item")
)
df_pos.show()

Output

order_idpositionitem
10Laptop
11Mouse
20Keyboard
30Monitor
31Stand
32HDMI Cable

5. explode_outer() — Keep Null / Empty Arrays

df2 = spark.createDataFrame([
(1, ["A", "B"]),
(2, None),
(3, []),
], ["id", "letters"])

df2.select("id", F.explode_outer("letters")).show()

Output

idcol
1A
1B
2null
3null

6. Accessing Struct Fields

Structs behave like JSON objects.

df.select(
"order_id",
"info.city",
"info.payment"
).show()

Output

order_idcitypayment
1BerlinCard
2ParisUPI
3RomeCash

7. explode() + Structs — Arrays of Struct

Let’s expand a complex list:

df_logs = spark.createDataFrame([
(1, [{"event": "click", "ts": 12}, {"event": "view", "ts": 15}]),
(2, [{"event": "view", "ts": 18}])
], ["session_id", "logs"])

df_logs_ex = df_logs.select(
"session_id",
F.explode("logs").alias("log")
)

df_logs_ex.select("session_id", "log.event", "log.ts").show()

Output

session_ideventts
1click12
1view15
2view18

8. LATERAL VIEW in Spark SQL

Sometimes SQL is more readable.

SELECT
order_id,
item
FROM orders
LATERAL VIEW explode(items) AS item;

Output is same as explode()

order_iditem
1Laptop
1Mouse
2Keyboard

Story

SQL analysts at NeoMart prefer LATERAL VIEW because it avoids long DataFrame transformations.


9. LATERAL VIEW with Structs (Array of Structs)

SELECT
session_id,
log.event,
log.ts
FROM session_logs
LATERAL VIEW explode(logs) AS log;

10. Best Practices for Complex Columns

✔ Avoid deep nesting when possible ✔ Use explode_outer to preserve records ✔ Prefer posexplode when order matters ✔ Use SQL LATERAL VIEW for large joins + explode ✔ Cache DataFrames when exploding huge arrays


Summary

By mastering:

  • explode / posexplode / explode_outer
  • struct access
  • arrays & nested objects
  • LATERAL VIEW

…you unlock the true power of PySpark for real-world messy data.

NeoMart’s pipelines depend on these operations daily — and now, so can yours.


Next: Pivot & Unpivot in PySpark — Turning Rows Into Columns (and Back!)