Explode, Lateral View, Structs & Arrays — Mastering Complex Columns in PySpark
At NeoMart, customer interactions leave behind massive footprints:
- A single order contains multiple items
- Each user has multiple addresses
- Product metadata comes as nested objects
- API logs arrive as arrays-of-structs
This is the reality of modern data — deeply nested and annoyingly complex.
To make this data analytics-friendly, PySpark gives us powerful tools:
✔ Arrays
✔ Structs
✔ explode()
✔ posexplode()
✔ lateral view in SQL
This chapter helps you master the art of flattening: turning messy nested data into clean, analysis-ready DataFrames.
1. Understanding Complex Columns in PySpark
PySpark handles many nested types:
✔ ArrayType
Used for lists such as items, tags, categories.
✔ StructType
Used for nested objects like address, metadata, payment info.
✔ Array of Structs
The most common format in logs and APIs.
2. Creating a Sample Nested DataFrame
from pyspark.sql import Row
from pyspark.sql import functions as F
data = [
Row(order_id=1, items=["Laptop", "Mouse"], info={"city": "Berlin", "payment": "Card"}),
Row(order_id=2, items=["Keyboard"], info={"city": "Paris", "payment": "UPI"}),
Row(order_id=3, items=["Monitor", "Stand", "HDMI Cable"], info={"city": "Rome", "payment": "Cash"})
]
df = spark.createDataFrame(data)
df.show(truncate=False)
Input Data
| order_id | items | info |
|---|---|---|
| 1 | ["Laptop", "Mouse"] | {"city": "Berlin", "payment": "Card"} |
| 2 | ["Keyboard"] | {"city": "Paris", "payment": "UPI"} |
| 3 | ["Monitor", "Stand", "HDMI Cable"] | {"city": "Rome", "payment": "Cash"} |
3. explode() — Flatten Array Columns
explode() turns each element of an array into its own row.
df_ex = df.select("order_id", F.explode("items").alias("item"))
df_ex.show()
Output
| order_id | item |
|---|---|
| 1 | Laptop |
| 1 | Mouse |
| 2 | Keyboard |
| 3 | Monitor |
| 3 | Stand |
| 3 | HDMI Cable |
Story
NeoMart’s analytics team needs one row per product sold instead of a nested list.
explode() delivers exactly that.
4. posexplode() — Preserve Position Index
Useful when order matters.
df_pos = df.select(
"order_id",
F.posexplode("items").alias("position", "item")
)
df_pos.show()
Output
| order_id | position | item |
|---|---|---|
| 1 | 0 | Laptop |
| 1 | 1 | Mouse |
| 2 | 0 | Keyboard |
| 3 | 0 | Monitor |
| 3 | 1 | Stand |
| 3 | 2 | HDMI Cable |
5. explode_outer() — Keep Null / Empty Arrays
df2 = spark.createDataFrame([
(1, ["A", "B"]),
(2, None),
(3, []),
], ["id", "letters"])
df2.select("id", F.explode_outer("letters")).show()
Output
| id | col |
|---|---|
| 1 | A |
| 1 | B |
| 2 | null |
| 3 | null |
6. Accessing Struct Fields
Structs behave like JSON objects.
df.select(
"order_id",
"info.city",
"info.payment"
).show()
Output
| order_id | city | payment |
|---|---|---|
| 1 | Berlin | Card |
| 2 | Paris | UPI |
| 3 | Rome | Cash |
7. explode() + Structs — Arrays of Struct
Let’s expand a complex list:
df_logs = spark.createDataFrame([
(1, [{"event": "click", "ts": 12}, {"event": "view", "ts": 15}]),
(2, [{"event": "view", "ts": 18}])
], ["session_id", "logs"])
df_logs_ex = df_logs.select(
"session_id",
F.explode("logs").alias("log")
)
df_logs_ex.select("session_id", "log.event", "log.ts").show()
Output
| session_id | event | ts |
|---|---|---|
| 1 | click | 12 |
| 1 | view | 15 |
| 2 | view | 18 |
8. LATERAL VIEW in Spark SQL
Sometimes SQL is more readable.
SELECT
order_id,
item
FROM orders
LATERAL VIEW explode(items) AS item;
Output is same as explode()
| order_id | item |
|---|---|
| 1 | Laptop |
| 1 | Mouse |
| 2 | Keyboard |
Story
SQL analysts at NeoMart prefer LATERAL VIEW because it avoids long DataFrame transformations.
9. LATERAL VIEW with Structs (Array of Structs)
SELECT
session_id,
log.event,
log.ts
FROM session_logs
LATERAL VIEW explode(logs) AS log;
10. Best Practices for Complex Columns
✔ Avoid deep nesting when possible
✔ Use explode_outer to preserve records
✔ Prefer posexplode when order matters
✔ Use SQL LATERAL VIEW for large joins + explode
✔ Cache DataFrames when exploding huge arrays
Summary
By mastering:
- explode / posexplode / explode_outer
- struct access
- arrays & nested objects
- LATERAL VIEW
…you unlock the true power of PySpark for real-world messy data.
NeoMart’s pipelines depend on these operations daily — and now, so can yours.
Next: Pivot & Unpivot in PySpark — Turning Rows Into Columns (and Back!)