Explode, Lateral View, Structs & Arrays — Mastering Complex Columns in PySpark

At NeoMart, customer interactions leave behind massive footprints:

A single order contains multiple items
Each user has multiple addresses
Product metadata comes as nested objects
API logs arrive as arrays-of-structs

This is the reality of modern data — deeply nested and annoyingly complex.

To make this data analytics-friendly, PySpark gives us powerful tools:

✔ Arrays
✔ Structs
✔ explode()
✔ posexplode()
✔ lateral view in SQL

This chapter helps you master the art of flattening: turning messy nested data into clean, analysis-ready DataFrames.

1. Understanding Complex Columns in PySpark

PySpark handles many nested types:

✔ ArrayType

Used for lists such as items, tags, categories.

✔ StructType

Used for nested objects like address, metadata, payment info.

✔ Array of Structs

The most common format in logs and APIs.

2. Creating a Sample Nested DataFrame

from pyspark.sql import Row
from pyspark.sql import functions as F

data = [
    Row(order_id=1, items=["Laptop", "Mouse"], info={"city": "Berlin", "payment": "Card"}),
    Row(order_id=2, items=["Keyboard"], info={"city": "Paris", "payment": "UPI"}),
    Row(order_id=3, items=["Monitor", "Stand", "HDMI Cable"], info={"city": "Rome", "payment": "Cash"})
]

df = spark.createDataFrame(data)
df.show(truncate=False)

Input Data

order_id	items	info
1	`["Laptop", "Mouse"]`	`{"city": "Berlin", "payment": "Card"}`
2	`["Keyboard"]`	`{"city": "Paris", "payment": "UPI"}`
3	`["Monitor", "Stand", "HDMI Cable"]`	`{"city": "Rome", "payment": "Cash"}`

3. explode() — Flatten Array Columns

explode() turns each element of an array into its own row.

df_ex = df.select("order_id", F.explode("items").alias("item"))
df_ex.show()

Output

order_id	item
1	Laptop
1	Mouse
2	Keyboard
3	Monitor
3	Stand
3	HDMI Cable

Story

NeoMart’s analytics team needs one row per product sold instead of a nested list. explode() delivers exactly that.

4. posexplode() — Preserve Position Index

Useful when order matters.

df_pos = df.select(
    "order_id",
    F.posexplode("items").alias("position", "item")
)
df_pos.show()

Output

order_id	position	item
1	0	Laptop
1	1	Mouse
2	0	Keyboard
3	0	Monitor
3	1	Stand
3	2	HDMI Cable

5. explode_outer() — Keep Null / Empty Arrays

df2 = spark.createDataFrame([
    (1, ["A", "B"]),
    (2, None),
    (3, []),
], ["id", "letters"])

df2.select("id", F.explode_outer("letters")).show()

Output

id	col
1	A
1	B
2	null
3	null

6. Accessing Struct Fields

Structs behave like JSON objects.

df.select(
    "order_id",
    "info.city",
    "info.payment"
).show()

Output

order_id	city	payment
1	Berlin	Card
2	Paris	UPI
3	Rome	Cash

7. explode() + Structs — Arrays of Struct

Let’s expand a complex list:

df_logs = spark.createDataFrame([
    (1, [{"event": "click", "ts": 12}, {"event": "view", "ts": 15}]),
    (2, [{"event": "view", "ts": 18}])
], ["session_id", "logs"])

df_logs_ex = df_logs.select(
    "session_id",
    F.explode("logs").alias("log")
)

df_logs_ex.select("session_id", "log.event", "log.ts").show()

Output

session_id	event	ts
1	click	12
1	view	15
2	view	18

8. LATERAL VIEW in Spark SQL

Sometimes SQL is more readable.

SELECT
  order_id,
  item
FROM orders
LATERAL VIEW explode(items) AS item;

Output is same as explode()

order_id	item
1	Laptop
1	Mouse
2	Keyboard

Story

SQL analysts at NeoMart prefer LATERAL VIEW because it avoids long DataFrame transformations.

9. LATERAL VIEW with Structs (Array of Structs)

SELECT
  session_id,
  log.event,
  log.ts
FROM session_logs
LATERAL VIEW explode(logs) AS log;

10. Best Practices for Complex Columns

✔ Avoid deep nesting when possible ✔ Use explode_outer to preserve records ✔ Prefer posexplode when order matters ✔ Use SQL LATERAL VIEW for large joins + explode ✔ Cache DataFrames when exploding huge arrays

Summary

By mastering:

explode / posexplode / explode_outer
struct access
arrays & nested objects
LATERAL VIEW

…you unlock the true power of PySpark for real-world messy data.

NeoMart’s pipelines depend on these operations daily — and now, so can yours.

Next: Pivot & Unpivot in PySpark — Turning Rows Into Columns (and Back!)

1. Understanding Complex Columns in PySpark​

✔ ArrayType​

✔ StructType​

✔ Array of Structs​

2. Creating a Sample Nested DataFrame​

Input Data​

3. explode() — Flatten Array Columns​

Output​

Story​

4. posexplode() — Preserve Position Index​

Output​

5. explode_outer() — Keep Null / Empty Arrays​

Output​

6. Accessing Struct Fields

Output​

7. explode() + Structs — Arrays of Struct​

Output​

8. LATERAL VIEW in Spark SQL​

Output is same as explode()​

Story​

9. LATERAL VIEW with Structs (Array of Structs)​

10. Best Practices for Complex Columns​

Summary​

1. Understanding Complex Columns in PySpark

✔ ArrayType

✔ StructType

✔ Array of Structs

2. Creating a Sample Nested DataFrame

Input Data

3. explode() — Flatten Array Columns

Output

Story

4. posexplode() — Preserve Position Index

Output

5. explode_outer() — Keep Null / Empty Arrays

Output

Output

7. explode() + Structs — Arrays of Struct

Output

8. LATERAL VIEW in Spark SQL

Output is same as explode()

Story

9. LATERAL VIEW with Structs (Array of Structs)

10. Best Practices for Complex Columns

Summary