Handling Semi-Structured Data in PySpark — JSON, XML, Avro
At DataVerse Labs, the engineering teams received data from multiple external partners:
- Marketing APIs → JSON
- Legacy systems → XML
- Event platforms → Avro
Semi-structured data is everywhere — flexible, schema-lite, and perfect for high-velocity environments.
PySpark provides first-class support to parse, transform, and store these formats at scale.
1. Working with JSON in PySpark
JSON is the most common semi-structured format for APIs and event streams.
1.1 Example JSON Input
{"order_id": "O1001", "customer": {"id": "C101", "country": "USA"}, "items": [{"sku": "P01", "qty": 2}, {"sku": "P02", "qty": 1}]}
{"order_id": "O1002", "customer": {"id": "C102", "country": "India"}, "items": [{"sku": "P03", "qty": 1}]}
Stored in: /data/orders.json
1.2 Reading JSON
df_json = spark.read.json("/data/orders.json")
df_json.printSchema()
df_json.show(truncate=False)
Output
root
|-- customer: struct (nullable = true)
| |-- country: string (nullable = true)
| |-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: struct
| |-- qty: long
| |-- sku: string
|-- order_id: string (nullable = true)
+--------+---------------------------+--------+
|order_id|customer |items |
+--------+---------------------------+--------+
|O1001 |{USA, C101} |[{2,P01},{1,P02}]|
|O1002 |{India, C102} |[{1,P03}] |
+--------+---------------------------+--------+
1.3 Extracting Nested Fields
from pyspark.sql.functions import col, explode
df_nested = df_json \
.withColumn("customer_id", col("customer.id")) \
.withColumn("country", col("customer.country")) \
.withColumn("item", explode("items")) \
.withColumn("sku", col("item.sku")) \
.withColumn("qty", col("item.qty")) \
.drop("item")
df_nested.show(truncate=False)
Output
+--------+------------+-------+---+
|order_id|customer_id |country|qty|
+--------+------------+-------+---+
|O1001 |C101 |USA |2 |
|O1001 |C101 |USA |1 |
|O1002 |C102 |India |1 |
+--------+------------+-------+---+
2. Working with XML in PySpark
XML is still widely used in finance, telecom, and healthcare.
PySpark handles XML using:
--packages com.databricks:spark-xml_2.12:0.17.0
2.1 Example XML Input
<orders>
<order>
<id>O1001</id>
<customer>C101</customer>
<amount>250</amount>
</order>
<order>
<id>O1002</id>
<customer>C102</customer>
<amount>400</amount>
</order>
</orders>
Stored in /data/orders.xml
2.2 Reading XML
df_xml = spark.read \
.format("xml") \
.option("rootTag", "orders") \
.option("rowTag", "order") \
.load("/data/orders.xml")
df_xml.show()
Output
+-----+--------+------+
| id |customer|amount|
+-----+--------+------+
|O1001|C101 |250 |
|O1002|C102 |400 |
+-----+--------+------+
2.3 Handling Missing XML Fields
df_clean = df_xml.fillna({"amount": 0})
3. Working with Avro in PySpark
Avro is optimized for:
- Event streaming
- Data pipelines
- Schema evolution
- Kafka integration
PySpark supports Avro via:
--packages org.apache.spark:spark-avro_2.12:3.4.0
3.1 Example Avro File Structure
Avro Schema:
{
"type": "record",
"name": "CustomerEvent",
"fields": [
{"name": "user", "type": "string"},
{"name": "event", "type": "string"},
{"name": "timestamp", "type": "long"}
]
}
Stored at: /events/customer_events.avro
3.2 Reading Avro
df_avro = spark.read \
.format("avro") \
.load("/events/customer_events.avro")
df_avro.show()
Output Example
+------+---------+-------------+
|user |event |timestamp |
+------+---------+-------------+
|U1001 |click |1700000000000|
|U1002 |purchase |1700000012000|
+------+---------+-------------+
3.3 Writing Avro
df_avro.write \
.format("avro") \
.mode("append") \
.save("/events/output_avro")
4. Comparing JSON, XML, and Avro (SEO Summary)
| Format | Best For | Pros | Cons |
|---|---|---|---|
| JSON | API data, logs, REST services | Human-readable, flexible | Larger size, verbose |
| XML | Legacy systems | Supports attributes, schemas | Verbose, slower |
| Avro | Kafka, ETL pipelines | Compact, binary, schema evolution | Not human-readable |
5. Best Practices for Semi-Structured Data in PySpark
✔ Always specify schema explicitly for large files
✔ Use explode and struct for nested fields
✔ Store final processed data in Parquet or Delta
✔ Use Avro for streaming pipelines
✔ Validate XML using XSD (finance/healthcare)
Summary
In this chapter, you learned how to work with the most common semi-structured formats:
🟦 JSON
- Parse nested fields
- Handle arrays with
explode
🟩 XML
- Use spark-xml
- Manage hierarchical structures
🟧 Avro
- Read/write optimized binary formats
- Ideal for streaming & ETL
Semi-structured data is everywhere — with PySpark, you can manage it at massive scale, reliably and efficiently.
Next Topic → ETL Pipelines in PySpark — End-to-End Example