Skip to main content

Handling Semi-Structured Data in PySpark — JSON, XML, Avro

At DataVerse Labs, the engineering teams received data from multiple external partners:

  • Marketing APIs → JSON
  • Legacy systems → XML
  • Event platforms → Avro

Semi-structured data is everywhere — flexible, schema-lite, and perfect for high-velocity environments.

PySpark provides first-class support to parse, transform, and store these formats at scale.


1. Working with JSON in PySpark

JSON is the most common semi-structured format for APIs and event streams.


1.1 Example JSON Input

{"order_id": "O1001", "customer": {"id": "C101", "country": "USA"}, "items": [{"sku": "P01", "qty": 2}, {"sku": "P02", "qty": 1}]}
{"order_id": "O1002", "customer": {"id": "C102", "country": "India"}, "items": [{"sku": "P03", "qty": 1}]}

Stored in: /data/orders.json


1.2 Reading JSON

df_json = spark.read.json("/data/orders.json")
df_json.printSchema()
df_json.show(truncate=False)

Output

root
|-- customer: struct (nullable = true)
| |-- country: string (nullable = true)
| |-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: struct
| |-- qty: long
| |-- sku: string
|-- order_id: string (nullable = true)
+--------+---------------------------+--------+
|order_id|customer |items |
+--------+---------------------------+--------+
|O1001 |{USA, C101} |[{2,P01},{1,P02}]|
|O1002 |{India, C102} |[{1,P03}] |
+--------+---------------------------+--------+

1.3 Extracting Nested Fields

from pyspark.sql.functions import col, explode

df_nested = df_json \
.withColumn("customer_id", col("customer.id")) \
.withColumn("country", col("customer.country")) \
.withColumn("item", explode("items")) \
.withColumn("sku", col("item.sku")) \
.withColumn("qty", col("item.qty")) \
.drop("item")

df_nested.show(truncate=False)

Output

+--------+------------+-------+---+
|order_id|customer_id |country|qty|
+--------+------------+-------+---+
|O1001 |C101 |USA |2 |
|O1001 |C101 |USA |1 |
|O1002 |C102 |India |1 |
+--------+------------+-------+---+

2. Working with XML in PySpark

XML is still widely used in finance, telecom, and healthcare.

PySpark handles XML using:

--packages com.databricks:spark-xml_2.12:0.17.0

2.1 Example XML Input

<orders>
<order>
<id>O1001</id>
<customer>C101</customer>
<amount>250</amount>
</order>
<order>
<id>O1002</id>
<customer>C102</customer>
<amount>400</amount>
</order>
</orders>

Stored in /data/orders.xml


2.2 Reading XML

df_xml = spark.read \
.format("xml") \
.option("rootTag", "orders") \
.option("rowTag", "order") \
.load("/data/orders.xml")

df_xml.show()

Output

+-----+--------+------+
| id |customer|amount|
+-----+--------+------+
|O1001|C101 |250 |
|O1002|C102 |400 |
+-----+--------+------+

2.3 Handling Missing XML Fields

df_clean = df_xml.fillna({"amount": 0})

3. Working with Avro in PySpark

Avro is optimized for:

  • Event streaming
  • Data pipelines
  • Schema evolution
  • Kafka integration

PySpark supports Avro via:

--packages org.apache.spark:spark-avro_2.12:3.4.0

3.1 Example Avro File Structure

Avro Schema:

{
"type": "record",
"name": "CustomerEvent",
"fields": [
{"name": "user", "type": "string"},
{"name": "event", "type": "string"},
{"name": "timestamp", "type": "long"}
]
}

Stored at: /events/customer_events.avro


3.2 Reading Avro

df_avro = spark.read \
.format("avro") \
.load("/events/customer_events.avro")

df_avro.show()

Output Example

+------+---------+-------------+
|user |event |timestamp |
+------+---------+-------------+
|U1001 |click |1700000000000|
|U1002 |purchase |1700000012000|
+------+---------+-------------+

3.3 Writing Avro

df_avro.write \
.format("avro") \
.mode("append") \
.save("/events/output_avro")

4. Comparing JSON, XML, and Avro (SEO Summary)

FormatBest ForProsCons
JSONAPI data, logs, REST servicesHuman-readable, flexibleLarger size, verbose
XMLLegacy systemsSupports attributes, schemasVerbose, slower
AvroKafka, ETL pipelinesCompact, binary, schema evolutionNot human-readable

5. Best Practices for Semi-Structured Data in PySpark

✔ Always specify schema explicitly for large files ✔ Use explode and struct for nested fields ✔ Store final processed data in Parquet or Delta ✔ Use Avro for streaming pipelines ✔ Validate XML using XSD (finance/healthcare)


Summary

In this chapter, you learned how to work with the most common semi-structured formats:

🟦 JSON

  • Parse nested fields
  • Handle arrays with explode

🟩 XML

  • Use spark-xml
  • Manage hierarchical structures

🟧 Avro

  • Read/write optimized binary formats
  • Ideal for streaming & ETL

Semi-structured data is everywhere — with PySpark, you can manage it at massive scale, reliably and efficiently.


Next Topic → ETL Pipelines in PySpark — End-to-End Example