Handling Semi-Structured Data in PySpark — JSON, XML, Avro

At DataVerse Labs, the engineering teams received data from multiple external partners:

Marketing APIs → JSON
Legacy systems → XML
Event platforms → Avro

Semi-structured data is everywhere — flexible, schema-lite, and perfect for high-velocity environments.

PySpark provides first-class support to parse, transform, and store these formats at scale.

1. Working with JSON in PySpark

JSON is the most common semi-structured format for APIs and event streams.

1.1 Example JSON Input

{"order_id": "O1001", "customer": {"id": "C101", "country": "USA"}, "items": [{"sku": "P01", "qty": 2}, {"sku": "P02", "qty": 1}]}
{"order_id": "O1002", "customer": {"id": "C102", "country": "India"}, "items": [{"sku": "P03", "qty": 1}]}

Stored in: /data/orders.json

1.2 Reading JSON

df_json = spark.read.json("/data/orders.json")
df_json.printSchema()
df_json.show(truncate=False)

Output

root
 |-- customer: struct (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- id: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct
 |         |-- qty: long
 |         |-- sku: string
 |-- order_id: string (nullable = true)

+--------+---------------------------+--------+
|order_id|customer                   |items   |
+--------+---------------------------+--------+
|O1001   |{USA, C101}                |[{2,P01},{1,P02}]|
|O1002   |{India, C102}              |[{1,P03}]        |
+--------+---------------------------+--------+

1.3 Extracting Nested Fields

from pyspark.sql.functions import col, explode

df_nested = df_json \
    .withColumn("customer_id", col("customer.id")) \
    .withColumn("country", col("customer.country")) \
    .withColumn("item", explode("items")) \
    .withColumn("sku", col("item.sku")) \
    .withColumn("qty", col("item.qty")) \
    .drop("item")

df_nested.show(truncate=False)

Output

+--------+------------+-------+---+
|order_id|customer_id |country|qty|
+--------+------------+-------+---+
|O1001   |C101        |USA    |2  |
|O1001   |C101        |USA    |1  |
|O1002   |C102        |India  |1  |
+--------+------------+-------+---+

2. Working with XML in PySpark

XML is still widely used in finance, telecom, and healthcare.

PySpark handles XML using:

--packages com.databricks:spark-xml_2.12:0.17.0

2.1 Example XML Input

<orders>
  <order>
    <id>O1001</id>
    <customer>C101</customer>
    <amount>250</amount>
  </order>
  <order>
    <id>O1002</id>
    <customer>C102</customer>
    <amount>400</amount>
  </order>
</orders>

Stored in /data/orders.xml

2.2 Reading XML

df_xml = spark.read \
    .format("xml") \
    .option("rootTag", "orders") \
    .option("rowTag", "order") \
    .load("/data/orders.xml")

df_xml.show()

Output

+-----+--------+------+
| id  |customer|amount|
+-----+--------+------+
|O1001|C101    |250   |
|O1002|C102    |400   |
+-----+--------+------+

2.3 Handling Missing XML Fields

df_clean = df_xml.fillna({"amount": 0})

3. Working with Avro in PySpark

Avro is optimized for:

Event streaming
Data pipelines
Schema evolution
Kafka integration

PySpark supports Avro via:

--packages org.apache.spark:spark-avro_2.12:3.4.0

3.1 Example Avro File Structure

Avro Schema:

{
  "type": "record",
  "name": "CustomerEvent",
  "fields": [
    {"name": "user", "type": "string"},
    {"name": "event", "type": "string"},
    {"name": "timestamp", "type": "long"}
  ]
}

Stored at: /events/customer_events.avro

3.2 Reading Avro

df_avro = spark.read \
    .format("avro") \
    .load("/events/customer_events.avro")

df_avro.show()

Output Example

+------+---------+-------------+
|user  |event    |timestamp    |
+------+---------+-------------+
|U1001 |click    |1700000000000|
|U1002 |purchase |1700000012000|
+------+---------+-------------+

3.3 Writing Avro

df_avro.write \
    .format("avro") \
    .mode("append") \
    .save("/events/output_avro")

4. Comparing JSON, XML, and Avro (SEO Summary)

Format	Best For	Pros	Cons
JSON	API data, logs, REST services	Human-readable, flexible	Larger size, verbose
XML	Legacy systems	Supports attributes, schemas	Verbose, slower
Avro	Kafka, ETL pipelines	Compact, binary, schema evolution	Not human-readable

5. Best Practices for Semi-Structured Data in PySpark

✔ Always specify schema explicitly for large files ✔ Use explode and struct for nested fields ✔ Store final processed data in Parquet or Delta ✔ Use Avro for streaming pipelines ✔ Validate XML using XSD (finance/healthcare)

Summary

In this chapter, you learned how to work with the most common semi-structured formats:

🟦 JSON

Parse nested fields
Handle arrays with explode

🟩 XML

Use spark-xml
Manage hierarchical structures

🟧 Avro

Read/write optimized binary formats
Ideal for streaming & ETL

Semi-structured data is everywhere — with PySpark, you can manage it at massive scale, reliably and efficiently.

Next Topic → ETL Pipelines in PySpark — End-to-End Example

1. Working with JSON in PySpark​

1.1 Example JSON Input​

1.2 Reading JSON​

1.3 Extracting Nested Fields​

2. Working with XML in PySpark​

2.1 Example XML Input​

2.2 Reading XML​

2.3 Handling Missing XML Fields​

3. Working with Avro in PySpark​

3.1 Example Avro File Structure​

3.2 Reading Avro​

3.3 Writing Avro​

4. Comparing JSON, XML, and Avro (SEO Summary)​

5. Best Practices for Semi-Structured Data in PySpark​

Summary​

🟦 JSON​

🟩 XML​

🟧 Avro​

1. Working with JSON in PySpark

1.1 Example JSON Input

1.2 Reading JSON

1.3 Extracting Nested Fields

2. Working with XML in PySpark

2.1 Example XML Input

2.2 Reading XML

2.3 Handling Missing XML Fields

3. Working with Avro in PySpark

3.1 Example Avro File Structure

3.2 Reading Avro

3.3 Writing Avro

4. Comparing JSON, XML, and Avro (SEO Summary)

5. Best Practices for Semi-Structured Data in PySpark

Summary

🟦 JSON

🟩 XML

🟧 Avro