Creating DataFrames from CSV, JSON, Parquet & Hive Tables
Every analytics pipeline at NeoMart, our growing e-commerce platform, starts with one step: loading data into Spark.
Whether it comes from mobile apps, warehouses, partners, or machine logs, your first job as a data engineer is to convert this raw data into a DataFrame β Sparkβs most widely used data structure.
DataFrames provide schema, structure, column-level operations, and optimization through Catalyst.
But how you create a DataFrame depends on the file format youβre working with.
Letβs explore the four most common formats: CSV, JSON, Parquet, and Hive tables.
Why File Formats Matterβ
Not all file formats behave the same.
Some are slow but simple (CSV), others lightning fast (Parquet), and some ideal for semi-structured workloads (JSON).
Choosing the right format can easily save minutes or even hours in large-scale ETL jobs.
1. Creating DataFrames from CSV Filesβ
CSV files are widely used but come with limitations β no schema, no compression, and slow parsing.
df = spark.read \
.option("header", True) \
.option("inferSchema", True) \
.csv("/mnt/data/sales.csv")
β When to Use CSVβ
- During initial ingestion
- When partners/vendors deliver small datasets
- For debugging and quick data inspection
β Avoid for big dataβ
CSV parsing becomes slow as data volume increases.
2. Creating DataFrames from JSON Filesβ
JSON is perfect for logs, nested attributes, and NoSQL-like structures.
df = spark.read \
.option("multiline", True) \
.json("/mnt/data/events.json")
β Best forβ
- Clickstream logs
- IoT events
- User activity streams
Story Exampleβ
NeoMartβs mobile app sends events like:
{
"user": "123",
"actions": ["view", "add_to_cart"]
}
JSON allows nested data, which Spark can parse easily.