File Formats (CSV, JSON, Parquet) — With Story and Use Cases

Welcome back to RetailCo, our fictional retail company.
Alice, the data engineer, has received multiple datasets from vendors and needs to decide the right file format to load into Snowflake.

“Choosing the right format affects performance, storage, and query efficiency,” she explains.

Let’s explore the most common file formats in Snowflake: CSV, JSON, and Parquet.

🏗️ 1️⃣ CSV — Comma-Separated Values

CSV is the most common format for tabular data:

Simple and widely used
Human-readable
Works well for small to medium datasets

RetailCo Example:

Vendor sends daily sales data in sales_jan.csv
Alice stages it in Snowflake table stage and uses:

COPY INTO SALES
FROM @%SALES FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1);

Pros:

Easy to generate and read
Supported by almost all tools

Cons:

Larger file size than binary formats
No schema enforcement
Slower queries for large datasets

🏗️ 2️⃣ JSON — Semi-Structured Data

JSON stores hierarchical or semi-structured data:

Ideal for logs, events, or nested data
Can use Snowflake’s VARIANT data type

RetailCo Example:

Customer clickstream data in clicks.json:

{
  "customer_id": 101,
  "page": "/home",
  "timestamp": "2025-11-27T12:34:56"
}

Load into Snowflake:

COPY INTO CLICKSTREAM
FROM @CLICKSTREAM_STAGE FILE_FORMAT = (TYPE = JSON);

Pros:

Handles complex nested structures
Flexible schema

Cons:

Larger storage size than Parquet
Queries can be slower if heavily nested

🏗️ 3️⃣ Parquet — Columnar Binary Format

Parquet is a columnar, binary format optimized for analytics:

Smaller storage footprint
Faster query performance for large datasets
Ideal for BI, analytics, and ML pipelines

RetailCo Example:

Weekly aggregated sales stored as sales_weekly.parquet:

COPY INTO SALES_AGG
FROM @S3_PARQUET_STAGE FILE_FORMAT = (TYPE = PARQUET);

Pros:

Compressed storage
Faster queries for analytic workloads
Supports schema evolution

Cons:

Not human-readable
Requires specific tools to generate

🧠 Choosing the Right Format

File Format	Use Case	Pros	Cons
CSV	Tabular, small datasets	Simple, widely supported	Large, no schema, slow queries
JSON	Semi-structured, logs/events	Handles nested data, flexible	Larger size, slower queries
Parquet	Large datasets, analytics	Compressed, fast queries	Not human-readable

RetailCo Rule:

CSV → Vendor uploads, quick ETL
JSON → Clickstream or API data
Parquet → Analytics tables and ML pipelines

🧩 Story Recap — RetailCo in Action

Vendor sends CSV → Alice loads into table stage
Clickstream JSON → stored in VARIANT column
Weekly aggregates → saved as Parquet for analytics

Result: Efficient storage, query performance, and workflow alignment with the right format.

🏁 Quick Summary

CSV, JSON, Parquet are the most common Snowflake file formats
CSV: simple tabular data
JSON: semi-structured/nested data
Parquet: columnar, analytics-optimized
Best practice: choose format based on dataset size, structure, and query type
Using the correct file format reduces storage costs, improves performance, and simplifies analytics workflows

🚀 Coming Next

👉 COPY INTO Command — Complete Guide with All Options