PySpark with Snowflake, Databricks, and Hive Integration
Connecting PySpark to the Modern Data Ecosystem
At DataVerse Labs, different teams use different data stores:
- Analytics team → Snowflake
- Machine learning team → Databricks
- Data warehouse team → Hive
PySpark acts as the unifying engine across these platforms, allowing teams to exchange data seamlessly.
This chapter shows you how to connect PySpark with Snowflake, Databricks, and Hive, with clean examples and real input/output.
1. Snowflake Integration with PySpark
Snowflake is widely used for cloud analytics and BI dashboards.
PySpark integrates using the Snowflake Spark Connector.
1.1 Installing the Snowflake Connector
--packages net.snowflake:snowflake-jdbc:3.13.30,net.snowflake:spark-snowflake_2.12:2.12.0-spark_3.4
1.2 Reading from Snowflake
Example — Loading Customer Table
sf_options = {
"sfURL": "account.snowflakecomputing.com",
"sfUser": "USER",
"sfPassword": "PASSWORD",
"sfDatabase": "DV_DB",
"sfSchema": "PUBLIC",
"sfWarehouse": "COMPUTE_WH"
}
df_sf = spark.read \
.format("snowflake") \
.options(**sf_options) \
.option("dbtable", "CUSTOMERS") \
.load()
df_sf.show()
Output Example
+---------+----------+--------+
|cust_id |name |country |
+---------+----------+--------+
|C101 |John Doe |USA |
|C102 |Maria Lee |Canada |
|C103 |Ishan Rao |India |
+---------+----------+--------+
1.3 Writing to Snowflake
df_sf.write \
.format("snowflake") \
.options(**sf_options) \
.option("dbtable", "CUSTOMERS_BACKUP") \
.mode("overwrite") \
.save()
2. Databricks Integration with PySpark
Databricks is a managed platform for Spark with built-in:
- Optimized runtimes
- MLFlow
- Delta Lake
- Collaboration notebooks
You integrate PySpark with Databricks using:
✔ Databricks Connect ✔ DBFS data access ✔ Delta Lake
2.1 Databricks Connect Setup
Databricks Connect allows you to run PySpark from your laptop and execute on a remote cluster.
pip install databricks-connect==14.0.*
2.2 Configure:
databricks-connect configure
You provide:
- Workspace URL
- Personal Access Token
- Cluster ID
2.3 Using PySpark with Databricks Connect
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Local-to-Databricks") \
.getOrCreate()
df = spark.read.format("delta").load("/mnt/datalake/customers")
df.show()
Output Example
+---------+-----------+----------+
|cust_id |age |is_active |
+---------+-----------+----------+
|C1 |32 |true |
|C2 |41 |false |
|C3 |29 |true |
+---------+-----------+----------+
2.4 Writing to Delta Lake
df.write.format("delta") \
.mode("append") \
.save("/mnt/datalake/customers_new")
3. Hive Integration with PySpark
Hive is a core warehouse in many enterprise systems.
PySpark connects to Hive using the Hive Metastore, allowing SQL queries and table management.
3.1 Enable Hive Support
spark = SparkSession.builder \
.appName("HiveIntegration") \
.enableHiveSupport() \
.getOrCreate()
3.2 Reading a Hive Table
df_hive = spark.sql("SELECT * FROM dv_db.customers")
df_hive.show()
Output Example
+---------+-------------+--------+
|cust_id |city |spend |
+---------+-------------+--------+
|C101 |New York |5600 |
|C102 |Toronto |3200 |
|C103 |Bangalore |4500 |
+---------+-------------+--------+
3.3 Writing Data to Hive
df_hive.write.saveAsTable("dv_db.customers_backup")
3.4 Creating External Hive Tables
CREATE EXTERNAL TABLE dv_db.logs_raw (
log STRING
)
LOCATION '/data/logs/raw';
4. When to Use What? (SEO Summary Section)
| Platform | Best For |
|---|---|
| Snowflake | BI analytics, dashboards, cost-efficient warehousing |
| Databricks | Machine learning, Delta Lake, advanced ETL, scalable compute |
| Hive | Legacy warehouses, Hadoop ecosystems, batch pipelines |
5. Best Practices
Snowflake
✔ Use AUTOCOMMIT=OFF for batch jobs
✔ Prefer COPY INTO for large writes
Databricks
✔ Store all data in Delta Lake ✔ Enable auto-optimize and Z-ordering
Hive
✔ Partition large tables ✔ Use ORC/Parquet formats for storage efficiency
Summary
In this chapter, you learned how PySpark integrates seamlessly with:
🔷 Snowflake
- Read/write tables
- Use Snowflake connector
🔷 Databricks
- Connect via Databricks Connect
- Read/write Delta Lake
🔷 Hive
- Enable Hive support
- Query and manage Hive tables
These integrations help enterprises like DataVerse Labs build scalable, multi-platform data pipelines.
Next Topic → Handling Semi-Structured Data (JSON, XML, Avro)