Skip to main content

PySpark with Snowflake, Databricks, and Hive Integration

Connecting PySpark to the Modern Data Ecosystem

At DataVerse Labs, different teams use different data stores:

  • Analytics team → Snowflake
  • Machine learning team → Databricks
  • Data warehouse team → Hive

PySpark acts as the unifying engine across these platforms, allowing teams to exchange data seamlessly.

This chapter shows you how to connect PySpark with Snowflake, Databricks, and Hive, with clean examples and real input/output.


1. Snowflake Integration with PySpark

Snowflake is widely used for cloud analytics and BI dashboards.
PySpark integrates using the Snowflake Spark Connector.


1.1 Installing the Snowflake Connector

--packages net.snowflake:snowflake-jdbc:3.13.30,net.snowflake:spark-snowflake_2.12:2.12.0-spark_3.4

1.2 Reading from Snowflake

Example — Loading Customer Table

sf_options = {
"sfURL": "account.snowflakecomputing.com",
"sfUser": "USER",
"sfPassword": "PASSWORD",
"sfDatabase": "DV_DB",
"sfSchema": "PUBLIC",
"sfWarehouse": "COMPUTE_WH"
}

df_sf = spark.read \
.format("snowflake") \
.options(**sf_options) \
.option("dbtable", "CUSTOMERS") \
.load()

df_sf.show()

Output Example

+---------+----------+--------+
|cust_id |name |country |
+---------+----------+--------+
|C101 |John Doe |USA |
|C102 |Maria Lee |Canada |
|C103 |Ishan Rao |India |
+---------+----------+--------+

1.3 Writing to Snowflake

df_sf.write \
.format("snowflake") \
.options(**sf_options) \
.option("dbtable", "CUSTOMERS_BACKUP") \
.mode("overwrite") \
.save()

2. Databricks Integration with PySpark

Databricks is a managed platform for Spark with built-in:

  • Optimized runtimes
  • MLFlow
  • Delta Lake
  • Collaboration notebooks

You integrate PySpark with Databricks using:

Databricks ConnectDBFS data accessDelta Lake


2.1 Databricks Connect Setup

Databricks Connect allows you to run PySpark from your laptop and execute on a remote cluster.

pip install databricks-connect==14.0.*

2.2 Configure:

databricks-connect configure

You provide:

  • Workspace URL
  • Personal Access Token
  • Cluster ID

2.3 Using PySpark with Databricks Connect

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("Local-to-Databricks") \
.getOrCreate()

df = spark.read.format("delta").load("/mnt/datalake/customers")
df.show()

Output Example

+---------+-----------+----------+
|cust_id |age |is_active |
+---------+-----------+----------+
|C1 |32 |true |
|C2 |41 |false |
|C3 |29 |true |
+---------+-----------+----------+

2.4 Writing to Delta Lake

df.write.format("delta") \
.mode("append") \
.save("/mnt/datalake/customers_new")

3. Hive Integration with PySpark

Hive is a core warehouse in many enterprise systems.

PySpark connects to Hive using the Hive Metastore, allowing SQL queries and table management.


3.1 Enable Hive Support

spark = SparkSession.builder \
.appName("HiveIntegration") \
.enableHiveSupport() \
.getOrCreate()

3.2 Reading a Hive Table

df_hive = spark.sql("SELECT * FROM dv_db.customers")
df_hive.show()

Output Example

+---------+-------------+--------+
|cust_id |city |spend |
+---------+-------------+--------+
|C101 |New York |5600 |
|C102 |Toronto |3200 |
|C103 |Bangalore |4500 |
+---------+-------------+--------+

3.3 Writing Data to Hive

df_hive.write.saveAsTable("dv_db.customers_backup")

3.4 Creating External Hive Tables

CREATE EXTERNAL TABLE dv_db.logs_raw (
log STRING
)
LOCATION '/data/logs/raw';

4. When to Use What? (SEO Summary Section)

PlatformBest For
SnowflakeBI analytics, dashboards, cost-efficient warehousing
DatabricksMachine learning, Delta Lake, advanced ETL, scalable compute
HiveLegacy warehouses, Hadoop ecosystems, batch pipelines

5. Best Practices

Snowflake

✔ Use AUTOCOMMIT=OFF for batch jobs ✔ Prefer COPY INTO for large writes

Databricks

✔ Store all data in Delta Lake ✔ Enable auto-optimize and Z-ordering

Hive

✔ Partition large tables ✔ Use ORC/Parquet formats for storage efficiency


Summary

In this chapter, you learned how PySpark integrates seamlessly with:

🔷 Snowflake

  • Read/write tables
  • Use Snowflake connector

🔷 Databricks

  • Connect via Databricks Connect
  • Read/write Delta Lake

🔷 Hive

  • Enable Hive support
  • Query and manage Hive tables

These integrations help enterprises like DataVerse Labs build scalable, multi-platform data pipelines.


Next Topic → Handling Semi-Structured Data (JSON, XML, Avro)