PySpark with Snowflake, Databricks, and Hive Integration

Connecting PySpark to the Modern Data Ecosystem

At DataVerse Labs, different teams use different data stores:

Analytics team → Snowflake
Machine learning team → Databricks
Data warehouse team → Hive

PySpark acts as the unifying engine across these platforms, allowing teams to exchange data seamlessly.

This chapter shows you how to connect PySpark with Snowflake, Databricks, and Hive, with clean examples and real input/output.

1. Snowflake Integration with PySpark

Snowflake is widely used for cloud analytics and BI dashboards.
PySpark integrates using the Snowflake Spark Connector.

1.1 Installing the Snowflake Connector

--packages net.snowflake:snowflake-jdbc:3.13.30,net.snowflake:spark-snowflake_2.12:2.12.0-spark_3.4

1.2 Reading from Snowflake

Example — Loading Customer Table

sf_options = {
    "sfURL": "account.snowflakecomputing.com",
    "sfUser": "USER",
    "sfPassword": "PASSWORD",
    "sfDatabase": "DV_DB",
    "sfSchema": "PUBLIC",
    "sfWarehouse": "COMPUTE_WH"
}

df_sf = spark.read \
    .format("snowflake") \
    .options(**sf_options) \
    .option("dbtable", "CUSTOMERS") \
    .load()

df_sf.show()

Output Example

+---------+----------+--------+
|cust_id  |name      |country |
+---------+----------+--------+
|C101     |John Doe  |USA     |
|C102     |Maria Lee |Canada  |
|C103     |Ishan Rao |India   |
+---------+----------+--------+

1.3 Writing to Snowflake

df_sf.write \
    .format("snowflake") \
    .options(**sf_options) \
    .option("dbtable", "CUSTOMERS_BACKUP") \
    .mode("overwrite") \
    .save()

2. Databricks Integration with PySpark

Databricks is a managed platform for Spark with built-in:

Optimized runtimes
MLFlow
Delta Lake
Collaboration notebooks

You integrate PySpark with Databricks using:

✔ Databricks Connect ✔ DBFS data access ✔ Delta Lake

2.1 Databricks Connect Setup

Databricks Connect allows you to run PySpark from your laptop and execute on a remote cluster.

pip install databricks-connect==14.0.*

2.2 Configure:

databricks-connect configure

You provide:

Workspace URL
Personal Access Token
Cluster ID

2.3 Using PySpark with Databricks Connect

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Local-to-Databricks") \
    .getOrCreate()

df = spark.read.format("delta").load("/mnt/datalake/customers")
df.show()

Output Example

+---------+-----------+----------+
|cust_id  |age        |is_active |
+---------+-----------+----------+
|C1       |32         |true      |
|C2       |41         |false     |
|C3       |29         |true      |
+---------+-----------+----------+

2.4 Writing to Delta Lake

df.write.format("delta") \
    .mode("append") \
    .save("/mnt/datalake/customers_new")

3. Hive Integration with PySpark

Hive is a core warehouse in many enterprise systems.

PySpark connects to Hive using the Hive Metastore, allowing SQL queries and table management.

3.1 Enable Hive Support

spark = SparkSession.builder \
    .appName("HiveIntegration") \
    .enableHiveSupport() \
    .getOrCreate()

3.2 Reading a Hive Table

df_hive = spark.sql("SELECT * FROM dv_db.customers")
df_hive.show()

Output Example

+---------+-------------+--------+
|cust_id  |city         |spend   |
+---------+-------------+--------+
|C101     |New York     |5600    |
|C102     |Toronto      |3200    |
|C103     |Bangalore    |4500    |
+---------+-------------+--------+

3.3 Writing Data to Hive

df_hive.write.saveAsTable("dv_db.customers_backup")

3.4 Creating External Hive Tables

CREATE EXTERNAL TABLE dv_db.logs_raw (
    log STRING
)
LOCATION '/data/logs/raw';

4. When to Use What? (SEO Summary Section)

Platform	Best For
Snowflake	BI analytics, dashboards, cost-efficient warehousing
Databricks	Machine learning, Delta Lake, advanced ETL, scalable compute
Hive	Legacy warehouses, Hadoop ecosystems, batch pipelines

5. Best Practices

Snowflake

✔ Use AUTOCOMMIT=OFF for batch jobs ✔ Prefer COPY INTO for large writes

Databricks

✔ Store all data in Delta Lake ✔ Enable auto-optimize and Z-ordering

Hive

✔ Partition large tables ✔ Use ORC/Parquet formats for storage efficiency

Summary

In this chapter, you learned how PySpark integrates seamlessly with:

🔷 Snowflake

Read/write tables
Use Snowflake connector

🔷 Databricks

Connect via Databricks Connect
Read/write Delta Lake

🔷 Hive

Enable Hive support
Query and manage Hive tables

These integrations help enterprises like DataVerse Labs build scalable, multi-platform data pipelines.

Next Topic → Handling Semi-Structured Data (JSON, XML, Avro)

1. Snowflake Integration with PySpark​

1.1 Installing the Snowflake Connector​

1.2 Reading from Snowflake​

1.3 Writing to Snowflake​

2. Databricks Integration with PySpark​

2.1 Databricks Connect Setup​

2.2 Configure:​

2.3 Using PySpark with Databricks Connect​

2.4 Writing to Delta Lake​

3. Hive Integration with PySpark​

3.1 Enable Hive Support​

3.2 Reading a Hive Table​

3.3 Writing Data to Hive​

3.4 Creating External Hive Tables​

4. When to Use What? (SEO Summary Section)​

5. Best Practices​

Snowflake​

Databricks​

Hive​

Summary​

🔷 Snowflake​

🔷 Databricks​

🔷 Hive​

1. Snowflake Integration with PySpark

1.1 Installing the Snowflake Connector

1.2 Reading from Snowflake

1.3 Writing to Snowflake

2. Databricks Integration with PySpark

2.1 Databricks Connect Setup

2.2 Configure:

2.3 Using PySpark with Databricks Connect

2.4 Writing to Delta Lake

3. Hive Integration with PySpark

3.1 Enable Hive Support

3.2 Reading a Hive Table

3.3 Writing Data to Hive

3.4 Creating External Hive Tables

4. When to Use What? (SEO Summary Section)

5. Best Practices

Snowflake

Databricks

Hive

Summary

🔷 Snowflake

🔷 Databricks

🔷 Hive