SparkSession, SparkContext, and Configuration Basics
When you run a PySpark job, everything starts with SparkSession and SparkContext. These are the entry points to Spark’s distributed computing power, and understanding them is essential for writing efficient and scalable jobs.
1. SparkContext (sc)
SparkContext is the core connection to a Spark cluster. It allows you to:
- Submit jobs to the cluster
- Access RDDs for distributed operations
- Manage resources and cluster configuration
Analogy: SparkContext is like a gateway to your Spark orchestra, telling each executor what to play.
Example:
from pyspark import SparkContext
sc = SparkContext("local[*]", "MyApp")
print(sc.parallelize([1, 2, 3, 4]).sum())
sc.stop()
Here, "local[*]" runs Spark locally on all CPU cores.
2. SparkSession
From Spark 2.0 onward, SparkSession became the entry point for DataFrame and SQL operations. It encapsulates SparkContext, SQLContext, and HiveContext into a single object.
Features:
- Create DataFrames and Datasets
- Run SQL queries on structured data
- Manage configuration for the Spark job
Example:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyDataApp") \
.config("spark.executor.memory", "2g") \
.getOrCreate()
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
spark.stop()
Tip: Always stop your SparkSession at the end to release cluster resources.
3. Spark Configuration Basics
Spark allows you to customize your job execution using configuration settings:
- spark.app.name — Name of your application
- spark.master — Cluster mode (
local[*],yarn,k8s, etc.) - spark.executor.memory — Memory per executor
- spark.executor.cores — CPU cores per executor
You can set configurations using:
- Builder API in SparkSession
- spark-submit command
Example (spark-submit):
spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 4G \
my_pyspark_job.py
Real-Life Example
At ShopVerse Retail, a daily sales ETL job uses:
- SparkSession for reading CSV files and performing SQL aggregations
- SparkContext to distribute raw RDD processing on large logs
- Configurations tuned for optimal memory and parallelism
This combination ensures the ETL runs fast and avoids out-of-memory errors.
Key Takeaways
- SparkContext: Core connection to Spark cluster, mainly for RDD operations.
- SparkSession: Unified entry point for DataFrames, SQL, and configurations.
- Configuration: Controls resources, memory, and cluster behavior for efficient processing.
- Always stop SparkSession to free resources.
- Proper understanding of these components is crucial for scalable PySpark jobs.
Next, we’ll explore First PySpark Job — Hello World Example, where we’ll write our first real Spark job and understand the end-to-end workflow.