SparkSession, SparkContext, and Configuration Basics

When you run a PySpark job, everything starts with SparkSession and SparkContext. These are the entry points to Spark’s distributed computing power, and understanding them is essential for writing efficient and scalable jobs.

1. SparkContext (sc)

SparkContext is the core connection to a Spark cluster. It allows you to:

Submit jobs to the cluster
Access RDDs for distributed operations
Manage resources and cluster configuration

Analogy: SparkContext is like a gateway to your Spark orchestra, telling each executor what to play.

Example:

from pyspark import SparkContext

sc = SparkContext("local[*]", "MyApp")
print(sc.parallelize([1, 2, 3, 4]).sum())
sc.stop()

Here, "local[*]" runs Spark locally on all CPU cores.

2. SparkSession

From Spark 2.0 onward, SparkSession became the entry point for DataFrame and SQL operations. It encapsulates SparkContext, SQLContext, and HiveContext into a single object.

Features:

Create DataFrames and Datasets
Run SQL queries on structured data
Manage configuration for the Spark job

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyDataApp") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()

spark.stop()

Tip: Always stop your SparkSession at the end to release cluster resources.

3. Spark Configuration Basics

Spark allows you to customize your job execution using configuration settings:

spark.app.name — Name of your application
spark.master — Cluster mode (local[*], yarn, k8s, etc.)
spark.executor.memory — Memory per executor
spark.executor.cores — CPU cores per executor

You can set configurations using:

Builder API in SparkSession
spark-submit command

Example (spark-submit):

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 4G \
  my_pyspark_job.py

Real-Life Example

At ShopVerse Retail, a daily sales ETL job uses:

SparkSession for reading CSV files and performing SQL aggregations
SparkContext to distribute raw RDD processing on large logs
Configurations tuned for optimal memory and parallelism

This combination ensures the ETL runs fast and avoids out-of-memory errors.

Key Takeaways

SparkContext: Core connection to Spark cluster, mainly for RDD operations.
SparkSession: Unified entry point for DataFrames, SQL, and configurations.
Configuration: Controls resources, memory, and cluster behavior for efficient processing.
Always stop SparkSession to free resources.
Proper understanding of these components is crucial for scalable PySpark jobs.

Next, we’ll explore First PySpark Job — Hello World Example, where we’ll write our first real Spark job and understand the end-to-end workflow.

1. SparkContext (sc)​

Example:​

2. SparkSession​

Features:​

Example:​

3. Spark Configuration Basics​

Example (spark-submit):​

Real-Life Example​

Key Takeaways​

1. SparkContext (sc)

Example:

2. SparkSession

Features:

Example:

3. Spark Configuration Basics

Example (spark-submit):

Real-Life Example

Key Takeaways