Spark UI & Job Debugging Techniques — Monitor and Optimize PySpark Jobs

At NeoMart, monitoring PySpark jobs is crucial:

Some jobs run slower than expected
Some tasks fail silently or consume too much memory
Optimizing joins, aggregations, and shuffles requires visibility

The Spark UI is the most powerful tool for analyzing job execution, stages, tasks, and memory usage.

1. Accessing Spark UI

If running locally: http://localhost:4040
On cluster: Access through Spark history server or Databricks UI
Tabs to focus on:
- Jobs → high-level view of actions
- Stages → detailed breakdown of tasks
- Storage → cached DataFrames/RDDs
- SQL → executed SQL queries and plans
- Environment → configuration and JVM details

2. Understanding Jobs and Stages

Job: Triggered by an action (e.g., count(), show())
Stage: Set of tasks that can run in parallel without shuffle
Task: Execution unit processing a partition

Example:

df_filtered = df.filter(F.col("price") > 100)
df_filtered.count()  # triggers a job

One action → One job
Spark UI → see stages, number of tasks, time, and shuffle info

3. DAG Visualization

Spark builds Directed Acyclic Graph (DAG) for transformations
Narrow vs wide transformations show differently:
- Narrow: straight line → no shuffle
- Wide: nodes merge → shuffle happens

Story: NeoMart analysts visualize DAG to spot shuffle-heavy operations slowing jobs.

4. Storage Tab — Caching and Persisting

Shows all cached DataFrames and RDDs
Displays:
- Storage level (memory/disk)
- Number of cached partitions
- Memory usage

df.cache()
df.count()

Check Storage tab → confirm DataFrame cached

5. SQL Tab — Query Monitoring

For DataFrames registered as temp views or tables, Spark SQL tab shows:
- Executed queries
- Physical plan
- Execution time

df.createOrReplaceTempView("products")
spark.sql("SELECT AVG(price) FROM products").show()

UI shows aggregation stage and tasks executed

6. Debugging Common Issues

Long-running tasks:
- Often due to data skew → consider salting or repartitioning
High shuffle write/read:
- Use broadcast joins for small tables
Executor OOM (Out of Memory):
- Persist intermediate results to disk
- Increase executor memory
Stragglers:
- Skewed keys → repartition or salt

7. Using Spark History Server

Tracks completed jobs for offline analysis

Steps:

Enable event logging:

spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.eventLog.dir", "/tmp/spark-events")

Open history server → view past jobs, DAGs, stages, tasks

Story: NeoMart can analyze nightly ETL jobs and spot inefficient transformations even after job completion.

8. Tips for Effective Job Debugging

✔ Use Spark UI DAG to identify wide transformations ✔ Monitor shuffle read/write bytes → optimize joins and aggregations ✔ Cache frequently reused DataFrames ✔ Check task distribution → prevent stragglers ✔ Use SQL tab for complex query optimization

Summary

Using Spark UI and history server, you can:

Monitor jobs, stages, and tasks
Visualize DAGs for performance insight
Debug skew, shuffle, and memory issues
Optimize iterative and large-scale pipelines

NeoMart engineers rely on Spark UI to save hours of troubleshooting and make PySpark pipelines production-ready.

Next Topic → Catalyst Optimizer & Tungsten Execution Engine

1. Accessing Spark UI​

2. Understanding Jobs and Stages​

3. DAG Visualization​

4. Storage Tab — Caching and Persisting​

5. SQL Tab — Query Monitoring​

6. Debugging Common Issues​

7. Using Spark History Server​

8. Tips for Effective Job Debugging​

Summary​