Skip to main content

Spark UI & Job Debugging Techniques — Monitor and Optimize PySpark Jobs

At NeoMart, monitoring PySpark jobs is crucial:

  • Some jobs run slower than expected
  • Some tasks fail silently or consume too much memory
  • Optimizing joins, aggregations, and shuffles requires visibility

The Spark UI is the most powerful tool for analyzing job execution, stages, tasks, and memory usage.


1. Accessing Spark UI

  • If running locally: http://localhost:4040
  • On cluster: Access through Spark history server or Databricks UI
  • Tabs to focus on:
    • Jobs → high-level view of actions
    • Stages → detailed breakdown of tasks
    • Storage → cached DataFrames/RDDs
    • SQL → executed SQL queries and plans
    • Environment → configuration and JVM details

2. Understanding Jobs and Stages

  • Job: Triggered by an action (e.g., count(), show())
  • Stage: Set of tasks that can run in parallel without shuffle
  • Task: Execution unit processing a partition

Example:

df_filtered = df.filter(F.col("price") > 100)
df_filtered.count() # triggers a job
  • One action → One job
  • Spark UI → see stages, number of tasks, time, and shuffle info

3. DAG Visualization

  • Spark builds Directed Acyclic Graph (DAG) for transformations

  • Narrow vs wide transformations show differently:

    • Narrow: straight line → no shuffle
    • Wide: nodes merge → shuffle happens

Story: NeoMart analysts visualize DAG to spot shuffle-heavy operations slowing jobs.


4. Storage Tab — Caching and Persisting

  • Shows all cached DataFrames and RDDs

  • Displays:

    • Storage level (memory/disk)
    • Number of cached partitions
    • Memory usage
df.cache()
df.count()
  • Check Storage tab → confirm DataFrame cached

5. SQL Tab — Query Monitoring

  • For DataFrames registered as temp views or tables, Spark SQL tab shows:

    • Executed queries
    • Physical plan
    • Execution time
df.createOrReplaceTempView("products")
spark.sql("SELECT AVG(price) FROM products").show()
  • UI shows aggregation stage and tasks executed

6. Debugging Common Issues

  1. Long-running tasks:

    • Often due to data skew → consider salting or repartitioning
  2. High shuffle write/read:

    • Use broadcast joins for small tables
  3. Executor OOM (Out of Memory):

    • Persist intermediate results to disk
    • Increase executor memory
  4. Stragglers:

    • Skewed keys → repartition or salt

7. Using Spark History Server

  • Tracks completed jobs for offline analysis

  • Steps:

    1. Enable event logging:

      spark.conf.set("spark.eventLog.enabled", "true")
      spark.conf.set("spark.eventLog.dir", "/tmp/spark-events")
    2. Open history server → view past jobs, DAGs, stages, tasks

Story: NeoMart can analyze nightly ETL jobs and spot inefficient transformations even after job completion.


8. Tips for Effective Job Debugging

✔ Use Spark UI DAG to identify wide transformations ✔ Monitor shuffle read/write bytes → optimize joins and aggregations ✔ Cache frequently reused DataFrames ✔ Check task distribution → prevent stragglers ✔ Use SQL tab for complex query optimization


Summary

Using Spark UI and history server, you can:

  • Monitor jobs, stages, and tasks
  • Visualize DAGs for performance insight
  • Debug skew, shuffle, and memory issues
  • Optimize iterative and large-scale pipelines

NeoMart engineers rely on Spark UI to save hours of troubleshooting and make PySpark pipelines production-ready.


Next Topic → Catalyst Optimizer & Tungsten Execution Engine