Skip to main content

Essential PySpark Interview Question & Answer(Explained Through Real-World Stories) – Part 4

26. How do you use regular expressions in PySpark?

Story/Modern Tech Analogy: Think of regular expressions (regex) as a powerful searchlight in a massive library—you can find all books that match a pattern like “titles starting with A” or “emails ending with .com” instantly.

Professional Explanation: In PySpark, regular expressions allow you to perform pattern-based transformations or filtering on string columns. Functions like regexp_extract and regexp_replace help you extract or modify strings efficiently, making them ideal for cleaning and parsing messy data.

Example:

from pyspark.sql.functions import regexp_extract, regexp_replace

# Extract domain from email
df.withColumn("domain", regexp_extract("email", r"@(\w+\.\w+)", 1))

# Replace non-alphanumeric characters
df.withColumn("clean_name", regexp_replace("name", r"[^a-zA-Z0-9]", ""))

27. What are Spark SQL temporary views and how do you create one?

Story/Modern Tech Analogy: Temporary views are like “guest passes” for a VIP lounge—they exist for your session and let you query data like a table using SQL without permanently saving it.

Professional Explanation: A temporary view allows you to expose a DataFrame as a SQL table within your Spark session. It’s useful for running SQL queries on DataFrames and combining SQL and PySpark operations seamlessly.

Example:

df.createOrReplaceTempView("employee_view")
spark.sql("SELECT department, AVG(salary) FROM employee_view GROUP BY department").show()

28. How do you add a monotonically increasing ID to a DataFrame?

Story/Modern Tech Analogy: Adding a monotonically increasing ID is like assigning ticket numbers at a concert—you give every row a unique, ever-increasing number without duplicates.

Professional Explanation: Use monotonically_increasing_id() to generate unique, increasing IDs for DataFrame rows. This is especially useful for indexing or joining data when no natural ID exists.

Example:

from pyspark.sql.functions import monotonically_increasing_id

df.withColumn("id", monotonically_increasing_id()).show()

29. How do you explode nested arrays or structs?

Story/Modern Tech Analogy: Imagine a Russian nesting doll—explode() opens up the doll so you can see each piece individually. Similarly, nested arrays or structs are flattened so every element becomes its own row.

Professional Explanation: explode() converts array or map elements into separate rows, making nested data accessible for further transformations or analysis.

Example:

from pyspark.sql.functions import explode

df.select("name", explode("skills").alias("skill")).show()

30. What is the difference between repartition and coalesce?

Story/Modern Tech Analogy: Repartitioning is like completely reorganizing a library’s shelves, spreading books evenly. Coalesce is like combining a few shelves to reduce clutter, keeping most of the original order intact.

Professional Explanation:

  • repartition(n): Creates n partitions and redistributes data across nodes (full shuffle).
  • coalesce(n): Reduces the number of partitions without full shuffle, optimized for smaller reductions. Use repartition for load balancing, coalesce for efficiency when reducing partitions.

Example:

df.repartition(10)  # Shuffle to 10 partitions
df.coalesce(2) # Reduce partitions to 2

31. How do you cache a DataFrame and why is it useful?

Story/Modern Tech Analogy: Caching is like keeping frequently used documents on your desk instead of fetching them from the archive each time—it speeds up repeated operations.

Professional Explanation: Caching a DataFrame stores it in memory, reducing recomputation of expensive transformations. It’s useful for iterative algorithms, repeated queries, or machine learning pipelines.

Example:

df.cache()
df.show()

32. What is the difference between persist and cache?

Story/Modern Tech Analogy: Caching is like putting a file on your desk (memory), while persisting is like storing it in a chosen medium—desk, shelf, or cloud (memory, disk, or both).

Professional Explanation:

  • cache(): Stores DataFrame in memory only (MEMORY_AND_DISK by default in newer Spark versions).
  • persist(storageLevel): Allows control over storage level—memory, disk, or both. Use persist when caching in memory alone may not be enough.

Example:

from pyspark import StorageLevel

df.persist(StorageLevel.MEMORY_AND_DISK)
df.cache() # Equivalent to persist() with default MEMORY_AND_DISK

33. How do you handle string operations in PySpark?

Story/Modern Tech Analogy: Handling strings in PySpark is like using a Swiss Army knife for text—you can slice, search, replace, convert, and format with precision.

Professional Explanation: PySpark provides multiple functions in pyspark.sql.functions to manipulate strings: substring, concat, length, upper/lower, trim, regexp_replace, etc. These functions are vectorized and optimized for distributed processing.

Example:

from pyspark.sql.functions import upper, concat_ws, length, trim

df.withColumn("upper_name", upper("name")) \
.withColumn("name_length", length("name")) \
.withColumn("full_info", concat_ws("-", "name", "department"))

34. How do you handle date and timestamp operations?

Story/Modern Tech Analogy: Dates and timestamps in PySpark are like scheduling apps—you can calculate durations, find weekdays, or extract months to make data actionable.

Professional Explanation: PySpark provides date and timestamp functions like current_date, date_add, year, month, datediff, to_date, and unix_timestamp. These allow transformations, filtering, and time-based analysis on distributed data efficiently.

Example:

from pyspark.sql.functions import current_date, datediff, to_date, year

df.withColumn("today", current_date()) \
.withColumn("days_diff", datediff(current_date(), "join_date")) \
.withColumn("join_year", year("join_date"))

35. Explain dropna, fillna, and replace in PySpark

Story/Modern Tech Analogy: Think of data cleaning like tidying your room:

  • dropna → throw away missing items
  • fillna → fill empty spaces with default values
  • replace → swap incorrect items with the right ones

Professional Explanation:

  • dropna(): Removes rows containing nulls.
  • fillna(): Replaces nulls with a specified value.
  • replace(): Replaces specific values with new values. These are crucial for preparing data for analysis or machine learning.

Example:

df.dropna(subset=["salary"]).show()           # Drop rows with null salary
df.fillna({"salary": 0, "department": "NA"}).show()
df.replace({"IT": "Information Technology"}).show()