Essential PySpark Interview Question & Answer(Explained Through Real-World Stories) – Part 4
26. How do you use regular expressions in PySpark?
Story/Modern Tech Analogy: Think of regular expressions (regex) as a powerful searchlight in a massive library—you can find all books that match a pattern like “titles starting with A” or “emails ending with .com” instantly.
Professional Explanation:
In PySpark, regular expressions allow you to perform pattern-based transformations or filtering on string columns. Functions like regexp_extract and regexp_replace help you extract or modify strings efficiently, making them ideal for cleaning and parsing messy data.
Example:
from pyspark.sql.functions import regexp_extract, regexp_replace
# Extract domain from email
df.withColumn("domain", regexp_extract("email", r"@(\w+\.\w+)", 1))
# Replace non-alphanumeric characters
df.withColumn("clean_name", regexp_replace("name", r"[^a-zA-Z0-9]", ""))
27. What are Spark SQL temporary views and how do you create one?
Story/Modern Tech Analogy: Temporary views are like “guest passes” for a VIP lounge—they exist for your session and let you query data like a table using SQL without permanently saving it.
Professional Explanation: A temporary view allows you to expose a DataFrame as a SQL table within your Spark session. It’s useful for running SQL queries on DataFrames and combining SQL and PySpark operations seamlessly.
Example:
df.createOrReplaceTempView("employee_view")
spark.sql("SELECT department, AVG(salary) FROM employee_view GROUP BY department").show()
28. How do you add a monotonically increasing ID to a DataFrame?
Story/Modern Tech Analogy: Adding a monotonically increasing ID is like assigning ticket numbers at a concert—you give every row a unique, ever-increasing number without duplicates.
Professional Explanation:
Use monotonically_increasing_id() to generate unique, increasing IDs for DataFrame rows. This is especially useful for indexing or joining data when no natural ID exists.
Example:
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
29. How do you explode nested arrays or structs?
Story/Modern Tech Analogy:
Imagine a Russian nesting doll—explode() opens up the doll so you can see each piece individually. Similarly, nested arrays or structs are flattened so every element becomes its own row.
Professional Explanation:
explode() converts array or map elements into separate rows, making nested data accessible for further transformations or analysis.
Example:
from pyspark.sql.functions import explode
df.select("name", explode("skills").alias("skill")).show()
30. What is the difference between repartition and coalesce?
Story/Modern Tech Analogy: Repartitioning is like completely reorganizing a library’s shelves, spreading books evenly. Coalesce is like combining a few shelves to reduce clutter, keeping most of the original order intact.
Professional Explanation:
repartition(n): Creates n partitions and redistributes data across nodes (full shuffle).coalesce(n): Reduces the number of partitions without full shuffle, optimized for smaller reductions. Userepartitionfor load balancing,coalescefor efficiency when reducing partitions.
Example:
df.repartition(10) # Shuffle to 10 partitions
df.coalesce(2) # Reduce partitions to 2
31. How do you cache a DataFrame and why is it useful?
Story/Modern Tech Analogy: Caching is like keeping frequently used documents on your desk instead of fetching them from the archive each time—it speeds up repeated operations.
Professional Explanation: Caching a DataFrame stores it in memory, reducing recomputation of expensive transformations. It’s useful for iterative algorithms, repeated queries, or machine learning pipelines.
Example:
df.cache()
df.show()
32. What is the difference between persist and cache?
Story/Modern Tech Analogy: Caching is like putting a file on your desk (memory), while persisting is like storing it in a chosen medium—desk, shelf, or cloud (memory, disk, or both).
Professional Explanation:
cache(): Stores DataFrame in memory only (MEMORY_AND_DISKby default in newer Spark versions).persist(storageLevel): Allows control over storage level—memory, disk, or both. Usepersistwhen caching in memory alone may not be enough.
Example:
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
df.cache() # Equivalent to persist() with default MEMORY_AND_DISK
33. How do you handle string operations in PySpark?
Story/Modern Tech Analogy: Handling strings in PySpark is like using a Swiss Army knife for text—you can slice, search, replace, convert, and format with precision.
Professional Explanation:
PySpark provides multiple functions in pyspark.sql.functions to manipulate strings: substring, concat, length, upper/lower, trim, regexp_replace, etc. These functions are vectorized and optimized for distributed processing.
Example:
from pyspark.sql.functions import upper, concat_ws, length, trim
df.withColumn("upper_name", upper("name")) \
.withColumn("name_length", length("name")) \
.withColumn("full_info", concat_ws("-", "name", "department"))
34. How do you handle date and timestamp operations?
Story/Modern Tech Analogy: Dates and timestamps in PySpark are like scheduling apps—you can calculate durations, find weekdays, or extract months to make data actionable.
Professional Explanation:
PySpark provides date and timestamp functions like current_date, date_add, year, month, datediff, to_date, and unix_timestamp. These allow transformations, filtering, and time-based analysis on distributed data efficiently.
Example:
from pyspark.sql.functions import current_date, datediff, to_date, year
df.withColumn("today", current_date()) \
.withColumn("days_diff", datediff(current_date(), "join_date")) \
.withColumn("join_year", year("join_date"))
35. Explain dropna, fillna, and replace in PySpark
Story/Modern Tech Analogy: Think of data cleaning like tidying your room:
dropna→ throw away missing itemsfillna→ fill empty spaces with default valuesreplace→ swap incorrect items with the right ones
Professional Explanation:
dropna(): Removes rows containing nulls.fillna(): Replaces nulls with a specified value.replace(): Replaces specific values with new values. These are crucial for preparing data for analysis or machine learning.
Example:
df.dropna(subset=["salary"]).show() # Drop rows with null salary
df.fillna({"salary": 0, "department": "NA"}).show()
df.replace({"IT": "Information Technology"}).show()