Skip to main content

Essential PySpark Interview Question & Answer(Explained Through Real-World Stories) – Part 3

16. What are UDFs and how do you create them in PySpark?

Story/Modern Tech Analogy: Think of PySpark as a factory assembly line. Each machine (built-in function) does a specific task automatically. But sometimes, you need a custom machine for a unique task—this is your User Defined Function (UDF). You design it, plug it in, and now your assembly line can handle tasks no standard machine can.

Professional Explanation: A UDF in PySpark is a way to apply custom logic to DataFrame columns that aren’t covered by built-in functions. You define a Python function and register it as a UDF to use it in PySpark transformations.

Example:

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def square(x):
return x * x

square_udf = udf(square, IntegerType())
df.withColumn("squared_value", square_udf(df["value"]))

17. What are Pandas UDFs and why are they useful?

Story/Modern Tech Analogy: Imagine UDFs are solo musicians, one row at a time. Pandas UDFs are like an entire orchestra—they process batches of data together using vectorized operations, making it way faster.

Professional Explanation: Pandas UDFs leverage Apache Arrow to process batches of data, improving performance over traditional row-wise UDFs. They are especially useful for large datasets where efficiency matters.

Example:

from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("long")
def add_one(s: pd.Series) -> pd.Series:
return s + 1

df.withColumn("new_value", add_one(df["value"]))

18. How do you perform aggregations in PySpark?

Story/Modern Tech Analogy: Aggregations are like dashboards in a gaming app—they summarize many actions into meaningful metrics, like total score, average time, or max level achieved.

Professional Explanation: Aggregations in PySpark are done using agg() or functions like sum, avg, count, etc., to summarize data. They are commonly combined with groupBy for grouped metrics.

Example:

from pyspark.sql.functions import avg, sum, count
df.groupBy("department").agg(
avg("salary").alias("avg_salary"),
sum("bonus").alias("total_bonus"),
count("*").alias("num_employees")
)

19. Explain groupBy and aggregate functions

Story/Modern Tech Analogy: Imagine a warehouse sorting toys by type (groupBy). Once grouped, you can count, sum, or find the average price per toy type (aggregate functions).

Professional Explanation: groupBy() groups rows based on column values. Aggregation functions like sum(), avg(), count() operate on these groups to produce summary statistics.

Example:

df.groupBy("category").sum("price").show()

20. How do you pivot a DataFrame in PySpark?

Story/Modern Tech Analogy: Pivoting is like rotating a spreadsheet so rows become columns—think of it as turning a long, thin LEGO tower into a flat display so you can see everything at a glance.

Professional Explanation: Pivoting reshapes data by turning unique values from one column into new columns. You usually aggregate the values while pivoting.

Example:

df.groupBy("date").pivot("product").sum("sales").show()

21. How do you join two DataFrames? Explain different join types

Story/Modern Tech Analogy: Joining is like merging two contact lists. You can keep only mutual friends (inner join), everyone including unknown contacts (outer join), or pick the contacts from one list even if the other doesn’t match (left/right join).

Professional Explanation: PySpark supports inner, left, right, full, and semi/anti joins. The choice depends on whether you want all rows, only matches, or exclusive rows from one side.

Example:

df1.join(df2, df1.id == df2.id, "left").show()

22. What are broadcast joins and when should you use them?

Story/Modern Tech Analogy: A broadcast join is like sending a tiny menu to every chef in a big kitchen instead of making them all come to one station—faster when one dataset is small.

Professional Explanation: Broadcast joins replicate the smaller DataFrame to all nodes, avoiding expensive shuffles. Use them when one DataFrame is small and fits in memory.

Example:

from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), "id").show()

23. What are window functions in PySpark? Give examples

Story/Modern Tech Analogy: Window functions are like giving each employee a view of their own performance and their peers’ performance, without collapsing the table—like seeing both your score and the team average side by side.

Professional Explanation: Window functions operate over a “window” of rows related to the current row, allowing ranking, cumulative sums, and moving averages without reducing the DataFrame.

Example:

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, desc

window_spec = Window.partitionBy("department").orderBy(desc("salary"))
df.withColumn("rank", rank().over(window_spec)).show()

24. How do you sort a DataFrame by column(s)?

Story/Modern Tech Analogy: Sorting is like arranging your Spotify playlist by mood or release date—everything lines up neatly according to your chosen metric.

Professional Explanation: Use orderBy() or sort() to sort by one or more columns in ascending or descending order.

Example:

df.orderBy("salary", ascending=False).show()

25. How do you filter data using multiple conditions?

Story/Modern Tech Analogy: Filtering with multiple conditions is like using multiple search filters on Amazon—price range + brand + rating—to get exactly what you want.

Professional Explanation: You can use filter() or where() with logical operators (& for AND, | for OR) to filter rows based on multiple conditions.

Example:

df.filter((df.salary > 50000) & (df.department == "IT")).show()