PySpark Quiz — Expert Level 2

⚙️ Expert Level 2 — Optimization & Cluster Tuning

1. You have two large DataFrames df1 and df2, and df2 is small. Which join optimization should you use?

2. How do you broadcast df2 before joining in PySpark?

3. Consider this code snippet. How would you optimize it? df1 = spark.read.csv('large_file.csv', header=True, inferSchema=True) df2 = spark.read.parquet('small_file.parquet') df3 = df1.join(df2, 'id') df3.count()

4. How do you handle data skew when joining two large DataFrames?

5. How do you persist a DataFrame df to memory and disk for repeated heavy operations?

6. Consider the following code: # df has millions of rows df_filtered = df.filter(df.age > 30).select('id','name','age') df_filtered.show() What is the recommended optimization?

7. How do you check the physical execution plan of a DataFrame?

8. How do you increase the number of shuffle partitions for large joins?

9. Which strategy reduces unnecessary shuffles when multiple DataFrames are joined?

10. You need to join two huge DataFrames and get top 10 results by score. result = df1.join(df2, 'id') .orderBy('score', ascending=False) .limit(10) Which approach is efficient?