PySpark Quiz — Structured Streaming & Production Pipelines

🌊 Structured Streaming & Production Pipelines

1. You have a Kafka topic 'sales' streaming into PySpark. How do you read it?

2. You want to compute total sales per product in 5-minute windows. from pyspark.sql.functions import window, sum

3. How do you write a streaming DataFrame to console with checkpointing?

4. You need to update a Delta Lake table with streaming data. Which sink is recommended?

5. Which approach correctly handles late-arriving data? from pyspark.sql.functions import col, window

6. You want to execute custom logic per micro-batch in streaming. Which method should you use?

7. How do you add a column 'processed_at' with the current timestamp?

8. How do you repartition a streaming DataFrame by 'region' before writing?

9. How do you monitor streaming query progress programmatically?

10. You want to write aggregated streaming data to Delta with checkpointing. # Stream from Kafka df = spark.readStream.format('kafka').option('subscribe','sales').load() # Transform agg_df = df.groupBy(window('timestamp','5 minutes'),'product_id').sum('amount') What is the correct approach?