PySpark Quiz — Structured Streaming & Production Pipelines
🌊 Structured Streaming & Production Pipelines
1. You have a Kafka topic 'sales' streaming into PySpark. How do you read it?
2. You want to compute total sales per product in 5-minute windows.
from pyspark.sql.functions import window, sum
3. How do you write a streaming DataFrame to console with checkpointing?
4. You need to update a Delta Lake table with streaming data. Which sink is recommended?
5. Which approach correctly handles late-arriving data?
from pyspark.sql.functions import col, window
6. You want to execute custom logic per micro-batch in streaming. Which method should you use?
7. How do you add a column 'processed_at' with the current timestamp?
8. How do you repartition a streaming DataFrame by 'region' before writing?
9. How do you monitor streaming query progress programmatically?
10. You want to write aggregated streaming data to Delta with checkpointing.
# Stream from Kafka
df = spark.readStream.format('kafka').option('subscribe','sales').load()
# Transform
agg_df = df.groupBy(window('timestamp','5 minutes'),'product_id').sum('amount')
What is the correct approach?