Window Functions in PySpark DataFrames
At NeoMart, analysts often need more than simple aggregations.
They want rankings, running totals, moving averages, and session-level analytics.
For example:
- Rank products by revenue per category
- Calculate each customer’s cumulative spend
- Determine the last purchase date for loyalty scoring
This is where window functions come in.
Why Window Functions Matter
- Allow calculations over partitions of data
- Keep row-level detail while performing aggregations
- Essential for ranking, cumulative metrics, and analytics features
- Perfect for dashboards, reporting, and feature engineering in ML
Unlike groupBy, which reduces rows, window functions retain all rows.
1. Import Window Functions
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, sum, avg, rank, desc
2. Ranking Rows: row_number()
window_spec = Window.partitionBy("category").orderBy(desc("revenue"))
df.withColumn("rank", row_number().over(window_spec)).show()
Story Example
NeoMart wants top 3 products per category for promotion campaigns.
3. Cumulative / Running Totals
window_spec = Window.partitionBy("customer_id").orderBy("date")
df.withColumn("running_total", sum("amount").over(window_spec)).show()
This tracks each customer’s cumulative spend over time, essential for loyalty programs.
4. Moving Averages
window_spec = Window.orderBy("date").rowsBetween(-3, 0)
df.withColumn("moving_avg", avg("amount").over(window_spec)).show()
Use Case
- Smooth out daily sales volatility
- Detect trends or anomalies
5. Ranking With Ties: rank() vs dense_rank()
window_spec = Window.partitionBy("category").orderBy(desc("revenue"))
df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
rank()→ leaves gaps for tiesdense_rank()→ no gaps
Useful for leaderboards or top-n product ranking.
Summary
Window functions in PySpark:
- Operate over partitions without reducing rows
- Enable rankings, cumulative sums, moving averages
- Are essential for BI, ML features, and advanced analytics
- Retain the flexibility of DataFrame operations
Next, we’ll cover Handling Missing Data — Drop, Fill, Replace, to clean and prepare datasets for analysis and modeling.