Aggregations & GroupBy — Sum, Count, Avg, Max & Min

At NeoMart, raw data is generated every second — orders, clicks, sessions, and product interactions.
To make sense of billions of rows, analysts need aggregated insights: total revenue, number of orders, average cart value, highest-selling products.

This is where groupBy() and aggregations in PySpark become essential.

Why Aggregations Matter

Aggregations help you:

Summarize large datasets efficiently
Generate key metrics for dashboards
Provide input for machine learning
Analyze trends across categories, regions, or time

Without aggregation, data remains just a massive raw table.

1. Basic Aggregations with groupBy

The groupBy() method allows you to group rows by one or more columns and apply aggregate functions.

Example — Total revenue by category

from pyspark.sql.functions import sum, avg, count, max, min

df.groupBy("category").agg(
    sum("revenue").alias("total_revenue"),
    avg("revenue").alias("avg_revenue"),
    count("*").alias("total_orders")
).show()

Story Example

NeoMart wants daily sales metrics by category:

Electronics: $1M total revenue, 5,000 orders
Clothing: $500K total revenue, 3,200 orders

Aggregation converts raw transactions into actionable metrics.

2. Quick Aggregation Functions

Function	Example	Description
`sum()`	`sum("sales")`	Total of a numeric column
`avg()`	`avg("price")`	Average value
`count()`	`count("*")`	Count of rows
`max()`	`max("revenue")`	Maximum value
`min()`	`min("discount")`	Minimum value

Example — Product-level statistics

df.groupBy("product_id").agg(
    max("price").alias("max_price"),
    min("price").alias("min_price"),
    avg("price").alias("avg_price")
).show()

3. Aggregation with Multiple Columns

You can group by multiple columns to analyze intersections:

df.groupBy("category", "region").agg(
    sum("revenue").alias("total_revenue"),
    count("*").alias("orders_count")
).show()

Story Example

NeoMart wants revenue by category and region to plan inventory and marketing campaigns.

4. Using SQL for Aggregations (Optional)

PySpark allows SQL-style aggregation:

df.createOrReplaceTempView("sales")
spark.sql("""
    SELECT category, SUM(revenue) AS total_revenue, COUNT(*) AS orders
    FROM sales
    GROUP BY category
""").show()

This is useful for analysts comfortable with SQL syntax.

Summary

Aggregations transform raw, row-level data into business insights:

groupBy() + agg() is the foundation
Functions like sum, count, avg, max, min generate metrics
Multi-column grouping allows fine-grained analysis
SQL syntax provides flexibility for analysts

In short, aggregation is where raw Spark tables turn into intelligence for dashboards, ML, and reporting.

Next, we’ll explore Window Functions in PySpark DataFrames, enabling running totals, rankings, and time-based calculations.

Why Aggregations Matter​

1. Basic Aggregations with groupBy​

Example — Total revenue by category​

Story Example​

2. Quick Aggregation Functions​

Example — Product-level statistics​

3. Aggregation with Multiple Columns​

Story Example​

4. Using SQL for Aggregations (Optional)​

Summary​

Why Aggregations Matter

1. Basic Aggregations with groupBy

Example — Total revenue by category

Story Example

2. Quick Aggregation Functions

Example — Product-level statistics

3. Aggregation with Multiple Columns

Story Example

4. Using SQL for Aggregations (Optional)

Summary