Skip to main content

📐 Linear Regression — Manual Math Breakdown

Let’s calculate everything manually to understand how PySpark computes slope (w) and intercept (b).


🍋 Use Case: Lemonade Stand Sales

Temperature (x °C)Sales (y units)
2030
2550
3070
3590

📘 Step-by-Step Explanation of the Core Formulas

We are trying to fit the equation:

y = w * x + b

Where:

w is the slope
b is the intercept

To compute this manually, here are the steps:

🧮 Step 1: Calculate the Mean of x-values

To find the average (mean) of your input values:

x̄ = (x₁ + x₂ + x₃ + x₄) / 4

This gives you the center point of all the x-values (like average temperature).

📊 Step 2: Calculate the Mean of y-values

Same idea for the target values:

ȳ = (y₁ + y₂ + y₃ + y₄) / 4

This gives the average output (like average sales).

📐 Step 3: Compute the Slope (w)

This formula tells you how much y changes for every 1-unit increase in x:

w = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²

Breakdown:
-Subtract the mean from each x and y value to get how far each point is from the average.
-Multiply the differences (xᵢ − x̄) and (yᵢ − ȳ) for each row.
-Sum those products (this gives the numerator).
-Then, square each (xᵢ − x̄), and sum them (this gives the denominator).
-Divide numerator by denominator to get w (slope).

📏 Step 4: Compute the Intercept (b)

Once you have the slope, plug into this formula:

b = ȳ − w * x̄

Meaning:
The intercept is the predicted value of y when x = 0. It "shifts" the line up or down to fit the data.

✅ Final Output

Put it all together:

y = w * x + b

Now you have a model that can predict future values of y (like sales) from new values of x (like temperature).

2️⃣ Step-by-Step Calculation

Step 1: Calculate Means

x valuesy values
20, 25, 30, 3530, 50, 70, 90
x̄ = 27.5
ȳ = 60

Step 2: Build Table

xyx - x̄y - ȳ(x−x̄)(y−ȳ)(x−x̄)²
2030-7.5-3022556.25
2550-2.5-10256.25
30702.510256.25
35907.53022556.25
Totals500125

Step 3: Calculate Coefficient & Intercept

w = 500 / 125 = 4.0
b = 60 - (4.0 × 27.5) = -50

✅ Final model: y = 4.0x - 50

3️⃣ Test the Manual Model

x (°C)Predicted y = 4x - 50Actual y
203030 ✅
255050 ✅
307070 ✅
359090 ✅

🎯 Perfect fit!

🔁 PySpark Comparison

Here’s how the exact same model looks using PySpark:

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

data = [(20, 30), (25, 50), (30, 70), (35, 90)]
df = spark.createDataFrame(data, ["temperature", "sales"])

assembler = VectorAssembler(inputCols=["temperature"], outputCol="features")
df_features = assembler.transform(df).select("features", "sales")

lr = LinearRegression(featuresCol="features", labelCol="sales")
model = lr.fit(df_features)

print("Coefficient:", model.coefficients)
print("Intercept:", model.intercept)

Result

Coefficient: [4.0]
Intercept: -50.0

✅ Matches the manual result exactly.

🎓 Why Learn This?

ReasonBenefit
Build intuitionUnderstand what slope and intercept really mean
Debugging skillsCheck if your ML models are making sense
ML foundationYou'll understand more complex models better later

🔑 1-Minute Summary — Manual Linear Regression (Lemonade Sales Example)

StepWhat You Did
📊 Raw DataTemperature and sales from a lemonade stall
🧮 GoalFit a line y = w*x + b to predict sales from temperature
📌 Formulas UsedSlope: w = Σ(x_i - x̄)(y_i - ȳ) / Σ(x_i - x̄)²
Intercept: b = ȳ - w*x̄
📈 Mean Valuesx̄ = 27.5, ȳ = 60
✍️ Computed TableCalculated (x - x̄)(y - ȳ) and (x - x̄)² for all data points
Sum of ProductsNumerator = 500, Denominator = 125
📐 Slope (w)w = 500 / 125 = 4.0
🧾 Intercept (b)b = 60 - (4.0 * 27.5) = -50
Final Equationy = 4.0x - 50
🔮 Manual PredictionsAll predicted values match actual ones perfectly
🔁 Compared with PySparkPySpark model gave same result: Coefficient = 4.0, Intercept = -50.0
🧠 Why This MattersBuilds intuition, helps interpret model meaning, and validates ML results

Next Topic is Predicting House Price from Size Using Linear Regression (PySpark)

Career