Skip to main content

๐Ÿ“ Linear Regression โ€” Manual Math Breakdown

Letโ€™s calculate everything manually to understand how PySpark computes slope (w) and intercept (b).


๐Ÿ‹ Use Case: Lemonade Stand Salesโ€‹

Temperature (x ยฐC)Sales (y units)
2030
2550
3070
3590

๐Ÿ“˜ Step-by-Step Explanation of the Core Formulasโ€‹

We are trying to fit the equation:

y = w * x + b

Where:

w is the slope
b is the intercept

To compute this manually, here are the steps:

๐Ÿงฎ Step 1: Calculate the Mean of x-valuesโ€‹

To find the average (mean) of your input values:

xฬ„ = (xโ‚ + xโ‚‚ + xโ‚ƒ + xโ‚„) / 4

This gives you the center point of all the x-values (like average temperature).

๐Ÿ“Š Step 2: Calculate the Mean of y-valuesโ€‹

Same idea for the target values:

yฬ„ = (yโ‚ + yโ‚‚ + yโ‚ƒ + yโ‚„) / 4

This gives the average output (like average sales).

๐Ÿ“ Step 3: Compute the Slope (w)โ€‹

This formula tells you how much y changes for every 1-unit increase in x:

w = ฮฃ(xแตข โˆ’ xฬ„)(yแตข โˆ’ yฬ„) / ฮฃ(xแตข โˆ’ xฬ„)ยฒ

Breakdown:
-Subtract the mean from each x and y value to get how far each point is from the average.
-Multiply the differences (xแตข โˆ’ xฬ„) and (yแตข โˆ’ yฬ„) for each row.
-Sum those products (this gives the numerator).
-Then, square each (xแตข โˆ’ xฬ„), and sum them (this gives the denominator).
-Divide numerator by denominator to get w (slope).

๐Ÿ“ Step 4: Compute the Intercept (b)โ€‹

Once you have the slope, plug into this formula:

b = yฬ„ โˆ’ w * xฬ„

Meaning:
The intercept is the predicted value of y when x = 0. It "shifts" the line up or down to fit the data.

โœ… Final Output

Put it all together:

y = w * x + b

Now you have a model that can predict future values of y (like sales) from new values of x (like temperature).

2๏ธโƒฃ Step-by-Step Calculationโ€‹

Step 1: Calculate Meansโ€‹

x valuesy values
20, 25, 30, 3530, 50, 70, 90
xฬ„ = 27.5
yฬ„ = 60

Step 2: Build Tableโ€‹

xyx - xฬ„y - yฬ„(xโˆ’xฬ„)(yโˆ’yฬ„)(xโˆ’xฬ„)ยฒ
2030-7.5-3022556.25
2550-2.5-10256.25
30702.510256.25
35907.53022556.25
Totals500125

Step 3: Calculate Coefficient & Interceptโ€‹

w = 500 / 125 = 4.0
b = 60 - (4.0 ร— 27.5) = -50

โœ… Final model: y = 4.0x - 50

3๏ธโƒฃ Test the Manual Modelโ€‹

x (ยฐC)Predicted y = 4x - 50Actual y
203030 โœ…
255050 โœ…
307070 โœ…
359090 โœ…

๐ŸŽฏ Perfect fit!

๐Ÿ” PySpark Comparisonโ€‹

Hereโ€™s how the exact same model looks using PySpark:

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

data = [(20, 30), (25, 50), (30, 70), (35, 90)]
df = spark.createDataFrame(data, ["temperature", "sales"])

assembler = VectorAssembler(inputCols=["temperature"], outputCol="features")
df_features = assembler.transform(df).select("features", "sales")

lr = LinearRegression(featuresCol="features", labelCol="sales")
model = lr.fit(df_features)

print("Coefficient:", model.coefficients)
print("Intercept:", model.intercept)

Result

Coefficient: [4.0]
Intercept: -50.0

โœ… Matches the manual result exactly.

๐ŸŽ“ Why Learn This?

ReasonBenefit
Build intuitionUnderstand what slope and intercept really mean
Debugging skillsCheck if your ML models are making sense
ML foundationYou'll understand more complex models better later

๐Ÿ”‘ 1-Minute Summary โ€” Manual Linear Regression (Lemonade Sales Example)โ€‹

StepWhat You Did
๐Ÿ“Š Raw DataTemperature and sales from a lemonade stall
๐Ÿงฎ GoalFit a line y = w*x + b to predict sales from temperature
๐Ÿ“Œ Formulas UsedSlope: w = ฮฃ(x_i - xฬ„)(y_i - ศณ) / ฮฃ(x_i - xฬ„)ยฒ
Intercept: b = ศณ - w*xฬ„
๐Ÿ“ˆ Mean Valuesxฬ„ = 27.5, ศณ = 60
โœ๏ธ Computed TableCalculated (x - xฬ„)(y - ศณ) and (x - xฬ„)ยฒ for all data points
โž• Sum of ProductsNumerator = 500, Denominator = 125
๐Ÿ“ Slope (w)w = 500 / 125 = 4.0
๐Ÿงพ Intercept (b)b = 60 - (4.0 * 27.5) = -50
โœ… Final Equationy = 4.0x - 50
๐Ÿ”ฎ Manual PredictionsAll predicted values match actual ones perfectly
๐Ÿ” Compared with PySparkPySpark model gave same result: Coefficient = 4.0, Intercept = -50.0
๐Ÿง  Why This MattersBuilds intuition, helps interpret model meaning, and validates ML results