๐ Linear Regression โ Manual Math Breakdown
Letโs calculate everything manually to understand how PySpark computes slope (w
) and intercept (b
).
๐ Use Case: Lemonade Stand Salesโ
Temperature (x ยฐC) | Sales (y units) |
---|---|
20 | 30 |
25 | 50 |
30 | 70 |
35 | 90 |
๐ Step-by-Step Explanation of the Core Formulasโ
We are trying to fit the equation:
y = w * x + b
Where:
w is the slope
b is the intercept
To compute this manually, here are the steps:
๐งฎ Step 1: Calculate the Mean of x-valuesโ
To find the average (mean) of your input values:
xฬ = (xโ + xโ + xโ + xโ) / 4
This gives you the center point of all the x-values (like average temperature).
๐ Step 2: Calculate the Mean of y-valuesโ
Same idea for the target values:
yฬ = (yโ + yโ + yโ + yโ) / 4
This gives the average output (like average sales).
๐ Step 3: Compute the Slope (w)โ
This formula tells you how much y changes for every 1-unit increase in x:
w = ฮฃ(xแตข โ xฬ)(yแตข โ yฬ) / ฮฃ(xแตข โ xฬ)ยฒ
Breakdown:
-Subtract the mean from each x and y value to get how far each point is from the average.
-Multiply the differences (xแตข โ xฬ) and (yแตข โ yฬ) for each row.
-Sum those products (this gives the numerator).
-Then, square each (xแตข โ xฬ), and sum them (this gives the denominator).
-Divide numerator by denominator to get w (slope).
๐ Step 4: Compute the Intercept (b)โ
Once you have the slope, plug into this formula:
b = yฬ โ w * xฬ
Meaning:
The intercept is the predicted value of y when x = 0. It "shifts" the line up or down to fit the data.
โ Final Output
Put it all together:
y = w * x + b
Now you have a model that can predict future values of y (like sales) from new values of x (like temperature).
2๏ธโฃ Step-by-Step Calculationโ
Step 1: Calculate Meansโ
x values | y values |
---|---|
20, 25, 30, 35 | 30, 50, 70, 90 |
xฬ = 27.5
yฬ = 60
Step 2: Build Tableโ
x | y | x - xฬ | y - yฬ | (xโxฬ)(yโyฬ) | (xโxฬ)ยฒ |
---|---|---|---|---|---|
20 | 30 | -7.5 | -30 | 225 | 56.25 |
25 | 50 | -2.5 | -10 | 25 | 6.25 |
30 | 70 | 2.5 | 10 | 25 | 6.25 |
35 | 90 | 7.5 | 30 | 225 | 56.25 |
Totals | 500 | 125 |
Step 3: Calculate Coefficient & Interceptโ
w = 500 / 125 = 4.0
b = 60 - (4.0 ร 27.5) = -50
โ Final model: y = 4.0x - 50
3๏ธโฃ Test the Manual Modelโ
x (ยฐC) | Predicted y = 4x - 50 | Actual y |
---|---|---|
20 | 30 | 30 โ |
25 | 50 | 50 โ |
30 | 70 | 70 โ |
35 | 90 | 90 โ |
๐ฏ Perfect fit!
๐ PySpark Comparisonโ
Hereโs how the exact same model looks using PySpark:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
data = [(20, 30), (25, 50), (30, 70), (35, 90)]
df = spark.createDataFrame(data, ["temperature", "sales"])
assembler = VectorAssembler(inputCols=["temperature"], outputCol="features")
df_features = assembler.transform(df).select("features", "sales")
lr = LinearRegression(featuresCol="features", labelCol="sales")
model = lr.fit(df_features)
print("Coefficient:", model.coefficients)
print("Intercept:", model.intercept)
Result
Coefficient: [4.0]
Intercept: -50.0
โ Matches the manual result exactly.
๐ Why Learn This?
Reason | Benefit |
---|---|
Build intuition | Understand what slope and intercept really mean |
Debugging skills | Check if your ML models are making sense |
ML foundation | You'll understand more complex models better later |
๐ 1-Minute Summary โ Manual Linear Regression (Lemonade Sales Example)โ
Step | What You Did |
---|---|
๐ Raw Data | Temperature and sales from a lemonade stall |
๐งฎ Goal | Fit a line y = w*x + b to predict sales from temperature |
๐ Formulas Used | Slope: w = ฮฃ(x_i - xฬ)(y_i - ศณ) / ฮฃ(x_i - xฬ)ยฒ Intercept: b = ศณ - w*xฬ |
๐ Mean Values | xฬ = 27.5 , ศณ = 60 |
โ๏ธ Computed Table | Calculated (x - xฬ)(y - ศณ) and (x - xฬ)ยฒ for all data points |
โ Sum of Products | Numerator = 500 , Denominator = 125 |
๐ Slope (w) | w = 500 / 125 = 4.0 |
๐งพ Intercept (b) | b = 60 - (4.0 * 27.5) = -50 |
โ Final Equation | y = 4.0x - 50 |
๐ฎ Manual Predictions | All predicted values match actual ones perfectly |
๐ Compared with PySpark | PySpark model gave same result: Coefficient = 4.0 , Intercept = -50.0 |
๐ง Why This Matters | Builds intuition, helps interpret model meaning, and validates ML results |