๐ Linear Regression โ Manual Math Breakdown
Letโs calculate everything manually to understand how PySpark computes slope (w) and intercept (b).
๐ Use Case: Lemonade Stand Salesโ
| Temperature (x ยฐC) | Sales (y units) |
|---|---|
| 20 | 30 |
| 25 | 50 |
| 30 | 70 |
| 35 | 90 |
๐ Step-by-Step Explanation of the Core Formulasโ
We are trying to fit the equation:
y = w * x + b
Where:
w is the slope
b is the intercept
To compute this manually, here are the steps:
๐งฎ Step 1: Calculate the Mean of x-valuesโ
To find the average (mean) of your input values:
xฬ = (xโ + xโ + xโ + xโ) / 4
This gives you the center point of all the x-values (like average temperature).
๐ Step 2: Calculate the Mean of y-valuesโ
Same idea for the target values:
yฬ = (yโ + yโ + yโ + yโ) / 4
This gives the average output (like average sales).
๐ Step 3: Compute the Slope (w)โ
This formula tells you how much y changes for every 1-unit increase in x:
w = ฮฃ(xแตข โ xฬ)(yแตข โ yฬ) / ฮฃ(xแตข โ xฬ)ยฒ
Breakdown:
-Subtract the mean from each x and y value to get how far each point is from the average.
-Multiply the differences (xแตข โ xฬ) and (yแตข โ yฬ) for each row.
-Sum those products (this gives the numerator).
-Then, square each (xแตข โ xฬ), and sum them (this gives the denominator).
-Divide numerator by denominator to get w (slope).
๐ Step 4: Compute the Intercept (b)โ
Once you have the slope, plug into this formula:
b = yฬ โ w * xฬ
Meaning:
The intercept is the predicted value of y when x = 0. It "shifts" the line up or down to fit the data.
โ Final Output
Put it all together:
y = w * x + b
Now you have a model that can predict future values of y (like sales) from new values of x (like temperature).
2๏ธโฃ Step-by-Step Calculationโ
Step 1: Calculate Meansโ
| x values | y values |
|---|---|
| 20, 25, 30, 35 | 30, 50, 70, 90 |
xฬ = 27.5
yฬ = 60
Step 2: Build Tableโ
| x | y | x - xฬ | y - yฬ | (xโxฬ)(yโyฬ) | (xโxฬ)ยฒ |
|---|---|---|---|---|---|
| 20 | 30 | -7.5 | -30 | 225 | 56.25 |
| 25 | 50 | -2.5 | -10 | 25 | 6.25 |
| 30 | 70 | 2.5 | 10 | 25 | 6.25 |
| 35 | 90 | 7.5 | 30 | 225 | 56.25 |
| Totals | 500 | 125 |
Step 3: Calculate Coefficient & Interceptโ
w = 500 / 125 = 4.0
b = 60 - (4.0 ร 27.5) = -50
โ Final model: y = 4.0x - 50
3๏ธโฃ Test the Manual Modelโ
| x (ยฐC) | Predicted y = 4x - 50 | Actual y |
|---|---|---|
| 20 | 30 | 30 โ |
| 25 | 50 | 50 โ |
| 30 | 70 | 70 โ |
| 35 | 90 | 90 โ |
๐ฏ Perfect fit!
๐ PySpark Comparisonโ
Hereโs how the exact same model looks using PySpark:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
data = [(20, 30), (25, 50), (30, 70), (35, 90)]
df = spark.createDataFrame(data, ["temperature", "sales"])
assembler = VectorAssembler(inputCols=["temperature"], outputCol="features")
df_features = assembler.transform(df).select("features", "sales")
lr = LinearRegression(featuresCol="features", labelCol="sales")
model = lr.fit(df_features)
print("Coefficient:", model.coefficients)
print("Intercept:", model.intercept)
Result
Coefficient: [4.0]
Intercept: -50.0
โ Matches the manual result exactly.
๐ Why Learn This?
| Reason | Benefit |
|---|---|
| Build intuition | Understand what slope and intercept really mean |
| Debugging skills | Check if your ML models are making sense |
| ML foundation | You'll understand more complex models better later |
๐ 1-Minute Summary โ Manual Linear Regression (Lemonade Sales Example)โ
| Step | What You Did |
|---|---|
| ๐ Raw Data | Temperature and sales from a lemonade stall |
| ๐งฎ Goal | Fit a line y = w*x + b to predict sales from temperature |
| ๐ Formulas Used | Slope: w = ฮฃ(x_i - xฬ)(y_i - ศณ) / ฮฃ(x_i - xฬ)ยฒ Intercept: b = ศณ - w*xฬ |
| ๐ Mean Values | xฬ = 27.5, ศณ = 60 |
| โ๏ธ Computed Table | Calculated (x - xฬ)(y - ศณ) and (x - xฬ)ยฒ for all data points |
| โ Sum of Products | Numerator = 500, Denominator = 125 |
| ๐ Slope (w) | w = 500 / 125 = 4.0 |
| ๐งพ Intercept (b) | b = 60 - (4.0 * 27.5) = -50 |
| โ Final Equation | y = 4.0x - 50 |
| ๐ฎ Manual Predictions | All predicted values match actual ones perfectly |
| ๐ Compared with PySpark | PySpark model gave same result: Coefficient = 4.0, Intercept = -50.0 |
| ๐ง Why This Matters | Builds intuition, helps interpret model meaning, and validates ML results |