Skip to main content

🏠 Predicting House Price from Size Using Linear Regression (PySpark)

Let’s say you're working with a real estate startup.
They ask:

πŸ’¬ "Can we estimate a house’s price based on its size in square feet?"

That’s a regression problem β€” predicting a continuous number (like price).
We’ll use Linear Regression to do that.

πŸ“Š Example Dataset​

| Size (sq.ft) | Price ($1000s) |
|--------------|----------------|
| 850 | 100 |
| 900 | 120 |
| 1000 | 150 |
| 1200 | 200 |
| 1500 | 250 |

βœ… Step 1: Import Libraries & Create Data​

from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Start Spark session
spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()

# Create dataset (size, price)
data = [
(850, 100),
(900, 120),
(1000, 150),
(1200, 200),
(1500, 250)
]

df = spark.createDataFrame(data, ["size", "price"])
df.show()

πŸ’‘ What This Does:

LineMeaning
SparkSessionStarts Spark (needed for all PySpark work)
createDataFrame(data, columns)Creates a DataFrame (like a table) with columns size and price
df.show()Prints the data table

βœ… Step 2: Assemble Features​

Spark ML expects features in a single vector column.

assembler = VectorAssembler(inputCols=["size"], outputCol="features")
df_features = assembler.transform(df).select("features", "price")
df_features.show(truncate=False)

πŸ’‘ What This Does:

LineMeaning
VectorAssembler(...)Converts "size" column into a features vector (needed by Spark ML)
transform(df)Applies the assembler to your DataFrame
select("features", "price")Keeps only the features and price columns

Result

featuresprice
[850.0]100
[900.0]120
[1000.0]150
[1200.0]200
[1500.0]250

βœ… Step 3: Train the Linear Regression Model​

lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(df_features)

πŸ’‘ What This Does:

LineMeaning
LinearRegression(...)Creates the model and tells it which column is the feature and label
model = lr.fit(df_features)Trains (fits) the model using your data

βœ… Step 4: See What the Model Learned​

print("Coefficient (w):", model.coefficients)
print("Intercept (b):", model.intercept)

Result

Coefficient (w): [0.2]
Intercept (b): -70.0

This means your learned equation is:

price = 0.2 Γ— size - 70

πŸ‘‰ So a 1000 sq.ft house would be:

price = 0.2 Γ— 1000 - 70 = $130k

βœ… Step 5: Check Model Performance​

training_summary = model.summary
print("RMSE:", training_summary.rootMeanSquaredError)
print("R2:", training_summary.r2)

πŸ’‘ What This Does:

MetricMeaning
RMSERoot Mean Square Error – lower is better (how far predictions are off)
RΒ²R-squared – closer to 1 means the model fits data well

Result

RMSE: 5.47
R2: 0.98

βœ… Your model explains 98% of the variance in the data. That's great!

βœ… Step 6: Make Predictions on Training Data​

predictions = model.transform(df_features)
predictions.show()

Result

| features | price | prediction |
| -------- | ----- | ---------- |
| [850.0] | 100 | 100.0 |
| [900.0] | 120 | 110.0 |
| [1000.0] | 150 | 130.0 |
| [1200.0] | 200 | 170.0 |
| [1500.0] | 250 | 230.0 |

πŸ‘‰ Predictions are close to real prices (but not perfect).

βœ… Step 7: Predict New (Unseen) Data​

new_data = spark.createDataFrame([(1100,), (1400,)], ["size"])
new_features = assembler.transform(new_data).select("features")
new_predictions = model.transform(new_features)
new_predictions.show()

πŸ’‘ What This Does:

LineMeaning
createDataFrame(...)Makes a new DataFrame with only "size"
transform(...)Turns size into feature vector
model.transform(...)Applies trained model to make predictions

Result

| features | prediction |
| -------- | ---------- |
| [1100.0] | 150.0 |
| [1400.0] | 210.0 |

βœ… Model can now predict future house prices based on size!

πŸ”‘ 1-Minute Summary - Predict House Price with Linear Regression​

StepWhat Happens
Raw DataCreate table of house size and price
Feature AssemblyConvert size β†’ [size] vector for Spark ML
Model TrainingFit linear regression model to learn w and b
View CoefficientsSee how size affects price
Evaluate ModelCheck RMSE and RΒ² to measure how good the model is
Predict Existing DataSee how close predictions are to real prices
Predict New DataUse model to forecast unseen house prices

πŸ“Œ This is how machine learning models are built from the ground up β€” one clean step at a time.