π Predicting House Price from Size Using Linear Regression (PySpark)
Letβs say you're working with a real estate startup.
They ask:
π¬ "Can we estimate a houseβs price based on its size in square feet?"
Thatβs a regression problem β predicting a continuous number (like price).
Weβll use Linear Regression to do that.
π Example Datasetβ
| Size (sq.ft) | Price ($1000s) |
|--------------|----------------|
| 850 | 100 |
| 900 | 120 |
| 1000 | 150 |
| 1200 | 200 |
| 1500 | 250 |
β Step 1: Import Libraries & Create Dataβ
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# Start Spark session
spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()
# Create dataset (size, price)
data = [
(850, 100),
(900, 120),
(1000, 150),
(1200, 200),
(1500, 250)
]
df = spark.createDataFrame(data, ["size", "price"])
df.show()
π‘ What This Does:
| Line | Meaning |
|---|---|
SparkSession | Starts Spark (needed for all PySpark work) |
createDataFrame(data, columns) | Creates a DataFrame (like a table) with columns size and price |
df.show() | Prints the data table |
β Step 2: Assemble Featuresβ
Spark ML expects features in a single vector column.
assembler = VectorAssembler(inputCols=["size"], outputCol="features")
df_features = assembler.transform(df).select("features", "price")
df_features.show(truncate=False)
π‘ What This Does:
| Line | Meaning |
|---|---|
VectorAssembler(...) | Converts "size" column into a features vector (needed by Spark ML) |
transform(df) | Applies the assembler to your DataFrame |
select("features", "price") | Keeps only the features and price columns |
Result
| features | price |
|---|---|
| [850.0] | 100 |
| [900.0] | 120 |
| [1000.0] | 150 |
| [1200.0] | 200 |
| [1500.0] | 250 |
β Step 3: Train the Linear Regression Modelβ
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(df_features)
π‘ What This Does:
| Line | Meaning |
|---|---|
LinearRegression(...) | Creates the model and tells it which column is the feature and label |
model = lr.fit(df_features) | Trains (fits) the model using your data |
β Step 4: See What the Model Learnedβ
print("Coefficient (w):", model.coefficients)
print("Intercept (b):", model.intercept)
Result
Coefficient (w): [0.2]
Intercept (b): -70.0
This means your learned equation is:
price = 0.2 Γ size - 70
π So a 1000 sq.ft house would be:
price = 0.2 Γ 1000 - 70 = $130k
β Step 5: Check Model Performanceβ
training_summary = model.summary
print("RMSE:", training_summary.rootMeanSquaredError)
print("R2:", training_summary.r2)
π‘ What This Does:
| Metric | Meaning |
|---|---|
RMSE | Root Mean Square Error β lower is better (how far predictions are off) |
RΒ² | R-squared β closer to 1 means the model fits data well |
Result
RMSE: 5.47
R2: 0.98
β Your model explains 98% of the variance in the data. That's great!
β Step 6: Make Predictions on Training Dataβ
predictions = model.transform(df_features)
predictions.show()
Result
| features | price | prediction |
| -------- | ----- | ---------- |
| [850.0] | 100 | 100.0 |
| [900.0] | 120 | 110.0 |
| [1000.0] | 150 | 130.0 |
| [1200.0] | 200 | 170.0 |
| [1500.0] | 250 | 230.0 |
π Predictions are close to real prices (but not perfect).
β Step 7: Predict New (Unseen) Dataβ
new_data = spark.createDataFrame([(1100,), (1400,)], ["size"])
new_features = assembler.transform(new_data).select("features")
new_predictions = model.transform(new_features)
new_predictions.show()
π‘ What This Does:
| Line | Meaning |
|---|---|
createDataFrame(...) | Makes a new DataFrame with only "size" |
transform(...) | Turns size into feature vector |
model.transform(...) | Applies trained model to make predictions |
Result
| features | prediction |
| -------- | ---------- |
| [1100.0] | 150.0 |
| [1400.0] | 210.0 |
β Model can now predict future house prices based on size!
π 1-Minute Summary - Predict House Price with Linear Regressionβ
| Step | What Happens |
|---|---|
| Raw Data | Create table of house size and price |
| Feature Assembly | Convert size β [size] vector for Spark ML |
| Model Training | Fit linear regression model to learn w and b |
| View Coefficients | See how size affects price |
| Evaluate Model | Check RMSE and RΒ² to measure how good the model is |
| Predict Existing Data | See how close predictions are to real prices |
| Predict New Data | Use model to forecast unseen house prices |
π This is how machine learning models are built from the ground up β one clean step at a time.