Regression & Classification in PySpark β MLlib Supervised Learning Guide
At DataVerse Labs, the analytics team needed to solve two common business problems:
- Predict numbers β sales forecasting, revenue estimation, product price prediction
- Predict categories β fraud detection, churn prediction, product recommendation
These tasks map directly to:
- Regression β predicting continuous values
- Classification β predicting classes (0/1 or multi-class)
In this chapter, youβll learn both using PySpark MLlib with step-by-step pipelines and clear input/output samples.
1. Regression in PySpark (Predicting Continuous Values)β
Letβs start with Linear Regression, one of the simplest and most useful models.
Example Scenario β Predicting House Pricesβ
Input Data
+------+----------+-----------+--------+
|rooms |area_sqft |location_ix|price |
+------+----------+-----------+--------+
|3 |1200 |1 |250000 |
|2 |800 |0 |180000 |
|4 |1500 |1 |320000 |
+------+----------+-----------+--------+
Code β Linear Regression Pipelineβ
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
assembler = VectorAssembler(
inputCols=["rooms", "area_sqft", "location_ix"],
outputCol="features"
)
lr = LinearRegression(featuresCol="features", labelCol="price")
pipeline = Pipeline(stages=[assembler, lr])
model = pipeline.fit(df)
predictions = model.transform(df)
predictions.select("rooms", "area_sqft", "price", "prediction").show()
Output Exampleβ
+-----+----------+--------+------------------+
|rooms|area_sqft |price |prediction |
+-----+----------+--------+------------------+
|3 |1200 |250000 |247312.56 |
|2 |800 |180000 |185430.22 |
|4 |1500 |320000 |318915.47 |
+-----+----------+--------+------------------+
Model Metrics (Optional)β
training_summary = model.stages[-1].summary
print(training_summary.r2)
print(training_summary.rootMeanSquaredError)