Regression & Classification in PySpark — MLlib Supervised Learning Guide
At DataVerse Labs, the analytics team needed to solve two common business problems:
- Predict numbers → sales forecasting, revenue estimation, product price prediction
- Predict categories → fraud detection, churn prediction, product recommendation
These tasks map directly to:
- Regression → predicting continuous values
- Classification → predicting classes (0/1 or multi-class)
In this chapter, you’ll learn both using PySpark MLlib with step-by-step pipelines and clear input/output samples.
1. Regression in PySpark (Predicting Continuous Values)
Let’s start with Linear Regression, one of the simplest and most useful models.
Example Scenario — Predicting House Prices
Input Data
+------+----------+-----------+--------+
|rooms |area_sqft |location_ix|price |
+------+----------+-----------+--------+
|3 |1200 |1 |250000 |
|2 |800 |0 |180000 |
|4 |1500 |1 |320000 |
+------+----------+-----------+--------+
Code — Linear Regression Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
assembler = VectorAssembler(
inputCols=["rooms", "area_sqft", "location_ix"],
outputCol="features"
)
lr = LinearRegression(featuresCol="features", labelCol="price")
pipeline = Pipeline(stages=[assembler, lr])
model = pipeline.fit(df)
predictions = model.transform(df)
predictions.select("rooms", "area_sqft", "price", "prediction").show()
Output Example
+-----+----------+--------+------------------+
|rooms|area_sqft |price |prediction |
+-----+----------+--------+------------------+
|3 |1200 |250000 |247312.56 |
|2 |800 |180000 |185430.22 |
|4 |1500 |320000 |318915.47 |
+-----+----------+--------+------------------+
Model Metrics (Optional)
training_summary = model.stages[-1].summary
print(training_summary.r2)
print(training_summary.rootMeanSquaredError)
2. Classification in PySpark (Predicting Categories)
Classification answers questions like:
- Will a customer churn or stay?
- Is a transaction fraud?
- Will a visitor purchase?
The most common model: Logistic Regression.
Example Scenario — Predicting Customer Churn
Input Data
+---------+--------+-----------+-----+
|tenure |income |contracts |label|
+---------+--------+-----------+-----+
|12 |45000 |1 |0 |
|6 |32000 |0 |1 |
|24 |70000 |2 |0 |
+---------+--------+-----------+-----+
Code — Logistic Regression Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
assembler = VectorAssembler(
inputCols=["tenure", "income", "contracts"],
outputCol="features"
)
lr = LogisticRegression(featuresCol="features", labelCol="label")
pipeline = Pipeline(stages=[assembler, lr])
model = pipeline.fit(df)
predictions = model.transform(df)
predictions.select("tenure", "income", "label", "prediction", "probability").show(truncate=False)
Output Example
+-------+------+-----+----------+--------------------------------------+
|tenure |income|label|prediction|probability |
+-------+------+-----+----------+--------------------------------------+
|12 |45000 |0 |0.0 |[0.82,0.18] |
|6 |32000 |1 |1.0 |[0.21,0.79] |
|24 |70000 |0 |0.0 |[0.93,0.07] |
+-------+------+-----+----------+--------------------------------------+
3. Train/Test Split (Critical for Real ML Systems)
Always split data to avoid overfitting:
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train_df)
predictions = model.transform(test_df)
4. Evaluating Classification Models
PySpark provides metrics via BinaryClassificationEvaluator.
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="rawPrediction"
)
auc = evaluator.evaluate(predictions)
print("AUC:", auc)
5. Evaluating Regression Models
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(
labelCol="price",
predictionCol="prediction",
metricName="rmse"
)
rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)
6. Why MLlib for Regression & Classification (SEO Summary)
✔ Scales to TB-level datasets ✔ Pipeline-ready, production-friendly ✔ Distributed training & evaluation ✔ Consistent workflow across algorithms ✔ Built-in metrics for fast validation
Summary
In this chapter, you learned:
- Regression → predicting continuous values using
LinearRegression - Classification → predicting categories using
LogisticRegression - How to build end-to-end ML pipelines
- How to evaluate model metrics like R², RMSE, and AUC
PySpark MLlib gives you a clean, scalable approach to supervised learning used by enterprise data teams worldwide.
Next Topic → Clustering & Recommendation Engines in PySpark