Skip to main content

Regression & Classification in PySpark β€” MLlib Supervised Learning Guide

At DataVerse Labs, the analytics team needed to solve two common business problems:

  1. Predict numbers β†’ sales forecasting, revenue estimation, product price prediction
  2. Predict categories β†’ fraud detection, churn prediction, product recommendation

These tasks map directly to:

  • Regression β†’ predicting continuous values
  • Classification β†’ predicting classes (0/1 or multi-class)

In this chapter, you’ll learn both using PySpark MLlib with step-by-step pipelines and clear input/output samples.


1. Regression in PySpark (Predicting Continuous Values)​

Let’s start with Linear Regression, one of the simplest and most useful models.

Example Scenario β€” Predicting House Prices​

Input Data

+------+----------+-----------+--------+
|rooms |area_sqft |location_ix|price |
+------+----------+-----------+--------+
|3 |1200 |1 |250000 |
|2 |800 |0 |180000 |
|4 |1500 |1 |320000 |
+------+----------+-----------+--------+

Code β€” Linear Regression Pipeline​

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline

assembler = VectorAssembler(
inputCols=["rooms", "area_sqft", "location_ix"],
outputCol="features"
)

lr = LinearRegression(featuresCol="features", labelCol="price")

pipeline = Pipeline(stages=[assembler, lr])

model = pipeline.fit(df)
predictions = model.transform(df)

predictions.select("rooms", "area_sqft", "price", "prediction").show()

Output Example​

+-----+----------+--------+------------------+
|rooms|area_sqft |price |prediction |
+-----+----------+--------+------------------+
|3 |1200 |250000 |247312.56 |
|2 |800 |180000 |185430.22 |
|4 |1500 |320000 |318915.47 |
+-----+----------+--------+------------------+

Model Metrics (Optional)​

training_summary = model.stages[-1].summary
print(training_summary.r2)
print(training_summary.rootMeanSquaredError)

2. Classification in PySpark (Predicting Categories)​

Classification answers questions like:

  • Will a customer churn or stay?
  • Is a transaction fraud?
  • Will a visitor purchase?

The most common model: Logistic Regression.

Example Scenario β€” Predicting Customer Churn​

Input Data

+---------+--------+-----------+-----+
|tenure |income |contracts |label|
+---------+--------+-----------+-----+
|12 |45000 |1 |0 |
|6 |32000 |0 |1 |
|24 |70000 |2 |0 |
+---------+--------+-----------+-----+

Code β€” Logistic Regression Pipeline​

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

assembler = VectorAssembler(
inputCols=["tenure", "income", "contracts"],
outputCol="features"
)

lr = LogisticRegression(featuresCol="features", labelCol="label")

pipeline = Pipeline(stages=[assembler, lr])
model = pipeline.fit(df)

predictions = model.transform(df)

predictions.select("tenure", "income", "label", "prediction", "probability").show(truncate=False)

Output Example​

+-------+------+-----+----------+--------------------------------------+
|tenure |income|label|prediction|probability |
+-------+------+-----+----------+--------------------------------------+
|12 |45000 |0 |0.0 |[0.82,0.18] |
|6 |32000 |1 |1.0 |[0.21,0.79] |
|24 |70000 |0 |0.0 |[0.93,0.07] |
+-------+------+-----+----------+--------------------------------------+

3. Train/Test Split (Critical for Real ML Systems)​

Always split data to avoid overfitting:

train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

model = pipeline.fit(train_df)
predictions = model.transform(test_df)

4. Evaluating Classification Models​

PySpark provides metrics via BinaryClassificationEvaluator.

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="rawPrediction"
)

auc = evaluator.evaluate(predictions)
print("AUC:", auc)

5. Evaluating Regression Models​

from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
labelCol="price",
predictionCol="prediction",
metricName="rmse"
)

rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)

6. Why MLlib for Regression & Classification​

βœ” Scales to TB-level datasets βœ” Pipeline-ready, production-friendly βœ” Distributed training & evaluation βœ” Consistent workflow across algorithms βœ” Built-in metrics for fast validation


Summary​

In this chapter, you learned:

  • Regression β†’ predicting continuous values using LinearRegression
  • Classification β†’ predicting categories using LogisticRegression
  • How to build end-to-end ML pipelines
  • How to evaluate model metrics like RΒ², RMSE, and AUC

PySpark MLlib gives you a clean, scalable approach to supervised learning used by enterprise data teams worldwide.


Next Topic β†’ Linear Regression - Explained Simply