Skip to main content

Regression & Classification in PySpark — MLlib Supervised Learning Guide

At DataVerse Labs, the analytics team needed to solve two common business problems:

  1. Predict numberssales forecasting, revenue estimation, product price prediction
  2. Predict categoriesfraud detection, churn prediction, product recommendation

These tasks map directly to:

  • Regression → predicting continuous values
  • Classification → predicting classes (0/1 or multi-class)

In this chapter, you’ll learn both using PySpark MLlib with step-by-step pipelines and clear input/output samples.


1. Regression in PySpark (Predicting Continuous Values)

Let’s start with Linear Regression, one of the simplest and most useful models.

Example Scenario — Predicting House Prices

Input Data

+------+----------+-----------+--------+
|rooms |area_sqft |location_ix|price |
+------+----------+-----------+--------+
|3 |1200 |1 |250000 |
|2 |800 |0 |180000 |
|4 |1500 |1 |320000 |
+------+----------+-----------+--------+

Code — Linear Regression Pipeline

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline

assembler = VectorAssembler(
inputCols=["rooms", "area_sqft", "location_ix"],
outputCol="features"
)

lr = LinearRegression(featuresCol="features", labelCol="price")

pipeline = Pipeline(stages=[assembler, lr])

model = pipeline.fit(df)
predictions = model.transform(df)

predictions.select("rooms", "area_sqft", "price", "prediction").show()

Output Example

+-----+----------+--------+------------------+
|rooms|area_sqft |price |prediction |
+-----+----------+--------+------------------+
|3 |1200 |250000 |247312.56 |
|2 |800 |180000 |185430.22 |
|4 |1500 |320000 |318915.47 |
+-----+----------+--------+------------------+

Model Metrics (Optional)

training_summary = model.stages[-1].summary
print(training_summary.r2)
print(training_summary.rootMeanSquaredError)

2. Classification in PySpark (Predicting Categories)

Classification answers questions like:

  • Will a customer churn or stay?
  • Is a transaction fraud?
  • Will a visitor purchase?

The most common model: Logistic Regression.

Example Scenario — Predicting Customer Churn

Input Data

+---------+--------+-----------+-----+
|tenure |income |contracts |label|
+---------+--------+-----------+-----+
|12 |45000 |1 |0 |
|6 |32000 |0 |1 |
|24 |70000 |2 |0 |
+---------+--------+-----------+-----+

Code — Logistic Regression Pipeline

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

assembler = VectorAssembler(
inputCols=["tenure", "income", "contracts"],
outputCol="features"
)

lr = LogisticRegression(featuresCol="features", labelCol="label")

pipeline = Pipeline(stages=[assembler, lr])
model = pipeline.fit(df)

predictions = model.transform(df)

predictions.select("tenure", "income", "label", "prediction", "probability").show(truncate=False)

Output Example

+-------+------+-----+----------+--------------------------------------+
|tenure |income|label|prediction|probability |
+-------+------+-----+----------+--------------------------------------+
|12 |45000 |0 |0.0 |[0.82,0.18] |
|6 |32000 |1 |1.0 |[0.21,0.79] |
|24 |70000 |0 |0.0 |[0.93,0.07] |
+-------+------+-----+----------+--------------------------------------+

3. Train/Test Split (Critical for Real ML Systems)

Always split data to avoid overfitting:

train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

model = pipeline.fit(train_df)
predictions = model.transform(test_df)

4. Evaluating Classification Models

PySpark provides metrics via BinaryClassificationEvaluator.

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="rawPrediction"
)

auc = evaluator.evaluate(predictions)
print("AUC:", auc)

5. Evaluating Regression Models

from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
labelCol="price",
predictionCol="prediction",
metricName="rmse"
)

rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)

6. Why MLlib for Regression & Classification (SEO Summary)

✔ Scales to TB-level datasets ✔ Pipeline-ready, production-friendly ✔ Distributed training & evaluation ✔ Consistent workflow across algorithms ✔ Built-in metrics for fast validation


Summary

In this chapter, you learned:

  • Regression → predicting continuous values using LinearRegression
  • Classification → predicting categories using LogisticRegression
  • How to build end-to-end ML pipelines
  • How to evaluate model metrics like R², RMSE, and AUC

PySpark MLlib gives you a clean, scalable approach to supervised learning used by enterprise data teams worldwide.


Next Topic → Clustering & Recommendation Engines in PySpark