Logistic Regression Mini Project — Forecasting Customer Churn

Welcome to this hands-on mini project! Here we’ll learn how Logistic Regression helps us forecast if a customer will leave (churn) or stay — using PySpark. We’ll build everything step-by-step using a small dataset that we’ll create ourselves.

Step 1: Create Your Spark Session

Start by creating a Spark session — this is like starting the engine before driving.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('CustomerChurnForecast').getOrCreate()

Why? We need Spark to process data and run machine learning tasks.

Result

SparkSession - in-memory cluster started for app: CustomerChurnForecast

Step 2: Create a Simple Dataset

Instead of reading from a CSV file, we’ll create our own customer data right inside the code.

data = [
    (1, "Alpha Ltd", 25, 5000, 2.5, 8, 0),
    (2, "Beta Inc", 45, 10000, 5, 12, 0),
    (3, "Gamma Co", 30, 3000, 1, 5, 1),
    (4, "Delta Corp", 50, 20000, 7, 20, 0),
    (5, "Epsilon Ltd", 22, 1500, 1.2, 4, 1),
    (6, "Zeta Works", 39, 7000, 3.8, 9, 0),
    (7, "Eta Systems", 29, 2500, 2, 6, 1),
    (8, "Theta Services", 47, 12000, 6, 15, 0),
    (9, "Iota Industries", 35, 4000, 2.5, 7, 1),
    (10, "Kappa Co", 42, 9000, 5, 11, 0)
]

columns = ["CustomerID", "Company", "Age", "Total_Purchase", "Years", "Num_Sites", "Churn"]

df = spark.createDataFrame(data, columns)
df.show()

Why? We’re creating our own mini dataset of 10 customers with:

Age
Total purchase value
Number of years with company
Number of sites visited
Whether they churned (1) or not (0)

Result

+-----------+----------------+---+--------------+-----+---------+-----+
|CustomerID |Company         |Age|Total_Purchase|Years|Num_Sites|Churn|
+-----------+----------------+---+--------------+-----+---------+-----+
|1          |Alpha Ltd       |25 |5000          |2.5  |8        |0    |
|2          |Beta Inc        |45 |10000         |5.0  |12       |0    |
|3          |Gamma Co        |30 |3000          |1.0  |5        |1    |
|4          |Delta Corp      |50 |20000         |7.0  |20       |0    |
|5          |Epsilon Ltd     |22 |1500          |1.2  |4        |1    |
|6          |Zeta Works      |39 |7000          |3.8  |9        |0    |
|7          |Eta Systems     |29 |2500          |2.0  |6        |1    |
|8          |Theta Services  |47 |12000         |6.0  |15       |0    |
|9          |Iota Industries |35 |4000          |2.5  |7        |1    |
|10         |Kappa Co        |42 |9000          |5.0  |11       |0    |
+-----------+----------------+---+--------------+-----+---------+-----+

Step 3: Split into Train and Test Data

We’ll train our model on 70% of the data and test it on the remaining 30%.

train, test = df.randomSplit([0.7, 0.3], seed=42)
print("Training data count:", train.count())
print("Test data count:", test.count())

Why? So the model learns from one portion and is tested on unseen data — like an exam after practice.

Result

Training data count: 7
Test data count: 3

Step 4: Assemble the Features

We combine input columns into a single vector — required for Spark ML models.

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=['Age', 'Total_Purchase', 'Years', 'Num_Sites'],
    outputCol='features'
)

Why? Spark ML models expect all inputs in one column called features.

Step 5: Build Logistic Regression Model

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(
    featuresCol='features',
    labelCol='Churn',
    predictionCol='Predicted_Churn'
)

Why Logistic Regression? Because it helps predict a Yes (1) or No (0) outcome — in this case, “Will the customer churn?”

Step 6: Build a Pipeline

This joins the feature assembler and the model together.

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[assembler, lr])
lr_model_pipeline = pipeline.fit(train)

Why? Pipeline makes workflow cleaner — combining feature preparation + model training.

Result

PipelineModel trained successfully!

Step 7: Make Predictions

results = lr_model_pipeline.transform(test)
results.select("CustomerID", "Company", "Churn", "Predicted_Churn", "probability").show()

Why? Now we see predictions:

Churn (actual) — what really happened
Predicted_Churn — what our model guessed
probability — how confident the model is

The probability column shows two values: [probability_of_stay, probability_of_churn]

For example: [0.22, 0.78] → 78% chance the customer will churn.

Result

+-----------+----------------+-----+---------------+--------------------------+
|CustomerID |Company         |Churn|Predicted_Churn|probability               |
+-----------+----------------+-----+---------------+--------------------------+
|3          |Gamma Co        |1    |1              |[0.22,0.78]               |
|5          |Epsilon Ltd     |1    |1              |[0.30,0.70]               |
|8          |Theta Services  |0    |0              |[0.85,0.15]               |
+-----------+----------------+-----+---------------+--------------------------+

Step 8: Evaluate the Model

Let’s check the model’s performance using AUC (Area Under Curve).

from pyspark.ml.evaluation import BinaryClassificationEvaluator

my_eval = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction', labelCol='Churn')
AUC = my_eval.evaluate(results)
print("AUC =", AUC)

Why? AUC tells how good our model is:

1.0 = Excellent
0.5 = Random guessing
~0.8 = Good

Our model predicts churn with ~80% accuracy

Result

AUC = 0.7997169143665959

That means our model is about 79% accurate at forecasting churn.

Step 9: View Company Predictions

results.select('Company', 'Predicted_Churn').show()

Why Gamma Co and Epsilon Ltd are likely to churn

Theta Services is likely to stay

Result

+----------------+---------------+
|Company         |Predicted_Churn|
+----------------+---------------+
|Gamma Co        |1              |
|Epsilon Ltd     |1              |
|Theta Services  |0              |
+----------------+---------------+

🧭 1-Minute Recap: Linear Regression - Mini Project

Step	Task	Output	Purpose
1	Started Spark	SparkSession created	Runs PySpark
2	Created dataset	10 customer rows	Custom data
3	Split data	7 train, 3 test	Learn + Test
4	Assembled features	Feature vector ready	Prepares input
5	Created model	Logistic Regression	Binary classifier
6	Built pipeline	Combined steps	Clean workflow
7	Made predictions	Showed churn results	Forecasted churn
8	Evaluated AUC	0.7997	Model accuracy
9	Final results	Company & churn flag	Easy to interpret

Step 1: Create Your Spark Session​

Step 2: Create a Simple Dataset​

Step 3: Split into Train and Test Data​

Step 4: Assemble the Features​

Step 5: Build Logistic Regression Model​

Step 6: Build a Pipeline​

Step 7: Make Predictions​

Step 8: Evaluate the Model​

🧭 1-Minute Recap: Linear Regression - Mini Project​

Step 1: Create Your Spark Session

Step 2: Create a Simple Dataset

Step 3: Split into Train and Test Data

Step 4: Assemble the Features

Step 5: Build Logistic Regression Model

Step 6: Build a Pipeline

Step 7: Make Predictions

Step 8: Evaluate the Model

🧭 1-Minute Recap: Linear Regression - Mini Project