Clustering & Recommendation Engines in PySpark — MLlib Unsupervised Learning

At DataVerse Labs, customer segmentation and product recommendations were essential for improving analytics and boosting conversions. Instead of manually grouping customers or building heuristics, they adopted MLlib’s unsupervised learning tools:

Clustering → group similar customers/items
Recommendation Engines → predict what users may like

In this chapter, we explore KMeans, GMM, and ALS with clean pipelines and real input/output examples.

1. Clustering in PySpark

Clustering helps answer questions like:

What customer groups exist organically?
Which products are bought by similar audiences?
Can we target different segments with personalized campaigns?

The most popular algorithm: KMeans.

2. KMeans Clustering Example — Customer Segmentation

Input Data

+--------+-----------+----------+
|income  |spending   |visits    |
+--------+-----------+----------+
|45000   |300        |12        |
|65000   |150        |5         |
|52000   |450        |20        |
|80000   |100        |3         |
+--------+-----------+----------+

Code — KMeans Pipeline

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline

assembler = VectorAssembler(
    inputCols=["income", "spending", "visits"],
    outputCol="features"
)

kmeans = KMeans(k=2, seed=42)

pipeline = Pipeline(stages=[assembler, kmeans])
model = pipeline.fit(df)

predictions = model.transform(df)
predictions.select("income", "spending", "visits", "prediction").show()

Output Example

+-------+--------+------+----------+
|income |spending|visits|prediction|
+-------+--------+------+----------+
|45000  |300     |12    |0         |
|65000  |150     |5     |1         |
|52000  |450     |20    |0         |
|80000  |100     |3     |1         |
+-------+--------+------+----------+

Cluster Centers

model.stages[-1].clusterCenters()

Output:

[array([48500.0, 375.0, 16.0]), array([72500.0,125.0,4.0])]

3. Gaussian Mixture Models (GMM) — Soft Clustering

GMM assigns probabilities instead of hard labels.

from pyspark.ml.clustering import GaussianMixture

gmm = GaussianMixture(k=2)
gmm_model = gmm.fit(df_features)
gmm_model.transform(df_features).show(truncate=False)

Output:

+-----------------+--------------------------------------------+
|features         |probability                                 |
+-----------------+--------------------------------------------+
|[45000,300,12]   |[0.87,0.13]                                  |
|[65000,150,5]    |[0.10,0.90]                                  |
+-----------------+--------------------------------------------+

Great for overlapping clusters.

4. Recommendation Engines — ALS Collaborative Filtering

The Alternating Least Squares (ALS) algorithm powers recommendation engines at:

Netflix
Amazon
Spotify
E-commerce platforms

MLlib provides a scalable ALS implementation.

5. ALS Example — Product Recommendations

Input Data (User Ratings)

+------+--------+--------+
|userId|itemId  |rating  |
+------+--------+--------+
|1     |101     |5       |
|1     |102     |3       |
|2     |101     |4       |
|2     |103     |2       |
|3     |102     |5       |
+------+--------+--------+

Code — ALS Recommender

from pyspark.ml.recommendation import ALS

als = ALS(
    userCol="userId",
    itemCol="itemId",
    ratingCol="rating",
    coldStartStrategy="drop"
)

model = als.fit(df)

# Recommendations for each user
recommendations = model.recommendForAllUsers(3)
recommendations.show(truncate=False)

Output Example

+------+---------------------------------------------+
|userId|recommendations                              |
+------+---------------------------------------------+
|1     |[{103, 4.8}, {104, 4.3}, {105, 4.1}]         |
|2     |[{102, 4.6}, {105, 4.2}, {101, 4.0}]         |
|3     |[{101, 4.7}, {103, 4.5}, {104, 4.2}]         |
+------+---------------------------------------------+

These predicted scores help businesses:

✔ Recommend relevant products ✔ Increase cart value ✔ Boost engagement

6. Evaluating ALS Models

ALS uses RMSE for evaluation.

from pyspark.ml.evaluation import RegressionEvaluator

predictions = model.transform(df)

evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
    predictionCol="prediction"
)

rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)

7. Best Practices for Clustering & Recommendations (SEO Summary)

Clustering

✔ Standardize numeric features (Optional: StandardScaler) ✔ Try multiple values of k ✔ Use silhouette score for validation

Recommendation Systems

✔ Use implicit feedback for clicks/views ✔ Tune rank, maxIter, regParam ✔ Always set coldStartStrategy='drop' ✔ Evaluate using RMSE

Summary

In this chapter, you learned:

🟦 Clustering

KMeans for hard cluster assignments
GMM for probability-based clustering
Pipelines for clean data-to-cluster flow

🟩 Recommendation Engines

ALS for collaborative filtering
Predicting top-N recommendations
Evaluating RMSE for ranking quality

Now your PySpark toolbox includes powerful unsupervised learning capabilities for segmentation and personalized recommendations.

Next Topic → PySpark with Snowflake, Databricks, and Hive Integration

1. Clustering in PySpark​

2. KMeans Clustering Example — Customer Segmentation​

Input Data​

Code — KMeans Pipeline​

Output Example​

Cluster Centers​

3. Gaussian Mixture Models (GMM) — Soft Clustering​

4. Recommendation Engines — ALS Collaborative Filtering​

5. ALS Example — Product Recommendations​

Input Data (User Ratings)​

Code — ALS Recommender​

Output Example​

6. Evaluating ALS Models​

7. Best Practices for Clustering & Recommendations (SEO Summary)​

Clustering​

Recommendation Systems​

Summary​

🟦 Clustering​

🟩 Recommendation Engines​

1. Clustering in PySpark

2. KMeans Clustering Example — Customer Segmentation

Input Data

Code — KMeans Pipeline

Output Example

Cluster Centers

3. Gaussian Mixture Models (GMM) — Soft Clustering

4. Recommendation Engines — ALS Collaborative Filtering

5. ALS Example — Product Recommendations

Input Data (User Ratings)

Code — ALS Recommender

Output Example

6. Evaluating ALS Models

7. Best Practices for Clustering & Recommendations (SEO Summary)

Clustering

Recommendation Systems

Summary

🟦 Clustering

🟩 Recommendation Engines