Skip to main content

Clustering & Recommendation Engines in PySpark — MLlib Unsupervised Learning

At DataVerse Labs, customer segmentation and product recommendations were essential for improving analytics and boosting conversions. Instead of manually grouping customers or building heuristics, they adopted MLlib’s unsupervised learning tools:

  • Clustering → group similar customers/items
  • Recommendation Engines → predict what users may like

In this chapter, we explore KMeans, GMM, and ALS with clean pipelines and real input/output examples.


1. Clustering in PySpark

Clustering helps answer questions like:

  • What customer groups exist organically?
  • Which products are bought by similar audiences?
  • Can we target different segments with personalized campaigns?

The most popular algorithm: KMeans.


2. KMeans Clustering Example — Customer Segmentation

Input Data

+--------+-----------+----------+
|income |spending |visits |
+--------+-----------+----------+
|45000 |300 |12 |
|65000 |150 |5 |
|52000 |450 |20 |
|80000 |100 |3 |
+--------+-----------+----------+

Code — KMeans Pipeline

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline

assembler = VectorAssembler(
inputCols=["income", "spending", "visits"],
outputCol="features"
)

kmeans = KMeans(k=2, seed=42)

pipeline = Pipeline(stages=[assembler, kmeans])
model = pipeline.fit(df)

predictions = model.transform(df)
predictions.select("income", "spending", "visits", "prediction").show()

Output Example

+-------+--------+------+----------+
|income |spending|visits|prediction|
+-------+--------+------+----------+
|45000 |300 |12 |0 |
|65000 |150 |5 |1 |
|52000 |450 |20 |0 |
|80000 |100 |3 |1 |
+-------+--------+------+----------+

Cluster Centers

model.stages[-1].clusterCenters()

Output:

[array([48500.0, 375.0, 16.0]), array([72500.0,125.0,4.0])]

3. Gaussian Mixture Models (GMM) — Soft Clustering

GMM assigns probabilities instead of hard labels.

from pyspark.ml.clustering import GaussianMixture

gmm = GaussianMixture(k=2)
gmm_model = gmm.fit(df_features)
gmm_model.transform(df_features).show(truncate=False)

Output:

+-----------------+--------------------------------------------+
|features |probability |
+-----------------+--------------------------------------------+
|[45000,300,12] |[0.87,0.13] |
|[65000,150,5] |[0.10,0.90] |
+-----------------+--------------------------------------------+

Great for overlapping clusters.


4. Recommendation Engines — ALS Collaborative Filtering

The Alternating Least Squares (ALS) algorithm powers recommendation engines at:

  • Netflix
  • Amazon
  • Spotify
  • E-commerce platforms

MLlib provides a scalable ALS implementation.


5. ALS Example — Product Recommendations

Input Data (User Ratings)

+------+--------+--------+
|userId|itemId |rating |
+------+--------+--------+
|1 |101 |5 |
|1 |102 |3 |
|2 |101 |4 |
|2 |103 |2 |
|3 |102 |5 |
+------+--------+--------+

Code — ALS Recommender

from pyspark.ml.recommendation import ALS

als = ALS(
userCol="userId",
itemCol="itemId",
ratingCol="rating",
coldStartStrategy="drop"
)

model = als.fit(df)

# Recommendations for each user
recommendations = model.recommendForAllUsers(3)
recommendations.show(truncate=False)

Output Example

+------+---------------------------------------------+
|userId|recommendations |
+------+---------------------------------------------+
|1 |[{103, 4.8}, {104, 4.3}, {105, 4.1}] |
|2 |[{102, 4.6}, {105, 4.2}, {101, 4.0}] |
|3 |[{101, 4.7}, {103, 4.5}, {104, 4.2}] |
+------+---------------------------------------------+

These predicted scores help businesses:

✔ Recommend relevant products ✔ Increase cart value ✔ Boost engagement


6. Evaluating ALS Models

ALS uses RMSE for evaluation.

from pyspark.ml.evaluation import RegressionEvaluator

predictions = model.transform(df)

evaluator = RegressionEvaluator(
metricName="rmse",
labelCol="rating",
predictionCol="prediction"
)

rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)

7. Best Practices for Clustering & Recommendations (SEO Summary)

Clustering

✔ Standardize numeric features (Optional: StandardScaler) ✔ Try multiple values of k ✔ Use silhouette score for validation

Recommendation Systems

✔ Use implicit feedback for clicks/views ✔ Tune rank, maxIter, regParam ✔ Always set coldStartStrategy='drop' ✔ Evaluate using RMSE


Summary

In this chapter, you learned:

🟦 Clustering

  • KMeans for hard cluster assignments
  • GMM for probability-based clustering
  • Pipelines for clean data-to-cluster flow

🟩 Recommendation Engines

  • ALS for collaborative filtering
  • Predicting top-N recommendations
  • Evaluating RMSE for ranking quality

Now your PySpark toolbox includes powerful unsupervised learning capabilities for segmentation and personalized recommendations.


Next Topic → PySpark with Snowflake, Databricks, and Hive Integration