Clustering & Recommendation Engines in PySpark — MLlib Unsupervised Learning
At DataVerse Labs, customer segmentation and product recommendations were essential for improving analytics and boosting conversions. Instead of manually grouping customers or building heuristics, they adopted MLlib’s unsupervised learning tools:
- Clustering → group similar customers/items
- Recommendation Engines → predict what users may like
In this chapter, we explore KMeans, GMM, and ALS with clean pipelines and real input/output examples.
1. Clustering in PySpark
Clustering helps answer questions like:
- What customer groups exist organically?
- Which products are bought by similar audiences?
- Can we target different segments with personalized campaigns?
The most popular algorithm: KMeans.
2. KMeans Clustering Example — Customer Segmentation
Input Data
+--------+-----------+----------+
|income |spending |visits |
+--------+-----------+----------+
|45000 |300 |12 |
|65000 |150 |5 |
|52000 |450 |20 |
|80000 |100 |3 |
+--------+-----------+----------+
Code — KMeans Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
assembler = VectorAssembler(
inputCols=["income", "spending", "visits"],
outputCol="features"
)
kmeans = KMeans(k=2, seed=42)
pipeline = Pipeline(stages=[assembler, kmeans])
model = pipeline.fit(df)
predictions = model.transform(df)
predictions.select("income", "spending", "visits", "prediction").show()
Output Example
+-------+--------+------+----------+
|income |spending|visits|prediction|
+-------+--------+------+----------+
|45000 |300 |12 |0 |
|65000 |150 |5 |1 |
|52000 |450 |20 |0 |
|80000 |100 |3 |1 |
+-------+--------+------+----------+
Cluster Centers
model.stages[-1].clusterCenters()
Output:
[array([48500.0, 375.0, 16.0]), array([72500.0,125.0,4.0])]
3. Gaussian Mixture Models (GMM) — Soft Clustering
GMM assigns probabilities instead of hard labels.
from pyspark.ml.clustering import GaussianMixture
gmm = GaussianMixture(k=2)
gmm_model = gmm.fit(df_features)
gmm_model.transform(df_features).show(truncate=False)
Output:
+-----------------+--------------------------------------------+
|features |probability |
+-----------------+--------------------------------------------+
|[45000,300,12] |[0.87,0.13] |
|[65000,150,5] |[0.10,0.90] |
+-----------------+--------------------------------------------+
Great for overlapping clusters.
4. Recommendation Engines — ALS Collaborative Filtering
The Alternating Least Squares (ALS) algorithm powers recommendation engines at:
- Netflix
- Amazon
- Spotify
- E-commerce platforms
MLlib provides a scalable ALS implementation.
5. ALS Example — Product Recommendations
Input Data (User Ratings)
+------+--------+--------+
|userId|itemId |rating |
+------+--------+--------+
|1 |101 |5 |
|1 |102 |3 |
|2 |101 |4 |
|2 |103 |2 |
|3 |102 |5 |
+------+--------+--------+
Code — ALS Recommender
from pyspark.ml.recommendation import ALS
als = ALS(
userCol="userId",
itemCol="itemId",
ratingCol="rating",
coldStartStrategy="drop"
)
model = als.fit(df)
# Recommendations for each user
recommendations = model.recommendForAllUsers(3)
recommendations.show(truncate=False)
Output Example
+------+---------------------------------------------+
|userId|recommendations |
+------+---------------------------------------------+
|1 |[{103, 4.8}, {104, 4.3}, {105, 4.1}] |
|2 |[{102, 4.6}, {105, 4.2}, {101, 4.0}] |
|3 |[{101, 4.7}, {103, 4.5}, {104, 4.2}] |
+------+---------------------------------------------+
These predicted scores help businesses:
✔ Recommend relevant products ✔ Increase cart value ✔ Boost engagement
6. Evaluating ALS Models
ALS uses RMSE for evaluation.
from pyspark.ml.evaluation import RegressionEvaluator
predictions = model.transform(df)
evaluator = RegressionEvaluator(
metricName="rmse",
labelCol="rating",
predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)
7. Best Practices for Clustering & Recommendations (SEO Summary)
Clustering
✔ Standardize numeric features (Optional: StandardScaler) ✔ Try multiple values of k ✔ Use silhouette score for validation
Recommendation Systems
✔ Use implicit feedback for clicks/views
✔ Tune rank, maxIter, regParam
✔ Always set coldStartStrategy='drop'
✔ Evaluate using RMSE
Summary
In this chapter, you learned:
🟦 Clustering
- KMeans for hard cluster assignments
- GMM for probability-based clustering
- Pipelines for clean data-to-cluster flow
🟩 Recommendation Engines
- ALS for collaborative filtering
- Predicting top-N recommendations
- Evaluating RMSE for ranking quality
Now your PySpark toolbox includes powerful unsupervised learning capabilities for segmentation and personalized recommendations.
Next Topic → PySpark with Snowflake, Databricks, and Hive Integration