Clustering & Recommendation Engines in PySpar
k β MLlib Unsupervised Learning
At DataVerse Labs, customer segmentation and product recommendations were essential for improving analytics and boosting conversions. Instead of manually grouping customers or building heuristics, they adopted MLlibβs unsupervised learning tools:
- Clustering β group similar customers/items
- Recommendation Engines β predict what users may like
In this chapter, we explore KMeans, GMM, and ALS with clean pipelines and real input/output examples.
1. Clustering in PySparkβ
Clustering helps answer questions like:
- What customer groups exist organically?
- Which products are bought by similar audiences?
- Can we target different segments with personalized campaigns?
The most popular algorithm: KMeans.
2. KMeans Clustering Example β Customer Segmentationβ
Input Dataβ
+--------+-----------+----------+
|income |spending |visits |
+--------+-----------+----------+
|45000 |300 |12 |
|65000 |150 |5 |
|52000 |450 |20 |
|80000 |100 |3 |
+--------+-----------+----------+
Code β KMeans Pipelineβ
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
assembler = VectorAssembler(
inputCols=["income", "spending", "visits"],
outputCol="features"
)
kmeans = KMeans(k=2, seed=42)
pipeline = Pipeline(stages=[assembler, kmeans])
model = pipeline.fit(df)
predictions = model.transform(df)
predictions.select("income", "spending", "visits", "prediction").show()
Output Exampleβ
+-------+--------+------+----------+
|income |spending|visits|prediction|
+-------+--------+------+----------+
|45000 |300 |12 |0 |
|65000 |150 |5 |1 |
|52000 |450 |20 |0 |
|80000 |100 |3 |1 |
+-------+--------+------+----------+
Cluster Centersβ
model.stages[-1].clusterCenters()
Output:
[array([48500.0, 375.0, 16.0]), array([72500.0,125.0,4.0])]
3. Gaussian Mixture Models (GMM) β Soft Clusteringβ
GMM assigns probabilities instead of hard labels.
from pyspark.ml.clustering import GaussianMixture
gmm = GaussianMixture(k=2)
gmm_model = gmm.fit(df_features)
gmm_model.transform(df_features).show(truncate=False)
Output:
+-----------------+--------------------------------------------+
|features |probability |
+-----------------+--------------------------------------------+
|[45000,300,12] |[0.87,0.13] |
|[65000,150,5] |[0.10,0.90] |
+-----------------+--------------------------------------------+
Great for overlapping clusters.
4. Recommendation Engines β ALS Collaborative Filteringβ
The Alternating Least Squares (ALS) algorithm powers recommendation engines at:
- Netflix
- Amazon
- Spotify
- E-commerce platforms
MLlib provides a scalable ALS implementation.
5. ALS Example β Product Recommendationsβ
Input Data (User Ratings)β
+------+--------+--------+
|userId|itemId |rating |
+------+--------+--------+
|1 |101 |5 |
|1 |102 |3 |
|2 |101 |4 |
|2 |103 |2 |
|3 |102 |5 |
+------+--------+--------+
Code β ALS Recommenderβ
from pyspark.ml.recommendation import ALS
als = ALS(
userCol="userId",
itemCol="itemId",
ratingCol="rating",
coldStartStrategy="drop"
)
model = als.fit(df)
# Recommendations for each user
recommendations = model.recommendForAllUsers(3)
recommendations.show(truncate=False)
Output Exampleβ
+------+---------------------------------------------+
|userId|recommendations |
+------+---------------------------------------------+
|1 |[{103, 4.8}, {104, 4.3}, {105, 4.1}] |
|2 |[{102, 4.6}, {105, 4.2}, {101, 4.0}] |
|3 |[{101, 4.7}, {103, 4.5}, {104, 4.2}] |
+------+---------------------------------------------+
These predicted scores help businesses:
β Recommend relevant products β Increase cart value β Boost engagement
6. Evaluating ALS Modelsβ
ALS uses RMSE for evaluation.
from pyspark.ml.evaluation import RegressionEvaluator
predictions = model.transform(df)
evaluator = RegressionEvaluator(
metricName="rmse",
labelCol="rating",
predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)
7. Best Practices for Clustering & Recommendationsβ
Clusteringβ
β Standardize numeric features (Optional: StandardScaler) β Try multiple values of k β Use silhouette score for validation
Recommendation Systemsβ
β Use implicit feedback for clicks/views
β Tune rank, maxIter, regParam
β Always set coldStartStrategy='drop'
β Evaluate using RMSE
Summaryβ
In this chapter, you learned:
π¦ Clusteringβ
- KMeans for hard cluster assignments
- GMM for probability-based clustering
- Pipelines for clean data-to-cluster flow
π© Recommendation Enginesβ
- ALS for collaborative filtering
- Predicting top-N recommendations
- Evaluating RMSE for ranking quality
Now your PySpark toolbox includes powerful unsupervised learning capabilities for segmentation and personalized recommendations.
Next Topic β PySpark with Snowflake, Databricks, and Hive Integration