Skip to main content

Clustering & Recommendation Engines in PySpar

k β€” MLlib Unsupervised Learning

At DataVerse Labs, customer segmentation and product recommendations were essential for improving analytics and boosting conversions. Instead of manually grouping customers or building heuristics, they adopted MLlib’s unsupervised learning tools:

  • Clustering β†’ group similar customers/items
  • Recommendation Engines β†’ predict what users may like

In this chapter, we explore KMeans, GMM, and ALS with clean pipelines and real input/output examples.


1. Clustering in PySpark​

Clustering helps answer questions like:

  • What customer groups exist organically?
  • Which products are bought by similar audiences?
  • Can we target different segments with personalized campaigns?

The most popular algorithm: KMeans.


2. KMeans Clustering Example β€” Customer Segmentation​

Input Data​

+--------+-----------+----------+
|income |spending |visits |
+--------+-----------+----------+
|45000 |300 |12 |
|65000 |150 |5 |
|52000 |450 |20 |
|80000 |100 |3 |
+--------+-----------+----------+

Code β€” KMeans Pipeline​

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline

assembler = VectorAssembler(
inputCols=["income", "spending", "visits"],
outputCol="features"
)

kmeans = KMeans(k=2, seed=42)

pipeline = Pipeline(stages=[assembler, kmeans])
model = pipeline.fit(df)

predictions = model.transform(df)
predictions.select("income", "spending", "visits", "prediction").show()

Output Example​

+-------+--------+------+----------+
|income |spending|visits|prediction|
+-------+--------+------+----------+
|45000 |300 |12 |0 |
|65000 |150 |5 |1 |
|52000 |450 |20 |0 |
|80000 |100 |3 |1 |
+-------+--------+------+----------+

Cluster Centers​

model.stages[-1].clusterCenters()

Output:

[array([48500.0, 375.0, 16.0]), array([72500.0,125.0,4.0])]

3. Gaussian Mixture Models (GMM) β€” Soft Clustering​

GMM assigns probabilities instead of hard labels.

from pyspark.ml.clustering import GaussianMixture

gmm = GaussianMixture(k=2)
gmm_model = gmm.fit(df_features)
gmm_model.transform(df_features).show(truncate=False)

Output:

+-----------------+--------------------------------------------+
|features |probability |
+-----------------+--------------------------------------------+
|[45000,300,12] |[0.87,0.13] |
|[65000,150,5] |[0.10,0.90] |
+-----------------+--------------------------------------------+

Great for overlapping clusters.


4. Recommendation Engines β€” ALS Collaborative Filtering​

The Alternating Least Squares (ALS) algorithm powers recommendation engines at:

  • Netflix
  • Amazon
  • Spotify
  • E-commerce platforms

MLlib provides a scalable ALS implementation.


5. ALS Example β€” Product Recommendations​

Input Data (User Ratings)​

+------+--------+--------+
|userId|itemId |rating |
+------+--------+--------+
|1 |101 |5 |
|1 |102 |3 |
|2 |101 |4 |
|2 |103 |2 |
|3 |102 |5 |
+------+--------+--------+

Code β€” ALS Recommender​

from pyspark.ml.recommendation import ALS

als = ALS(
userCol="userId",
itemCol="itemId",
ratingCol="rating",
coldStartStrategy="drop"
)

model = als.fit(df)

# Recommendations for each user
recommendations = model.recommendForAllUsers(3)
recommendations.show(truncate=False)

Output Example​

+------+---------------------------------------------+
|userId|recommendations |
+------+---------------------------------------------+
|1 |[{103, 4.8}, {104, 4.3}, {105, 4.1}] |
|2 |[{102, 4.6}, {105, 4.2}, {101, 4.0}] |
|3 |[{101, 4.7}, {103, 4.5}, {104, 4.2}] |
+------+---------------------------------------------+

These predicted scores help businesses:

βœ” Recommend relevant products βœ” Increase cart value βœ” Boost engagement


6. Evaluating ALS Models​

ALS uses RMSE for evaluation.

from pyspark.ml.evaluation import RegressionEvaluator

predictions = model.transform(df)

evaluator = RegressionEvaluator(
metricName="rmse",
labelCol="rating",
predictionCol="prediction"
)

rmse = evaluator.evaluate(predictions)
print("RMSE:", rmse)

7. Best Practices for Clustering & Recommendations​

Clustering​

βœ” Standardize numeric features (Optional: StandardScaler) βœ” Try multiple values of k βœ” Use silhouette score for validation

Recommendation Systems​

βœ” Use implicit feedback for clicks/views βœ” Tune rank, maxIter, regParam βœ” Always set coldStartStrategy='drop' βœ” Evaluate using RMSE


Summary​

In this chapter, you learned:

🟦 Clustering​

  • KMeans for hard cluster assignments
  • GMM for probability-based clustering
  • Pipelines for clean data-to-cluster flow

🟩 Recommendation Engines​

  • ALS for collaborative filtering
  • Predicting top-N recommendations
  • Evaluating RMSE for ranking quality

Now your PySpark toolbox includes powerful unsupervised learning capabilities for segmentation and personalized recommendations.


Next Topic β†’ PySpark with Snowflake, Databricks, and Hive Integration