Skip to main content

Introduction to MLlib — Pipelines, Transformers, and Estimators

At DataVerse Labs, the data science team struggled with a familiar problem:

“Our ML code works… until someone touches it.”

Different preprocessing steps, inconsistent model versions, and messy feature engineering caused pipeline failures every week.

So the team switched to PySpark MLlib Pipelines — a structured, reproducible, production-friendly way to prepare data and train machine learning models.

In this chapter, you’ll understand Pipelines, Transformers, and Estimators one by one, with real examples and output.


1. What Are MLlib Pipelines?

Think of a pipeline as an assembly line:

  1. Raw data enters
  2. It passes through multiple stages
  3. A prediction-ready dataset or model comes out

A PySpark ML Pipeline consists of stages, and each stage is either:

  • Transformer → transforms data (e.g., VectorAssembler, StringIndexerModel)
  • Estimator → learns from data and creates a transformer (e.g., LogisticRegression)

A pipeline ensures:

✔ reproducibility
✔ cleaner code
✔ easy hyperparameter tuning
✔ versioned, deployable ML workflow


2. Transformers — They Transform Data

Transformers apply a function to your dataset.

Example: StringIndexerModel, VectorAssembler, StandardScalerModel.

Imagine NeoMart customer data:

Input Data

+--------+--------+-------+
|user_id |gender |income |
+--------+--------+-------+
|U1001 |Male |50000 |
|U1002 |Female |65000 |
|U1003 |Female |45000 |
+--------+--------+-------+

Code — StringIndexer (Transformer)

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
model_indexer = indexer.fit(df)
df_indexed = model_indexer.transform(df)

df_indexed.show()

Output

+--------+--------+-------+-------------+
|user_id |gender |income |gender_index |
+--------+--------+-------+-------------+
|U1001 |Male |50000 |1.0 |
|U1002 |Female |65000 |0.0 |
|U1003 |Female |45000 |0.0 |
+--------+--------+-------+-------------+

Transformers do not learn — they only apply existing logic.


3. Estimators — They Learn From Data

Estimators produce transformers after fitting.

Examples:

  • StringIndexer → produces StringIndexerModel
  • StandardScaler → produces StandardScalerModel
  • LogisticRegression → produces LogisticRegressionModel

Code — Logistic Regression (Estimator)

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol="features", labelCol="label")
model_lr = lr.fit(training_df)

Here:

  • lr is an Estimator
  • model_lr is a Transformer created by learning from training data

4. Building a Full ML Pipeline (Step-by-Step)

Let’s build a pipeline predicting whether a customer will purchase a product.

Input Data

+-------+--------+-------+------+
|gender |age |income |label |
+-------+--------+-------+------+
|Male |34 |50000 |1 |
|Female |28 |65000 |0 |
|Female |30 |45000 |1 |
+-------+--------+-------+------+

Pipeline Code

from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol="gender", outputCol="gender_index")

assembler = VectorAssembler(
inputCols=["gender_index", "age", "income"],
outputCol="features"
)

lr = LogisticRegression(featuresCol="features", labelCol="label")

pipeline = Pipeline(stages=[indexer, assembler, lr])

model = pipeline.fit(df)
predictions = model.transform(df)

predictions.select("gender", "age", "income", "prediction").show()

Output Example

+--------+---+-------+----------+
|gender |age|income |prediction|
+--------+---+-------+----------+
|Male |34 |50000 |1.0 |
|Female |28 |65000 |0.0 |
|Female |30 |45000 |1.0 |
+--------+---+-------+----------+

The entire process — encoding, assembling, training — is now automated.


5. Saving & Loading the Pipeline

pipeline.write().overwrite().save("/tmp/pipeline_purchase")
model.write().overwrite().save("/tmp/pipeline_purchase_model")

To load later:

from pyspark.ml import PipelineModel
loaded_model = PipelineModel.load("/tmp/pipeline_purchase_model")

Production teams love this step — saves retraining costs and ensures reproducibility.


6. Why Use MLlib Pipelines? (SEO Boost Summary)

Prevents messy ML codeReusable componentsEnsures consistent feature engineeringWorks seamlessly with Spark clustersBest for production-grade ML systems


Summary

In this chapter, you learned:

Transformers → transform data ✨ Estimators → learn from data ✨ Pipelines → combine everything into a clean workflow

MLlib pipelines help companies like DataVerse Labs achieve clean, scalable, reproducible ML engineering.


Next Topic → Regression & Classification in PySpark