Introduction to MLlib — Pipelines, Transformers, and Estimators
At DataVerse Labs, the data science team struggled with a familiar problem:
“Our ML code works… until someone touches it.”
Different preprocessing steps, inconsistent model versions, and messy feature engineering caused pipeline failures every week.
So the team switched to PySpark MLlib Pipelines — a structured, reproducible, production-friendly way to prepare data and train machine learning models.
In this chapter, you’ll understand Pipelines, Transformers, and Estimators one by one, with real examples and output.
1. What Are MLlib Pipelines?
Think of a pipeline as an assembly line:
- Raw data enters
- It passes through multiple stages
- A prediction-ready dataset or model comes out
A PySpark ML Pipeline consists of stages, and each stage is either:
- Transformer → transforms data (e.g., VectorAssembler, StringIndexerModel)
- Estimator → learns from data and creates a transformer (e.g., LogisticRegression)
A pipeline ensures:
✔ reproducibility
✔ cleaner code
✔ easy hyperparameter tuning
✔ versioned, deployable ML workflow
2. Transformers — They Transform Data
Transformers apply a function to your dataset.
Example: StringIndexerModel, VectorAssembler, StandardScalerModel.
Imagine NeoMart customer data:
Input Data
+--------+--------+-------+
|user_id |gender |income |
+--------+--------+-------+
|U1001 |Male |50000 |
|U1002 |Female |65000 |
|U1003 |Female |45000 |
+--------+--------+-------+
Code — StringIndexer (Transformer)
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
model_indexer = indexer.fit(df)
df_indexed = model_indexer.transform(df)
df_indexed.show()
Output
+--------+--------+-------+-------------+
|user_id |gender |income |gender_index |
+--------+--------+-------+-------------+
|U1001 |Male |50000 |1.0 |
|U1002 |Female |65000 |0.0 |
|U1003 |Female |45000 |0.0 |
+--------+--------+-------+-------------+
Transformers do not learn — they only apply existing logic.
3. Estimators — They Learn From Data
Estimators produce transformers after fitting.
Examples:
StringIndexer→ produces StringIndexerModelStandardScaler→ produces StandardScalerModelLogisticRegression→ produces LogisticRegressionModel
Code — Logistic Regression (Estimator)
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="label")
model_lr = lr.fit(training_df)
Here:
lris an Estimatormodel_lris a Transformer created by learning from training data
4. Building a Full ML Pipeline (Step-by-Step)
Let’s build a pipeline predicting whether a customer will purchase a product.
Input Data
+-------+--------+-------+------+
|gender |age |income |label |
+-------+--------+-------+------+
|Male |34 |50000 |1 |
|Female |28 |65000 |0 |
|Female |30 |45000 |1 |
+-------+--------+-------+------+
Pipeline Code
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
assembler = VectorAssembler(
inputCols=["gender_index", "age", "income"],
outputCol="features"
)
lr = LogisticRegression(featuresCol="features", labelCol="label")
pipeline = Pipeline(stages=[indexer, assembler, lr])
model = pipeline.fit(df)
predictions = model.transform(df)
predictions.select("gender", "age", "income", "prediction").show()
Output Example
+--------+---+-------+----------+
|gender |age|income |prediction|
+--------+---+-------+----------+
|Male |34 |50000 |1.0 |
|Female |28 |65000 |0.0 |
|Female |30 |45000 |1.0 |
+--------+---+-------+----------+
The entire process — encoding, assembling, training — is now automated.
5. Saving & Loading the Pipeline
pipeline.write().overwrite().save("/tmp/pipeline_purchase")
model.write().overwrite().save("/tmp/pipeline_purchase_model")
To load later:
from pyspark.ml import PipelineModel
loaded_model = PipelineModel.load("/tmp/pipeline_purchase_model")
Production teams love this step — saves retraining costs and ensures reproducibility.
6. Why Use MLlib Pipelines? (SEO Boost Summary)
✔ Prevents messy ML code ✔ Reusable components ✔ Ensures consistent feature engineering ✔ Works seamlessly with Spark clusters ✔ Best for production-grade ML systems
Summary
In this chapter, you learned:
✨ Transformers → transform data ✨ Estimators → learn from data ✨ Pipelines → combine everything into a clean workflow
MLlib pipelines help companies like DataVerse Labs achieve clean, scalable, reproducible ML engineering.
Next Topic → Regression & Classification in PySpark