Skip to main content

Logistic Regression - Practical handson

2. Professional / Technical Style

Below is a refined technical exposition, including code, intermediate snapshots, and explanations of each component in proper ML / Spark terms.

2.1 Setup and Sample Data

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('LogitExample').getOrCreate()

# Let’s create a toy DataFrame instead of reading from file:
data = [
(0, 3, "male", 22.0, 1, 0, 7.25, "S"),
(1, 1, "female", 38.0, 1, 0, 71.283, "C"),
(1, 3, "female", 26.0, 0, 0, 7.925, "S"),
(1, 1, "female", 35.0, 1, 0, 53.10, "S"),
(0, 3, "male", 35.0, 0, 0, 8.05, "S"),
(0, 2, "male", 27.0, 0, 0, 21.0, "S")
]
columns = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
df = spark.createDataFrame(data, schema=columns)
df.show(truncate=False)

Output (before transformations):

+--------+------+------+----+-----+-----+-------+--------+
|Survived|Pclass|Sex |Age |SibSp|Parch|Fare |Embarked|
+--------+------+------+----+-----+-----+-------+--------+
|0 |3 |male |22.0|1 |0 |7.25 |S |
|1 |1 |female|38.0|1 |0 |71.283 |C |
|1 |3 |female|26.0|0 |0 |7.925 |S |
|1 |1 |female|35.0|1 |0 |53.1 |S |
|0 |3 |male |35.0|0 |0 |8.05 |S |
|0 |2 |male |27.0|0 |0 |21.0 |S |
+--------+------+------+----+-----+-----+-------+--------+

This corresponds to your final_data.show() before you apply transformations.

2.2 Transformations: Indexing, Encoding, Assembling

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

# Step 1: indexers
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')
embark_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkedIndex')

# Step 2: one‑hot encoders
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')
embark_encoder = OneHotEncoder(inputCol='EmbarkedIndex', outputCol='EmbarkedVec')

# Step 3: vector assembler
assembler = VectorAssembler(
inputCols=['Pclass', 'SexVec', 'Age', 'SibSp', 'Parch', 'Fare', 'EmbarkedVec'],
outputCol='features'
)

pipeline_features = Pipeline(stages=[
gender_indexer, embark_indexer,
gender_encoder, embark_encoder,
assembler
])

# Fit and transform
model_feats = pipeline_features.fit(df)
df_transformed = model_feats.transform(df)
df_transformed.select("Survived", "features").show(truncate=False)

Output (after transformations):

+--------+-------------------------------------------------------+
|Survived|features |
+--------+-------------------------------------------------------+
|0 |[3.0, 1.0, 0.0, 22.0, 1.0, 0.0, 7.25, 1.0, 0.0, 0.0] |
|1 |[1.0, 0.0, 1.0, 38.0, 1.0, 0.0, 71.283, 0.0, 1.0, 0.0] |
|1 |[3.0, 0.0, 1.0, 26.0, 0.0, 0.0, 7.925, 1.0, 0.0, 0.0] |
|1 |[1.0, 0.0, 1.0, 35.0, 1.0, 0.0, 53.1, 1.0, 0.0, 0.0] |
|0 |[3.0, 1.0, 0.0, 35.0, 0.0, 0.0, 8.05, 1.0, 0.0, 0.0] |
|0 |[2.0, 1.0, 0.0, 27.0, 0.0, 0.0, 21.0, 1.0, 0.0, 0.0] |
+--------+-------------------------------------------------------+

Here:

  • The vector length is 1 (Pclass) + 2 (SexVec) + 1 (Age) + 1 (SibSp) + 1 (Parch) + 1 (Fare) + 3 (EmbarkedVec) = 10 features
  • For example, row 1: Pclass = 3, SexVec = [1.0, 0.0], Age = 22.0, SibSp = 1, Parch = 0, Fare = 7.25, EmbarkedVec = [1.0, 0.0, 0.0]

2.3 Logistic Regression Modeling

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

lr = LogisticRegression(featuresCol='features', labelCol='Survived', predictionCol='prediction')

# Combine feature pipeline + logistic regression into a single pipeline
pipeline = Pipeline(stages=[gender_indexer, embark_indexer, gender_encoder, embark_encoder, assembler, lr])

train, test = df.randomSplit([0.7, 0.3], seed=42)
lr_pipeline_model = pipeline.fit(train)

# Apply to test set
results = lr_pipeline_model.transform(test)
results.select("Survived", "prediction", "probability").show(truncate=False)

Output (example):

+--------+----------+--------------------------+
|Survived|prediction|probability |
+--------+----------+--------------------------+
|1 |1.0 |[0.22, 0.78] |
|0 |0.0 |[0.85, 0.15] |
|0 |1.0 |[0.40, 0.60] |
+--------+----------+--------------------------+

Here:

  • probability = [p0, p1], where p1 is the model’s estimate of survival probability
  • prediction is 1.0 if p1 >= 0.5, else 0.0
  • On a row where Survived=0 but prediction=1.0, that’s a false positive

2.4 Evaluation with BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
rawPredictionCol='rawPrediction', # the default
labelCol='Survived',
metricName='areaUnderROC'
)
auc = evaluator.evaluate(results)
print("AUC = ", auc)

Explanation:

  • BinaryClassificationEvaluator computes metrics for binary classification tasks
  • metricName = "areaUnderROC" means Area Under the Receiver Operating Characteristic (ROC) curve
  • The ROC curve plots True Positive Rate vs False Positive Rate at various thresholds
  • AUC ranges from 0 to 1; a value closer to 1 means better separability
  • If AUC = 0.5, the model is no better than random guessing You could also use metricName = "areaUnderPR" (Area under precision‑recall curve).

2.5 Full Code (Concise) & Flow

# 1. Read or create DataFrame → df  
# 2. Pipeline of transformations: StringIndexer, OneHotEncoder, VectorAssembler
# 3. Append the LogisticRegression estimator
# 4. Fit on train set, transform test set
# 5. Inspect predictions + probabilities
# 6. Use BinaryClassificationEvaluator to compute AUC

The result is a trained logistic regression model that, given new passenger data, outputs survival probabilities and predictions.

🧭 1-Minute Recap: Linear Regression

StepDescriptionKey Tools / ConceptsOutput / Purpose
1. Gather DataCollect user or passenger dataFeatures like age, gender, fare, etc.Raw dataset
2. Clean & PrepareHandle missing values, drop irrelevant columnsImputation, feature selectionCleaned DataFrame
3. Feature EngineeringCreate meaningful featuresE.g., watch time, payment failuresEnriched feature set
4. String IndexingConvert text to numeric indexesStringIndexere.g., "male" → 1, "S" → 0
5. One-Hot EncodingConvert index to binary vectorOneHotEncodere.g., SexVec = [0,1]
6. Assemble FeaturesMerge all features into one vectorVectorAssemblerSingle features column
7. Train ModelFit logistic regression on training dataLogisticRegressionModel learns weights
8. PredictPredict on new/test data.transform()Outputs: prediction, probability
9. EvaluateAssess model performanceBinaryClassificationEvaluatorMetric: AUC (e.g., 0.85)
10. DeployUse pipeline for new dataPipelineEnd-to-end repeatable workflow