Logistic Regression - Practical handson

2. Professional / Technical Style

Below is a refined technical exposition, including code, intermediate snapshots, and explanations of each component in proper ML / Spark terms.

2.1 Setup and Sample Data

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('LogitExample').getOrCreate()

# Let’s create a toy DataFrame instead of reading from file:
data = [
    (0, 3, "male",   22.0, 1, 0, 7.25,  "S"),
    (1, 1, "female", 38.0, 1, 0, 71.283, "C"),
    (1, 3, "female", 26.0, 0, 0, 7.925, "S"),
    (1, 1, "female", 35.0, 1, 0, 53.10, "S"),
    (0, 3, "male",   35.0, 0, 0, 8.05,   "S"),
    (0, 2, "male",   27.0, 0, 0, 21.0,   "S")
]
columns = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
df = spark.createDataFrame(data, schema=columns)
df.show(truncate=False)

Output (before transformations):

+--------+------+------+----+-----+-----+-------+--------+
|Survived|Pclass|Sex   |Age |SibSp|Parch|Fare   |Embarked|
+--------+------+------+----+-----+-----+-------+--------+
|0       |3     |male  |22.0|1    |0    |7.25   |S       |
|1       |1     |female|38.0|1    |0    |71.283 |C       |
|1       |3     |female|26.0|0    |0    |7.925  |S       |
|1       |1     |female|35.0|1    |0    |53.1   |S       |
|0       |3     |male  |35.0|0    |0    |8.05   |S       |
|0       |2     |male  |27.0|0    |0    |21.0   |S       |
+--------+------+------+----+-----+-----+-------+--------+

This corresponds to your final_data.show() before you apply transformations.

2.2 Transformations: Indexing, Encoding, Assembling

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

# Step 1: indexers
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')
embark_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkedIndex')

# Step 2: one‑hot encoders
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')
embark_encoder = OneHotEncoder(inputCol='EmbarkedIndex', outputCol='EmbarkedVec')

# Step 3: vector assembler
assembler = VectorAssembler(
    inputCols=['Pclass', 'SexVec', 'Age', 'SibSp', 'Parch', 'Fare', 'EmbarkedVec'],
    outputCol='features'
)

pipeline_features = Pipeline(stages=[
    gender_indexer, embark_indexer,
    gender_encoder, embark_encoder,
    assembler
])

# Fit and transform
model_feats = pipeline_features.fit(df)
df_transformed = model_feats.transform(df)
df_transformed.select("Survived", "features").show(truncate=False)

Output (after transformations):

+--------+-------------------------------------------------------+
|Survived|features                                               |
+--------+-------------------------------------------------------+
|0       |[3.0, 1.0, 0.0, 22.0, 1.0, 0.0, 7.25, 1.0, 0.0, 0.0]     |
|1       |[1.0, 0.0, 1.0, 38.0, 1.0, 0.0, 71.283, 0.0, 1.0, 0.0]   |
|1       |[3.0, 0.0, 1.0, 26.0, 0.0, 0.0, 7.925, 1.0, 0.0, 0.0]    |
|1       |[1.0, 0.0, 1.0, 35.0, 1.0, 0.0, 53.1, 1.0, 0.0, 0.0]     |
|0       |[3.0, 1.0, 0.0, 35.0, 0.0, 0.0, 8.05, 1.0, 0.0, 0.0]     |
|0       |[2.0, 1.0, 0.0, 27.0, 0.0, 0.0, 21.0, 1.0, 0.0, 0.0]     |
+--------+-------------------------------------------------------+

Here:

The vector length is 1 (Pclass) + 2 (SexVec) + 1 (Age) + 1 (SibSp) + 1 (Parch) + 1 (Fare) + 3 (EmbarkedVec) = 10 features
For example, row 1: Pclass = 3, SexVec = [1.0, 0.0], Age = 22.0, SibSp = 1, Parch = 0, Fare = 7.25, EmbarkedVec = [1.0, 0.0, 0.0]

2.3 Logistic Regression Modeling

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

lr = LogisticRegression(featuresCol='features', labelCol='Survived', predictionCol='prediction')

# Combine feature pipeline + logistic regression into a single pipeline
pipeline = Pipeline(stages=[gender_indexer, embark_indexer, gender_encoder, embark_encoder, assembler, lr])

train, test = df.randomSplit([0.7, 0.3], seed=42)
lr_pipeline_model = pipeline.fit(train)

# Apply to test set
results = lr_pipeline_model.transform(test)
results.select("Survived", "prediction", "probability").show(truncate=False)

Output (example):

+--------+----------+--------------------------+
|Survived|prediction|probability               |
+--------+----------+--------------------------+
|1       |1.0       |[0.22, 0.78]              |
|0       |0.0       |[0.85, 0.15]              |
|0       |1.0       |[0.40, 0.60]              |
+--------+----------+--------------------------+

Here:

probability = [p0, p1], where p1 is the model’s estimate of survival probability
prediction is 1.0 if p1 >= 0.5, else 0.0
On a row where Survived=0 but prediction=1.0, that’s a false positive

2.4 Evaluation with BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
    rawPredictionCol='rawPrediction',  # the default
    labelCol='Survived',
    metricName='areaUnderROC'
)
auc = evaluator.evaluate(results)
print("AUC = ", auc)

Explanation:

BinaryClassificationEvaluator computes metrics for binary classification tasks
metricName = "areaUnderROC" means Area Under the Receiver Operating Characteristic (ROC) curve
The ROC curve plots True Positive Rate vs False Positive Rate at various thresholds
AUC ranges from 0 to 1; a value closer to 1 means better separability
If AUC = 0.5, the model is no better than random guessing You could also use metricName = "areaUnderPR" (Area under precision‑recall curve).

2.5 Full Code (Concise) & Flow

# 1. Read or create DataFrame → df  
# 2. Pipeline of transformations: StringIndexer, OneHotEncoder, VectorAssembler  
# 3. Append the LogisticRegression estimator  
# 4. Fit on train set, transform test set  
# 5. Inspect predictions + probabilities  
# 6. Use BinaryClassificationEvaluator to compute AUC  

The result is a trained logistic regression model that, given new passenger data, outputs survival probabilities and predictions.

🧭 1-Minute Recap: Linear Regression

Step	Description	Key Tools / Concepts	Output / Purpose
1. Gather Data	Collect user or passenger data	Features like age, gender, fare, etc.	Raw dataset
2. Clean & Prepare	Handle missing values, drop irrelevant columns	Imputation, feature selection	Cleaned DataFrame
3. Feature Engineering	Create meaningful features	E.g., watch time, payment failures	Enriched feature set
4. String Indexing	Convert text to numeric indexes	`StringIndexer`	e.g., "male" → 1, "S" → 0
5. One-Hot Encoding	Convert index to binary vector	`OneHotEncoder`	e.g., SexVec = [0,1]
6. Assemble Features	Merge all features into one vector	`VectorAssembler`	Single `features` column
7. Train Model	Fit logistic regression on training data	`LogisticRegression`	Model learns weights
8. Predict	Predict on new/test data	`.transform()`	Outputs: `prediction`, `probability`
9. Evaluate	Assess model performance	`BinaryClassificationEvaluator`	Metric: AUC (e.g., 0.85)
10. Deploy	Use pipeline for new data	`Pipeline`	End-to-end repeatable workflow

2. Professional / Technical Style​

2.1 Setup and Sample Data​

2.2 Transformations: Indexing, Encoding, Assembling​

2.3 Logistic Regression Modeling​

2.4 Evaluation with BinaryClassificationEvaluator​

2.5 Full Code (Concise) & Flow​

🧭 1-Minute Recap: Linear Regression​