PySpark Quiz — MLlib Advanced & Production Pipelines

🚀MLlib Advanced & Production Pipelines

1. Which PySpark MLlib class is used to assemble multiple feature columns into a single vector for ML models?

2. You have a DataFrame df with categorical column 'category'. How do you convert it to numeric labels for ML?

3. You want to train a Logistic Regression model in PySpark. Which code is correct? from pyspark.ml.classification import LogisticRegression

4. How do you split your dataset into training (70%) and testing (30%) sets?

5. After training, you want to evaluate the model using area under ROC. Which evaluator is correct?

6. You want to save your trained MLlib model to disk. How do you do it?

7. How do you load a saved PySpark MLlib model?

8. You want to apply a trained model on streaming data. Which approach is correct?

9. Consider this code: from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=['age','salary'], outputCol='features') df_transformed = assembler.transform(df) What does df_transformed contain?

10. You want to perform hyperparameter tuning with cross-validation. Which PySpark classes are used?