This document is a comprehensive cheatsheet for PySpark Machine Learning (MLlib), covering data preparation, feature transformation, model training, evaluation, and persistence. It includes code snippets for various operations such as creating vectors, scaling features, training models, and evaluating performance metrics. The document also addresses advanced topics like hyperparameter tuning, model interpretability, and recommender systems.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
22 views8 pages
Pyspark MLlib
This document is a comprehensive cheatsheet for PySpark Machine Learning (MLlib), covering data preparation, feature transformation, model training, evaluation, and persistence. It includes code snippets for various operations such as creating vectors, scaling features, training models, and evaluating performance metrics. The document also addresses advanced topics like hyperparameter tuning, model interpretability, and recommender systems.
● Create a pipeline: from pyspark.ml import Pipeline; pipeline =
Pipeline(stages=[assembler, scaler, lr]) ● Fit a pipeline: pipeline_model = pipeline.fit(train_data) ● Transform data using a pipeline: predictions = pipeline_model.transform(test_data)
10. Utilities
● Correlation: from pyspark.ml.stat import Correlation; corr_matrix =
● FP-Growth: from pyspark.ml.fpm import FPGrowth; fpGrowth =
FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6) ● PrefixSpan: from pyspark.ml.fpm import PrefixSpan; prefixSpan = PrefixSpan(minSupport=0.1, maxPatternLength=5, maxLocalProjDBSize=32000000) ● Association Rules: from pyspark.ml.fpm import FPGrowth; fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6); model = fpGrowth.fit(data); associationRules = model.associationRules
17. Model Interpretability
● Feature Importance: model.featureImportances
● Decision Tree Visualization: from pyspark.ml.classification import DecisionTreeClassificationModel; model.toDebugString ● Linear Model Coefficients: model.coefficients ● Linear Model Intercept: model.intercept
18. Hyperparameter Tuning
● ParamGridBuilder: from pyspark.ml.tuning import ParamGridBuilder;