This document is a comprehensive cheatsheet for essential Python methods used in machine learning, covering data preprocessing, feature selection, dimensionality reduction, model training and evaluation, model selection and hyperparameter tuning, evaluation metrics, model interpretability, persistence, multiclass and multilabel classification, and clustering. Each section lists various functions and classes along with their primary purposes. The cheatsheet serves as a quick reference for practitioners in the field of machine learning.
This document is a comprehensive cheatsheet for essential Python methods used in machine learning, covering data preprocessing, feature selection, dimensionality reduction, model training and evaluation, model selection and hyperparameter tuning, evaluation metrics, model interpretability, persistence, multiclass and multilabel classification, and clustering. Each section lists various functions and classes along with their primary purposes. The cheatsheet serves as a quick reference for practitioners in the field of machine learning.
● train_test_split(): Split data into training and testing sets.
● StandardScaler(): Standardize features by removing the mean and scaling to unit variance. ● MinMaxScaler(): Scale features to a specified range (default: [0, 1]). ● MaxAbsScaler(): Scale features by their maximum absolute value. ● RobustScaler(): Scale features using statistics that are robust to outliers. ● Normalizer(): Normalize samples individually to unit norm. ● Binarizer(): Binarize data (set feature values to 0 or 1) according to a threshold. ● PolynomialFeatures(): Generate polynomial and interaction features. ● FunctionTransformer(): Construct a transformer from an arbitrary callable. ● KBinsDiscretizer(): Bin continuous data into intervals. ● LabelEncoder(): Encode target labels with integer values between 0 and n_classes-1. ● OneHotEncoder(): Encode categorical features as a one-hot numeric array. ● OrdinalEncoder(): Encode categorical features as an integer array. ● LabelBinarizer(): Binarize labels in a one-vs-all fashion. ● MultiLabelBinarizer(): Transform between iterable of iterables and a multilabel format. ● SimpleImputer(): Impute missing values using specified strategy (e.g., mean, median, most_frequent). ● IterativeImputer(): Impute missing values by modeling each feature with missing values as a function of other features. ● KNNImputer(): Impute missing values using k-Nearest Neighbors. ● MissingIndicator(): Transform a dataset into corresponding binary matrix indicating the presence of missing values.
Feature Selection:
● SelectKBest(): Select features according to the k highest scores.
● SelectPercentile(): Select features according to a percentile of the highest scores. ● SelectFpr(): Select features based on a false positive rate test.
By: Waleed Mousa
● SelectFdr(): Select features based on an estimated false discovery rate. ● SelectFromModel(): Select features based on importance weights. ● SequentialFeatureSelector(): Select features sequentially based on a specified criterion. ● RFE(): Feature ranking with recursive feature elimination. ● RFECV(): Feature ranking with recursive feature elimination and cross-validated selection of the best number of features. ● VarianceThreshold(): Feature selector that removes low-variance features. ● GenericUnivariateSelect(): Univariate feature selector with configurable strategy.
Dimensionality Reduction:
● PCA(): Perform principal component analysis (PCA) for dimensionality
reduction. ● IncrementalPCA(): Perform incremental PCA on a large dataset. ● KernelPCA(): Perform kernel PCA for non-linear dimensionality reduction. ● SparsePCA(): Perform PCA with sparsity constraints. ● TruncatedSVD(): Perform dimensionality reduction using truncated SVD (aka LSA). ● FastICA(): Perform Independent Component Analysis (ICA) for blind source separation. ● NMF(): Perform non-negative matrix factorization (NMF) for dimensionality reduction. ● MiniBatchNMF(): Perform mini-batch non-negative matrix factorization. ● LatentDirichletAllocation(): Perform Latent Dirichlet Allocation (LDA) for topic modeling. ● TSNE(): Perform t-distributed Stochastic Neighbor Embedding for dimensionality reduction. ● Isomap(): Perform Isomap embedding for non-linear dimensionality reduction. ● LocallyLinearEmbedding(): Perform Locally Linear Embedding for non-linear dimensionality reduction. ● MDS(): Perform Multidimensional Scaling (MDS) for dimensionality reduction. ● SpectralEmbedding(): Perform spectral embedding for non-linear dimensionality reduction.
By: Waleed Mousa
Model Training and Evaluation:
● fit(): Train a model on the given training data.
● predict(): Make predictions using a trained model. ● score(): Return the mean accuracy on the given test data and labels. ● cross_val_score(): Perform cross-validation and compute accuracy scores. ● cross_val_predict(): Generate cross-validated estimates for each input data point. ● cross_validate(): Evaluate a model using cross-validation. ● learning_curve(): Compute the learning curve to assess model performance. ● validation_curve(): Compute the validation curve to assess model performance. ● permutation_test_score(): Perform a permutation test for model evaluation. ● check_cv(): Determine the cross-validation splitting strategy. ● train_test_split(): Split data into training and testing sets. ● KFold(): K-Folds cross-validator. ● StratifiedKFold(): Stratified K-Folds cross-validator. ● LeaveOneOut(): Leave-One-Out cross-validator. ● LeavePOut(): Leave-P-Out cross-validator. ● ShuffleSplit(): Random permutation cross-validator. ● StratifiedShuffleSplit(): Stratified ShuffleSplit cross-validator. ● TimeSeriesSplit(): Time Series cross-validator.
Model Selection and Hyperparameter Tuning:
● GridSearchCV(): Perform grid search over specified parameter values for
an estimator. ● RandomizedSearchCV(): Perform randomized search over specified parameter distributions for an estimator. ● HalvingGridSearchCV(): Perform successive halving with grid search. ● HalvingRandomSearchCV(): Perform successive halving with randomized search. ● BayesSearchCV(): Perform Bayesian optimization for hyperparameter tuning. ● validation_curve(): Compute the validation curve to assess model performance.
By: Waleed Mousa
● learning_curve(): Compute the learning curve to assess model performance.
Model Evaluation Metrics:
● accuracy_score(): Compute the accuracy score.
● balanced_accuracy_score(): Compute the balanced accuracy score. ● average_precision_score(): Compute the average precision score. ● brier_score_loss(): Compute the Brier score loss. ● classification_report(): Build a text report showing the main classification metrics. ● cohen_kappa_score(): Compute the Cohen's kappa score. ● confusion_matrix(): Compute the confusion matrix to evaluate the accuracy of a classification. ● dcg_score(): Compute the Discounted Cumulative Gain (DCG) score. ● det_curve(): Compute the Detection Error Tradeoff (DET) curve. ● f1_score(): Compute the F1 score, which is the harmonic mean of precision and recall. ● fbeta_score(): Compute the F-beta score, which is the weighted harmonic mean of precision and recall. ● hamming_loss(): Compute the Hamming loss. ● hinge_loss(): Compute the hinge loss for binary classification. ● jaccard_score(): Compute the Jaccard similarity coefficient score. ● log_loss(): Compute the logarithmic loss. ● matthews_corrcoef(): Compute the Matthews correlation coefficient (MCC). ● multilabel_confusion_matrix(): Compute a confusion matrix for each class or sample. ● ndcg_score(): Compute the Normalized Discounted Cumulative Gain (NDCG) score. ● precision_recall_curve(): Compute precision-recall pairs for different probability thresholds. ● precision_recall_fscore_support(): Compute precision, recall, F-measure, and support for each class. ● precision_score(): Compute the precision score. ● recall_score(): Compute the recall score. ● roc_auc_score(): Compute the Area Under the Receiver Operating Characteristic Curve (ROC AUC) score. ● roc_curve(): Compute Receiver Operating Characteristic (ROC) curve. ● top_k_accuracy_score(): Compute the Top-k accuracy score.
By: Waleed Mousa
● zero_one_loss(): Compute the Zero-One classification loss. ● explained_variance_score(): Compute the explained variance score. ● max_error(): Compute the maximum residual error. ● mean_absolute_error(): Compute the mean absolute error. ● mean_squared_error(): Compute the mean squared error. ● mean_squared_log_error(): Compute the mean squared logarithmic error. ● median_absolute_error(): Compute the median absolute error. ● r2_score(): Compute the coefficient of determination (R^2) score. ● mean_poisson_deviance(): Compute the mean Poisson deviance. ● mean_gamma_deviance(): Compute the mean Gamma deviance. ● mean_tweedie_deviance(): Compute the mean Tweedie deviance.
Model Interpretability:
● permutation_importance(): Compute feature importances using permutation
importance. ● partial_dependence(): Compute partial dependence plots. ● plot_partial_dependence(): Plot partial dependence plots. ● plot_tree(): Plot a decision tree. ● export_graphviz(): Export a decision tree in DOT format. ● export_text(): Export a decision tree in text format. ● inspect(): Inspect an estimator or a callable.
Model Persistence:
● pickle.dump(): Save a trained model to a file using pickle.
● pickle.load(): Load a trained model from a file using pickle. ● joblib.dump(): Save a trained model to a file using joblib. ● joblib.load(): Load a trained model from a file using joblib.
strategy. ● OneVsOneClassifier(): One-vs-one (OvO) multiclass strategy. ● OutputCodeClassifier(): (Error-Correcting) Output-Code multiclass strategy. ● ClassifierChain(): A multi-label model that arranges binary classifiers into a chain. ● MultiOutputClassifier(): Multi target classification.
By: Waleed Mousa
Clustering:
● KMeans(): K-Means clustering algorithm.
● MiniBatchKMeans(): Mini-Batch K-Means clustering algorithm. ● AffinityPropagation(): Affinity Propagation clustering algorithm. ● MeanShift(): Mean Shift clustering algorithm. ● SpectralClustering(): Spectral clustering algorithm. ● AgglomerativeClustering(): Agglomerative Hierarchical Clustering algorithm. ● DBSCAN(): Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. ● OPTICS(): Ordering Points To Identify the Clustering Structure (OPTICS) algorithm. ● Birch(): Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm. ● FeatureAgglomeration(): Agglomerate features.