0% found this document useful (0 votes)
51 views

Enthought Python Machine Learning SciKit Learn Cheat Sheets 1 3 v1.0

The document discusses various machine learning regression and classification algorithms including linear regression, ridge regression, lasso regression, polynomial features, radial basis functions, support vector regression, stochastic gradient descent regression, and support vector classification. Performance metrics like mean squared error, r2 score, and mean absolute error are also covered.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Enthought Python Machine Learning SciKit Learn Cheat Sheets 1 3 v1.0

The document discusses various machine learning regression and classification algorithms including linear regression, ridge regression, lasso regression, polynomial features, radial basis functions, support vector regression, stochastic gradient descent regression, and support vector classification. Performance metrics like mean squared error, r2 score, and mean absolute error are also covered.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Regression: Predict Continuous Data

Predict how a dependent variable (output, t) changes when any


of the independent variables (inputs, or features, x) change, Nonlinear Transformations
for example, how house prices change as a function of
When to use them: “Straight line” not sufficient, for example,
neighborhood and size, or how time spent on a web page predicting temperature has a function of time of day.
varies as a function of the number of ads and content How it works: “Reword” a nonlinear model in linear terms using
nonlinear basis functions, Φj(x), so we can use linear model
type. Training data has N samples and D features. machinery to solve nonlinear problems. Linear model becomes:

Linear Model O(ND2) Polynomial Expansion of Order P: E.g. A 2nd order polynomial
two-feature model:
Solves problems of the form: Becomes a model with these 6 basis functions:

with predicted value y, features, x, and fitted weights w.


Solved by minimizing “least square error”, ED: Gotchas: Same feature affects many different coefficients, so an
outlier can have a big global effect. Number of basis functions
grows very quickly, O((P+1)(D+1)).
Code: poly=preprocessing.PolynomialFeatures(degree)
x_poly = poly.fit_transform(x)
On fitted models, access w as model.coef_ and w0 as model.intercept_.
Gotchas: Features must be uncorrelated, use decomposition.PCA().
Radial Basis Functions (RBF): Local, Gaussian-shaped functions,
Code: linear_model.LinearRegression() if less than 100k samples, or see SGD.
defined by centers and width. Turns one feature into P features.

Ridge O(ND2) Code: metrics.pairwise.rbf_kernel(x, centers, gamma)

When to use it: Less than 100k samples, noisy outputs.


How it works: Linear model that limits the size of the weights. Prevents overfitting by
increasing bias. Minimizes E instead of ED , where second term is called “L2 norm”:

Code: linear_model.Ridge(alpha)
• alpha: Regularization strength, alpha > 0, corresponds to 1/C in other models. Support Vector Regressor ~O(N2D)
• Increase if noisy samples.
When to use it: Many important features,
Lasso O(ND2) more features than samples, nonlinear
problem.
When to use it: Less than 100k samples, only some features should be How it works: Find a function such that
important. training points fit within a “tube” of acceptable
How it works: Linear model that forces small weights to be zero. error, with some tolerance towards points that are
Minimizes E instead of ED , where the second term is called “L1 norm”: outside the tube.
Gotchas: Must scale inputs, see StandardScalar and RobustScalar.
Code: Start with svm.LinearSVR(epsilon, C=1).
Make C smaller if lots of noisy observations (C = 1/α, small C means
more regularization).
Code: linear_model.Lasso(alpha) If LinearSVR doesn’t work, use svm.SVR(kernel='rbf', gamma).
• alpha: Regularization strength, alpha > 0, corresponds to
• 1/C in other models. Increase if noisy samples. Stochastic Gradient Descent (SGD) Regressor
Tip: Use with feature_selection.SelectFromModel as a
transformation stage to select features with non-zero When to use it: Fit is too slow with other estimators.
weights. How it works: “Online” method, learns the weights in batches,
with a subset of the data each time. Pair with manual basis function
Ridge vs. Lasso – Shape of Ew expansion to train nonlinear models on really large datasets.
Code: linear_model.SGDRegressor() and partial_fit() method.
With Ridge and Lasso, the error to minimize E has
an extra component Ew:
Performance Metrics in sklearn.metrics
Lasso produces sparse models because small mean_squared_error: Smaller is better.
weights are forced to zero. Puts large weight on outliers.

r2_score: Coefficient of determination.


Best score is 1.0. Proportion of explained
variance. Default for model.score(x, t).

mean_absolute_error: Smaller is better.


Uses same scale as the data.

median_absolute_error: Robust to outliers.

Take y our M achine L e a rning skills to the ne xt le ve l! Reg ister at w w w .enth ou g h t.com/p yth on -f or-d ata-scien ce-train in g
© 2 0 1 7 E n t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-No Deriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice n s e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Classification: Predict Categorical Data
Predict the class, or label (t), of a sample based on its features (x).
Examples: Recognize hand-written digits, or mark email as spam. In Support Vector Classifier
scikit-learn, labels are represented as integers and get expanded
O(ND2) to O(ND3)
internally into matrices of binary choices between unique integer
When to use it: Large number of
labels. Use class_weight='balanced' in most models to adjust features. Slightly more features than
for unbalanced datasets (more training data from one class than samples.
others). Training data has N samples and D features. How it works: Maximize distance
between classes in high-dimensional
space, i.e., “maximum margin classifier”.
Gotchas: Scale your data.
Logistic Regression O(ND2) Code: svm.SVC(kernel, C=1).
Make C smaller if lots of noisy samples.
When to use it: Need to understand contributions of If accuracy is important set kernel='rbf'.
features. Fast to train, easy to interpret. If fast training is important, use svm.LinearSVC().
How it works: Fits an s-shaped function (logistic function),
which is continuous but has a steep transition between the
two classes, and assigns class based on sign. Neighbor Classifiers
Gotchas: Inputs must be scaled and uncorrelated.
Code: linear_model.LogisticRegression(C, solver).
O(DlogN) to O(DN)
• penalty='l1' to use estimator for feature selection. When to use them: Large datasets.
• solver='liblinear' for small datasets or L1 penalty, Very irregular decision boundary.
• 'lbfgs', 'sag' or 'newton-cg' for multi-class problems and large datasets, and How it works: Predict class by majority vote ?
• 'sag' for very large datasets. from nearby data. K=3
Gotchas: Efficiency comes at the cost of also K=5

Decision Tree O(NDlog(N)) having high variance.


Code: neighbors.KNeighborsClassifier(n_neighbors).
• Use RadiusNeighborsClassifier() for unbalanced data
When to use it: Need to understand prediction decisions. Data has both • and D not too large.
continuous and categorical features. No scaling needed. • Try weights='uniform' and 'distance'.
How it works: Chain binary decisions on increasingly smaller subsets of data.
Deeper trees have more complex decision rules and a better fit.
Gotchas: Very often overfits. Consider doing dimensionality reduction before-
Stochastic Gradient Descent
hand. N must double with each extra level. (SGD) Classifier
Code: tree.DecisionTreeClassifier(max_depth). Start with
max_depth=3, then increase. Use tree.export_graphviz to visualize tree. When to use it: Very large N and D, e.g., 105
samples and 105 features.
Yes known number? No How it works: “Online” method, learns the weights
in batches.
Gotchas: Data must be scaled.
is grandma? ignore Code: linear_model.SGDClassifier(loss, alpha, n_iter)
and partial_fit() method.
pick up ignore • Use n_iter=np.ceil(10**6/n_samples). loss='hinge' gives SVC,
'log' gives logistic regression.

Ensemble Methods Performance Metrics in sklearn.metrics


When to use them: No single estimator gave satisfying results. They take targets, t, and predicted classes, y, as arguments.
How they work: "Wisdom of the crowd". Combines predictions of There's more than one way to be wrong. A fire alarm that

C1
multiple weak, biased estimators to create a better one. Two types: always goes off is annoying, one that never goes off is costly.

C2
averaging methods build many estimators and average predictions; • confusion_matrix: Explore how model confuses classes.

C3
in boosting methods each new estimator tries to improve the Visualize with seaborn.heatmap. C1 C2 C3

previous one.
Gotchas: Hard to generate the perfect mix of estimators. false negatives true negatives
• accuracy_score (default for
Code: All in ensemble module.
model.score): Fraction correctly
predicted. Meaningless if samples are
Averaging estimators:
unbalanced. (TP + TN) / Total
• RandomForestClassifier(max_features)
• ExtraTreesClassifier(max_features)
Start with these, but always cross-validate: true false
• recall_score: Fraction of predicted fire
• max_features=sqrt(n_features) positives positives
when there's actually fire. TP / (TP + FN)
• max_depth=None
• min_samples_split=1
• precision_score: Fraction of correctly
Boosting estimator:
predicted fire of all cases where fire is predicted.
• AdaBoostClassifier()
P predicted as P. TP / (TP + FP)
• GradientBoostingClassifier()

All: predicted as positive


• Parallelize with n_jobs=-1
• Increasing n_estimators is better, but
• slower

Take y our M achine L e a rning skills to the ne xt le ve l! Reg ister at w w w .enth ou g h t.com/p yth on -f or-d ata-scien ce-train in g
© 2 0 1 7 E n t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-No Deriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice n s e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Clustering: Unsupervised Learning
Predict the underlying structure in features, without the use of
targets or labels. Splits samples into groups called “clusters”. Agglomerative Clustering O(N2 logN)
With no targets, models are trained by minimizing some
When to use it: Need a flexible definition of distance (e.g.
definition of “distance” within a cluster. Data has N samples, Levenshtein).
D features, and the model discovers k clusters. Models can How it works: Defines all observations as unique clusters, then
merges the closest ones iteratively.
be used for prediction or for transformation, by reducing Gotchas: Worst time complexity. “Rich get richer” behavior.
D features into one with k unique values. Code: cluster.AgglomerativeClustering(linkage,
affinity, connectivity). Set linkage criteria for merging:
• 'ward': minimize sum of square differences. Minimizes
Some models expect geometries that
• variance. Gives most regular cluster size.
are “flat”, or roughly spherical.
• 'complete': minimize max distance between sample pairs.
Clusters with complicated shapes like
• 'average': minimize average distance between all sample pairs
rings or lines are not flat, and will not
• Yields uneven cluster sizes.
work in those models. Flat Non-Flat Non-Flat
affinity: defines type of distances. 'l1' for sparse features,
e.g., text; 'cosine' is invariant to scaling.
K-Means O(kN) connectivity: provides extra constraints about which nodes can
be merged, e.g., neighbors.kneighbors_graph.
When to use it: Scales well. Works best on a small number of flat clusters. For large
sample sizes, substitute MiniBatchKMeans. a a b c d e f
How it works: Assigns samples to nearest of k cluster centers, then moves the centers
to minize the average distance between centers and samples. bc de
Gotchas: The K-Means algorithm used by scikit-learn is sensitive to initial location of the
centers. Performs poorly on complex, non-flat shapes.
def
Code: cluster.KMeans(n_clusters). n_jobs=-1 to parallelize. b d bcdef
1.Init 2. Classification 3. Update centers Repeat 2. & 3. f
c e abcdef

BIRCH O(kN)
When to use it: Large number of observations and small
number of features.
Mean Shift O(NlogN) How it works: Builds a balanced tree of groups of data, then
clusters those groups instead of the raw data.
When to use it: Non-flat geometries. Unknown number of clusters. Need to Gotchas: Performs poorly with large number of features.
guarantee convergence. Code: cluster.Birch(threshold, branching_factor,
How it works: Finds local maxima given a window size. n_clusters)
Gotchas: Accuracy strongly tied to selecting correct window.
Code: cluster.MeanShift(bandwith). Set bandwidth manually to small
value for large dataset. Estimating it is O(N2) and can be the bottleneck.

Affinity Propagation O(N2)


branching factor = 3
When to use it: Unknown number of clusters. Need to specify own
similarity metric (affinity argument).
How it works: Finds data points which maximize similarity within
cluster while minimizing similarity with data outside of cluster.
Gotchas: O(N2) memory use. Accuracy tied to damping.
Code: cluster.AffinityPropagation(preference,
damping) Performance Metrics in sklearn.metrics
• preference: Negative. Controls the number of clusters.
• Explore on log scale. The metrics do not take into account the exact class values, only their separation.
• damping: 0.5 to 1. Score is based on ground truth, (targets), if available, or to a measure of similarity within
class, and difference across classes.
DBSCAN O(N ) 2
Needs ground truth:
• adjusted_rand_score: -1 to 1 (best). 0 is random classes. Measures similarity.
When to use it: Very non-flat geometries. • Related to accuracy (% correct).
Very uneven clusters. • adjusted_mutual_info_score: 0 to 1 (best). 0 is random classes. 10x slower than
How it works: Clusters are contiguous • adjusted_rand_score. Measures agreement.
areas with high data density. Bounds of Outlier • homogeneity_completeness_v_measure: 0 to 1 (best). homogeneity: each
clusters are found using graph • cluster only contains members of one class; completeness: all members of a class •
connectivity. • are in the same cluster: and, v_measure_score: the harmonic mean of both. Not
Gotchas: O(N2) memory use. Core Samples
• normalized for random labeling. Doesn't need ground truth:
Not deterministic at cluster boundaries. • silhouette_score: -1 to 1 (best). 0 means overlapping clusters. Based on distance
Code: cluster.DBSCAN(min_samples, • to samples in same cluster and distance to next nearest cluster.
eps, metric)
• Higher min_samples, or lower eps
• requires higher density to form a
• cluster.

Take y our M achin e L e arnin g skills to th e ne xt le ve l! Re g i ster at w w w .e nth ou g h t.com/ p y th on -f or -d ata -sc i en ce -tra inin g
© 2 0 1 7 E n t h o u gh t , I nc. , lic e n s ed u n de r t h e Cr e a t iv e Co mmo n s At t rib ut io n -No nCo mmerci al-No De riv at iv e s 4.0 I n t ern at io n al Lic e n s e .
T o v ie w a co py o f t his lic e n s e , v is it h t tp:/ / cre at iv e co mmo n s .org/ lic e n s e s /by -n c-n d/ 4.0/

You might also like