Enthought Python Machine Learning SciKit Learn Cheat Sheets 1 3 v1.0
Enthought Python Machine Learning SciKit Learn Cheat Sheets 1 3 v1.0
Linear Model O(ND2) Polynomial Expansion of Order P: E.g. A 2nd order polynomial
two-feature model:
Solves problems of the form: Becomes a model with these 6 basis functions:
Code: linear_model.Ridge(alpha)
• alpha: Regularization strength, alpha > 0, corresponds to 1/C in other models. Support Vector Regressor ~O(N2D)
• Increase if noisy samples.
When to use it: Many important features,
Lasso O(ND2) more features than samples, nonlinear
problem.
When to use it: Less than 100k samples, only some features should be How it works: Find a function such that
important. training points fit within a “tube” of acceptable
How it works: Linear model that forces small weights to be zero. error, with some tolerance towards points that are
Minimizes E instead of ED , where the second term is called “L1 norm”: outside the tube.
Gotchas: Must scale inputs, see StandardScalar and RobustScalar.
Code: Start with svm.LinearSVR(epsilon, C=1).
Make C smaller if lots of noisy observations (C = 1/α, small C means
more regularization).
Code: linear_model.Lasso(alpha) If LinearSVR doesn’t work, use svm.SVR(kernel='rbf', gamma).
• alpha: Regularization strength, alpha > 0, corresponds to
• 1/C in other models. Increase if noisy samples. Stochastic Gradient Descent (SGD) Regressor
Tip: Use with feature_selection.SelectFromModel as a
transformation stage to select features with non-zero When to use it: Fit is too slow with other estimators.
weights. How it works: “Online” method, learns the weights in batches,
with a subset of the data each time. Pair with manual basis function
Ridge vs. Lasso – Shape of Ew expansion to train nonlinear models on really large datasets.
Code: linear_model.SGDRegressor() and partial_fit() method.
With Ridge and Lasso, the error to minimize E has
an extra component Ew:
Performance Metrics in sklearn.metrics
Lasso produces sparse models because small mean_squared_error: Smaller is better.
weights are forced to zero. Puts large weight on outliers.
Take y our M achine L e a rning skills to the ne xt le ve l! Reg ister at w w w .enth ou g h t.com/p yth on -f or-d ata-scien ce-train in g
© 2 0 1 7 E n t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-No Deriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice n s e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Classification: Predict Categorical Data
Predict the class, or label (t), of a sample based on its features (x).
Examples: Recognize hand-written digits, or mark email as spam. In Support Vector Classifier
scikit-learn, labels are represented as integers and get expanded
O(ND2) to O(ND3)
internally into matrices of binary choices between unique integer
When to use it: Large number of
labels. Use class_weight='balanced' in most models to adjust features. Slightly more features than
for unbalanced datasets (more training data from one class than samples.
others). Training data has N samples and D features. How it works: Maximize distance
between classes in high-dimensional
space, i.e., “maximum margin classifier”.
Gotchas: Scale your data.
Logistic Regression O(ND2) Code: svm.SVC(kernel, C=1).
Make C smaller if lots of noisy samples.
When to use it: Need to understand contributions of If accuracy is important set kernel='rbf'.
features. Fast to train, easy to interpret. If fast training is important, use svm.LinearSVC().
How it works: Fits an s-shaped function (logistic function),
which is continuous but has a steep transition between the
two classes, and assigns class based on sign. Neighbor Classifiers
Gotchas: Inputs must be scaled and uncorrelated.
Code: linear_model.LogisticRegression(C, solver).
O(DlogN) to O(DN)
• penalty='l1' to use estimator for feature selection. When to use them: Large datasets.
• solver='liblinear' for small datasets or L1 penalty, Very irregular decision boundary.
• 'lbfgs', 'sag' or 'newton-cg' for multi-class problems and large datasets, and How it works: Predict class by majority vote ?
• 'sag' for very large datasets. from nearby data. K=3
Gotchas: Efficiency comes at the cost of also K=5
C1
multiple weak, biased estimators to create a better one. Two types: always goes off is annoying, one that never goes off is costly.
C2
averaging methods build many estimators and average predictions; • confusion_matrix: Explore how model confuses classes.
C3
in boosting methods each new estimator tries to improve the Visualize with seaborn.heatmap. C1 C2 C3
previous one.
Gotchas: Hard to generate the perfect mix of estimators. false negatives true negatives
• accuracy_score (default for
Code: All in ensemble module.
model.score): Fraction correctly
predicted. Meaningless if samples are
Averaging estimators:
unbalanced. (TP + TN) / Total
• RandomForestClassifier(max_features)
• ExtraTreesClassifier(max_features)
Start with these, but always cross-validate: true false
• recall_score: Fraction of predicted fire
• max_features=sqrt(n_features) positives positives
when there's actually fire. TP / (TP + FN)
• max_depth=None
• min_samples_split=1
• precision_score: Fraction of correctly
Boosting estimator:
predicted fire of all cases where fire is predicted.
• AdaBoostClassifier()
P predicted as P. TP / (TP + FP)
• GradientBoostingClassifier()
Take y our M achine L e a rning skills to the ne xt le ve l! Reg ister at w w w .enth ou g h t.com/p yth on -f or-d ata-scien ce-train in g
© 2 0 1 7 E n t h o u gh t , I n c. , lice n s e d u n de r t h e Cre a t iv e Co mmo n s At t ribu t io n -No n Co mme rcial-No Deriv at iv e s 4.0 I n t e rn at io n al Lice n s e.
T o v ie w a co py o f t h is lice n s e , v is it h t t p:/ / cre at iv e co mmo n s .o rg/ lice n s e s / by -n c-n d/ 4.0/
Clustering: Unsupervised Learning
Predict the underlying structure in features, without the use of
targets or labels. Splits samples into groups called “clusters”. Agglomerative Clustering O(N2 logN)
With no targets, models are trained by minimizing some
When to use it: Need a flexible definition of distance (e.g.
definition of “distance” within a cluster. Data has N samples, Levenshtein).
D features, and the model discovers k clusters. Models can How it works: Defines all observations as unique clusters, then
merges the closest ones iteratively.
be used for prediction or for transformation, by reducing Gotchas: Worst time complexity. “Rich get richer” behavior.
D features into one with k unique values. Code: cluster.AgglomerativeClustering(linkage,
affinity, connectivity). Set linkage criteria for merging:
• 'ward': minimize sum of square differences. Minimizes
Some models expect geometries that
• variance. Gives most regular cluster size.
are “flat”, or roughly spherical.
• 'complete': minimize max distance between sample pairs.
Clusters with complicated shapes like
• 'average': minimize average distance between all sample pairs
rings or lines are not flat, and will not
• Yields uneven cluster sizes.
work in those models. Flat Non-Flat Non-Flat
affinity: defines type of distances. 'l1' for sparse features,
e.g., text; 'cosine' is invariant to scaling.
K-Means O(kN) connectivity: provides extra constraints about which nodes can
be merged, e.g., neighbors.kneighbors_graph.
When to use it: Scales well. Works best on a small number of flat clusters. For large
sample sizes, substitute MiniBatchKMeans. a a b c d e f
How it works: Assigns samples to nearest of k cluster centers, then moves the centers
to minize the average distance between centers and samples. bc de
Gotchas: The K-Means algorithm used by scikit-learn is sensitive to initial location of the
centers. Performs poorly on complex, non-flat shapes.
def
Code: cluster.KMeans(n_clusters). n_jobs=-1 to parallelize. b d bcdef
1.Init 2. Classification 3. Update centers Repeat 2. & 3. f
c e abcdef
BIRCH O(kN)
When to use it: Large number of observations and small
number of features.
Mean Shift O(NlogN) How it works: Builds a balanced tree of groups of data, then
clusters those groups instead of the raw data.
When to use it: Non-flat geometries. Unknown number of clusters. Need to Gotchas: Performs poorly with large number of features.
guarantee convergence. Code: cluster.Birch(threshold, branching_factor,
How it works: Finds local maxima given a window size. n_clusters)
Gotchas: Accuracy strongly tied to selecting correct window.
Code: cluster.MeanShift(bandwith). Set bandwidth manually to small
value for large dataset. Estimating it is O(N2) and can be the bottleneck.
Take y our M achin e L e arnin g skills to th e ne xt le ve l! Re g i ster at w w w .e nth ou g h t.com/ p y th on -f or -d ata -sc i en ce -tra inin g
© 2 0 1 7 E n t h o u gh t , I nc. , lic e n s ed u n de r t h e Cr e a t iv e Co mmo n s At t rib ut io n -No nCo mmerci al-No De riv at iv e s 4.0 I n t ern at io n al Lic e n s e .
T o v ie w a co py o f t his lic e n s e , v is it h t tp:/ / cre at iv e co mmo n s .org/ lic e n s e s /by -n c-n d/ 4.0/