100% found this document useful (1 vote)
233 views

Python For Data Science Cheat Sheet: Scikit-Learn Create Your Model Evaluate Your Model's Performance

This document provides a summary of key machine learning concepts in Python using the scikit-learn library. It discusses loading and preparing data, fitting models using supervised and unsupervised algorithms like linear regression, KNN, SVM, k-means clustering and PCA. It also covers evaluating model performance using various metrics for classification like accuracy, confusion matrix, and regression like mean squared error and R2 score. Cross-validation techniques are mentioned to validate models.

Uploaded by

srikantkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
233 views

Python For Data Science Cheat Sheet: Scikit-Learn Create Your Model Evaluate Your Model's Performance

This document provides a summary of key machine learning concepts in Python using the scikit-learn library. It discusses loading and preparing data, fitting models using supervised and unsupervised algorithms like linear regression, KNN, SVM, k-means clustering and PCA. It also covers evaluating model performance using various metrics for classification like accuracy, confusion matrix, and regression like mean squared error and R2 score. Cross-validation techniques are mentioned to validate models.

Uploaded by

srikantkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

PYTHON FOR DATA SCIENCE CHEAT SHEET Learn Python for Data Science at www.edureka.

co

Scikit-learn Create Your Model Evaluate Your Model’s Performance


Scikit-learn is an open source Python library that Supervised Learning Estimators Classification Metrics
implements a range of machine learning,
scikit
preprocessing, cross-validation and visualization Linear Regression Accuracy Score
algorithms using a unified interface. >>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test)
#Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score
A Basic Example >>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Classification Report #Metric scoring functions
>>> from sklearn import neighbors, datasets, preprocessing >>> from sklearn.svm import SVC
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report
>>> from sklearn.cross_validation import train_test_split >>> print(classification_report(y_test, y_pred))
>>> from sklearn.metrics import accuracy_score Naive Bayes
>>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> iris = datasets.load_iris() #Precision, recall,
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix f1-score and support
>>> X, y = iris.data[:, :2], iris.target
>>> print(confusion_matrix(y_test, y_pred))
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) KNN
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> from sklearn import neighbors
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) Regression Metrics
>>> X_train = scaler.transform(X_train)
>>> X_test = scaler.transform(X_test) Mean Absolute Error
Unsupervised Learning Estimators
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> from sklearn.metrics import mean_absolute_error
>>> knn.fit(X_train, y_train) K Means >>> y_true = [3, -0.5, 2]
>>> y_pred = knn.predict(X_test) >>> from sklearn.decomposition import PCA >>> mean_absolute_error(y_true, y_pred)
>>> accuracy_score(y_test, y_pred) >>> pca = PCA(n_components=0.95) Mean Squared Error
>>> from sklearn.metrics import mean_squared_error
Principal Component Analysis (PCA) >>> mean_squared_error(y_test, y_pred)
Loading The Data >>> from sklearn.cluster import KMeans R² Score
>>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
Your data needs to be numeric and stored as NumPy arrays or SciPy >>> r2_score(y_true, y_pred)
sparse matrices. Other types that are convertible to numeric arrays,
such as Pandas DataFrame, are also acceptable. Model Fitting Clustering Metrics
>>> import numpy as np Adjusted Rand Index
Supervised learning
>>> X = np.random.random((10,5)) >>> from sklearn.metrics import adjusted_rand_score
>>> lr.fit(X, Y)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F']) #Fit the model to the data
>>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, Y_train)
>>> X[X < 0.7] = 0
>>> svc.fit(X_train, Y_train) Homogeneity
Unsupervised Learning #Fit the model to the data >>> from sklearn.metrics import homogeneity_score
Training And Test Data >>> k_means.fit(X_train) #Fit to data, then transform it >>> homogeneity_score(y_true, y_pred)
>>> pca_model = pca.fit_transform(X_train) V-measure
>>> from sklearn.cross_validation import train_test_split >>> from sklearn.metrics import v_measure_score
>>> X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0) >>> metrics.v_measure_score(y_true, y_pred)
Prediction
Cross-Validation
Supervised Estimators #Predict labels
>>> y_pred = svc.predict(np.random.random((2,5))) #Predict labels Adjusted Rand Index
>>> y_pred = lr.predict(X_test) #Estimate probability >>> from sklearn.cross_validation import cross_val_score
>>> y_pred = knn.predict_proba(X_test)) of a label >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Unsupervised Estimators >>> print(cross_val_score(lr, X, y, cv=2))
>>> y_pred = k_means.predict(X_test) #Predict labels in
clustering algos
PYTHON FOR DATA SCIENCE Tune Your Model

Scikit-learn Grid Search

Standardization Encoding Categorical Features >>> from sklearn.grid_search import GridSearchCV


>>> params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> grid = GridSearchCV(estimator=knn,param_grid=params)
>>> scaler = StandardScaler().fit(X_train) >>> enc = LabelEncoder() >>> grid.fit(X_train, y_train)
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y) >>> print(grid.best_score_)
>>> standardized_X_test = scaler.transform(X_test) >>> print(grid.best_estimator_.n_neighbors)

Normalization Imputing Missing Values Randomized Parameter Optimization


>>> from sklearn.preprocessing import Normalizer >>> from sklearn.preprocessing import Imputer
>>> scaler = Normalizer().fit(X_train) >>> from sklearn.grid_search import RandomizedSearchCV
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0) >>> params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train)
>>> normalized_X_test = scaler.transform(X_test) >>> rsearch = RandomizedSearchCV(estimator=knn,
param_distributions=params,
cv=4,
Binarization Generating Polynomial Features n_iter=8,
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures random_state=5)
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5) >>> rsearch.fit(X_train, y_train)
>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) >>> print(rsearch.best_score_)

You might also like