0% found this document useful (0 votes)
78 views

Python Scikit-Learn Cheat Sheet For Machine Learning

This document provides a summary of machine learning techniques using scikit-learn in Python. It loads iris data, divides it into training and test sets, trains a k-nearest neighbors model on the training set and predicts the test set labels. It then discusses various preprocessing techniques like standardization, binarization, normalization, handling categorical features, imputing missing values, and generating polynomial features to transform data for machine learning models.

Uploaded by

gepiv94928
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Python Scikit-Learn Cheat Sheet For Machine Learning

This document provides a summary of machine learning techniques using scikit-learn in Python. It loads iris data, divides it into training and test sets, trains a k-nearest neighbors model on the training set and predicts the test set labels. It then discusses various preprocessing techniques like standardization, binarization, normalization, handling categorical features, imputing missing values, and generating polynomial features to transform data for machine learning models.

Uploaded by

gepiv94928
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Python Scikit-Learn Cheat Sheet for Machine Learning

Let’s create a basic example using scikit-learn library which will be used to

⚫ Load the data


⚫ Divide the data into train and test,
⚫ Train your data using the KNN Algorithm and,
⚫ Predict the result

from sklearn import neighbors, datasets, preprocessing


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)>>> y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)

Loading the data


You need to have a numeric data stored in NumPy arrays or SciPy sparse matrices.
You can also use other numeric arrays, such as Pandas DataFrame.

import numpy as np
X = np.random.random((10,5))
y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
X[X < 0.7] = 0

Train and Test


Once the data is loaded, your next task would be split your dataset into training data
and testing data.

from sklearn.model_selection import train_test_spli


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Data Preprocessing

Standardization
Data standardization is one of the data preprocessing step which is used for
rescaling one or more attributes so that the attributes have a mean value of 0 and a
standard deviation of 1. Standardization assumes that your data has a Gaussian
(bell curve) distribution.

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

Binarization
Binarization is a common operation performed on text count data. Using binarization
the analyst can decide to consider the presence or absence of a feature rather than
having a quantified number of occurrences for instance.

from sklearn.preprocessing import Binarizer


binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)

Normalization
Normalization is a technique generally used for data preparation for machine
learning. The main goal of normalization is to change the values of numeric columns
in the dataset so that we can have a common scale, without losing the information
or distorting the differences in the ranges of values.

from sklearn.preprocessing import Normalizer


scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

Encoding Categorical Features


The LabelEncoder is another class used in data-preprocessing for encoding class
levels. It can also be used to transform non-numerical labels into numerical labels.

from sklearn.preprocessing import LabelEncoder


enc = LabelEncoder()>>> y = enc.fit_transform(y)
Imputing missing values
The Imputer class in python will provide you with the basic strategies for
imputing/filling missing values. It does this by using the mean, median values or the
most frequent value of the row or column in which the missing values are located.
This class also allows for encoding different missing values.

from sklearn.preprocessing import Imputer


imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit_transform(X_train)

Generating Polynomial Features


Polynomial Feature generates a new feature matrix which consists of all polynomial
combinations of the features with degree less than or equal to the specified degree.
For example, if an input sample is two dimensional and of the form [a, b], then the 2-
degree polynomial features are [1, a, b, a^2, ab, b^2].

from sklearn.preprocessing import PolynomialFeatures


poly = PolynomialFeatures(5)
poly.fit_transform(X)

Full Article and Source


https://www.edureka.co/blog/cheatsheets/python-scikit-learn-cheat-sheet/

You might also like