Classification Algorithms II
Classification Algorithms II
Cookbook".
where P ( y i=1∨ X=x i ) is the probability of the i th observation's target value, y i, being class
1, X is the training data, and β 0, β 1 are the parameters to be learned. The effect of the
logistic function is to constrain the value of the function's output to between 0 and 1 so that
it can be interpreted as a probability. If P ( y i=1∨ X=x i ) is greater than or equal 0.5, class 1
is predicted; otherwise, class 0 is predicted.
Iris dataset
This is a classical dataset included in scikit-learn in the datasets module. We can load it
by calling the load_iris() function:
from sklearn.datasets import load_iris
iris = load_iris()
The iris object that is returned by load_iris is a Bunch object, which is very similar to a
dictionary. It contains keys and values:
print("Keys of iris_dataset: \n{}".format(iris.keys()))
Keys of iris_dataset:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR',
'feature_names', 'filename', 'data_module'])
The value of the key DESCR is a short description of the dataset. We show the beginning of
the description here (feel free to look up the rest yourself):
print(iris['DESCR'][:193] + "\n...")
.. _iris_dataset:
The value of the key target_names is an array of strings, containing the species of flower
that we want to predict:
print("Target names: {}".format(iris['target_names']))
The value of feature_names is a list of strings, giving the description of each feature:
print("Feature names: \n{}".format(iris['feature_names']))
Feature names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal
width (cm)']
The data itself is contained in the target and data fields. data contains the numeric
measurements of sepal length, sepal width, petal length, and petal width in a NumPy array:
print("Type of data: {}".format(type(iris['data'])))
The rows in the data array correspond to flowers, while the columns represent the four
measurements that were taken for each flower:
print("Shape of data: {}".format(iris['data'].shape))
We see that the array contains measurements for 150 different flowers. The following is
the feature values for the first five samples:
print("First five columns of data:\n{}".format(iris['data'][:5]))
From this data, we can see that all of the first five flowers have a petal width of 0.2 cm and
that the first flower has the longest sepal, at 5.1 cm.
print("Shape of target: {}".format(iris['target'].shape))
Shape of target: (150,)
Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2
2 2]
The meanings of the numbers are given by the iris['target_names'] array: 0 means
"setosa", 1 means "versicolor", and 2 means "virginica".
In scikit-learn, we can learn a logistic regression model using LogisticRegression.
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
# Train model
model = logreg.fit(features_standardized, target)
Once it is trained, we can use the model to predict the class of new observations.
# Create new observation
new_observation = [[.5, .5, .5, .5]]
# Predict class
model.predict(new_observation)
array([1])
In this example, our observation was predicted to be class 1. Additionally, we can see the
probability that an observation is a member of each class:
# View predicted probabilities
model.predict_proba(new_observation)
array([[0.17738424, 0.82261576]])
Our observation had an 17.7% chance of being class 0 and 82.2% chance of being class 1.
Multiclass classifier
On their own, logistic regressions are only binary classifiers, meaning they cannot handle
target vectors with more than two classes. However, two clever extensions to logistic
regression do just that.
First, in one-vs-rest logistic regression (OVR) a separate model is trained for each class
predicted whether an observation is that class or not (thus making it a binary classification
problem). It assumes that each classification problem (e.g., class 0 or not) is independent.
To make a prediction, all binary classifiers are run on a test point. The classifier that has the
highest score on its single class "wins", and this class label is returned as the prediction.
Alternatively, in multinomial logistic regression (MLR), the logistic function is replaced
with a softmax function:
eβ k xi
P ( y i=k ∨X =xi ) = K−1
,
1+ ∑ e β j xi
j=1
where P ( y i=1∨ X=x i ) is the probability of the i th observation's target value, y i, being class
k , and K is the total number of classes. One practical advantage of the MLR is that its
predicted probabilities using the predict_proba method are more reliable (i.e., better
calibrated).
When using LogisticRegression we can select which of the two techniques we want,
with OVR, ovr, being the default argument. We can switch to an MLR by setting the
argument to multinomial.
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
# Train model
model = logreg.fit(X_train, y_train)
# Train model
model = logreg.fit(X_train, y_train)
Higher values of α increase the penalty for larger parameter values (i.e., more complex
models). For LogisticRegression, the trade-off parameter that determines the strength
1
of the regularization is called C where C is the inverse of the regularization strength: C= ,
α
and higher values of C correspond to less regularization. In other words, when you use a
high value for the parameter C , LogisticRegression tries to fit the training set as best as
possible, while with low values of the parameter C , the models put more emphasis on
finding a coefficient vector ( 𝛃 ) that is close to zero.
There is another interesting aspect of how the parameter C acts. Using low values of C will
cause the algorithms to try to adjust to the "majority" of data points, while using a higher
value of C stresses the importance that each individual data point be classified correctly.
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
# Train model
model = logreg.fit(X_train, y_train)
If we desire a more interpretable model, using L 1 regularization might help, as it limits the
model to using only a few features.
# Create multinomial logistic regression object
logreg = LogisticRegression(penalty='l1', solver="saga", C=100,
random_state=0, multi_class="multinomial")
# Train model
model = logreg.fit(X_train, y_train)
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/
_sag.py:352: ConvergenceWarning: The max_iter was reached which means
the coef_ did not converge
warnings.warn(
As in regression, the penalty parameter influences the regularization and whether the
model will use all available features or select only a subset.
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
# Train model
model = logreg.fit(X_train, y_train)