Classification Algorithms
Classification Algorithms
Scikit-learn's classification API provides a comprehensive suite of tools for performing supervised learning
tasks, where the goal is to predict the category or class label of an input based on labeled training data.
Basic Workflow for Classification:
1. Import the classifier: Choose a classifier from Scikit-learn’s collection of algorithms.
2. Prepare the data: Load and split your dataset into features (X) and labels (y).
3. Train the classifier: Fit the model to your training data.
4. Predict the labels: Use the trained model to make predictions on new data.
5. Evaluate the model: Assess the model’s performance using metrics like accuracy, precision, recall, F1-
score, etc.
Common Classification Algorithms in Scikit-learn:
- Logistic Regression (`sklearn.linear_model.LogisticRegression`)
- Support Vector Machine (SVM) (`sklearn.svm.SVC`)
- k-Nearest Neighbors (k-NN) (`sklearn.neighbors.KNeighborsClassifier`)
- Decision Trees (`sklearn.tree.DecisionTreeClassifier`)
- Random Forests (`sklearn.ensemble.RandomForestClassifier`)
- Naive Bayes (`sklearn.naive_bayes.GaussianNB`)
API Details:
1. Importing a Classifier:
from sklearn.linear_model import LogisticRegression
2. Splitting Data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Training the Model:
model = LogisticRegression()
model.fit(X_train, y_train)
4. Making Predictions:
y_pred = model.predict(X_test)
5. Evaluating the Model:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
6. Hyperparameter Tuning:
- Use `GridSearchCV` or `RandomizedSearchCV` for hyperparameter optimization.
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
Commonly Used Functions:
fit(X, y): Trains the model using the input features `X` and target labels `y`.
predict(X): Predicts the labels for new data `X`.
score(X, y)`: Returns the mean accuracy of the model on the given test data and labels.
predict_proba(X)`: Returns the probability estimates for the input `X` for each class (for probabilistic
classifiers).
Example:
Here's an example using the DecisionTreeClassifier:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
data = load_iris()
X, y = data.data, data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Load data
data = load_iris()
X, y = data.data, data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Make predictions
y_pred = clf.predict(X_test)
Scikit-learn Implementation:
In Scikit-learn, Logistic Regression is implemented in the `LogisticRegression` class.
1. Importing Logistic Regression:
from sklearn.linear_model import LogisticRegression
2. Basic Workflow:
- Training: Fit the logistic regression model to your training data.
- Prediction: Use the trained model to predict class labels for new data.
- Evaluation: Evaluate the model's performance using metrics like accuracy, precision, recall, or ROC-AUC.
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train Logistic Regression model
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Categories of Features:
1. BnRadiusBn:
- Bn`mean radius`Bn: The average distance from the center to points on the perimeter.
- Bn`radius error`Bn: The standard error of the radius.
- Bn`worst radius`Bn: The largest (or "worst") value of the radius.
2. BnTextureBn:
- Bn`mean texture`Bn: The standard deviation of gray-scale values within the tumor.
- Bn`texture error`Bn: The standard error of the texture.
- Bn`worst texture`Bn: The largest (or "worst") value of the texture.
3. BnPerimeterBn:
- Bn`mean perimeter`Bn: The average perimeter of the tumor.
- Bn`perimeter error`Bn: The standard error of the perimeter.
- Bn`worst perimeter`Bn: The largest (or "worst") value of the perimeter.
4. BnAreaBn:
- Bn`mean area`Bn: The average area of the tumor.
- Bn`area error`Bn: The standard error of the area.
- Bn`worst area`Bn: The largest (or "worst") value of the area.
5. BnSmoothnessBn:
- Bn`mean smoothness`Bn: The average local variation in radius lengths.
- Bn`smoothness error`Bn: The standard error of smoothness.
- Bn`worst smoothness`Bn: The largest (or "worst") value of smoothness.
6. BnCompactnessBn:
- Bn`mean compactness`Bn: Calculated as \( \frac{\text{perimeter}^2}{\text{area}} - 1.0 \).
- Bn`compactness error`Bn: The standard error of compactness.
- Bn`worst compactness`Bn: The largest (or "worst") value of compactness.
7. BnConcavityBn:
- Bn`mean concavity`Bn: The average severity of concave portions of the contour.
- Bn`concavity error`Bn: The standard error of concavity.
- Bn`worst concavity`Bn: The largest (or "worst") value of concavity.
8. BnConcave PointsBn:
- Bn`mean concave points`Bn: The average number of concave portions of the contour.
- Bn`concave points error`Bn: The standard error of concave points.
- Bn`worst concave points`Bn: The largest (or "worst") value of concave points.
9. BnSymmetryBn:
- Bn`mean symmetry`Bn: The average symmetry of the tumor.
- Bn`symmetry error`Bn: The standard error of symmetry.
- Bn`worst symmetry`Bn: The largest (or "worst") value of symmetry.
10. BnFractal DimensionBn:
- Bn`mean fractal dimension`Bn: The average "coastline approximation" of the tumor (a measure of
complexity).
- Bn`fractal dimension error`Bn: The standard error of fractal dimension.
- Bn`worst fractal dimension`Bn: The largest (or "worst") value of fractal dimension.
These features capture various geometric and textural properties of the tumor, providing a comprehensive
view of the tumor's characteristics, which are useful for distinguishing between malignant and benign
tumors.
- beta_1 is the slope of the line (indicating how much y changes for a unit change in x.
- epsilon is the error term (the difference between the actual and predicted values).
Numerical Example
Dataset
Suppose we have the following data on the number of hours studied (independent variable \( x \)) and the
corresponding test scores (dependent variable \( y \)):
|-------------------------|----------------------|
|1 |2 |
|2 |4 |
|3 |5 |
|4 |4 |
|5 |5 |
We want to fit a linear regression model to predict test scores based on hours studied.
First, compute the mean (average) of the independent variable x and the dependent variable y.
=4+0+0+0+2=6
sum{(x_i - \bar{x})^2} = (1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2
= 4 + 1 + 0 + 1 + 4 = 10
y = 2.2 + 0.6x
This equation can be used to predict the test score based on the number of hours studied.
Let's use the equation to predict the test score for someone who studied for 6 hours.
So, if a student studies for 6 hours, the predicted test score is 5.8.
Conclusion
This simple example demonstrates how to perform a linear regression to find the relationship between two
variables. The equation \( y = 2.2 + 0.6x \) can be used to make predictions based on the independent
variable \( x \).
Logistic Regression
Logistic Regression is a popular classification algorithm that is used when the target variable is categorical.
Despite its name, Logistic Regression is a classification method, not a regression method. It estimates
probabilities using a logistic (sigmoid) function and classifies data into binary or multiple classes.
2. Decision Boundary:
Logistic Regression uses a threshold (typically 0.5) to decide the class label. If the predicted probability is
greater than the threshold, the instance is classified as class 1; otherwise, it is classified as class 0.
Polynomial Regression
Polynomial Regression is an extension of linear regression that models the relationship between the
dependent variable yyy and one or more independent variables XXX as a polynomial. Unlike linear
regression, which assumes a linear relationship, polynomial regression can model more complex
relationships by adding powers of the independent variables to the regression equation.
For a single feature X, the equation for polynomial regression of degree n is:
Where:
Polynomial regression is useful when the data shows a non-linear relationship between the
dependent and independent variables. It allows for fitting a curve rather than a straight line to the
data.
Explanation
Pros:
Captures Non-linearity: It can model more complex, non-linear relationships that linear
regression cannot capture.
Flexibility: You can choose the degree of the polynomial to fit the complexity of the data.
Cons:
Overfitting: If the degree of the polynomial is too high, the model may overfit the training
data, resulting in poor generalization to new data.
Complexity: High-degree polynomials can lead to a complex model that is difficult to
interpret and more sensitive to small fluctuations in the data.
- Definition: Naive Bayes is a classification algorithm based on Bayes' Theorem with a strong assumption
of independence between features.
- Types:
- Gaussian Naive Bayes: Used for continuous data, assuming a Gaussian distribution.
- Multinomial Naive Bayes: Suitable for discrete data like word counts in text classification.
2. Bayes' Theorem:
P(C|X) = \frac{P(X|C) \times P(C)}{P(X)}
3. Assumptions:
- Feature Independence: The features are conditionally independent given the class. For example, in text
classification, the occurrence of words is assumed to be independent of each other, given the class label
(e.g., spam or not spam).
4. Example Workflow:
1. Data Preprocessing: Clean and prepare the dataset (e.g., text tokenization for text classification).
3. Prediction: For a new instance, calculate the posterior probability for each class and assign the class
with the highest probability.
5. Advantages:
- Works Well with Text Data: Popular for spam detection and sentiment analysis.
6. Disadvantages:
- Strong Independence Assumption: May not hold in real-world data, which can reduce accuracy.
- Zero Frequency Problem: If a category of a feature is not present in the training data, it leads to zero
probability. This can be handled using smoothing techniques (e.g., Laplace smoothing).
7. Applications:
8. Smoothing Techniques:
- Laplace Smoothing: Adds a small value (often 1) to each probability to handle zero frequencies.
# Code Example (Python with Scikit-learn):
model = GaussianNB()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Accuracy
print(f"Accuracy: {accuracy}")
Confusion Matrix
The confusion matrix is a performance measurement tool used in machine learning and statistics to
evaluate the accuracy of a classification algorithm. It is a table that describes the performance of a
classification model on a set of test data for which the true values are known. The matrix compares
the actual target values with the predicted values.
A confusion matrix is typically structured as a 2x2 table for binary classification, but it can be
extended for multi-class classification.
EXAMPLE
A machine learning model is trained to predict tumor in patients. The test dataset
consists of 100 people.
Precision, Recall
Both precision and recall are crucial for information retrieval, where positive class
mattered the most as compared to negative. Why?
While searching something on the web, the model does not care about
something irrelevant and not retrieved (this is the true negative case). Therefore
only TP, FP, FN are used in Precision and Recall.
Precision
Out of all the positive predicted, what percentage is truly positive.
Recall
Out of the total positive, what percentage are predicted positive. It is the same as TPR
(true positive rate).
F1 Score
It is the harmonic mean of precision and recall. It takes both false positive and false
negatives into account. Therefore, it performs well on an imbalanced dataset.
Beta represents how many times recall is more important than precision. If
the recall is twice as important as precision, the value of Beta is 2.