0% found this document useful (0 votes)
2 views

Classification Algorithms

Scikit-learn's classification API offers tools for supervised learning to predict class labels using labeled training data. The basic workflow includes importing a classifier, preparing data, training the model, making predictions, and evaluating performance with various metrics. Common algorithms include Logistic Regression, SVM, k-NN, Decision Trees, Random Forests, and Naive Bayes, with examples provided for implementing these models using the Breast Cancer dataset and linear regression techniques.

Uploaded by

G Ramyasri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Classification Algorithms

Scikit-learn's classification API offers tools for supervised learning to predict class labels using labeled training data. The basic workflow includes importing a classifier, preparing data, training the model, making predictions, and evaluating performance with various metrics. Common algorithms include Logistic Regression, SVM, k-NN, Decision Trees, Random Forests, and Naive Bayes, with examples provided for implementing these models using the Breast Cancer dataset and linear regression techniques.

Uploaded by

G Ramyasri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Scikit-learn's classification API

Scikit-learn's classification API provides a comprehensive suite of tools for performing supervised learning
tasks, where the goal is to predict the category or class label of an input based on labeled training data.
Basic Workflow for Classification:
1. Import the classifier: Choose a classifier from Scikit-learn’s collection of algorithms.
2. Prepare the data: Load and split your dataset into features (X) and labels (y).
3. Train the classifier: Fit the model to your training data.
4. Predict the labels: Use the trained model to make predictions on new data.
5. Evaluate the model: Assess the model’s performance using metrics like accuracy, precision, recall, F1-
score, etc.
Common Classification Algorithms in Scikit-learn:
- Logistic Regression (`sklearn.linear_model.LogisticRegression`)
- Support Vector Machine (SVM) (`sklearn.svm.SVC`)
- k-Nearest Neighbors (k-NN) (`sklearn.neighbors.KNeighborsClassifier`)
- Decision Trees (`sklearn.tree.DecisionTreeClassifier`)
- Random Forests (`sklearn.ensemble.RandomForestClassifier`)
- Naive Bayes (`sklearn.naive_bayes.GaussianNB`)
API Details:
1. Importing a Classifier:
from sklearn.linear_model import LogisticRegression
2. Splitting Data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Training the Model:
model = LogisticRegression()
model.fit(X_train, y_train)
4. Making Predictions:
y_pred = model.predict(X_test)
5. Evaluating the Model:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
6. Hyperparameter Tuning:
- Use `GridSearchCV` or `RandomizedSearchCV` for hyperparameter optimization.
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
Commonly Used Functions:
fit(X, y): Trains the model using the input features `X` and target labels `y`.
predict(X): Predicts the labels for new data `X`.
score(X, y)`: Returns the mean accuracy of the model on the given test data and labels.
predict_proba(X)`: Returns the probability estimates for the input `X` for each class (for probabilistic
classifiers).
Example:
Here's an example using the DecisionTreeClassifier:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
data = load_iris()
X, y = data.data, data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train classifier


clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Scikit-learn makes it easy to switch between classifiers by just changing the imported class, while keeping
the workflow largely the same.

from sklearn.naive_bayes import GaussianNB


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the classifier


clf = GaussianNB()
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Scikit-learn Implementation:
In Scikit-learn, Logistic Regression is implemented in the `LogisticRegression` class.
1. Importing Logistic Regression:
from sklearn.linear_model import LogisticRegression
2. Basic Workflow:
- Training: Fit the logistic regression model to your training data.
- Prediction: Use the trained model to predict class labels for new data.
- Evaluation: Evaluate the model's performance using metrics like accuracy, precision, recall, or ROC-AUC.

3. Example with Binary Logistic Regression:


from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train Logistic Regression model
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Breast Cancer dataset


The Breast Cancer dataset contains 30 features, each derived from digitized images of fine needle aspirate
(FNA) of breast masses. These features describe the characteristics of the cell nuclei present in the images.
Each feature is computed based on three properties: the mean, the standard error, and the "worst" (or
largest) value for each characteristic.
Here is the list of the 30 features, grouped into 10 categories, along with a brief description of each:

Categories of Features:
1. BnRadiusBn:
- Bn`mean radius`Bn: The average distance from the center to points on the perimeter.
- Bn`radius error`Bn: The standard error of the radius.
- Bn`worst radius`Bn: The largest (or "worst") value of the radius.
2. BnTextureBn:
- Bn`mean texture`Bn: The standard deviation of gray-scale values within the tumor.
- Bn`texture error`Bn: The standard error of the texture.
- Bn`worst texture`Bn: The largest (or "worst") value of the texture.
3. BnPerimeterBn:
- Bn`mean perimeter`Bn: The average perimeter of the tumor.
- Bn`perimeter error`Bn: The standard error of the perimeter.
- Bn`worst perimeter`Bn: The largest (or "worst") value of the perimeter.
4. BnAreaBn:
- Bn`mean area`Bn: The average area of the tumor.
- Bn`area error`Bn: The standard error of the area.
- Bn`worst area`Bn: The largest (or "worst") value of the area.
5. BnSmoothnessBn:
- Bn`mean smoothness`Bn: The average local variation in radius lengths.
- Bn`smoothness error`Bn: The standard error of smoothness.
- Bn`worst smoothness`Bn: The largest (or "worst") value of smoothness.
6. BnCompactnessBn:
- Bn`mean compactness`Bn: Calculated as \( \frac{\text{perimeter}^2}{\text{area}} - 1.0 \).
- Bn`compactness error`Bn: The standard error of compactness.
- Bn`worst compactness`Bn: The largest (or "worst") value of compactness.
7. BnConcavityBn:
- Bn`mean concavity`Bn: The average severity of concave portions of the contour.
- Bn`concavity error`Bn: The standard error of concavity.
- Bn`worst concavity`Bn: The largest (or "worst") value of concavity.
8. BnConcave PointsBn:
- Bn`mean concave points`Bn: The average number of concave portions of the contour.
- Bn`concave points error`Bn: The standard error of concave points.
- Bn`worst concave points`Bn: The largest (or "worst") value of concave points.
9. BnSymmetryBn:
- Bn`mean symmetry`Bn: The average symmetry of the tumor.
- Bn`symmetry error`Bn: The standard error of symmetry.
- Bn`worst symmetry`Bn: The largest (or "worst") value of symmetry.
10. BnFractal DimensionBn:
- Bn`mean fractal dimension`Bn: The average "coastline approximation" of the tumor (a measure of
complexity).
- Bn`fractal dimension error`Bn: The standard error of fractal dimension.
- Bn`worst fractal dimension`Bn: The largest (or "worst") value of fractal dimension.

Complete List of Features:


1. mean radius 2. mean texture 3. mean perimeter 4. mean area 5. mean smoothness
6. mean compactness 7. mean concavity 8. mean concave points 9. mean symmetry
10. mean fractal dimension 11. radius error 12. texture error 13. perimeter error
14. area error 15. smoothness error 16. compactness error 17. concavity error
18. concave points error 19. symmetry error 20. fractal dimension error
21. worst radius22. worst texture 23. worst perimeter 24. worst area
25. worst smoothness 26. worst compactness 27. worst concavity 28. worst concave points
29. worst symmetry 30. worst fractal dimension

These features capture various geometric and textural properties of the tumor, providing a comprehensive
view of the tumor's characteristics, which are useful for distinguishing between malignant and benign
tumors.

Linear Regression Overview


Linear regression is a method used to model the relationship between a dependent variable \( y \) and one
or more independent variables \( X \). The simplest form, called simple linear regression, involves one
dependent variable and one independent variable, and it assumes that there is a linear relationship
between them.

The linear regression model is represented as:

y = beta_0 + beta_1 x + epsilon


Where:

- y is the dependent variable.

- x is the independent variable.

- beta_0 is the intercept (the value of y when x = 0.

- beta_1 is the slope of the line (indicating how much y changes for a unit change in x.

- epsilon is the error term (the difference between the actual and predicted values).

Numerical Example

Let's perform a simple linear regression on a small dataset.

Dataset

Suppose we have the following data on the number of hours studied (independent variable \( x \)) and the
corresponding test scores (dependent variable \( y \)):

| Hours Studied ( x) | Test Score ( y) |

|-------------------------|----------------------|

|1 |2 |

|2 |4 |

|3 |5 |

|4 |4 |

|5 |5 |

We want to fit a linear regression model to predict test scores based on hours studied.

Step 1: Calculate the Means of x and y

First, compute the mean (average) of the independent variable x and the dependent variable y.

bar{x} = \frac{1 + 2 + 3 + 4 + 5}{5} = 3

\bar{y} = \frac{2 + 4 + 5 + 4 + 5}{5} = 4

Step 2: Calculate the Slope \( \beta_1 \)

The slope beta_1 is calculated using the formula:

beta_1 = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}

Let's calculate the individual terms:

sum{(x_i - \bar{x})(y_i - \bar{y})} = (1-3)(2-4) + (2-3)(4-4) + (3-3)(5-4) + (4-3)(4-4) + (5-3)(5-4)

= (-2)(-2) + (-1)(0) + (0)(1) + (1)(0) + (2)(1)

=4+0+0+0+2=6
sum{(x_i - \bar{x})^2} = (1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2

= (-2)^2 + (-1)^2 + (0)^2 + (1)^2 + (2)^2

= 4 + 1 + 0 + 1 + 4 = 10

Now, calculate \( \beta_1 \):

beta_1 = \frac{6}{10} = 0.6

Step 3: Calculate the Intercept \( \beta_0 \)

The intercept \( \beta_0 \) is calculated using the formula:

beta_0 = \bar{y} - \beta_1 \bar{x}

Substituting the values:

\beta_0 = 4 - (0.6 \times 3) = 4 - 1.8 = 2.2

Step 4: Form the Linear Regression Equation

The linear regression equation is:

y = 2.2 + 0.6x

This equation can be used to predict the test score based on the number of hours studied.

Step 5: Make Predictions

Let's use the equation to predict the test score for someone who studied for 6 hours.

y = 2.2 + 0.6 \times 6 = 2.2 + 3.6 = 5.8

So, if a student studies for 6 hours, the predicted test score is 5.8.

Conclusion

This simple example demonstrates how to perform a linear regression to find the relationship between two
variables. The equation \( y = 2.2 + 0.6x \) can be used to make predictions based on the independent
variable \( x \).

# Importing required libraries


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Sample dataset: Hours Studied (X) vs Test Scores (Y)
data = {'Hours_Studied': [1, 2, 3, 4, 5], 'Test_Score': [2, 4, 5, 4, 5]}
df = pd.DataFrame(data)
# Splitting the dataset into features (X) and target (y)
X = df[['Hours_Studied']] # Independent variable
y = df['Test_Score'] # Dependent variable
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model
model = LinearRegression()
# Train the model using the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 Score): {r2}")
# Predicting a new value (for 6 hours of study)
new_hours = np.array([[6]])
predicted_score = model.predict(new_hours)
print(f"Predicted Test Score for 6 hours of study: {predicted_score[0]}")

# Visualizing the regression line with the dataset


plt.scatter(X, y, color='blue') # Original data points
plt.plot(X, model.predict(X), color='red') # Regression line
plt.title('Hours Studied vs Test Score')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.show()

Logistic Regression
Logistic Regression is a popular classification algorithm that is used when the target variable is categorical.
Despite its name, Logistic Regression is a classification method, not a regression method. It estimates
probabilities using a logistic (sigmoid) function and classifies data into binary or multiple classes.

Key Concepts of Logistic Regression:

1. Logistic Function (Sigmoid Function):


Logistic Regression models the probability that a given input belongs to a particular class. The output is
modeled as:
where B0is the intercept and are the coefficients for each feature . The logistic
function outputs a probability between 0 and 1.

2. Decision Boundary:
Logistic Regression uses a threshold (typically 0.5) to decide the class label. If the predicted probability is
greater than the threshold, the instance is classified as class 1; otherwise, it is classified as class 0.

3. Binary vs. Multiclass Logistic Regression:


- Binary Logistic Regression: Used for binary classification problems (e.g., spam or not spam).
- Multiclass Logistic Regression: For multiclass classification problems, Logistic Regression can be
extended using approaches like One-vs-Rest (OvR) or Softmax Regression (also known as multinomial
logistic regression).
4. Loss Function (Log-Loss / Cross-Entropy Loss):
Logistic Regression is optimized by minimizing the log-loss, which measures the difference between the
predicted probabilities and the actual class labels. The log-loss function for a binary classifier is:
5. Regularization:
Logistic Regression often uses regularization (e.g., L1 or L2 regularization) to prevent overfitting. L2
regularization (Ridge) adds a penalty for large coefficients, while L1 regularization (Lasso) encourages
sparsity in the coefficients.

Polynomial Regression

Polynomial Regression is an extension of linear regression that models the relationship between the
dependent variable yyy and one or more independent variables XXX as a polynomial. Unlike linear
regression, which assumes a linear relationship, polynomial regression can model more complex
relationships by adding powers of the independent variables to the regression equation.

Polynomial Regression Equation

For a single feature X, the equation for polynomial regression of degree n is:

Where:

 y is the dependent variable.


 X is the independent variable.
 b0,b1,…,bn are the coefficients.
 n is the degree of the polynomial.
When to Use Polynomial Regression

Polynomial regression is useful when the data shows a non-linear relationship between the
dependent and independent variables. It allows for fitting a curve rather than a straight line to the
data.

Explanation

1. Feature Transformation: The PolynomialFeatures class transforms the original


feature(s) into polynomial features. For a degree nnn polynomial, it creates new features like
X^2, X^3, etc.
2. Model Fitting: Once the features are transformed, a linear regression model is fitted to
these features. Though the model is still linear in terms of the coefficients, the features are
now non-linear (polynomial), allowing the model to fit a curve.
3. Visualization: You can visualize the curve fitted by the polynomial regression model,
which should capture the non-linear relationship between the independent and dependent
variables.

Pros and Cons of Polynomial Regression

Pros:

 Captures Non-linearity: It can model more complex, non-linear relationships that linear
regression cannot capture.
 Flexibility: You can choose the degree of the polynomial to fit the complexity of the data.

Cons:

 Overfitting: If the degree of the polynomial is too high, the model may overfit the training
data, resulting in poor generalization to new data.
 Complexity: High-degree polynomials can lead to a complex model that is difficult to
interpret and more sensitive to small fluctuations in the data.

Naive Bayes Classification


1. Overview:

- Definition: Naive Bayes is a classification algorithm based on Bayes' Theorem with a strong assumption
of independence between features.

- Types:

- Gaussian Naive Bayes: Used for continuous data, assuming a Gaussian distribution.

- Multinomial Naive Bayes: Suitable for discrete data like word counts in text classification.

- Bernoulli Naive Bayes: For binary/Boolean features.

2. Bayes' Theorem:
P(C|X) = \frac{P(X|C) \times P(C)}{P(X)}

- \(P(C|X)\): Posterior probability of class \(C\) given feature set \(X\).

- \(P(X|C)\): Likelihood of feature set \(X\) given class \(C\).

- \(P(C)\): Prior probability of class \(C\).

- \(P(X)\): Evidence (normalizing factor).

3. Assumptions:

- Feature Independence: The features are conditionally independent given the class. For example, in text
classification, the occurrence of words is assumed to be independent of each other, given the class label
(e.g., spam or not spam).

4. Example Workflow:

1. Data Preprocessing: Clean and prepare the dataset (e.g., text tokenization for text classification).

2. Train the Model: Estimate probabilities from the training data.

3. Prediction: For a new instance, calculate the posterior probability for each class and assign the class
with the highest probability.

5. Advantages:

- Fast and Efficient: Suitable for large datasets.

- Works Well with Text Data: Popular for spam detection and sentiment analysis.

- Handles Multiclass Problems.

6. Disadvantages:

- Strong Independence Assumption: May not hold in real-world data, which can reduce accuracy.

- Zero Frequency Problem: If a category of a feature is not present in the training data, it leads to zero
probability. This can be handled using smoothing techniques (e.g., Laplace smoothing).

7. Applications:

- Text Classification: Spam detection, sentiment analysis.

- Medical Diagnosis: Classifying diseases based on symptoms.

- Document Categorization: Classifying documents into predefined categories.

8. Smoothing Techniques:

- Laplace Smoothing: Adds a small value (often 1) to each probability to handle zero frequencies.
# Code Example (Python with Scikit-learn):

from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Sample Data (Replace with actual dataset)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Gaussian Naive Bayes Classifier

model = GaussianNB()

model.fit(X_train, y_train)

# Predictions

y_pred = model.predict(X_test)

# Accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
Confusion Matrix
The confusion matrix is a performance measurement tool used in machine learning and statistics to
evaluate the accuracy of a classification algorithm. It is a table that describes the performance of a
classification model on a set of test data for which the true values are known. The matrix compares
the actual target values with the predicted values.

Structure of a Confusion Matrix:

A confusion matrix is typically structured as a 2x2 table for binary classification, but it can be
extended for multi-class classification.

For binary classification, the matrix looks like this:

Predicted Positive Predicted Negative


Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

EXAMPLE
A machine learning model is trained to predict tumor in patients. The test dataset
consists of 100 people.

Confusion Matrix for tumor detection


True Positive (TP) — model correctly predicts the positive class (prediction and
actual both are positive). In the above example, 10 people who have tumors are
predicted positively by the model.
True Negative (TN) — model correctly predicts the negative class (prediction and
actual both are negative). In the above example, 60 people who don’t have tumors
are predicted negatively by the model.
False Positive (FP) — model gives the wrong prediction of the negative class
(predicted-positive, actual-negative). In the above example, 22 people are predicted
as positive of having a tumor, although they don’t have a tumor. FP is also called
a TYPE I error.
False Negative (FN) — model wrongly predicts the positive class (predicted-
negative, actual-positive). In the above example, 8 people who have tumors are
predicted as negative. FN is also called a TYPE II error.
With the help of these four values, we can calculate True Positive Rate (TPR), False
Negative Rate (FPR), True Negative Rate (TNR), and False Negative Rate (FNR).
Even if data is imbalanced, we can figure out that our model is working well or not.
For that, the values of TPR and TNR should be high, and FPR and FNR should
be as low as possible.
With the help of TP, TN, FN, and FP, other performance metrics can be calculated.

Precision, Recall
Both precision and recall are crucial for information retrieval, where positive class
mattered the most as compared to negative. Why?
While searching something on the web, the model does not care about
something irrelevant and not retrieved (this is the true negative case). Therefore
only TP, FP, FN are used in Precision and Recall.

Precision
Out of all the positive predicted, what percentage is truly positive.

The precision value lies between 0 and 1.

Recall
Out of the total positive, what percentage are predicted positive. It is the same as TPR
(true positive rate).

EXAMPLE 1- Credit card fraud detection


Confusion Matrix for Credit Card Fraud Detection
We do not want to miss any fraud transactions. Therefore, we want False-
Negative to be as low as possible. In these situations, we can compromise with the
low precision, but recall should be high. Similarly, in the medical application, we don’t
want to miss any patient. Therefore we focus on having a high recall.
So far, we have discussed when the recall is important than precision. But, when is
the precision more important than recall?

EXAMPLE 2 — Spam detection

Confusion Matrix for Spam detection


In the detection of spam mail, it is okay if any spam mail remains undetected (false
negative), but what if we miss any critical mail because it is classified as spam (false
positive). In this situation, False Positive should be as low as possible. Here,
precision is more vital as compared to recall.
When comparing different models, it will be difficult to decide which is better (high
precision and low recall or vice-versa). Therefore, there should be a metric that
combines both of these. One such metric is the F1 score.

F1 Score
It is the harmonic mean of precision and recall. It takes both false positive and false
negatives into account. Therefore, it performs well on an imbalanced dataset.

F1 score gives the same weightage to recall and precision.


There is a weighted F1 score in which we can give different weightage to recall and
precision. As discussed in the previous section, different problems give different
weightage to recall and precision.

Beta represents how many times recall is more important than precision. If
the recall is twice as important as precision, the value of Beta is 2.

You might also like