0% found this document useful (0 votes)

8 views16 pages

Classification Algorithms

Scikit-learn's classification API offers tools for supervised learning to predict class labels using labeled training data. The basic workflow includes importing a classifier, preparing data, training the model, making predictions, and evaluating performance with various metrics. Common algorithms include Logistic Regression, SVM, k-NN, Decision Trees, Random Forests, and Naive Bayes, with examples provided for implementing these models using the Breast Cancer dataset and linear regression techniques.

Uploaded by

G Ramyasri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views16 pages

Classification Algorithms

Uploaded by

G Ramyasri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Scikit-learn's classification API

Scikit-learn's classification API provides a comprehensive suite of tools for performing supervised learning
tasks, where the goal is to predict the category or class label of an input based on labeled training data.
Basic Workflow for Classification:
1. Import the classifier: Choose a classifier from Scikit-learn’s collection of algorithms.
2. Prepare the data: Load and split your dataset into features (X) and labels (y).
3. Train the classifier: Fit the model to your training data.
4. Predict the labels: Use the trained model to make predictions on new data.
5. Evaluate the model: Assess the model’s performance using metrics like accuracy, precision, recall, F1-
score, etc.
Common Classification Algorithms in Scikit-learn:
- Logistic Regression (`sklearn.linear_model.LogisticRegression`)
- Support Vector Machine (SVM) (`sklearn.svm.SVC`)
- k-Nearest Neighbors (k-NN) (`sklearn.neighbors.KNeighborsClassifier`)
- Decision Trees (`sklearn.tree.DecisionTreeClassifier`)
- Random Forests (`sklearn.ensemble.RandomForestClassifier`)
- Naive Bayes (`sklearn.naive_bayes.GaussianNB`)
API Details:
1. Importing a Classifier:
from sklearn.linear_model import LogisticRegression
2. Splitting Data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Training the Model:
model = LogisticRegression()
model.fit(X_train, y_train)
4. Making Predictions:
y_pred = model.predict(X_test)
5. Evaluating the Model:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
6. Hyperparameter Tuning:
- Use `GridSearchCV` or `RandomizedSearchCV` for hyperparameter optimization.
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
Commonly Used Functions:
fit(X, y): Trains the model using the input features `X` and target labels `y`.
predict(X): Predicts the labels for new data `X`.
score(X, y)`: Returns the mean accuracy of the model on the given test data and labels.
predict_proba(X)`: Returns the probability estimates for the input `X` for each class (for probabilistic
classifiers).
Example:
Here's an example using the DecisionTreeClassifier:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
data = load_iris()
X, y = data.data, data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train classifier

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Scikit-learn makes it easy to switch between classifiers by just changing the imported class, while keeping
the workflow largely the same.

from sklearn.naive_bayes import GaussianNB

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the classifier

clf = GaussianNB()
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Scikit-learn Implementation:
In Scikit-learn, Logistic Regression is implemented in the `LogisticRegression` class.
1. Importing Logistic Regression:
from sklearn.linear_model import LogisticRegression
2. Basic Workflow:
- Training: Fit the logistic regression model to your training data.
- Prediction: Use the trained model to predict class labels for new data.
- Evaluation: Evaluate the model's performance using metrics like accuracy, precision, recall, or ROC-AUC.

3. Example with Binary Logistic Regression:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train Logistic Regression model
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Breast Cancer dataset

The Breast Cancer dataset contains 30 features, each derived from digitized images of fine needle aspirate
(FNA) of breast masses. These features describe the characteristics of the cell nuclei present in the images.
Each feature is computed based on three properties: the mean, the standard error, and the "worst" (or
largest) value for each characteristic.
Here is the list of the 30 features, grouped into 10 categories, along with a brief description of each:

Categories of Features:
1. BnRadiusBn:
- Bn`mean radius`Bn: The average distance from the center to points on the perimeter.
- Bn`radius error`Bn: The standard error of the radius.
- Bn`worst radius`Bn: The largest (or "worst") value of the radius.
2. BnTextureBn:
- Bn`mean texture`Bn: The standard deviation of gray-scale values within the tumor.
- Bn`texture error`Bn: The standard error of the texture.
- Bn`worst texture`Bn: The largest (or "worst") value of the texture.
3. BnPerimeterBn:
- Bn`mean perimeter`Bn: The average perimeter of the tumor.
- Bn`perimeter error`Bn: The standard error of the perimeter.
- Bn`worst perimeter`Bn: The largest (or "worst") value of the perimeter.
4. BnAreaBn:
- Bn`mean area`Bn: The average area of the tumor.
- Bn`area error`Bn: The standard error of the area.
- Bn`worst area`Bn: The largest (or "worst") value of the area.
5. BnSmoothnessBn:
- Bn`mean smoothness`Bn: The average local variation in radius lengths.
- Bn`smoothness error`Bn: The standard error of smoothness.
- Bn`worst smoothness`Bn: The largest (or "worst") value of smoothness.
6. BnCompactnessBn:
- Bn`mean compactness`Bn: Calculated as \( \frac{\text{perimeter}^2}{\text{area}} - 1.0 \).
- Bn`compactness error`Bn: The standard error of compactness.
- Bn`worst compactness`Bn: The largest (or "worst") value of compactness.
7. BnConcavityBn:
- Bn`mean concavity`Bn: The average severity of concave portions of the contour.
- Bn`concavity error`Bn: The standard error of concavity.
- Bn`worst concavity`Bn: The largest (or "worst") value of concavity.
8. BnConcave PointsBn:
- Bn`mean concave points`Bn: The average number of concave portions of the contour.
- Bn`concave points error`Bn: The standard error of concave points.
- Bn`worst concave points`Bn: The largest (or "worst") value of concave points.
9. BnSymmetryBn:
- Bn`mean symmetry`Bn: The average symmetry of the tumor.
- Bn`symmetry error`Bn: The standard error of symmetry.
- Bn`worst symmetry`Bn: The largest (or "worst") value of symmetry.
10. BnFractal DimensionBn:
- Bn`mean fractal dimension`Bn: The average "coastline approximation" of the tumor (a measure of
complexity).
- Bn`fractal dimension error`Bn: The standard error of fractal dimension.
- Bn`worst fractal dimension`Bn: The largest (or "worst") value of fractal dimension.

Complete List of Features:

1. mean radius 2. mean texture 3. mean perimeter 4. mean area 5. mean smoothness
6. mean compactness 7. mean concavity 8. mean concave points 9. mean symmetry
10. mean fractal dimension 11. radius error 12. texture error 13. perimeter error
14. area error 15. smoothness error 16. compactness error 17. concavity error
18. concave points error 19. symmetry error 20. fractal dimension error
21. worst radius22. worst texture 23. worst perimeter 24. worst area
25. worst smoothness 26. worst compactness 27. worst concavity 28. worst concave points
29. worst symmetry 30. worst fractal dimension

These features capture various geometric and textural properties of the tumor, providing a comprehensive
view of the tumor's characteristics, which are useful for distinguishing between malignant and benign
tumors.

Linear Regression Overview

Linear regression is a method used to model the relationship between a dependent variable \( y \) and one
or more independent variables \( X \). The simplest form, called simple linear regression, involves one
dependent variable and one independent variable, and it assumes that there is a linear relationship
between them.

The linear regression model is represented as:

y = beta_0 + beta_1 x + epsilon

Where:

- y is the dependent variable.

- x is the independent variable.

- beta_0 is the intercept (the value of y when x = 0.

- beta_1 is the slope of the line (indicating how much y changes for a unit change in x.

- epsilon is the error term (the difference between the actual and predicted values).

Numerical Example

Let's perform a simple linear regression on a small dataset.

Dataset

Suppose we have the following data on the number of hours studied (independent variable \( x \)) and the
corresponding test scores (dependent variable \( y \)):

| Hours Studied ( x) | Test Score ( y) |

|-------------------------|----------------------|

|1 |2 |

|2 |4 |

|3 |5 |

|4 |4 |

|5 |5 |

We want to fit a linear regression model to predict test scores based on hours studied.

Step 1: Calculate the Means of x and y

First, compute the mean (average) of the independent variable x and the dependent variable y.

bar{x} = \frac{1 + 2 + 3 + 4 + 5}{5} = 3

\bar{y} = \frac{2 + 4 + 5 + 4 + 5}{5} = 4

Step 2: Calculate the Slope \( \beta_1 \)

The slope beta_1 is calculated using the formula:

beta_1 = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}

Let's calculate the individual terms:

sum{(x_i - \bar{x})(y_i - \bar{y})} = (1-3)(2-4) + (2-3)(4-4) + (3-3)(5-4) + (4-3)(4-4) + (5-3)(5-4)

= (-2)(-2) + (-1)(0) + (0)(1) + (1)(0) + (2)(1)

=4+0+0+0+2=6
sum{(x_i - \bar{x})^2} = (1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2

= (-2)^2 + (-1)^2 + (0)^2 + (1)^2 + (2)^2

= 4 + 1 + 0 + 1 + 4 = 10

Now, calculate \( \beta_1 \):

beta_1 = \frac{6}{10} = 0.6

Step 3: Calculate the Intercept \( \beta_0 \)

The intercept \( \beta_0 \) is calculated using the formula:

beta_0 = \bar{y} - \beta_1 \bar{x}

Substituting the values:

\beta_0 = 4 - (0.6 \times 3) = 4 - 1.8 = 2.2

Step 4: Form the Linear Regression Equation

The linear regression equation is:

y = 2.2 + 0.6x

This equation can be used to predict the test score based on the number of hours studied.

Step 5: Make Predictions

Let's use the equation to predict the test score for someone who studied for 6 hours.

y = 2.2 + 0.6 \times 6 = 2.2 + 3.6 = 5.8

So, if a student studies for 6 hours, the predicted test score is 5.8.

Conclusion

This simple example demonstrates how to perform a linear regression to find the relationship between two
variables. The equation \( y = 2.2 + 0.6x \) can be used to make predictions based on the independent
variable \( x \).

# Importing required libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Sample dataset: Hours Studied (X) vs Test Scores (Y)
data = {'Hours_Studied': [1, 2, 3, 4, 5], 'Test_Score': [2, 4, 5, 4, 5]}
df = pd.DataFrame(data)
# Splitting the dataset into features (X) and target (y)
X = df[['Hours_Studied']] # Independent variable
y = df['Test_Score'] # Dependent variable
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model
model = LinearRegression()
# Train the model using the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 Score): {r2}")
# Predicting a new value (for 6 hours of study)
new_hours = np.array([[6]])
predicted_score = model.predict(new_hours)
print(f"Predicted Test Score for 6 hours of study: {predicted_score[0]}")

# Visualizing the regression line with the dataset

plt.scatter(X, y, color='blue') # Original data points
plt.plot(X, model.predict(X), color='red') # Regression line
plt.title('Hours Studied vs Test Score')
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.show()

Logistic Regression
Logistic Regression is a popular classification algorithm that is used when the target variable is categorical.
Despite its name, Logistic Regression is a classification method, not a regression method. It estimates
probabilities using a logistic (sigmoid) function and classifies data into binary or multiple classes.

Key Concepts of Logistic Regression:

1. Logistic Function (Sigmoid Function):

Logistic Regression models the probability that a given input belongs to a particular class. The output is
modeled as:
where B0is the intercept and are the coefficients for each feature . The logistic
function outputs a probability between 0 and 1.

2. Decision Boundary:
Logistic Regression uses a threshold (typically 0.5) to decide the class label. If the predicted probability is
greater than the threshold, the instance is classified as class 1; otherwise, it is classified as class 0.

3. Binary vs. Multiclass Logistic Regression:

- Binary Logistic Regression: Used for binary classification problems (e.g., spam or not spam).
- Multiclass Logistic Regression: For multiclass classification problems, Logistic Regression can be
extended using approaches like One-vs-Rest (OvR) or Softmax Regression (also known as multinomial
logistic regression).
4. Loss Function (Log-Loss / Cross-Entropy Loss):
Logistic Regression is optimized by minimizing the log-loss, which measures the difference between the
predicted probabilities and the actual class labels. The log-loss function for a binary classifier is:
5. Regularization:
Logistic Regression often uses regularization (e.g., L1 or L2 regularization) to prevent overfitting. L2
regularization (Ridge) adds a penalty for large coefficients, while L1 regularization (Lasso) encourages
sparsity in the coefficients.

Polynomial Regression

Polynomial Regression is an extension of linear regression that models the relationship between the
dependent variable yyy and one or more independent variables XXX as a polynomial. Unlike linear
regression, which assumes a linear relationship, polynomial regression can model more complex
relationships by adding powers of the independent variables to the regression equation.

Polynomial Regression Equation

For a single feature X, the equation for polynomial regression of degree n is:

Where:

 y is the dependent variable.

 X is the independent variable.
 b0,b1,…,bn are the coefficients.
 n is the degree of the polynomial.
When to Use Polynomial Regression

Polynomial regression is useful when the data shows a non-linear relationship between the
dependent and independent variables. It allows for fitting a curve rather than a straight line to the
data.

Explanation

1. Feature Transformation: The PolynomialFeatures class transforms the original

feature(s) into polynomial features. For a degree nnn polynomial, it creates new features like
X^2, X^3, etc.
2. Model Fitting: Once the features are transformed, a linear regression model is fitted to
these features. Though the model is still linear in terms of the coefficients, the features are
now non-linear (polynomial), allowing the model to fit a curve.
3. Visualization: You can visualize the curve fitted by the polynomial regression model,
which should capture the non-linear relationship between the independent and dependent
variables.

Pros and Cons of Polynomial Regression

Pros:

 Captures Non-linearity: It can model more complex, non-linear relationships that linear
regression cannot capture.
 Flexibility: You can choose the degree of the polynomial to fit the complexity of the data.

Cons:

 Overfitting: If the degree of the polynomial is too high, the model may overfit the training
data, resulting in poor generalization to new data.
 Complexity: High-degree polynomials can lead to a complex model that is difficult to
interpret and more sensitive to small fluctuations in the data.

Naive Bayes Classification

1. Overview:

- Definition: Naive Bayes is a classification algorithm based on Bayes' Theorem with a strong assumption
of independence between features.

- Types:

- Gaussian Naive Bayes: Used for continuous data, assuming a Gaussian distribution.

- Multinomial Naive Bayes: Suitable for discrete data like word counts in text classification.

- Bernoulli Naive Bayes: For binary/Boolean features.

2. Bayes' Theorem:
P(C|X) = \frac{P(X|C) \times P(C)}{P(X)}

- \(P(C|X)\): Posterior probability of class \(C\) given feature set \(X\).

- \(P(X|C)\): Likelihood of feature set \(X\) given class \(C\).

- \(P(C)\): Prior probability of class \(C\).

- \(P(X)\): Evidence (normalizing factor).

3. Assumptions:

- Feature Independence: The features are conditionally independent given the class. For example, in text
classification, the occurrence of words is assumed to be independent of each other, given the class label
(e.g., spam or not spam).

4. Example Workflow:

1. Data Preprocessing: Clean and prepare the dataset (e.g., text tokenization for text classification).

2. Train the Model: Estimate probabilities from the training data.

3. Prediction: For a new instance, calculate the posterior probability for each class and assign the class
with the highest probability.

5. Advantages:

- Fast and Efficient: Suitable for large datasets.

- Works Well with Text Data: Popular for spam detection and sentiment analysis.

- Handles Multiclass Problems.

6. Disadvantages:

- Strong Independence Assumption: May not hold in real-world data, which can reduce accuracy.

- Zero Frequency Problem: If a category of a feature is not present in the training data, it leads to zero
probability. This can be handled using smoothing techniques (e.g., Laplace smoothing).

7. Applications:

- Text Classification: Spam detection, sentiment analysis.

- Medical Diagnosis: Classifying diseases based on symptoms.

- Document Categorization: Classifying documents into predefined categories.

8. Smoothing Techniques:

- Laplace Smoothing: Adds a small value (often 1) to each probability to handle zero frequencies.
# Code Example (Python with Scikit-learn):

from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Sample Data (Replace with actual dataset)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Gaussian Naive Bayes Classifier

model = GaussianNB()

model.fit(X_train, y_train)

# Predictions

y_pred = model.predict(X_test)

# Accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
Confusion Matrix
The confusion matrix is a performance measurement tool used in machine learning and statistics to
evaluate the accuracy of a classification algorithm. It is a table that describes the performance of a
classification model on a set of test data for which the true values are known. The matrix compares
the actual target values with the predicted values.

Structure of a Confusion Matrix:

A confusion matrix is typically structured as a 2x2 table for binary classification, but it can be
extended for multi-class classification.

For binary classification, the matrix looks like this:

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

EXAMPLE
A machine learning model is trained to predict tumor in patients. The test dataset
consists of 100 people.

Confusion Matrix for tumor detection

True Positive (TP) — model correctly predicts the positive class (prediction and
actual both are positive). In the above example, 10 people who have tumors are
predicted positively by the model.
True Negative (TN) — model correctly predicts the negative class (prediction and
actual both are negative). In the above example, 60 people who don’t have tumors
are predicted negatively by the model.
False Positive (FP) — model gives the wrong prediction of the negative class
(predicted-positive, actual-negative). In the above example, 22 people are predicted
as positive of having a tumor, although they don’t have a tumor. FP is also called
a TYPE I error.
False Negative (FN) — model wrongly predicts the positive class (predicted-
negative, actual-positive). In the above example, 8 people who have tumors are
predicted as negative. FN is also called a TYPE II error.
With the help of these four values, we can calculate True Positive Rate (TPR), False
Negative Rate (FPR), True Negative Rate (TNR), and False Negative Rate (FNR).
Even if data is imbalanced, we can figure out that our model is working well or not.
For that, the values of TPR and TNR should be high, and FPR and FNR should
be as low as possible.
With the help of TP, TN, FN, and FP, other performance metrics can be calculated.

Precision, Recall
Both precision and recall are crucial for information retrieval, where positive class
mattered the most as compared to negative. Why?
While searching something on the web, the model does not care about
something irrelevant and not retrieved (this is the true negative case). Therefore
only TP, FP, FN are used in Precision and Recall.

Precision
Out of all the positive predicted, what percentage is truly positive.

The precision value lies between 0 and 1.

Recall
Out of the total positive, what percentage are predicted positive. It is the same as TPR
(true positive rate).

EXAMPLE 1- Credit card fraud detection

Confusion Matrix for Credit Card Fraud Detection
We do not want to miss any fraud transactions. Therefore, we want False-
Negative to be as low as possible. In these situations, we can compromise with the
low precision, but recall should be high. Similarly, in the medical application, we don’t
want to miss any patient. Therefore we focus on having a high recall.
So far, we have discussed when the recall is important than precision. But, when is
the precision more important than recall?

EXAMPLE 2 — Spam detection

Confusion Matrix for Spam detection

In the detection of spam mail, it is okay if any spam mail remains undetected (false
negative), but what if we miss any critical mail because it is classified as spam (false
positive). In this situation, False Positive should be as low as possible. Here,
precision is more vital as compared to recall.
When comparing different models, it will be difficult to decide which is better (high
precision and low recall or vice-versa). Therefore, there should be a metric that
combines both of these. One such metric is the F1 score.

F1 Score
It is the harmonic mean of precision and recall. It takes both false positive and false
negatives into account. Therefore, it performs well on an imbalanced dataset.

F1 score gives the same weightage to recall and precision.

There is a weighted F1 score in which we can give different weightage to recall and
precision. As discussed in the previous section, different problems give different
weightage to recall and precision.

Beta represents how many times recall is more important than precision. If
the recall is twice as important as precision, the value of Beta is 2.

Statistics With Computer Application
100% (6)
Statistics With Computer Application
17 pages
Breast Cancer Classification
100% (2)
Breast Cancer Classification
16 pages
Cancer Cell Classification Using Scikit
No ratings yet
Cancer Cell Classification Using Scikit
4 pages
Assignment 1 - Introduction To Machine Learning: Version 1.0 of This Notebook. To Download
0% (1)
Assignment 1 - Introduction To Machine Learning: Version 1.0 of This Notebook. To Download
30 pages
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
100% (1)
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
23 pages
Digital FilmMaker Issue 48 2017
100% (1)
Digital FilmMaker Issue 48 2017
116 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
Health Problems
No ratings yet
Health Problems
10 pages
Catch Up Friday DLL Sheena
86% (7)
Catch Up Friday DLL Sheena
2 pages
Bengal Nights
0% (1)
Bengal Nights
13 pages
Cancer Detection Using Data Mining
No ratings yet
Cancer Detection Using Data Mining
13 pages
APA Chapter3 T20
No ratings yet
APA Chapter3 T20
24 pages
Machine Learning Algorithm
No ratings yet
Machine Learning Algorithm
18 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
Daily Lesson Log Grade 10 - 3rd Week
100% (2)
Daily Lesson Log Grade 10 - 3rd Week
3 pages
Breast Cancer Classification Using Python
No ratings yet
Breast Cancer Classification Using Python
26 pages
Progrram8-Decision Tree
No ratings yet
Progrram8-Decision Tree
3 pages
Draft PMC Agreement of Dabul Golden Jubilee CHSL
100% (8)
Draft PMC Agreement of Dabul Golden Jubilee CHSL
15 pages
SAP ABAP Webdynpro Interview Questions and Answers
No ratings yet
SAP ABAP Webdynpro Interview Questions and Answers
29 pages
WBCD Slide
No ratings yet
WBCD Slide
13 pages
Breast Cancer Classification Using DTC
No ratings yet
Breast Cancer Classification Using DTC
1 page
Machine Learning Evaluation Metrics Lecturer
No ratings yet
Machine Learning Evaluation Metrics Lecturer
30 pages
Breast Cancer Detection
No ratings yet
Breast Cancer Detection
15 pages
Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium
No ratings yet
Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium
15 pages
Mla - 2 (Cia - 1) - 20221013
No ratings yet
Mla - 2 (Cia - 1) - 20221013
14 pages
BSAN Case 3
No ratings yet
BSAN Case 3
9 pages
Breast Cancer Detection Using Python & Machine Learning
No ratings yet
Breast Cancer Detection Using Python & Machine Learning
12 pages
Scikit Learn What Were Covering
No ratings yet
Scikit Learn What Were Covering
15 pages
AMCIS 2020 Slide Template ERF
No ratings yet
AMCIS 2020 Slide Template ERF
14 pages
IJERT Developing A Web Based System For
No ratings yet
IJERT Developing A Web Based System For
5 pages
Breast Cancer Detection and Prediction: Created by
No ratings yet
Breast Cancer Detection and Prediction: Created by
20 pages
ML - Hands On
No ratings yet
ML - Hands On
24 pages
A Computational Study On Classification of Malignant
No ratings yet
A Computational Study On Classification of Malignant
63 pages
Breast Cancer Prediction
No ratings yet
Breast Cancer Prediction
5 pages
Project Final
No ratings yet
Project Final
15 pages
ML 4
No ratings yet
ML 4
4 pages
IDS Project Group 11
No ratings yet
IDS Project Group 11
35 pages
Support Vector Machines Com Python
No ratings yet
Support Vector Machines Com Python
13 pages
ITM Document Format - Vedant
No ratings yet
ITM Document Format - Vedant
5 pages
ML0101EN Clas SVM Cancer Py v1
No ratings yet
ML0101EN Clas SVM Cancer Py v1
10 pages
Python Final Project Group 03
No ratings yet
Python Final Project Group 03
18 pages
Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer
No ratings yet
Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer
11 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
TensorFlow Classification
No ratings yet
TensorFlow Classification
68 pages
ML Report2
No ratings yet
ML Report2
21 pages
ML Acti
No ratings yet
ML Acti
23 pages
Lesson 2.4.1 What Is Scikit Learn Keynote
No ratings yet
Lesson 2.4.1 What Is Scikit Learn Keynote
21 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
4c Sklearn-Classification-Regression-Bkhw-Spring 2019
No ratings yet
4c Sklearn-Classification-Regression-Bkhw-Spring 2019
20 pages
2018 02 Msu Data Science
No ratings yet
2018 02 Msu Data Science
65 pages
Breast Cancer Project Analysis Report
No ratings yet
Breast Cancer Project Analysis Report
4 pages
Building A Simple Machine Learning Model On Breast Cancer Data
No ratings yet
Building A Simple Machine Learning Model On Breast Cancer Data
12 pages
Breast Cancer Classification Report
No ratings yet
Breast Cancer Classification Report
16 pages
Breast Cancer Classification
No ratings yet
Breast Cancer Classification
18 pages
Machine Learning Lab5
No ratings yet
Machine Learning Lab5
2 pages
AAI Lecture 11 SP 25
No ratings yet
AAI Lecture 11 SP 25
77 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
S2 24 WIPRO AML Labcourse2 Kittu
No ratings yet
S2 24 WIPRO AML Labcourse2 Kittu
15 pages
Study of Ensemble Classifers
No ratings yet
Study of Ensemble Classifers
8 pages
Final Big Data
No ratings yet
Final Big Data
23 pages
From Import: Dict - Keys ( ('Data', 'Target', 'Frame', 'Target - Names', 'DESCR', 'Feature - Names', 'Filename', 'Data - Module') )
No ratings yet
From Import: Dict - Keys ( ('Data', 'Target', 'Frame', 'Target - Names', 'DESCR', 'Feature - Names', 'Filename', 'Data - Module') )
4 pages
Decision Tree Code Explanation
No ratings yet
Decision Tree Code Explanation
4 pages
Data Analysis and Machine Learning On The Wisconsin Breast Cancer Dataset
No ratings yet
Data Analysis and Machine Learning On The Wisconsin Breast Cancer Dataset
11 pages
Breast Cancer KFold Prediction
No ratings yet
Breast Cancer KFold Prediction
17 pages
ML Healthcare Clean APA Final
No ratings yet
ML Healthcare Clean APA Final
9 pages
Experiment 8
No ratings yet
Experiment 8
4 pages
Lakbay Aral
No ratings yet
Lakbay Aral
9 pages
CE448W Conceptual Design Submittal Group3.08
No ratings yet
CE448W Conceptual Design Submittal Group3.08
34 pages
Desert Survival: 4 Diff. Ranking (1-3) (Individual Error Score) 5 Diff. Ranking (2-3) (Group Error Score)
No ratings yet
Desert Survival: 4 Diff. Ranking (1-3) (Individual Error Score) 5 Diff. Ranking (2-3) (Group Error Score)
1 page
Unit 1
No ratings yet
Unit 1
50 pages
11U.Essay Writing 101
No ratings yet
11U.Essay Writing 101
18 pages
Request To Write PHD Research Proposal On Climate Change
No ratings yet
Request To Write PHD Research Proposal On Climate Change
8 pages
Assignment !!!: Questions
No ratings yet
Assignment !!!: Questions
2 pages
Desmos Lesson
No ratings yet
Desmos Lesson
2 pages
The World of Robotics An Overview
No ratings yet
The World of Robotics An Overview
8 pages
Computations of Flows For On Demand Irrigation Systems
No ratings yet
Computations of Flows For On Demand Irrigation Systems
52 pages
Cephalopod A
No ratings yet
Cephalopod A
156 pages
Pillars of Leadership
No ratings yet
Pillars of Leadership
5 pages
Unique ID Management
No ratings yet
Unique ID Management
3 pages
Artificial Japanese Glass Eel Production in Korea
No ratings yet
Artificial Japanese Glass Eel Production in Korea
3 pages
Problem Definition - Software Engineering
No ratings yet
Problem Definition - Software Engineering
10 pages
RectorDecryptor.2.3.14.0 07.05.2011 22.02.03 Log
No ratings yet
RectorDecryptor.2.3.14.0 07.05.2011 22.02.03 Log
2 pages
Cambridge As Level Results Statistics June 2015
No ratings yet
Cambridge As Level Results Statistics June 2015
2 pages
Forces 3 QP
No ratings yet
Forces 3 QP
7 pages
Bộ 5 đề thi giữa HK1 môn Tiếng Anh 12 năm 2022-2023 có đáp án Trường THPT Nguyễn Gia Thiều
No ratings yet
Bộ 5 đề thi giữa HK1 môn Tiếng Anh 12 năm 2022-2023 có đáp án Trường THPT Nguyễn Gia Thiều
30 pages
Humanoid
No ratings yet
Humanoid
21 pages
Midterm Exam
No ratings yet
Midterm Exam
8 pages
Chapter 7 Flashcards - Quizlet
No ratings yet
Chapter 7 Flashcards - Quizlet
3 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Classification Algorithms

Uploaded by

Classification Algorithms

Uploaded by

Scikit-learn's classification API

# Initialize and train classifier

from sklearn.naive_bayes import GaussianNB

# Initialize and train the classifier

# Evaluate the model

3. Example with Binary Logistic Regression:

Breast Cancer dataset

Complete List of Features:

Linear Regression Overview

The linear regression model is represented as:

y = beta_0 + beta_1 x + epsilon

- y is the dependent variable.

- x is the independent variable.

- beta_0 is the intercept (the value of y when x = 0.

Let's perform a simple linear regression on a small dataset.

| Hours Studied ( x) | Test Score ( y) |

Step 1: Calculate the Means of x and y

bar{x} = \frac{1 + 2 + 3 + 4 + 5}{5} = 3

\bar{y} = \frac{2 + 4 + 5 + 4 + 5}{5} = 4

Step 2: Calculate the Slope \( \beta_1 \)

The slope beta_1 is calculated using the formula:

beta_1 = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}

Let's calculate the individual terms:

sum{(x_i - \bar{x})(y_i - \bar{y})} = (1-3)(2-4) + (2-3)(4-4) + (3-3)(5-4) + (4-3)(4-4) + (5-3)(5-4)

= (-2)(-2) + (-1)(0) + (0)(1) + (1)(0) + (2)(1)

= (-2)^2 + (-1)^2 + (0)^2 + (1)^2 + (2)^2

Now, calculate \( \beta_1 \):

beta_1 = \frac{6}{10} = 0.6

Step 3: Calculate the Intercept \( \beta_0 \)

The intercept \( \beta_0 \) is calculated using the formula:

beta_0 = \bar{y} - \beta_1 \bar{x}

Substituting the values:

\beta_0 = 4 - (0.6 \times 3) = 4 - 1.8 = 2.2

Step 4: Form the Linear Regression Equation

The linear regression equation is:

Step 5: Make Predictions

y = 2.2 + 0.6 \times 6 = 2.2 + 3.6 = 5.8

# Importing required libraries

# Visualizing the regression line with the dataset

Key Concepts of Logistic Regression:

1. Logistic Function (Sigmoid Function):

3. Binary vs. Multiclass Logistic Regression:

Polynomial Regression Equation

 y is the dependent variable.

1. Feature Transformation: The PolynomialFeatures class transforms the original

Pros and Cons of Polynomial Regression

Naive Bayes Classification

- Bernoulli Naive Bayes: For binary/Boolean features.

- \(P(C|X)\): Posterior probability of class \(C\) given feature set \(X\).

- \(P(X|C)\): Likelihood of feature set \(X\) given class \(C\).

- \(P(C)\): Prior probability of class \(C\).

- \(P(X)\): Evidence (normalizing factor).

2. Train the Model: Estimate probabilities from the training data.

- Fast and Efficient: Suitable for large datasets.

- Handles Multiclass Problems.

- Text Classification: Spam detection, sentiment analysis.

- Medical Diagnosis: Classifying diseases based on symptoms.

- Document Categorization: Classifying documents into predefined categories.

from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Sample Data (Replace with actual dataset)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Gaussian Naive Bayes Classifier

accuracy = accuracy_score(y_test, y_pred)

Structure of a Confusion Matrix:

For binary classification, the matrix looks like this:

Predicted Positive Predicted Negative

Confusion Matrix for tumor detection

The precision value lies between 0 and 1.

EXAMPLE 1- Credit card fraud detection

EXAMPLE 2 — Spam detection

Confusion Matrix for Spam detection

F1 score gives the same weightage to recall and precision.

You might also like