CSE445 NSU Week - 2
CSE445 NSU Week - 2
Classifier
1
Classifier (Recall)
2
Categorical/Discrete variables
►A categorical/ discrete variable is one that has two or more categories (values).
►There are two kinds of categorical variable
►Nominal
►A nominal variable has no intrinsic ordering to the categories.
►E.g., a gender is a categorical variable with values {Male, Female} that have no intrinsic ordering to the
categories
►Ordinal (related to order)
►An ordinal data has clear ordering.
►E.g., Temperature = {low, medium, high}
3
ZeroR classifier
►ZeroR stands for Zero Rule
►Simplest classification method that relies on the target (output, ) and ignores all
predictors (features, ).
►ZeroR classifier simply predicts the majority category.
►Although there is no predictability power in ZeroR, it is useful in determining a
baseline performance as a benchmark for other classification methods.
Algorithm
Construct a frequency table for the target and select the most frequent value.
4
ZeroR classifier
For binary classification : Accuracy
Can be tested for imbalanced dataset
Training accuracy =
frequency table
5
Classification Model Evaluation
►Models need to be evaluated and therefore some kind of model evaluation
techniques need to be in place.
►Confusion Matrix
►A confusion matrix shows the number of correct and incorrect predictions made by the
classification model compared to the actual outcomes (target value) in the data.
►The matrix is N × N, where N is the number of classes in the target variable.
►N = the number of classes
6
Terminologies (Related to Confusion
Matrix)
►Accuracy: The proportion of the total number of predictions that were correct.
►Positive Predictive Value or Precision: The proportion of positive cases that were
correctly identified.
►Precision
►Negative Predictive Value: The proportion of negative cases that were correctly
identified.
►Sensitivity or Recall: The proportion of actual positive cases which are correctly
identified.
►Recall
►Specificity: The proportion of actual negative cases that are correctly identified.
►F1-score: Harmonic mean of precision and recall
7
2 × 2 Confusion Matrix for two
classes (Positive and Negative)
Confusion Matrix Target (Actual)
Positive Negative
Positive a (TP) b (FP) Positive a/(a+b)
Predicted
value
Model Negative c (FN) d (TN) Negative d/(c+d)
(Predicted) Predicted
Value
Sensitivity Specificity Accuracy = (a+d)/(a+b+c+d)
a/(a+c) d/(b+d)
9
Confusion Matrix of the ZeroR
Classifier for the “play golf” dataset
10
Confusion Matrix for multiclassification
problem
12
ML library
►WEKA: Machine Learning Software in Java
►Contains tools for data preparation, classification, regression, clustering,
association rules mining, and visualization
►Developed at the University of Waikato, New Zealand
►Anaconda: Anaconda is a distribution of the Python and R programming
languages for scientific computing (data science, machine learning applications,
large-scale data processing, predictive analytics, etc.), that aims to simplify
package management and deployment
►Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows users to launch applications and manage
conda packages, environments and channels without using command-line
commands
13
WEKA
14
Weka Explorer
15
Weka File Selection
16
Weka
17
Weka Results
18
Weka Results
19
Anaconda libraries
►Jupyter Notebook: Jupyter Notebook App is a server-client application that
allows editing and running notebook documents via a web browser
►Pandas: read, write data, data preprocessing – data handling
►Matplotlib, seaborn: data visualization libraries
►Sklearn (Scikit-learn): ML libraries
►TensorFlow: library for machine learning and deep neural networks
►Numpy: numerical Python. large, multi-dimensional arrays and matrices, along
with a large collection of high-level mathematical functions to operate on these
arrays
►Change directory from Anaconda Prompt
►Green: edit mode; Blue: Command mode
20
Google Colab
►free cloud service hosted by Google that runs on cloud
►Provides free GPU support
►Kaggle: a subsidiary of Google LLC - online community of data scientists and
machine learning practitioners.
►Kaggle allows users to find and publish data sets, explore and build models in a
web-based data-science environment, work with other data scientists and
machine learning engineers, and enter competitions to solve data science
challenges
21
OneR Classifier
►OneR stands for “One Rule”
►A simple, yet accurate, classification algorithm that generates one rule for each
predictor in the data, then selects the rule with the smallest total error as its
“one rule”.
OneR Algorithm:
For each predictor,
For each value of that predictor, make a rule as follows;
Count how often each value of target (class) appears
Find the most frequent class
Make the rule assign that class to this value of the predictor
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.
22
OneR classifier
23
OneR classifier
Outlook_error = 2+0+2 =4; smallest error; Winning Predictor
Temp_error = 2+2+1 = 5
Humidity_error = 3 + 1 =4
Windy_error = 2+3=5
24
Model Evaluation (OneR classifier)
TP = 7
FP = 2
FN = 2
TN = 3
25
Machine learning in Python
imports
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import pandas as pd
27
ZeroR classifier using Python
Reading dataset
df = pd.read_csv('E:\ML-teaching\Sample Datasets\weather.csv')
28
ZeroR classifier using Python
Separate X and y
X = df.drop(columns = ['Play'])
y = df['Play']
29
ZeroR classifier using Python
We should ideally divide the dataset into train set and test set
Train set should be used to create the model
Test set should be used to test how well the model is predicting
unlabel cases.
31