0% found this document useful (0 votes)
22 views31 pages

CSE445 NSU Week - 2

The document discusses ZeroR and OneR classifiers, which are simple classification methods used in supervised learning. ZeroR predicts the majority category without considering predictors, serving as a baseline for model performance, while OneR generates a single rule for each predictor and selects the one with the smallest total error. Additionally, it covers model evaluation techniques, confusion matrices, and tools like WEKA and Anaconda for machine learning applications.

Uploaded by

Rabiul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views31 pages

CSE445 NSU Week - 2

The document discusses ZeroR and OneR classifiers, which are simple classification methods used in supervised learning. ZeroR predicts the majority category without considering predictors, serving as a baseline for model performance, while OneR generates a single rule for each predictor and selects the one with the smallest total error. Additionally, it covers model evaluation techniques, confusion matrices, and tools like WEKA and Anaconda for machine learning applications.

Uploaded by

Rabiul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

ZeroR and OneR

Classifier

1
Classifier (Recall)

►Supervised learning contains (X, y)


►X is the data comprising of instances with attributes <x1, x2, ….. xn> (n
featured attribute)
►y is the label
►If y is discrete/categorical, then the problem is a classification
problem and we require a classifier to classify.
►If y is continuous, then the problem is a regression problem

2
Categorical/Discrete variables
►A categorical/ discrete variable is one that has two or more categories (values).
►There are two kinds of categorical variable
►Nominal
►A nominal variable has no intrinsic ordering to the categories.
►E.g., a gender is a categorical variable with values {Male, Female} that have no intrinsic ordering to the
categories
►Ordinal (related to order)
►An ordinal data has clear ordering.
►E.g., Temperature = {low, medium, high}

3
ZeroR classifier
►ZeroR stands for Zero Rule
►Simplest classification method that relies on the target (output, ) and ignores all
predictors (features, ).
►ZeroR classifier simply predicts the majority category.
►Although there is no predictability power in ZeroR, it is useful in determining a
baseline performance as a benchmark for other classification methods.
Algorithm
Construct a frequency table for the target and select the most frequent value.

4
ZeroR classifier
For binary classification : Accuracy
Can be tested for imbalanced dataset
Training accuracy =

frequency table

5
Classification Model Evaluation
►Models need to be evaluated and therefore some kind of model evaluation
techniques need to be in place.
►Confusion Matrix
►A confusion matrix shows the number of correct and incorrect predictions made by the
classification model compared to the actual outcomes (target value) in the data.
►The matrix is N × N, where N is the number of classes in the target variable.
►N = the number of classes

6
Terminologies (Related to Confusion
Matrix)
►Accuracy: The proportion of the total number of predictions that were correct.
►Positive Predictive Value or Precision: The proportion of positive cases that were
correctly identified.
►Precision
►Negative Predictive Value: The proportion of negative cases that were correctly
identified.
►Sensitivity or Recall: The proportion of actual positive cases which are correctly
identified.
►Recall
►Specificity: The proportion of actual negative cases that are correctly identified.
►F1-score: Harmonic mean of precision and recall
7
2 × 2 Confusion Matrix for two
classes (Positive and Negative)
Confusion Matrix Target (Actual)
Positive Negative
Positive a (TP) b (FP) Positive a/(a+b)
Predicted
value
Model Negative c (FN) d (TN) Negative d/(c+d)
(Predicted) Predicted
Value
Sensitivity Specificity Accuracy = (a+d)/(a+b+c+d)
a/(a+c) d/(b+d)

a = True Positive (TP)


b = False Positive (FP)
c = False Negative (FN)
8
d = True Negative (TN)
Precision-recall trade-Terminologies
(Related to Confusion Matrix)off
►Precision and recall: Important for imbalanced/skewed dataset
►Cancer prediction, spam email prediction
►TPR
►FPR
►ROC curve: TPR vs. FPR
►AUC: Area under the ROC Curve
►closer ROC AUC is to 1, the better.

9
Confusion Matrix of the ZeroR
Classifier for the “play golf” dataset

10
Confusion Matrix for multiclassification
problem

Precision and recall for EACH class/category is calculated


Arithmetic/Macro average is calculated
Weighted average considers how many samples/instances of each class there were
in its calculation, so fewer of one class means that it’s precision/recall/F1 score has
less of an impact on their weighted average
11
ML problem steps
►EDA: Exploratory Data Analysis
►Data Preprocessing: Remove duplicate and missing entries and incorrect data;
outlier and noise detection
►Outlier: data point that differs significantly from other observations
►Label and One-hot encoding: Convert categorical values to numerical data
►Feature scaling: standardize the independent features present in the data in a
fixed range [0,1]
►Hyperparameter optimization/tuning:
choosing a set of optimal hyperparameters
for a learning algorithm

12
ML library
►WEKA: Machine Learning Software in Java
►Contains tools for data preparation, classification, regression, clustering,
association rules mining, and visualization
►Developed at the University of Waikato, New Zealand
►Anaconda: Anaconda is a distribution of the Python and R programming
languages for scientific computing (data science, machine learning applications,
large-scale data processing, predictive analytics, etc.), that aims to simplify
package management and deployment
►Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows users to launch applications and manage
conda packages, environments and channels without using command-line
commands

13
WEKA

►Weka is a collection of machine learning algorithms for data mining


tasks.
►It contains tools for data preparation, classification, regression,
clustering, association rules mining, and visualization.
►Weka supports deep learning!
►Built in Java
►Link: https://www.cs.waikato.ac.nz/ml/weka/

14
Weka Explorer

15
Weka File Selection

16
Weka

17
Weka Results

18
Weka Results

19
Anaconda libraries
►Jupyter Notebook: Jupyter Notebook App is a server-client application that
allows editing and running notebook documents via a web browser
►Pandas: read, write data, data preprocessing – data handling
►Matplotlib, seaborn: data visualization libraries
►Sklearn (Scikit-learn): ML libraries
►TensorFlow: library for machine learning and deep neural networks
►Numpy: numerical Python. large, multi-dimensional arrays and matrices, along
with a large collection of high-level mathematical functions to operate on these
arrays
►Change directory from Anaconda Prompt
►Green: edit mode; Blue: Command mode

20
Google Colab
►free cloud service hosted by Google that runs on cloud
►Provides free GPU support
►Kaggle: a subsidiary of Google LLC - online community of data scientists and
machine learning practitioners.
►Kaggle allows users to find and publish data sets, explore and build models in a
web-based data-science environment, work with other data scientists and
machine learning engineers, and enter competitions to solve data science
challenges

21
OneR Classifier
►OneR stands for “One Rule”
►A simple, yet accurate, classification algorithm that generates one rule for each
predictor in the data, then selects the rule with the smallest total error as its
“one rule”.
OneR Algorithm:
For each predictor,
For each value of that predictor, make a rule as follows;
Count how often each value of target (class) appears
Find the most frequent class
Make the rule assign that class to this value of the predictor
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.
22
OneR classifier

23
OneR classifier
Outlook_error = 2+0+2 =4; smallest error; Winning Predictor
Temp_error = 2+2+1 = 5
Humidity_error = 3 + 1 =4
Windy_error = 2+3=5

24
Model Evaluation (OneR classifier)

TP = 7
FP = 2
FN = 2
TN = 3

25
Machine learning in Python

Scikit-learn: Free, software machine learning library for Python


programming language

Pandas: Software library written for Python programming language for


data manipulation and analysis.

Matplotlib: A plotting library for Python programming language

Numpy: LIbrary for Python programming language which is used for


working with arrays and matrices
26
ZeroR classifier using Python

imports
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import pandas as pd

27
ZeroR classifier using Python

Reading dataset
df = pd.read_csv('E:\ML-teaching\Sample Datasets\weather.csv')

Try the following


print(df)
print(df.shape)
print(df.head())
print(df.describe())

28
ZeroR classifier using Python

Separate X and y

X = df.drop(columns = ['Play'])
y = df['Play']

29
ZeroR classifier using Python

Train and Test

model = DummyClassifier(strategy = 'most_frequent', random_state =


0)
model.fit(X,y)
predictions = model.predict(X)
score = accuracy_score(y, predictions)
print(score)
print(confusion_matrix(y, predictions))
print(classification_report(y, predictions)) 30
Future

We should ideally divide the dataset into train set and test set
Train set should be used to create the model
Test set should be used to test how well the model is predicting
unlabel cases.

31

You might also like