0% found this document useful (0 votes)
11 views121 pages

III Unit Mtech 2023

Uploaded by

Maryam Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views121 pages

III Unit Mtech 2023

Uploaded by

Maryam Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 121

Duplicate Data

Handling Missing Values


• Removing the Missing Data: We can delete the rows or columns which
are having missing data.
• Imputation: In this technique we remove the missing values by filling the
missing values positions with some other value.
• Forward Fill or Backward Fill: The forward fill method fills the missing
value with the previous non missing value whereas backward fill method
fills the missing value
• Interpolation: it involves predicting the missing values based on observed
values in the dataset.
• linear interpolation which assumes there is a linear relationship between
observed values and missing data points, this method predicts the missing
value by fitting a straight line between two adjacent non-missing points.
Data discretization refers to a method of converting a huge number of data
values into smaller ones so that the evaluation and management of data
become easy. In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of intervals with
minimum data loss. There are two forms of data discretization first is
supervised discretization, and the second is unsupervised discretization.
Some Famous techniques of data discretization
• Binarization is a process where numerical features are converted into
binary values based on a specified threshold. Values below the threshold
become 0, while values above or equal to the threshold become 1. This is
particularly useful when converting continuous data into discrete
categories.
The process of binarization involves the selection of a threshold value, and then
converting all pixel values below the threshold to 0 and all pixel values above
the threshold to 1. The choice of threshold is critical and can be determined
using various methods, including manual selection, global thresholding, or
adaptive thresholding.
•Manual Threshold Selection: In this approach, the threshold value is chosen
manually by inspecting the histogram of the image or based on domain
knowledge. This method is straightforward but may not be robust across
different images with varying lighting conditions or contrast levels.
•Global Thresholding:
•Global thresholding techniques use a single threshold value for the entire
image. A popular global thresholding method is Otsu's method, which selects
the threshold by minimizing the intra-class variance of the black and white
pixels, effectively separating the background from the foreground.
•Adaptive Thresholding: Adaptive or local thresholding methods determine the
threshold value based on the local neighborhoods of each pixel. This approach
is more flexible and can handle images with varying illumination by considering
the local context of each pixel.

•(global,local,Optmization based )
Sampling means selecting the group that you will actually collect data from
in your research

Purposive sampling :the sample units are selected with definite purpose.
Sample Size
Architecture for feature subset
selection
Entropy-based discretization is a supervised, top-down splitting
approach. It explores class distribution data in its computation and
preservation of split-points the method choose the value of A that
has the minimum entropy as a split-point, and recursively divisions
the resulting intervals to appear at a hierarchical discretization
d = √[(x2 – x1) + (y2 – y1) ]
2 2
d = √[(x2 – x1) + (y2 – y1)]

d = √[(x2 – x1)2 + (y2 – y1)2]

d = MAX[(x2 – x1) + (y2 –


y1)]
The cosine similarity is defined as the cosine of the angle between
them, that is, the dot product of the vectors divided by the product of

their lengths .
The Density-based Clustering tool works by detecting areas where points are
concentrated and where they are separated by areas that are empty or sparse.

A dissimilarity measure for cluster analysis is presented and used in the


context of probabilistic distance (PD) clustering.eg Guassian approach
The best data visualization tools
• include Google Charts, Tableau, Grafana,
Chartist. js, FusionCharts, Datawrapper,
Infogram, ChartBlocks, and D3. js.
Segmentation
• Data Segmentation is the process of taking the data you hold and dividing it up and
grouping similar data together based on the chosen parameters so that you can use it more
efficiently within marketing and operations. Examples of Data Segmentation could be:
Gender.

Types
Demographic, psychographic, behavioral and geographic segmentation are considered
the four main types of market segmentation
• Segmentation can be approached in three main ways: firmographic, behavioural and
needs-based
• The most basic level of customer segmentation is demographics, also known as
firmographics in b2b markets.
• Geographic segmentation splits your audience depending on where they are located.
(Continent,Country,Region,City,District)
• Psychographic segmentation separates your audience by their personality.
(Interests,Attitudes,Values)
• Behavioural segmentation divides your audience by their previous behaviour in
relation to your brand.
• Needs-based segmentation groups your audience by similar needs and/or benefits a
particular group is seeking. Problem-solving needs,emotions,functional,value
allignment)
demographic transition is a phenomenon and theory which refers to the historical shift
from high birth rates and high death rates in societies with minimal technology,
education (especially of women) and economic development, to low birth rates and
low death rates in societies with advanced technology, education and economic
development, as well as the stages between these two scenarios.
Transactional segmentation, or RFM modelling, looks at the spending
patterns of your customers to identify who your most valuable customers are
and group them by behaviour.
The model catalogues customers according to:
•Recency. How recently a customer purchased from your business.
•Frequency. How often they purchase from you.
•Monetary. How much they spent.
5 Image Segmentation Techniques

Segmentation is the process of classifying the


market into several approachable groups.
•Edge-Based Segmentation.
•Threshold-Based Segmentation.
•Region-Based Segmentation.
•Cluster-Based Segmentation.
•Watershed Segmentation.
• The threshold segmentation process can be regarded as the process of
separating foreground from background. Threshold segmentation mainly
extracts foreground based on gray value information.
• Edge-based segmentation relies on edges found in an image using
various edge detection operators
• The basic idea of region splitting is to break the image into a set of
disjoint regions which are coherent within themselves
• Watershed is a region-based technique that utilizes image morphology
• Cluster-based : It is a method to perform Image Segmentation of pixel-
wise segmentation. Take each point as a separate cluster
Types of Clustering

• Centroid-based Clustering.
• Density-based Clustering.
• Distribution-based Clustering.
• Hierarchical Clustering.
Data transformation
• Data transformation is the process of converting, cleansing, and structuring
data into a usable format that can be analyzed to support decision making
processes, and to propel the growth of an organization. Data transformation
is the process of converting raw data into a more suitable format or
structure for analysis, to improve its quality and make it compatible with the
requirements of a particular task or system.
Types
The most common types of data transformation are:
• Constructive: The data transformation process adds, copies, or replicates
data.
• Destructive: The system deletes fields or records.
• Aesthetic: The transformation standardizes the data to meet requirements
or parameters
Phases
• Understanding the 4 stages of digital
transformation and what you need to move
forward
• Planning.
• Implementation.
• Acceleration.
• Measurement.
techniques
• The different types of data transformation
techniques such as manipulation,
normalization, attribute construction,
generalization, discretization, aggregation,
and smoothing can help solve various
problems that arise in data analysis projects.
• What are the 5 levels of transformation?
• The five stages of change are precontemplation, contemplation,
preparation, action, and maintenance.

Data Transformations Types


• Bucketing/Binning(Data binning, also called data discrete binning or data
bucketing, is a data pre-processing technique used to reduce the effects
of minor observation errors.)
• Data Aggregation.
• Data Cleansing.
• Data Deduplication.(Data deduplication is a process that eliminates
excessive copies of data and significantly decreases storage capacity
requirements)
• Data Derivation.
• Data Filtering.
• Data Integration.
• Data Joining.
Machine Learning Algorithm

CLASSIFIER
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’
statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability to
model complex nonlinear decision boundaries (margin maximization)
• Used both for classification and prediction
• Applications:
– handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests

July 27, 2024 101


SVM—General Philosophy

Small Margin Large Margin


Support Vectors

July 27, 2024 102


SVM—Linearly Separable
 A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w 0 + w1 x 1 + w2 x 2 = 0
 The hyperplane defining the sides of the margin:
H 1 : w 0 + w1 x 1 + w2 x 2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints 
Quadratic Programming (QP)  Lagrangian multipliers
July 27, 2024 103
SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)

July 27, 2024 104


July 27, 2024 105
July 27, 2024 106
SVM vs. Neural Network
• SVM • Neural Network
– Relatively new concept – Relatively old, but hot
again
– Deterministic algorithm
– Nondeterministic
– Nice Generalization algorithm
properties – Generalizes well
– Hard to learn – learned in – Can easily be learned in
batch mode using incremental fashion
quadratic programming – To learn complex functions
—use multilayer
techniques perceptron (not that trivial)
– Using kernels can learn – Local minima
very complex functions
July 27, 2024 107
Bayesian Classification

CLASSIFIER
Bayesian Classification

• Statically classifiers
• Based on Baye’s theorem
• Naïve Bayesian classification
• Class conditional independence
• Bayesian belief netwoks

CLASSIFIER
Lecture-35 - Bayesian Classification
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”), class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(C|X), the probability that the hypothesis holds
given the observed data sample X
• P(C) (prior probability), the initial probability of C
– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|C) (likelihood), the probability of observing the sample X, given that the
hypothesis H holds
– E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income

CLASSIFIER
July 27, 2024 110
Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H, P(C|X),
follows the Bayes theorem

P(C | X)  P(X | C ) P(C )


P(X)
• Informally, this can be written as
posteriori = prior x likelihood / evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all
the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many probabilities,
significant computational cost

CLASSIFIER
July 27, 2024 111
Towards Naïve Bayesian Classifier
• Let D be a training set of tuples and their associated
class labels, and each tuple is represented by an n-D
attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem (1<= i <=
m)
• Since P(X) is constant for
P(X | C )all
i
P(Cclasses,
i
) only
P(X | C )P(C )
• needs to be maximized P(C | X) 
i
i
P(X)
i

CLASSIFIER
July 27, 2024 112
Derivation of Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):
n
P ( X | C i )   P ( x | C i )  P ( x | C i )  P ( x | C i )  ... P ( x | C i )
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts
the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continuous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and  ( x )2
1
standard deviation σ

g ( x,  ,  )   2
e 2

2 

and P(xk|Ci) is
P ( X | C i )  g ( xk ,  Ci ,  Ci )
CLASSIFIER
July 27, 2024 Data Mining: Concepts and Techniques 113
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
July 27, 2024 114
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

• Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

CLASSIFIER
July 27, 2024 115
Naïve Bayesian Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., salary and age.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
CLASSIFIER
July 27, 2024 116
Play-tennis example: estimating P(xi|C)
outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high false N
sunny hot high true N P(overcast|p) = 4/9 P(overcast|n) = 0
overcast hot high false P
rain mild high false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal false P
rain cool normal true N temperature
overcast cool normal true P
sunny
sunny
mild
cool
high false
normal false
N
P
P(hot|p) = 2/9 P(hot|n) = 2/5
rain
sunny
mild
mild
normal false
normal true
P
P
P(mild|p) = 4/9 P(mild|n) = 2/5
overcast mild high true P
overcast hot normal false P
P(cool|p) = 3/9 P(cool|n) = 1/5
rain mild high true N
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(p) = 9/14 P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(n) = 5/14
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
CLASSIFIER
Lecture-35 - Bayesian Classification
Play-tennis example: classifying X

• An unseen sample X = <rain, hot, high, false>

• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286

• Sample X is classified in class n (don’t play)

CLASSIFIER
Lecture-35 - Bayesian Classification
How effective are Bayesian classifiers?

• makes computation possible


• optimal classifiers when satisfied
• but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
– Decision trees, that reason on one attribute at the time,
considering most important attributes first

CLASSIFIER
Lecture-35 - Bayesian Classification
Using IF-THEN Rules for Classification

• Represent the knowledge in the form of IF-THEN rules


R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of data points covered by R
– ncorrect = # of data points correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
• If more than one rule is triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
– Rule-based ordering (decision list): rules are organized into one long priority list,
according to some measure of rule quality (accuracy) or by experts

CLASSIFIER
July 27, 2024 121
Rule Extraction from a Decision Tree
age?

<=30 31..40 >40


 Rules are easier to understand than large trees student? credit rating?
yes
 One rule is created for each path from the root excellent fair
no yes
to a leaf no yes
no yes
 Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
 Rules are mutually exclusive and exhaustive
• Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
CLASSIFIER
July 27, 2024 122

You might also like