0% found this document useful (0 votes)
18 views38 pages

AP For NLP-LO2

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views38 pages

AP For NLP-LO2

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Advanced Python for NLP

Word embedding

GloVe (Global Vectors for Word Representation)

is a popular unsupervised machine learning algorithm for learning high-dimensional vector


representations (embeddings) of words based on their co-occurrence statistics in a corpus. The
algorithm was introduced by researchers at Stanford University in 2014 and has since become one
of the most widely used methods for creating word embeddings.

The basic idea behind GloVe is to create a vector space in which each word is represented by a
dense vector of real-valued numbers. This vector should capture some of the semantic and syntactic
information associated with the word, so that words that are similar in meaning or that tend to occur
in similar contexts will have similar vector representations.
Word embedding
The training process for GloVe involves constructing a co-occurrence matrix that records the
number of times each word appears in the same context as every other word in the corpus. A
"context" here refers to a fixed number of words on either side of the target word, which are
used to define its context.

Once the co-occurrence matrix is constructed, GloVe applies a process of matrix factorization
to learn the word embeddings. The goal of this process is to find a low-dimensional
representation of the co-occurrence matrix that preserves the relative frequencies of word co-
occurrences. The resulting word embeddings are dense, real-valued vectors that capture the
meaning and context of each word.
SVM—Support Vector Machines
• A relatively new classification method for both linear and
nonlinear data
• It uses a nonlinear mapping to transform the original training
data into a higher dimension
• With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated
by a hyperplane
• SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
4
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
• Used for: classification and numeric prediction
• Applications:
• handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
5
Let’s Learn This Like a Five Year
Old

We have 2 colors of balls on


the table that we want to
separate.
We get a stick and put it on
the table, this works pretty
well right?
Some villain comes and places
more balls on the table, it kind
of works but one of the balls is
on the wrong side and there is
probably a better place to put
the stick now.
SVMs try to put the stick
in the best possible
place by having as big a
gap on either side of the
stick as possible.
Now when the villain
returns the stick is still in a
pretty good spot.
There is another trick in the
SVM toolbox that is even
more important. Say the
villain has seen how good
you are with a stick so he
gives you a new challenge.
There’s no stick in the world
that will let you split those
balls well, so what do you
do? You flip the table of
course! Throwing the balls
into the air. Then, with your
pro ninja skills, you grab a
sheet of paper and slip it
between the balls.
Now, looking at the balls from
where the villain is standing,
the balls will look split by some
curvy line.
Let’s Go Little
Deeper
Support Vectors are attempting to
find a hyperplane that divides the
two classes with the largest margin.
The support vectors are the points
which fall within this margin.

Here margin means the distance


from the
closest elements to the hyperplane .
A
Hyperplan
e
SVM—General Philosophy

Small Margin Large Margin


Support Vectors

16
SVM—Margins and Support Vectors

Data Mining: Concepts and


December 14, 2024 Techniques 17
SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
18
SVM—Linearly Separable
 A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar
(bias)
 For 2-D it can be written as
w 0 + w 1 x1 + w 2 x2 = 0
 The hyperplane defining the sides of the margin:
H 1: w 0 + w 1 x1 + w 2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic
optimization problem: Quadratic objective function and
linear constraints  Quadratic Programming (QP)  19
Why Is SVM Effective on High Dimensional Data?

 The complexity of trained classifier is characterized by the


# of support vectors rather than the dimensionality of the
data
 The support vectors are the essential or critical training
examples —they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to
compute an (upper) bound on the expected error rate of the
SVM classifier, which is independent of the data
dimensionality
 Thus, an SVM with a small number of support vectors can
20
A2

SVM—Linearly Inseparable
 Transform the original input data into a higher
A1
dimensional space

 Search for a linear separating hyperplane in the


new space
21
SVM: Different Kernel functions
 Instead of computing the dot product on the
transformed data, it is math. equivalent to applying
a kernel function K(Xi, Xj) to the original data, i.e.,
K(Xi, Xj) = Φ(Xi) Φ(Xj)
 Typical Kernel Functions

 SVM can also be used for classifying multiple (> 2)


classes and for regression analysis (with additional
22
Classification
The Classification algorithm is a Supervised Learning
technique that is used to identify the category of new
observations on the basis of training data.
In Classification, a program learns from the given
dataset or observations and then classifies new
observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat
or dog, etc. Classes can be called as targets/labels
or categories.
In classification algorithm, a discrete output function(y) is mapped to input
variable(x).
y=f(x), where y = categorical output
• Classification is a two-step process, learning step and
prediction step. In the learning step, the model is
developed based on given training data. In the prediction
step, the model is used to predict the response for given
data.
• The algorithm which implements the classification on a
dataset is known as a classifier. There are two types of
Classifications:
• Binary Classifier: If the classification problem has only two
possible outcomes, then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT
SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than
two outcomes, then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of
types of music.
Use cases of Classification Algorithms

Classification algorithms can be used in different places.


• Email Spam Detection
• Speech Recognition
• Identifications of Cancer tumor cells.
• Drugs Classification
• Biometric Identification, etc.
• As a marketing manager, you want a set of customers who are most
likely to purchase your product.
• classifying customers into a group of potential and non-potential
customers or safe or risky loan applications
Decision Tree Algorithm

• A decision tree is a flowchart-like tree structure where an


internal node represents feature(or attribute), the branch
represents a decision rule, and each leaf node represents the
outcome.
• The topmost node in a decision tree is known as the root
node. It learns to partition on the basis of the attribute value.
• It partitions the tree in recursively manner call recursive
partitioning. This flowchart-like structure helps you in decision
making.
• It's visualization like a flowchart diagram which easily mimics
the human level thinking. That is why decision trees are easy
to understand and interpret.
Decision Tree Induction: An Example

❑ Training data set: Buys_computer


❑ Resulting tree:

age
?
<=3 overcas
31..40 >4
0 t 0
student ye credit
? s rating?
no ye excellen fair
s t
n ye n ye
o s o s 27
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
28
Attribute Selection Measure:
Information Gain (ID3/C4.5)
■ Select the attribute with the highest information gain
■ Let pi be the probability that an arbitrary tuple in D belongs to class C i, estimated by |
Ci, D|/|D|
■ Expected information (entropy) needed to classify a tuple in D:

■ Information needed (after using A to split D into v partitions) to classify D:

■ Information gained by branching on attribute A

31
Attribute Selection: Information Gain

g Class P: buys_computer = “yes”


g Class N: buys_computer = “no”

means “age <=30” has 5 out of 14


samples, with 2 yes’es and 3 no’s. Hence

Similarly,

32
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a large number
of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)

• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.

• gain_ratio(income) = 0.029/1.557 = 0.019


• The attribute with the maximum gain ratio is selected as the splitting
attribute

33
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as

where pj is the relative frequency of class j in


D
• If a data set D is split on A into two subsets D1 and D2, the
gini index gini(D) is defined as

• Reduction in Impurity:

• The attribute provides the smallest ginisplit(D) (or the largest


reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
34
Comparing Attribute Selection Measures
• The three measures, in general, return good results but
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and purity in
both partitions

35
Advantages of the Decision Tree
• It is simple to understand as it follows the same process
which a human follow while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a
problem.
• There is less requirement of data cleaning compared to
other algorithms.
Disadvantages of the Decision Tree
• The decision tree contains lots of layers, which makes it
complex.
• It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
• For more class labels, the computational complexity of the
decision tree may increase.
THANKS!
Any questions?

38

You might also like