AP For NLP-LO2
AP For NLP-LO2
Word embedding
The basic idea behind GloVe is to create a vector space in which each word is represented by a
dense vector of real-valued numbers. This vector should capture some of the semantic and syntactic
information associated with the word, so that words that are similar in meaning or that tend to occur
in similar contexts will have similar vector representations.
Word embedding
The training process for GloVe involves constructing a co-occurrence matrix that records the
number of times each word appears in the same context as every other word in the corpus. A
"context" here refers to a fixed number of words on either side of the target word, which are
used to define its context.
Once the co-occurrence matrix is constructed, GloVe applies a process of matrix factorization
to learn the word embeddings. The goal of this process is to find a low-dimensional
representation of the co-occurrence matrix that preserves the relative frequencies of word co-
occurrences. The resulting word embeddings are dense, real-valued vectors that capture the
meaning and context of each word.
SVM—Support Vector Machines
• A relatively new classification method for both linear and
nonlinear data
• It uses a nonlinear mapping to transform the original training
data into a higher dimension
• With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated
by a hyperplane
• SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
4
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
• Used for: classification and numeric prediction
• Applications:
• handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
5
Let’s Learn This Like a Five Year
Old
16
SVM—Margins and Support Vectors
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
18
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar
(bias)
For 2-D it can be written as
w 0 + w 1 x1 + w 2 x2 = 0
The hyperplane defining the sides of the margin:
H 1: w 0 + w 1 x1 + w 2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic
optimization problem: Quadratic objective function and
linear constraints Quadratic Programming (QP) 19
Why Is SVM Effective on High Dimensional Data?
SVM—Linearly Inseparable
Transform the original input data into a higher
A1
dimensional space
age
?
<=3 overcas
31..40 >4
0 t 0
student ye credit
? s rating?
no ye excellen fair
s t
n ye n ye
o s o s 27
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
28
Attribute Selection Measure:
Information Gain (ID3/C4.5)
■ Select the attribute with the highest information gain
■ Let pi be the probability that an arbitrary tuple in D belongs to class C i, estimated by |
Ci, D|/|D|
■ Expected information (entropy) needed to classify a tuple in D:
31
Attribute Selection: Information Gain
Similarly,
32
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a large number
of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
33
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as
• Reduction in Impurity:
35
Advantages of the Decision Tree
• It is simple to understand as it follows the same process
which a human follow while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a
problem.
• There is less requirement of data cleaning compared to
other algorithms.
Disadvantages of the Decision Tree
• The decision tree contains lots of layers, which makes it
complex.
• It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
• For more class labels, the computational complexity of the
decision tree may increase.
THANKS!
Any questions?
38