0% found this document useful (0 votes)
38 views40 pages

Introduction To Machine Learning and Deep Learning

This document provides an introduction to machine learning and deep learning concepts including: 1. It discusses word clouds, document classification, and feature selection as machine learning techniques. 2. It describes how document classification involves determining which class or classes a document belongs to based on training data from pre-classified documents. 3. It provides an overview of how naive Bayes classification works by calculating the probability of a document belonging to a class based on term frequencies.

Uploaded by

F13 NIEC
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views40 pages

Introduction To Machine Learning and Deep Learning

This document provides an introduction to machine learning and deep learning concepts including: 1. It discusses word clouds, document classification, and feature selection as machine learning techniques. 2. It describes how document classification involves determining which class or classes a document belongs to based on training data from pre-classified documents. 3. It provides an overview of how naive Bayes classification works by calculating the probability of a document belonging to a class based on term frequencies.

Uploaded by

F13 NIEC
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Introduction to Machine

Learning and Deep Learning

Dr. Madan Lal Yadav


Assistant Professor
Indian Institute of Management
Bodh Gaya
Agenda of the Session

• Wordcloud
• Document classification
• Feature selection
• Patent classification
• Summary
Word Cloud
• Text cloud, tag cloud
• Visual representation of textual data.
• Represents words and its frequencies in the document.
• R has package called wordcloud to handle it.
Document Classification
Document Classification
• Given a set of classes, we seek to determine which class(es) a given
document belongs to.
• In the example, to divide new documents into the two classes:
documents about Economic Development of Brazil and documents not
about Economic Development of Brazil.
Document Classification
• This is referred to as two-class classification.
• A class can be as “Brazil” or “Coffee” or “Sports”
• Such general classes are usually referred to as topics, and the
classification task is then called text classification, text categorization,
topic classification, or topic spotting.
Objectives

• The automatic detection of spam pages.


• Sentiment detection or the automatic classification of a movie or
product review as positive or negative.
• An example application is a user searching for negative reviews
before buying a camera to make sure it has no undesirable features or
quality problems.
Objectives
• Personal email sorting.
• A user may have folders like electronic bills, email from family and friends.
• Use a classifier to classify each incoming email and automatically move it to
the appropriate folder. It is easier to find messages in sorted folders than in a
very large inbox. The most common case of this application is a spam folder
that holds all suspected spam messages.
Text Classification

• Many classification tasks have been solved manually.


• Books in a library categorize by librarian.
• But manual classification is expensive to scale.
• Machine learning-based text classification: the set of rules or,
more generally, the decision criterion of the text classifier, is
learned automatically from training data.
Some Definitions

• In Machine learning based text classification, we require a


number of good example documents (training set) for each
class.
• Training documents come from a person who has labeled
them – where labeling refers to the process of annotating each
document with its class.
• In text classification, we are given a description d ∈ X of a
document, where X is the document space; and a fixed set of
classes C = {c , c , . . . , c }.
1 2 J

• Classes are also called categories or labels


Learning Algorithm

• The document space X is a high-dimensional space, and the classes are


human defined for the needs of an application,
• We are given a training set D of labeled documents (d, c) where (d, c) ∈ X ,
C.
(d, c) = (“Beijing joins the World Trade Organization”, China)
• Using a learning method or learning algorithm, we then wish to learn a
classifier or classification function g that maps documents to classes:
τ:X→C
• The learning method Γ takes the training set D as input and returns the
learned classification function τ.
Γ(D) = τ.
Example
regions industries subject areas

Cj UK Brazil poultry coffee elections sports


London Rio de Janeiro chicken beans votes baseball
D1i congestion Olympics feed roasting recount diamond
Big Ben Portuguese ducks robusta run-off soccer
D2i Parliament colonization pate arabica seat forward
the Queen Dilma Rousseff, turkey harvest campaign captain
D3i Windsor woman president bird flu Kenya TV ads Olympic team

d(test) = “first Olympic Delegation Arrives in Rio”

τ(d) = Brazil τ(d) = sports


Naive Bayes text classification

• Multinomial Naïve Bayes or Multinomial NB Model


• a probabilistic learning method
• The probability of a document d being in class c is computed as
𝑃 𝑐 𝑑 ≈𝑃(𝑐) ෑ 𝑃(𝑡𝑘 |𝑐)
1≤𝑘≤𝑛𝑑

• where P(tk|c) is the conditional probability of term tk occurring in a


document of class c
• In other words, P(tk|c) is a measure of how much evidence tk contributes
that c is the correct class.
• P(c) is the prior probability of a document occurring in class c.
Naïve Bayes

• If a document’s terms do not provide clear evidence for one class


versus another, choose the one that has a higher prior probability.
• (t1, t2, . . . , tnd) are the tokens in d that are part of the vocabulary we
use for classification and nd is the number of such tokens in d. For
example, (t1, t2, . . . , tnd) for the one-sentence document “first
Olympic Delegation Arrives in Rio” might be
(first, Olympic, Delegation, Arrive, Rio), with nd = 5
Estimation of 𝑃(c)
෠ and 𝑃(t
෠ k|c)

• Estimation is based on Relative Frequency and corresponds to the most


likely value of each parameter given the training data.
• For the priors this estimate is:
𝑁𝑐
𝑃෠ 𝑐 =
𝑁
• Nc is the number of documents in class c and N is the total number of documents.

• The conditional probability is 𝑃(t|c) as the relative frequency of term t in
documents belonging to class c:
𝑡𝑐𝑡
𝑃෠ 𝑡 𝑐 =
σ𝑡∈𝑉 𝑡𝑐𝑡
• Where tct is the number of occurrences of t in training documents from class c,
including multiple occurrences of a term in a document.
Handling zero conditional probability

• It is possible that 𝑃෠ 𝑡 𝑐 for a particular term can be zero


• If the term Olympics in the training data only occurred in Brazil documents, then the MLE estimates for the
other classes, for example UK, will be zero
𝑃෠ 𝑂𝑙𝑦𝑚𝑝𝑖𝑐𝑠 𝑈𝐾 = 0
• Then “Britain sends its delegation to Olympics” will get a conditional probability of Zero! This is because of
“sparseness”
• To eliminate zeros, use “add-one” or “Laplace” smoothing, which simply adds one to each count
𝑡𝑐𝑡 + 1 𝑡𝑐𝑡 + 1
𝑃෠ 𝑡 𝑐 = =
σ𝑡∈𝑉(𝑡𝑐𝑡 +1) σ𝑡∈𝑉 𝑡𝑐𝑡 + 𝐵
• B = |V| = number of terms in the vocabulary. “Add-one” smoothing can be interpreted as a uniform
prior (each term occurs once for each class) that is then updated as evidence from the training data
comes in.
Example
Doc ID words in document in c = Brazil?
training set
1 Brazil Rio Brazil yes
2 Brazil Brazil São Paulo yes
3 Chinese Brazil yes
4 Tokyo Brazil Japan no
test set
5 Brazil Brazil Brazil Tokyo Japan ?

3 1 6 3
𝑃෠ 𝑐 = 4 𝑎𝑛𝑑 𝑃෠ 𝑐ҧ = 4 𝑃෠ 𝐵𝑟𝑎𝑧𝑖𝑙 𝑐 = 5 + 1 ÷ 8 + 6 = =
14 7
1
𝑃෠ 𝑇𝑜𝑘𝑦𝑜 𝑐 = 𝑃෠ 𝐽𝑎𝑝𝑎𝑛 𝑐 = 0 + 1 ÷ 8 + 6 =
Vocabulary = 6 terms 14
2
𝑃෠ 𝐵𝑟𝑎𝑧𝑖𝑙 𝑐ҧ = 1 + 1 ÷ 3 + 6 =
9
2
𝑃෠ 𝑇𝑜𝑘𝑦𝑜 𝑐ҧ = 𝑃෠ 𝐽𝑎𝑝𝑎𝑛 𝑐ҧ = 1 + 1 ÷ 3 + 6 =
9
Calculating the Probabilities

𝑃 𝑐 𝑑 ≈𝑃(𝑐) ෑ 𝑃(𝑡𝑘 |𝑐)


1≤𝑘≤𝑛𝑑
• 𝑃෠ (c|d5) ≈ 3/4 · (3/7)3 · 1/14 · 1/14 ≈ 0.0003.
• Brazil, Tokyo and Japan
෠ 𝑐ҧ |d5) ≈ 1/4 · (2/9)3 · 2/9 · 2/9 ≈ 0.0001.
• 𝑃(
• Brazil, Tokyo and Japan
• Classify the document as “c” based on the probabilities
Feature Selection
• Process of selecting a subset of the terms occurring in the training set and
using only this subset as features in text classification.
• Serves two main purposes.
• First, it makes training and applying a classifier more efficient by decreasing the size of
the effective vocabulary.
• Increases classification accuracy by eliminating noise features or Over-fitting
• Suppose a rare term, say “xyz”, has no information about a class, say Brazil, but all instances of
“xyz” happen to occur in Brazil documents in the training set.
• Then the learning method might produce a classifier that mis-assigns test documents containing
“xyz” to Brazil.
• Such an incorrect generalization from an accidental property of the training is nothing but
OVERFITTING.
Utility Measures

• For a given class c, we compute a utility measure A(t, c) for


each term t of the vocabulary and select the k terms that have
the highest values of A(t, c).
• All other terms are discarded and not used in classification.
• Three different utility measures
• Expected mutual information, A(t, c) = I(Ut; Cc);
• χ2 test, A(t, c) = χ2(t, c) and
• frequency, A(t, c) = N(t, c).
Mutual Information
• U is a random variable that takes values et = 1 (the document contains
term t) and et = 0 (the document does not contain t),
• and C is a random variable that takes values ec = 1 (the document is in
class c) and ec = 0 (the document is not in class c)
𝑃(𝑈 = 𝑒𝑡 , 𝐶 = 𝑒𝑐 )
𝐼 𝑈, 𝐶 = ෍ ෍ 𝑃(𝑈 = 𝑒𝑡 , 𝐶 = 𝑒𝑐 )𝑙𝑜𝑔2
𝑃 𝑈 = 𝑒𝑡 . 𝑃(𝐶 = 𝑒𝑐 )
𝑒𝑡 {0,1} 𝑒𝑐 {0,1}
Calculation wise:

𝐼 𝑈, 𝐶
𝑁00 𝑁𝑁00 𝑁10 𝑁𝑁10 𝑁01 𝑁𝑁01 𝑁11 𝑁𝑁11
= 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2
𝑁 𝑁.0 𝑁0. 𝑁 𝑁.0 𝑁1. 𝑁 𝑁.1 𝑁0. 𝑁 𝑁.1 𝑁1.

Ns are counts of documents that have the values of et and ec that are indicated by the two subscripts.
Example
Consider the class Coffee and the term Import in a dataset.

Coffee
1 0 Sum
1 49 27,652 27,701
Import
0 141 774,106 774,247
Sum 190 801,758 801,948
Nt,c/N NNt,c/N.cNt. Log2(NNt,c/N.cNt.) (Nt,c/N) Log2(NNt,c/N.cNt.)

(t,c)=(1,1) 6.11012E-05 7.466090337 2.900352965 0.000177215


(t,c)=(0,1) 0.000175822 0.768656296 -0.379589451 -6.67401E-05
(t,c)=(1,0) 0.034481039 0.998467671 -0.002212379 -7.62851E-05
(t,c)=(0,0) 0.965282038 1.000054824 7.90916E-05 7.63457E-05
Sum 1 Utility (t,c)= 0.000110536
Mutual Information
• Mutual information measures how much information – in the
information theoretic sense – a term contains about the class.
• If a term’s distribution is the same in the class as it is in the collection as
a whole, then I(U; C) = 0.
• MI reaches its maximum value if the term is a perfect indicator for class
membership, that is, if the term is present in a document if and only if
the document is in the class.
• To select k terms t1, . . . , tk for a given class, compute the utility
measure as A(t, c) = I(Ut, Cc) and select the k terms with the largest
values.
χ2 Feature selection

• the χ2 test is applied to test the independence of two events, where two
events A and B are defined to be independent if P(AB) = P(A)P(B) or,
equivalently, P(A|B) = P(A) and P(B|A) = P(B).
• In feature selection, the two events are occurrence of the term and
occurrence of the class. Rank terms with respect to the following quantity:
(𝑁 −𝐸 ) 2
2 𝑒𝑡 𝑒𝑐 𝑒𝑡 𝑒𝑐
χ = ෍ ෍
𝐸𝑒𝑡 𝑒𝑐
𝑒𝑡 {0,1} 𝑒𝑐 {0,1}
Calculations
Observed Values
Coffee
1 0 Sum
1 49 27,652 27,701
Import 0 141 774,106 774,247
Sum 190 801,758 801,948
Expected Values
1 0 Sum
1 6.56 27,694.44 27,701
Import 0 183.44 774,063.56 774,247
Sum 190 801,758 801,948
Chi-Square Calculations
1 0 Sum
1 274.40 0.07 274.47
Import 0 9.82 0.00 9.82
Sum Chi-Square Value 284.29
Use of χ2

• The relative importance of features is important.


• Use χ2 feature selection only ranks features with respect to their usefulness
Frequency-based feature selection
• Based on the terms that are most common in the class.
• Frequency can be either defined as document frequency (the
number of documents in the class c that contain the term t) or as
collection frequency (the number of tokens of t that occur in
documents in c).
• Note: Frequency-based feature selection selects some frequent
terms that have no specific information about the class, (ex: the
days of the week, Monday, Tuesday, . . . ) which are frequent across
classes in newswire text
Case on Patent Classification

• When a patent application is considered or submitted, the search


for previous inventions in the field—known as prior art—relies
crucially on accurate patent classification.
• The retrieval of patent documents is crucial to patent-issuing
authorities, potential inventors, research and development units,
and others concerned with the application or development of
technology.
• The number of patent applications is currently rising rapidly
worldwide, creating the need for an automated categorization
system .
• In industry, patents are a major source for gathering intelligence
about competitors’ activities.
IPC
• The International Patent Classification (IPC) is a complex hierarchical classification system
comprising sections, classes, subclasses and groups.
• The latest edition of the IPC contains eight sections, 120 classes, 630 subclasses, and
approximately 69,000 groups
• The IPC divides all technological fields into sections designated by capital letters A to H,
• A: “Human necessities”; B: “Performing operations, transporting”; C: “Chemistry,
metallurgy”; D: “Textiles, paper”; E: “Fixed constructions”; F: “Mechanical engineering,
lighting, heating, weapons, blasting”; G: “Physics”; H: “Electricity”.
• Each section is subdivided into classes, consist of the section symbol followed by a two-digit
number, such as A01.
• Each class is divided into several subclasses, consist of the class symbol followed by a capital
letter, for example, A01B
• http://www.wipo.int/classifications/ipc/en/
Document Collection

• The documents in the collection consist of patent applications


submitted to WIPO under the Patent Cooperation Treaty (PCT).
• A patent application includes a title, a list of inventors, a list of applicant
companies or individuals, an abstract, a claims section, and a long
description.
• The documents are in English
Classification Process

• Patent classifiers typically hold university degrees and are domain experts
responsible for classifying documents in a small subset of the IPC.
• Because each patent classifier covers separate areas, it is difficult to
measure accurately the inter-judge agreement for PCT applications.
• While all IPC codes have been allotted with extreme care, it is possible that
two patent offices would classify similar documents differently,
particularly in categories with overlapping content.
• Agreement on classifications is primary guided by specifying categorization
rules within the text of the IPC and through committee meetings at WIPO
where delegates from all member states are invited to participate in the
revisions of the IPC.
Document Structure
• The document collection consists of a set of XML documents with a
customized set of markup tags.
• The document reference information in the <record> tag contains the
country of origin in a cy attribute a document reference number (dnum
attribute), the kind of publication type (kind attribute), an application
number and a publication number (an and pn respectively). Priority
numbers in a <prs> tag indicate patent publication numbers and
publication dates in various countries.
• Inventors and applicant companies are listed in <ins> and <pas> tags
respectively. In the dataset, the title, abstract, claims, and full description
are provided in English, in <tis>, <abs>, <cls>, <txts> tags respectively
Example of XML Document
• <?xmlversion="1.0"encoding="iso-8859-1"?>
• <record cy="WO“
an="US0024942"pn="WO012006320010322"dnum="0120063"kind="A1">
• <prs><pr prn="US1999091460/153,825"/></prs>
• <ipcs ed="7"mc="D01D00106">
• <ipc ic="D01F00110"></ipc>
• <ipc ic="D01F00606"></ipc></ipcs>
• <ins><in>COOK, Michael, Charles</in>
• <in>MCDOWALL, Debra, Jean</in>
• <in>STANO, Dana, Elizabeth</in></ins>
• <pas><pa>KIMBERLY-CLARKWORLDWIDE,INC.</pa></pas>
• <tis><tixml:lang="EN">METHOD OF FORMING A TREATED FIBER AND A
TREATED FIBER FORMED THERE FROM</ti></tis>
Training and Testing Datasets

Average num. Median num.


Number of Average num. Median num.
docs per docs per
documents docs per class docs per class
subclass subclass

Training
46,324 406 213 102 61
Dataset
Testing
28,926 253 78 64 19
Dataset
Total 75250
Methodology
• Categorization is done using
• multinomial Naïve Bayes (NB),
• k-Nearest Neighbors (k-NN), and
• Support Vector Machine (SVM)
• Indexing is performed at word level, accounting for word frequencies in
each document,
• 524 common stop words are removed,
• word stemming is performed.
• term selection is made on the basis of information gain.
Methodology

• NB and k-NN algorithms


• No term selection and used the full training vocabulary.
• Term selection performed with the k-NN algorithm did not significantly improve
the system precision.
• The k-NN algorithm samples the 30 closest neighbors throughout all tests.
• The SVM algorithm
• For performance reasons the vocabulary was limited to 20,000 words and at most
500 documents per class
• 100 documents per subclass
Evaluation Criteria
• Top-prediction scheme
• Compare the top predicted category with the main IPC category.
• Three-guesses approach
• Compare the top three categories predicted by the classifier,
• If a single match is found, the categorization is deemed successful.
• This measure is adapted to evaluating categorization assistance—where a user
ultimately makes the decision of the correct category. In this case, it is tolerable
that the correct guess appears second or third in the list of suggestions.
Results – Percentage

Evaluation Fields Indexed Naïve Bayes k-NN SVM


Method Class Level Categorization
Titles 45 33
Top Prediction Claims 50 18 45
First 300 Words 55 51 55
Titles 66 52
Three Guesses Claims 72 35 63
First 300 Words 79 77 73
Results – Percentage
Evaluation Fields Indexed Naïve Bayes k-NN SVM
Method

Sub-class Categorization
First 300 Words 33 39 41
Top Prediction
Abstracts 28 26 34
First 300 Words 53 62 59
Three Guesses
Abstracts 47 45 52

You might also like