Introduction To Machine Learning and Deep Learning
Introduction To Machine Learning and Deep Learning
• Wordcloud
• Document classification
• Feature selection
• Patent classification
• Summary
Word Cloud
• Text cloud, tag cloud
• Visual representation of textual data.
• Represents words and its frequencies in the document.
• R has package called wordcloud to handle it.
Document Classification
Document Classification
• Given a set of classes, we seek to determine which class(es) a given
document belongs to.
• In the example, to divide new documents into the two classes:
documents about Economic Development of Brazil and documents not
about Economic Development of Brazil.
Document Classification
• This is referred to as two-class classification.
• A class can be as “Brazil” or “Coffee” or “Sports”
• Such general classes are usually referred to as topics, and the
classification task is then called text classification, text categorization,
topic classification, or topic spotting.
Objectives
3 1 6 3
𝑃 𝑐 = 4 𝑎𝑛𝑑 𝑃 𝑐ҧ = 4 𝑃 𝐵𝑟𝑎𝑧𝑖𝑙 𝑐 = 5 + 1 ÷ 8 + 6 = =
14 7
1
𝑃 𝑇𝑜𝑘𝑦𝑜 𝑐 = 𝑃 𝐽𝑎𝑝𝑎𝑛 𝑐 = 0 + 1 ÷ 8 + 6 =
Vocabulary = 6 terms 14
2
𝑃 𝐵𝑟𝑎𝑧𝑖𝑙 𝑐ҧ = 1 + 1 ÷ 3 + 6 =
9
2
𝑃 𝑇𝑜𝑘𝑦𝑜 𝑐ҧ = 𝑃 𝐽𝑎𝑝𝑎𝑛 𝑐ҧ = 1 + 1 ÷ 3 + 6 =
9
Calculating the Probabilities
𝐼 𝑈, 𝐶
𝑁00 𝑁𝑁00 𝑁10 𝑁𝑁10 𝑁01 𝑁𝑁01 𝑁11 𝑁𝑁11
= 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2
𝑁 𝑁.0 𝑁0. 𝑁 𝑁.0 𝑁1. 𝑁 𝑁.1 𝑁0. 𝑁 𝑁.1 𝑁1.
Ns are counts of documents that have the values of et and ec that are indicated by the two subscripts.
Example
Consider the class Coffee and the term Import in a dataset.
Coffee
1 0 Sum
1 49 27,652 27,701
Import
0 141 774,106 774,247
Sum 190 801,758 801,948
Nt,c/N NNt,c/N.cNt. Log2(NNt,c/N.cNt.) (Nt,c/N) Log2(NNt,c/N.cNt.)
• the χ2 test is applied to test the independence of two events, where two
events A and B are defined to be independent if P(AB) = P(A)P(B) or,
equivalently, P(A|B) = P(A) and P(B|A) = P(B).
• In feature selection, the two events are occurrence of the term and
occurrence of the class. Rank terms with respect to the following quantity:
(𝑁 −𝐸 ) 2
2 𝑒𝑡 𝑒𝑐 𝑒𝑡 𝑒𝑐
χ =
𝐸𝑒𝑡 𝑒𝑐
𝑒𝑡 {0,1} 𝑒𝑐 {0,1}
Calculations
Observed Values
Coffee
1 0 Sum
1 49 27,652 27,701
Import 0 141 774,106 774,247
Sum 190 801,758 801,948
Expected Values
1 0 Sum
1 6.56 27,694.44 27,701
Import 0 183.44 774,063.56 774,247
Sum 190 801,758 801,948
Chi-Square Calculations
1 0 Sum
1 274.40 0.07 274.47
Import 0 9.82 0.00 9.82
Sum Chi-Square Value 284.29
Use of χ2
• Patent classifiers typically hold university degrees and are domain experts
responsible for classifying documents in a small subset of the IPC.
• Because each patent classifier covers separate areas, it is difficult to
measure accurately the inter-judge agreement for PCT applications.
• While all IPC codes have been allotted with extreme care, it is possible that
two patent offices would classify similar documents differently,
particularly in categories with overlapping content.
• Agreement on classifications is primary guided by specifying categorization
rules within the text of the IPC and through committee meetings at WIPO
where delegates from all member states are invited to participate in the
revisions of the IPC.
Document Structure
• The document collection consists of a set of XML documents with a
customized set of markup tags.
• The document reference information in the <record> tag contains the
country of origin in a cy attribute a document reference number (dnum
attribute), the kind of publication type (kind attribute), an application
number and a publication number (an and pn respectively). Priority
numbers in a <prs> tag indicate patent publication numbers and
publication dates in various countries.
• Inventors and applicant companies are listed in <ins> and <pas> tags
respectively. In the dataset, the title, abstract, claims, and full description
are provided in English, in <tis>, <abs>, <cls>, <txts> tags respectively
Example of XML Document
• <?xmlversion="1.0"encoding="iso-8859-1"?>
• <record cy="WO“
an="US0024942"pn="WO012006320010322"dnum="0120063"kind="A1">
• <prs><pr prn="US1999091460/153,825"/></prs>
• <ipcs ed="7"mc="D01D00106">
• <ipc ic="D01F00110"></ipc>
• <ipc ic="D01F00606"></ipc></ipcs>
• <ins><in>COOK, Michael, Charles</in>
• <in>MCDOWALL, Debra, Jean</in>
• <in>STANO, Dana, Elizabeth</in></ins>
• <pas><pa>KIMBERLY-CLARKWORLDWIDE,INC.</pa></pas>
• <tis><tixml:lang="EN">METHOD OF FORMING A TREATED FIBER AND A
TREATED FIBER FORMED THERE FROM</ti></tis>
Training and Testing Datasets
Training
46,324 406 213 102 61
Dataset
Testing
28,926 253 78 64 19
Dataset
Total 75250
Methodology
• Categorization is done using
• multinomial Naïve Bayes (NB),
• k-Nearest Neighbors (k-NN), and
• Support Vector Machine (SVM)
• Indexing is performed at word level, accounting for word frequencies in
each document,
• 524 common stop words are removed,
• word stemming is performed.
• term selection is made on the basis of information gain.
Methodology
Sub-class Categorization
First 300 Words 33 39 41
Top Prediction
Abstracts 28 26 34
First 300 Words 53 62 59
Three Guesses
Abstracts 47 45 52