0% found this document useful (0 votes)

53 views33 pages

Prof. Mohammed Tanzeem Agra

The document discusses various machine learning techniques including Naive Bayes analysis, support vector machines, and text mining. Naive Bayes analysis uses probability theory to classify instances and compute the probability of a new instance belonging to each target class. Support vector machines find the optimal separating hyperplane between classes that maximizes the margin between them. Text mining discovers patterns from organized text collections through techniques like identifying frequent terms, meaningful phrases, and combining phrases into topics.

Uploaded by

CR world

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views33 pages

Prof. Mohammed Tanzeem Agra

Uploaded by

CR world

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

MODULE 5

Prof. Mohammed Tanzeem Agra

Module 5 – Contents

1) Text Mining
2) Naive – Bayes Analysis
3) Support Vector Machines
4) Web Mining
5) Social Network Analysis
NAIVE-Bayes Analysis

➢ NB tech is a supervised learning tech that uses

probability theory based analysis.
➢ It computes the probability of an instance belonging
to each one of many target classes.
➢ Example : classifying text documents.
➢ Probability : from past records, the probability of
something happening in the future can be reliably
assessed.
➢ Example: the probabilty of dying from an airline
accident, by the total no:of airline accident related
deaths in a time period by the total no:of flying
durring that time.
Advantages and Disadvantages
● The NB Logic is simple for ● Posterior probability computation
classification of instances. required time.
● Conditional Probability can be ● When there are no
computed for descrete data. :of variables in the vector X then
the problem can be modeled
● It computes the probability of a
using probability function. Such
new occurence not only on the
as : normal, lognormal,gamma
recent record, but also on the
and poisson.
basis of prior experience.
NAIVE-BAYES MODEL
SIMPLE CLASSIFICATION EXAMPLE

● Suppose a salon needs to predict the service required by the incoming customer.
● If there are only two service offered : Hair Cut (R) and Manicure-pedicure (M)
● Predict : whether the next customer will be R or M
● Let the no:of classes K = 2
● Data of one year : 2500 customer R and 1500 customer for M
● Ans : the default probability for the next customer to be for R is 2500/4000 or 5/8 and for M is
1500/4000 or 3/8, next customer would likely to be for R.
● Example Sequence : R,M,R,M,M
● NB Posterior Probability P(R)= 5/8*2/5 = 10/40 = 0.25
● NB Posterior Probability P(M) = 3/8*3/5 = 9/40 = 0.225
TEXT CLASSIFICATION EXAMPLE

TRAINNING SET DOCUMENT ID KEYWORDS IN CLASS = H

THE DOCUMENT (HEALTY)
1 Love Happy Joy Joy Yes
Love
2 Happy Love Kick Joy Yes
Happy
3 Love Move Joy Good Yes

4 Love Happy Joy Pain Yes

Love
5 Joy Love Pain Kick No
Pain
6 Pain Pain Love Kick No

TEST DATA 7 Love Pain Joy Love ?

Kick
SUPPORT VECTOR MACHINE(SVM)

SVM is a supervised Machine Learning Algorithm which

can be used for both classification and regression
challenges. However, it is most widely used in
classification problem.
Application : Face Detection, Text and Image
Classification, Handwritting Recognition.
In this algorithm, we plot each data item as point in a
n-diemnsional space with the value of each feature
being the value of a particular coordinate
We perform the classification by finding the hyper-
plan that differenciate the two classes very well
called segments.
DIAGRAM
SVM MODEL (How it Works)

● An SVM is a classifier function in a high-diemensional space that defines the decesin boundary
between two classes.
● Classify the label set of points into two classes called segments. The goal is to find the best
classifier between the two points of the two type.
● SVM takes the widest street approach to demarcate the two classes and thus finds the
hyperplane that has the widest margin.
● In diagram, the dotted line is the optimal hyperplane. The hardlines are the gutters on the sides
of the two classes.
● The gap between the gutters is the maximum margin.
● The classifier is defined by only those points that fall on the gutters on both side. This points are
called Support Vector.
● Rest of the data point are irrelevant for defining the classifier.
SVM MODEL

● Notations

● The training data of n-points : (X1,Y1)...(Xn,Yn)

● X will represent two binary class value 1 or -1
● The classifier hyperplan that satiesfy the equation W.X + b = 0 ;
where W = normal vector to the hyperplan
● The Hard Margin can be defined ; W.X + b = 1 and W.X + b = -1
● The width of the hard margin can be calculated 2/|W|
THE KERNEL METHOD

● The of heart of an SVM algorithm is the kernel method. Most kernel algorithm is based on
optimization in a convex space.
● Kernel methods operate using what is called the “Kernel trick”.
● Trick involves computing and working with the inner products of only the relevant pairs of data
in the feature space.
● The kernel trick makes the algorithm much less demanding in computational and memory
resources.
THE KERNEL METHOD

● There are several types of support vector models including linear, polynomial, and sigmoid.
● In all types, we assign high weights to the abnormal situation and very low weight to the normal
situation.
● SVM is more flexible and be able to tolerate some amount of misclassification. By categorising
Soft Margin and Hard Margin.
ADVANTAGES AND DISADVANTAGES

● SVM work very well when no:of ● It works well only with real
features are larger then the numbers. All the data should be
instances. defined in numerical values.
● It can work on data set with huge ● It works only with binary
features space; example : spam classification problem.
filtering.
● Training the SVM is an ineeficient
● SVM are easy to undersatnd. They and time consuming process.
create an easy-to-undersatnd When the data is large
linear classifier.
● SVM does not work well when
● They are computationally there is much noise in the data.
efficient.
● SVM will also not provide a
● SVM are now available with probability estimate of
almost all data analytics toolsets. classification.
TEXT MINING

✔ Text Mining is the art and science of discovering patterns from an organized collection of
textual database.
✔ Textual mining can help with frequency analysis of important terms and their semantic
relationship.
✔ Text mining can help be applied to large scale social media data for gathering preferences and
measuring emotional sentiments.
5.1.1: TEXT MINING – EXAMPLE

● Format : Word Documents, PDF Files, XML Files, Text messages etc
● In Legal Profession: text sources would include law, court delibration, court orders etc
● Academic Research : published papers and articles
● World of Finance : statutory reports, internal histories, discharge summaries etc
● Medicine : medical journels, patient history, discharge summaries
● Marketing : advertisement, customers comments etc.
5.1.2:Text Mining Applications

● Marketing
● The voice of the customer can be captured in its native and raw format and then analyzed for
customer preference and complaints.
● Social Personas are a clustering technique to develop the customer segments of interest.
● Listening platform is a text mining application, that in real time gethers social media, blogs and
other textual feedback and filters out the chatter to extract true consumer sentiments.
● The BPO conversation and records can be analysed for pattern of customers complaints.
Text Mining Applications

● Business Operations

– Social Network Analysis and Text Mining can be applied to email,

blogs and social media to measure the emotional status and the mood
of employee populations.
– Studying people as emotional investors and using text analysis of the
social internet.
● Legal

–Lawyers can more easily search case histories and law for relevant
documents in a particular case
–E-discovery platforms that help in minizing risk in the process of sharing
legal documents.
Text Mining Applications

● Governance and Politics

– Social Network analysis and text mining of large scal social

media data can be used for measuring emotional state and the
mood of contituent populations.
– In geo political security, internet chatter can be proceed for real
time information and to connect the dots on any emerging
threats.
5.1.3 : DATA MINING PROCESS

● There are three levels

● Level 1 : Identifying frequent words

–This create a bag of important words. Text documents or smaller messages – can then
be ranked on how they match to a particular bag-of-words
–Establishing the corpus of text and organized.
● Level 2 : Identifying the meaningful phrases from the words.
● Example : ice and cream will be two different key words that often come together. However there is a more meanigful
phrases by combining the two words into “ice cream”. This is also called “Structure Using Term Document Matrix”
● Level 3 : Multiple Phrases can be combined or Mine TDM for Patterns

–The two phrases can be put into a common busket and this busket is called “Desserts”

● Refer diagram in page no : 138

TDM : TERM DOCUMENT MATRIX

● This is the heart of structuring process. Text can be converted into numerical data, which can
then be mined using regular data mining technique.
● One could call key words, phrases or a topic as a term of interest. This approach measure the
frequencies of selecting important term occurring in each document. This create a txd, where
t=no:of terms and d= no:of documents.
MINING THE TDM

● The TDM can be mined to extract the patterns/knowledge. The variety of tech can be applied.

– Visualize the highest frequency term, this can be done using

wordcloud tech.
– Predictors of desirable terms, example the word “profit” is a
desirable word in a document. The no:of occurence of the word
profit in a document could be regressed against many other
terms in the TDM.
WEB MINING

Web Mining is the art and Science of discovering patterns from WWW so as
to improve.
The WWW is at the heart of the digital revolution, billions of users are
using it every day for a variety of purposes.
The web is used for electronic commerce, business commuincation, and
many other application.
Data for web mining is collected via web crawlers, web logs and other
means.
Characteristics of web sites

● Appearance : Attractive Design, well-formatted content, easy to scan and navigate, good color
contrasts.
● Content : well-planned information architecture with useful contents, Fresh content, SEO, link
to other good sites.
● Functionality : Accessible to all the autorised users, fast loading time, usable form, mobile
enabled.
● FeedBack Analysis : FeedBack data can be used for commercial advertisement and even for
social engineering.
TYPES OF WEB MINING ( 3 TYPES)
Web Content Mining

● A web site is designed in the form of pages with distinct URL. A large website may contain
thousand of web pages.
● These pages are managed by a specialized software called Content Management Systems.
● Every page may have text, graphics, audio, video, forms, applications, and more kind of
content.
● The website keeps a record of all request received for its URL, including the requester
information using cookies.
● The log of these requests could be analyzed to gauge the popularity of those pages among
different segment of the population (PageRank Algorithm).
Web Content Mining

● The text and application content on the pages could be analyzed for its usage by visit counts.
● Use quality contents to attract the users.
● Unwanted and unpopular pages could be weeded out or they can be transfered with different
contents and style.
● Assign more resources to keep more popular pages fresh and inviting.
WEB STRUCTURE MINING

● The web works through a system of hyperlinks using the http.

● Any page can create a hyperlink to any other pages (Link to another pages), self referral nature
of web lends itself to some unique network analytical algorithms.
● There are two basic strategic models for successful websites – Hubs and Authorities.
WEB STRUCTURE MINING

● Hubs (Gathering Point) :

– These are pages with large no:of interesting liks, media site like yahoo.com or
government site could serve that purpose.
– More focus site like www.google.com could aspire to become hub for new emerging
areas.
● Authorities:

– The page which provide the most complete and authoritative information on a
particular subject.
– Example : news, advice, user reviews etc
– These web site have maximum no:of inbound link from other web sites.
– Example : www.mayoclinic.com (medical opinion), www.nytimes.com (daily news)
WEB USAGE MINING

● The goal of the web usage mining is to extract useful information and patterns from data
generated through web page visits and transactions.
● The activity data comes from data stored in server access log, referrer logs, agent logs, and
client-side cookies.
● The user characteristics and usage profile are also gathered directory or indirectly through
sydicate data.
● Meta data such as page attributes, content attribute and usage data are also gathered.
Web Content Analysis
There are two types of web usage mining

● The Server Side Analysis

– Display the relative popularity of the web pages accessed. Those web
sites could be hubs and authorities.
● The client – side Analysis

– Focus on the usage pattern or the actual content consumed and

created by users.
– Clickstream analyse web activity for pattern of sequence of clicks
and location and duration of the visit on web sites.
– Clickstream analysis can be useful for web activity analysis, software
testing, market research and analysing employee productivity.
There are two types of web usage mining

● The client – side Analysis

– Textual information can be analyzed using text mining technique

called bag-of-words.
– Bag-of-words matrix mine using cluster analysis and association
rules for patterns such as popular topic and sentiment analysis.
– Application : it can help to predict user bahavior based on previously
learned rules and users profile and can help to determine life time
value of clients.
– Can be used to designed cross marketing strategy by using
association rule.

Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
64 pages
Utilization of Text Mining As A Big Data Analysis Tool For Food Science and Nutrition
No ratings yet
Utilization of Text Mining As A Big Data Analysis Tool For Food Science and Nutrition
20 pages
Lecture 10 - Data Mining in Practice
No ratings yet
Lecture 10 - Data Mining in Practice
41 pages
Isba 1 Finals Reviewer
No ratings yet
Isba 1 Finals Reviewer
3 pages
BA4027 Datamining For BI
100% (1)
BA4027 Datamining For BI
67 pages
Text Mining
No ratings yet
Text Mining
3 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
04 - DDM - SLIDES - Text, Data - Visual Mining - Consumer Sentiment
No ratings yet
04 - DDM - SLIDES - Text, Data - Visual Mining - Consumer Sentiment
48 pages
UNIT-4 Information Retrieval Notes
No ratings yet
UNIT-4 Information Retrieval Notes
16 pages
Data Mining
No ratings yet
Data Mining
34 pages
Use of Data Mining and Text Mining (Machine Learning)
No ratings yet
Use of Data Mining and Text Mining (Machine Learning)
42 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Text Mining
No ratings yet
Text Mining
16 pages
Lec1 PDF
No ratings yet
Lec1 PDF
20 pages
Unit Ii DM
No ratings yet
Unit Ii DM
18 pages
AFM - Module 4
No ratings yet
AFM - Module 4
48 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
Lecture 5 - Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5 - Text Mining Sentiment and Social Media Analytics
52 pages
Algorithms - Reading Assignment
No ratings yet
Algorithms - Reading Assignment
17 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
DS Finalexam (Thxtoshravani)
No ratings yet
DS Finalexam (Thxtoshravani)
31 pages
02.MOUDLE 5- Text Mining
No ratings yet
02.MOUDLE 5- Text Mining
27 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
Bda Module 5
No ratings yet
Bda Module 5
39 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries
No ratings yet
MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries
25 pages
Skill Module
No ratings yet
Skill Module
15 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
63 pages
ch05 - DS Unit 4
No ratings yet
ch05 - DS Unit 4
148 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
No ratings yet
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
36 pages
BDA Module-5b Text Mining
No ratings yet
BDA Module-5b Text Mining
23 pages
Module1 Introduction
No ratings yet
Module1 Introduction
35 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Chapter 1
No ratings yet
Chapter 1
12 pages
TMK DWDM Unit 7 Advance Topics
No ratings yet
TMK DWDM Unit 7 Advance Topics
28 pages
1 DMiningKuliah 1 Introduction
No ratings yet
1 DMiningKuliah 1 Introduction
51 pages
MLDM Lect1 Introduction
No ratings yet
MLDM Lect1 Introduction
40 pages
Emerging Concepts & Trends in Business Analytics
No ratings yet
Emerging Concepts & Trends in Business Analytics
15 pages
DLWSS551 - Introduction
No ratings yet
DLWSS551 - Introduction
54 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
Module 7 Mining Object Spatial Multimedia Text and Web Data
100% (1)
Module 7 Mining Object Spatial Multimedia Text and Web Data
28 pages
Internal
No ratings yet
Internal
267 pages
SVM
No ratings yet
SVM
4 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
9 pages
Social Media
No ratings yet
Social Media
17 pages
Mining Social Media A Brief Introduction
No ratings yet
Mining Social Media A Brief Introduction
17 pages
DM - Unit-1 - Fundamentals of Data Mining
No ratings yet
DM - Unit-1 - Fundamentals of Data Mining
43 pages
04 Intro To DM Classification
No ratings yet
04 Intro To DM Classification
29 pages
Theorical Basis
No ratings yet
Theorical Basis
4 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
Web Mining: Based On Tutorials and Presentations
No ratings yet
Web Mining: Based On Tutorials and Presentations
101 pages
5 Questions
No ratings yet
5 Questions
19 pages
SQL Server 2008 For Business Intelligence: UTS Short Course
No ratings yet
SQL Server 2008 For Business Intelligence: UTS Short Course
43 pages
E-Commerce Data: Topic-5.2: Text Mining/Analytics
No ratings yet
E-Commerce Data: Topic-5.2: Text Mining/Analytics
63 pages
The Comprehensive Guide to Machine Learning Algorithms and Techniques
From Everand
The Comprehensive Guide to Machine Learning Algorithms and Techniques
Mohammed Ahmed
5/5 (1)
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
50 Analytics Projects!
100% (1)
50 Analytics Projects!
52 pages
Accounting Information Systems
No ratings yet
Accounting Information Systems
5 pages
The Role of The Notes To The Financial Statements in Corporate Decision-Making
No ratings yet
The Role of The Notes To The Financial Statements in Corporate Decision-Making
12 pages
Text Mining Tools On The Internet
No ratings yet
Text Mining Tools On The Internet
75 pages
A Survey On Deep Learning For Patent Analysis
No ratings yet
A Survey On Deep Learning For Patent Analysis
13 pages
Document (14) - 2
No ratings yet
Document (14) - 2
55 pages
A Framework For Process Risk Assessment Incorporating Prior Hazard
No ratings yet
A Framework For Process Risk Assessment Incorporating Prior Hazard
19 pages
New Text Document
No ratings yet
New Text Document
19 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
Journal Pre-Proofs: Expert Systems With Applications
No ratings yet
Journal Pre-Proofs: Expert Systems With Applications
16 pages
Learned Publishing - 2012 - SMIT - Journal Article Mining The Scholarly Publishers Perspective
No ratings yet
Learned Publishing - 2012 - SMIT - Journal Article Mining The Scholarly Publishers Perspective
10 pages
Shulman Sale Document
No ratings yet
Shulman Sale Document
3 pages
Top 50 Analytics Projects 1691221401
100% (2)
Top 50 Analytics Projects 1691221401
52 pages
SPSS Appendix
No ratings yet
SPSS Appendix
17 pages
BA - Topic1 - Introduction To Business Analytics PDF
No ratings yet
BA - Topic1 - Introduction To Business Analytics PDF
96 pages
MCQ For Data Science Users DR Dhananjay Bisen DR Neeraj Sahu DR Brijesh
No ratings yet
MCQ For Data Science Users DR Dhananjay Bisen DR Neeraj Sahu DR Brijesh
17 pages
Educational Computer Program For Design of Building
No ratings yet
Educational Computer Program For Design of Building
14 pages
M7 Ds21-Dab-Sa
No ratings yet
M7 Ds21-Dab-Sa
38 pages
DSS Assignment DU
No ratings yet
DSS Assignment DU
40 pages
BA Questions - Answers
No ratings yet
BA Questions - Answers
12 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Fundamentals of Data Science 1st Edition Sanjeev J. Wagh All Chapters Instant Download
No ratings yet
Fundamentals of Data Science 1st Edition Sanjeev J. Wagh All Chapters Instant Download
29 pages
Web Mining
No ratings yet
Web Mining
73 pages
The Forrester Wave AI Bas
No ratings yet
The Forrester Wave AI Bas
15 pages
Natural Language Processing in Finance - A Survey
No ratings yet
Natural Language Processing in Finance - A Survey
25 pages
Text Analytics With Python A Practical Real World Approach To Gaining Actionable Insights From Your Data 1st Edition Dipanjan Sarkar
100% (2)
Text Analytics With Python A Practical Real World Approach To Gaining Actionable Insights From Your Data 1st Edition Dipanjan Sarkar
55 pages
Module 3 - Paper 1 - Extracting Relations From Text From Word Sequences To Dependency Paths
No ratings yet
Module 3 - Paper 1 - Extracting Relations From Text From Word Sequences To Dependency Paths
11 pages
Labor Market Prediction Using Machine Learning Methods A Systematic Literature Review
No ratings yet
Labor Market Prediction Using Machine Learning Methods A Systematic Literature Review
5 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages