Prof. Mohammed Tanzeem Agra
Prof. Mohammed Tanzeem Agra
1) Text Mining
2) Naive – Bayes Analysis
3) Support Vector Machines
4) Web Mining
5) Social Network Analysis
NAIVE-Bayes Analysis
● Suppose a salon needs to predict the service required by the incoming customer.
● If there are only two service offered : Hair Cut (R) and Manicure-pedicure (M)
● Predict : whether the next customer will be R or M
● Let the no:of classes K = 2
● Data of one year : 2500 customer R and 1500 customer for M
● Ans : the default probability for the next customer to be for R is 2500/4000 or 5/8 and for M is
1500/4000 or 3/8, next customer would likely to be for R.
● Example Sequence : R,M,R,M,M
● NB Posterior Probability P(R)= 5/8*2/5 = 10/40 = 0.25
● NB Posterior Probability P(M) = 3/8*3/5 = 9/40 = 0.225
TEXT CLASSIFICATION EXAMPLE
● An SVM is a classifier function in a high-diemensional space that defines the decesin boundary
between two classes.
● Classify the label set of points into two classes called segments. The goal is to find the best
classifier between the two points of the two type.
● SVM takes the widest street approach to demarcate the two classes and thus finds the
hyperplane that has the widest margin.
● In diagram, the dotted line is the optimal hyperplane. The hardlines are the gutters on the sides
of the two classes.
● The gap between the gutters is the maximum margin.
● The classifier is defined by only those points that fall on the gutters on both side. This points are
called Support Vector.
● Rest of the data point are irrelevant for defining the classifier.
SVM MODEL
● Notations
● The of heart of an SVM algorithm is the kernel method. Most kernel algorithm is based on
optimization in a convex space.
● Kernel methods operate using what is called the “Kernel trick”.
● Trick involves computing and working with the inner products of only the relevant pairs of data
in the feature space.
● The kernel trick makes the algorithm much less demanding in computational and memory
resources.
THE KERNEL METHOD
● There are several types of support vector models including linear, polynomial, and sigmoid.
● In all types, we assign high weights to the abnormal situation and very low weight to the normal
situation.
● SVM is more flexible and be able to tolerate some amount of misclassification. By categorising
Soft Margin and Hard Margin.
ADVANTAGES AND DISADVANTAGES
● SVM work very well when no:of ● It works well only with real
features are larger then the numbers. All the data should be
instances. defined in numerical values.
● It can work on data set with huge ● It works only with binary
features space; example : spam classification problem.
filtering.
● Training the SVM is an ineeficient
● SVM are easy to undersatnd. They and time consuming process.
create an easy-to-undersatnd When the data is large
linear classifier.
● SVM does not work well when
● They are computationally there is much noise in the data.
efficient.
● SVM will also not provide a
● SVM are now available with probability estimate of
almost all data analytics toolsets. classification.
TEXT MINING
✔ Text Mining is the art and science of discovering patterns from an organized collection of
textual database.
✔ Textual mining can help with frequency analysis of important terms and their semantic
relationship.
✔ Text mining can help be applied to large scale social media data for gathering preferences and
measuring emotional sentiments.
5.1.1: TEXT MINING – EXAMPLE
● Format : Word Documents, PDF Files, XML Files, Text messages etc
● In Legal Profession: text sources would include law, court delibration, court orders etc
● Academic Research : published papers and articles
● World of Finance : statutory reports, internal histories, discharge summaries etc
● Medicine : medical journels, patient history, discharge summaries
● Marketing : advertisement, customers comments etc.
5.1.2:Text Mining Applications
● Marketing
● The voice of the customer can be captured in its native and raw format and then analyzed for
customer preference and complaints.
● Social Personas are a clustering technique to develop the customer segments of interest.
● Listening platform is a text mining application, that in real time gethers social media, blogs and
other textual feedback and filters out the chatter to extract true consumer sentiments.
● The BPO conversation and records can be analysed for pattern of customers complaints.
Text Mining Applications
● Business Operations
–Lawyers can more easily search case histories and law for relevant
documents in a particular case
–E-discovery platforms that help in minizing risk in the process of sharing
legal documents.
Text Mining Applications
–This create a bag of important words. Text documents or smaller messages – can then
be ranked on how they match to a particular bag-of-words
–Establishing the corpus of text and organized.
● Level 2 : Identifying the meaningful phrases from the words.
● Example : ice and cream will be two different key words that often come together. However there is a more meanigful
phrases by combining the two words into “ice cream”. This is also called “Structure Using Term Document Matrix”
● Level 3 : Multiple Phrases can be combined or Mine TDM for Patterns
–The two phrases can be put into a common busket and this busket is called “Desserts”
● This is the heart of structuring process. Text can be converted into numerical data, which can
then be mined using regular data mining technique.
● One could call key words, phrases or a topic as a term of interest. This approach measure the
frequencies of selecting important term occurring in each document. This create a txd, where
t=no:of terms and d= no:of documents.
MINING THE TDM
● The TDM can be mined to extract the patterns/knowledge. The variety of tech can be applied.
Web Mining is the art and Science of discovering patterns from WWW so as
to improve.
The WWW is at the heart of the digital revolution, billions of users are
using it every day for a variety of purposes.
The web is used for electronic commerce, business commuincation, and
many other application.
Data for web mining is collected via web crawlers, web logs and other
means.
Characteristics of web sites
● Appearance : Attractive Design, well-formatted content, easy to scan and navigate, good color
contrasts.
● Content : well-planned information architecture with useful contents, Fresh content, SEO, link
to other good sites.
● Functionality : Accessible to all the autorised users, fast loading time, usable form, mobile
enabled.
● FeedBack Analysis : FeedBack data can be used for commercial advertisement and even for
social engineering.
TYPES OF WEB MINING ( 3 TYPES)
Web Content Mining
● A web site is designed in the form of pages with distinct URL. A large website may contain
thousand of web pages.
● These pages are managed by a specialized software called Content Management Systems.
● Every page may have text, graphics, audio, video, forms, applications, and more kind of
content.
● The website keeps a record of all request received for its URL, including the requester
information using cookies.
● The log of these requests could be analyzed to gauge the popularity of those pages among
different segment of the population (PageRank Algorithm).
Web Content Mining
● The text and application content on the pages could be analyzed for its usage by visit counts.
● Use quality contents to attract the users.
● Unwanted and unpopular pages could be weeded out or they can be transfered with different
contents and style.
● Assign more resources to keep more popular pages fresh and inviting.
WEB STRUCTURE MINING
– These are pages with large no:of interesting liks, media site like yahoo.com or
government site could serve that purpose.
– More focus site like www.google.com could aspire to become hub for new emerging
areas.
● Authorities:
– The page which provide the most complete and authoritative information on a
particular subject.
– Example : news, advice, user reviews etc
– These web site have maximum no:of inbound link from other web sites.
– Example : www.mayoclinic.com (medical opinion), www.nytimes.com (daily news)
WEB USAGE MINING
● The goal of the web usage mining is to extract useful information and patterns from data
generated through web page visits and transactions.
● The activity data comes from data stored in server access log, referrer logs, agent logs, and
client-side cookies.
● The user characteristics and usage profile are also gathered directory or indirectly through
sydicate data.
● Meta data such as page attributes, content attribute and usage data are also gathered.
Web Content Analysis
There are two types of web usage mining
– Display the relative popularity of the web pages accessed. Those web
sites could be hubs and authorities.
● The client – side Analysis