100% found this document useful (1 vote)

178 views58 pages

Text Mining PPT Merged

Ppz

Uploaded by

K ANIL KUMAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

178 views58 pages

Text Mining PPT Merged

Ppz

Uploaded by

K ANIL KUMAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Text Mining

- extracting essential information from standard

language text data.
What is Text Mining?
• Text data mining can be described as the process of
extracting essential data from standard language text.

• All the data that we generate via :

o Text messages
o Text documents
o Emails
o Files

• The primary source of data is e-commerce websites, social

media platforms, published articles, survey, and many
more.
• The larger part of the generated data is unstructured,
which makes it challenging and expensive for the
organizations to analyze with the help of the people.
What is Text Mining?
• Text mining, also referred to as text data mining,
roughly equivalent to text analytics, is the process of
deriving high-quality information from text.

• Text mining uses natural language processing (NLP),

allowing machines to understand the human language
and process it automatically.

• Natural language processing (NLP) is the ability of a

computer program to understand human language as
it is spoken and written. It is a component of artificial
intelligence (AI).
Why is Text Mining Important?

• Individuals and organizations generate tons of data

every day. Stats claim that almost 80% of the existing
text data is unstructured, meaning it’s not organized
in a predefined way, it’s not searchable, and it’s
almost impossible to manage. In other words, it’s just
not useful.
How Does Text Mining Work?
• Text mining helps to analyze large amounts of raw
data and find relevant insights. Combined with
machine learning, it can create text analysis models
that learn to classify or extract specific information
based on previous training.

• The first step in text mining is collecting or gathering

the data.

• Data can be internal (interactions through chats,

emails, surveys, spreadsheets, databases, etc) or
external (information from social media, review sites,
news outlets, and any other websites).
How Does Text Mining Work?

• The second step is preparing(preprocessing) your

data. Text mining systems use several NLP techniques
― like Segmentation, tokenization,lemmatization,
stemming and stop removal ― to build the inputs of
your machine learning model.
The steps to perform preprocessing of data :
• Segmentation:
- Break the entire document/article into its component
sentences by its punctuations like full stops and
commas.
The steps to perform preprocessing of data :
• Tokenizing:
- Tokenization is a process of splitting a text / sentence
into smaller units(words) which are also called tokens.
The steps to perform preprocessing of data :
• Stemming:
- Stemming is a technique used to extract the base
form of the words by removing affixes from them. For
example, the stem of the words eating, eats, eaten is
eat.

• Lemmatization:
- Lemmatization considers the context and converts
the word to its meaningful base form, which is called
Lemma. For instance, stemming the word 'Caring'
would return 'Car'. For instance, lemmatizing the word
'Caring' would return 'Care'.
The steps to perform preprocessing of data :
The steps to perform preprocessing of data :
• Filtering (or) Removing Stop Words:
- It is a process of removing non-essential words,i.e
Words such as was, in, is, and, the, are called stop
words and can be removed.
How Does Text Mining Work?

• The third step is Feature Extraction form pre

processed data.

• The mapping from textual data to real-valued vectors

is called feature extraction.

• One of the commonly used technique to extract the

features from textual data is calculating the frequency
of words/tokens in the document/corpus.
Bag of Words (BOW)
• One of the simplest technique is Bag of Words (BOW)
to represent the text in numerical format(Vectors).

• In BOW, we make a list of unique words in the text

corpus called vocabulary. Then we can represent each
sentence or document as a vector, with each word
represented as 1 for presence and 0 for absence.

• The bag-of-words (BOW) model is a representation

that turns arbitrary text into fixed-length vectors by
counting how many times each word appears. This
process is often referred to as vectorization.
Bag of Words (BOW)
• Let’s understand this with an example. Suppose we
wanted to vectorize the following:
Document 1 : the cat sat
Document 2: the cat sat in the hat
Document 3: the cat with the hat

Step 1: Determine the Vocabulary

- We first define our vocabulary, which is the set of all
unique words found in our document set.
- The Words are :
the, cat, sat, in, hat, with
Bag of Words (BOW)
Step 2: Count
Bag of Words (BOW)
Step 3: Vector Representation
Now we have length-6 vectors for each document.
• the cat sat: [1, 1, 1, 0, 0, 0]
• the cat sat in the hat: [2, 1, 1, 1, 1, 0]
• the cat with the hat: [2, 1, 0, 0, 1, 1]

- Notice that we lose contextual information, e.g.

where in the document the word appeared, when we
use BOW. It’s like a literal bag-of-words: it only tells
you what words occur in the document,
not where they occurred.
-Term frequency — Inverse document frequency (TFIDF)
-Word2Vec (W2V)
Multimedia Data Mining
- extracting
interesting information from
multimedia databases.
What is Multimedia Mining?
• Multimedia data mining discovers interesting patterns
from multimedia databases that store and manage
large collections of multimedia objects.

• The Multimedia data includes the following:

– image data,
– video data,
– audio data,
– sequence data,
– hypertext data containing text.
• Multimedia data mining has a number of uses in
today’s society. An example of this would be the use
of traffic camera footage to analyze traffic flow.

• Multimedia data mining can be defined as a process

that finds patterns in various types of data, including
images, audio, video, and animation.
Categories of Multimedia Data Mining
• Multimedia data mining is classified into two broad
categories: static and dynamic media.
Text mining
• Text Mining also referred to as text data mining and it
is used to find meaningful information from the
unstructured texts that are from various sources.

Image mining
• Image mining systems can discover meaningful
information or image patterns from a huge collection
of images.
Video mining
• Video mining has the objective of describing interesting
patterns form large amount of video data.

• Video has several type of multimedia data such as image,

text, audio, visual etc.

• It is widely used in application such as entertainment,

medicine, education, sports etc.

Audio mining
• Audio mining is the technique in which audio signals are
automatically analyzed and searched. This technique is
generally implemented in automatic speech recognition.
Applications of Multimedia Mining:

• Digital Library
• Traffic Video Sequences
• Medical Analysis
• Media Making and Broadcasting
• Surveillance system
Process of Multimedia Data Mining:
Architecture for Multimedia Data Mining:
We considered two main families of multimedia
retrieval systems, i.e. similarity search in multimedia
data.

• Description-based retrieval system creates indices

and object retrieval based on image descriptions,
such as keywords, captions, size, and creation time.

• Content-based retrieval system supports image

content retrieval, for example, color histogram,
texture, shape, objects, and wavelet transform.
Models for Multimedia Mining

The data mining models / techniques that are applied

to multimedia data are
• classification,

• clustering,

• association rule mining

Spatial Data Mining
-aspecialized subfield of data mining that deals
with extracting knowledge from spatial data.
What is Spatial Data Mining?
• Spatial data mining is a specialized subfield of data
mining that deals with extracting knowledge from
spatial data.

• Spatial data refers to data that is associated with a

particular location or geography.

• Examples of spatial data include maps, satellite

images, GPS data, and other geospatial information.
• Spatial data mining involves analyzing and
discovering patterns, relationships, and trends in this
data to gain insights and make informed decisions.

• The use of spatial data mining has become

increasingly important in various fields, such as
logistics, environmental science, urban planning,
transportation, and public health.
• For instance, a transportation company can optimize
its delivery routes for faster and more efficient
deliveries using spatial data mining techniques.

• They can analyze their delivery data along with other

spatial data, such as traffic flow, road network, and
weather patterns, to identify the most efficient routes
for each delivery.
Types of Spatial Data
• Different types of spatial data are used in spatial data
mining. These include point data, line data, and
polygon data.
Point Data
• Point data represents a single location or a set of
locations on a map. Each point is defined by its x and y
coordinates, representing its position in the
geographic space.

• Point data is commonly used to represent geographic

features such as cities, landmarks, or specific
locations of interest. Examples of point data in
transportation include delivery locations, bus stops,
or railway stations.
Line Data
• Line data represents a linear feature, such as a road,
a river, or a pipeline, on a map. Each line is defined by
a set of vertices, which represent the start and end
points of the line.

• Line data is commonly used to represent

`transportation networks, such as roads, highways,
or railways.
Polygon Data
• Polygon data represents a closed shape or an area on
a map. Each polygon is defined by a set of vertices
that connect to form a closed boundary.

• Polygon data is commonly used to represent

administrative boundaries, land use, or demographic
data.

• In transportation, polygon data can be used to

represent areas of interest, such as delivery zones or
traffic zones.
Applications of Spatial Data Mining

The following are some of the applications of spatial data mining:

Urban Planning

• Spatial Data Mining is used by urban planners to analyze and

improve urban dynamics. It can be used to enhance urban
growth, improve transportation systems, and refine decisions
about land.

Public Health

• Spatial Data Mining plays an important role in public health

research. It is used to develop strategies to identify diseases,
track the spread of infections, and optimize healthcare
resources.
Transportation
• Spatial Data Mining can be used to identify traffic
patterns, prevent congestion, manage the transportation
network, and optimize transportation routes.

Environmental Management
• Spatial Data Mining also contributes to environmental
management by detecting changes in the environment,
identifying the land at risk, conserving water and
biodiversity, and monitoring natural resources.

Crime Analysis
• Spatial Data Mining can be used to identify crime
hotspots, understand crime patterns and develop proper
strategies to prevent crimes and hence improve public
safety.
Web Mining
- Discovering interesting and useful information
from Web content and usage data
What is Web Mining?
• Web mining is a data mining technique to extract
knowledge from web data.

• Web data includes :

o web documents
o hyperlinks between documents
o usage logs of web sites

• The WWW is huge, widely distributed, global

information service centre and, therefore, It is a rich
source for data mining.
World Wide Web
 There are about 1.5 billion websites.
 But less than 400 million are active.
 Grows at about 1 million pages a day
 By the time you finish this class, thousands of new sites will
spawn.
 Lots of duplication (70-80%)

What is the most visited site in the world?

Go ahead — Google it!

Fun fact:
More than 80% of all Google searches are initiated by the Google staff,
in the process of developing and refining its search algorithms.
How Many Websites Are There in the World
World Wide Web
• Diverse types of data
– Text
– Images
– Audio & Video
Web Mining
• Web mining is the application of data mining
techniques to discover useful information
from the World Wide Web.

• It uses automated methods to extract both

structured and unstructured data from web
pages, server logs and link structures.
Web Mining
Examples:
• Web search, e.g. Google, Yahoo, Bing, Dogpile ,Duck Duck Go
Ecosia ,Gigablast ,…

• Specialized search: e.g. Squool Tube - Search for factual,

educational videos., Elephind - search the world's historical
newspaper archives.

• eCommerce : e.g. Amazon,Flipkart,eBay,Fiverr,Upwork.

• Advertising, e.g. Google Adsense

• Improving Web site design and performance

Web Mining
Web Content Mining
• Web content mining can be used to extract useful
data, information, knowledge from the web page
content.

• In web content mining, each web page is

considered as an individual document.

• The primary task of content mining is data

extraction, where structured data is extracted
from unstructured websites.
Web Content Mining
• Web content mining can be utilized to distinguish
topics on the web.
For Example, if any user searches for a specific book
on the search engine, then the user will get a list of
suggestions.

• The technologies that are normally used in web

content mining are NLP (Natural language
processing) and IR (Information Retrieval).

• Techniques in Web Content Mining :

– Classification
– Clustering
Web Content Mining
Web Usage Mining
• Web usage mining is the application of identifying or
discovering interesting usage patterns from large data
sets.

• And these patterns enable you to understand the user

behaviors or something like that.

• Web usage mining is used for mining the web log

records (access information of web pages) and helps to
discover the user access patterns of web pages.

• Web server registers a web log entry for every web

page.
Web Usage Mining
• The main source of data here is Web Server
and Application Server.

• Log files are created when a user/customer

interacts with a web page.

• Techniques in Web Usage Mining :

– Association Rules
– Classification
– Clustering
Web Usage Mining
Advantage:
• This technology has enabled e-commerce to do
personalized marketing, which eventually results in
higher trade volumes.

Disadvantage:
• This technology when used on data of personal
nature might cause concerns. The most criticized
ethical issue involving web usage mining is the
invasion of privacy.
Web Structure Mining
• Web structure mining is the application of discovering
structure information from the web.

• The structure of the web graph consists of web pages as

nodes, and hyperlinks as edges connecting related pages.

• Structure mining basically shows the structured

summary of a particular website. It identifies
relationship between web pages linked by information or
direct link connection.
• Techniques in Web Structure Mining :
– Association Rules
– Classification
Web Structure Mining
Example:
• The most important application in this regard is
the Google search engine, which estimates the
ranking of its outcomes primarily with the
PageRank algorithm.

• The rank of a page is decided by the number and

quality of links pointing to the target node.

• In general, Web structure mining can be very

useful to companies to determine the
connection between two commercial websites.
Web Structure Mining
Web Structure Mining

Trusted Link Installation Guide
No ratings yet
Trusted Link Installation Guide
19 pages
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet
CSE3013 Module6
No ratings yet
CSE3013 Module6
127 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
An Introduction To WEKA
No ratings yet
An Introduction To WEKA
85 pages
Data Warehousing and Data Mining (10cs755)
No ratings yet
Data Warehousing and Data Mining (10cs755)
142 pages
2-Capacity, Underfitting, overfitting-15-Jul-2020Material - I - 15-Jul-2020 - ML - Fundamentals
No ratings yet
2-Capacity, Underfitting, overfitting-15-Jul-2020Material - I - 15-Jul-2020 - ML - Fundamentals
35 pages
10.object Oriented Design and UML Diagrams
No ratings yet
10.object Oriented Design and UML Diagrams
89 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Data Mining-Multimedia Datamining
No ratings yet
Data Mining-Multimedia Datamining
8 pages
CS-114 Fundamentals of Programming: Looping Constructs
No ratings yet
CS-114 Fundamentals of Programming: Looping Constructs
22 pages
Switch Statements
No ratings yet
Switch Statements
11 pages
Line, Circle, Ellipse
No ratings yet
Line, Circle, Ellipse
8 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Distributed File Systems
No ratings yet
Distributed File Systems
18 pages
Internet Threats
No ratings yet
Internet Threats
39 pages
Mc4301 APR May 24 (Machine Learning)
No ratings yet
Mc4301 APR May 24 (Machine Learning)
3 pages
Demonstration of Preprocessing On Dataset Student - Arff Aim: This Experiment Illustrates Some of The Basic Data Preprocessing Operations That Can Be
100% (1)
Demonstration of Preprocessing On Dataset Student - Arff Aim: This Experiment Illustrates Some of The Basic Data Preprocessing Operations That Can Be
4 pages
Data Acquisition
No ratings yet
Data Acquisition
16 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Social Information Filtering
No ratings yet
Social Information Filtering
25 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
30 pages
Support Vector Machine (SVM) : Basic Terminologies
100% (1)
Support Vector Machine (SVM) : Basic Terminologies
2 pages
Data Mining: Books
No ratings yet
Data Mining: Books
14 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
16 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
SD Unit 1
No ratings yet
SD Unit 1
30 pages
Parallel Database Systems
No ratings yet
Parallel Database Systems
17 pages
Principles of Programming Languages UNIT I
No ratings yet
Principles of Programming Languages UNIT I
91 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
Virtualization Technologies
No ratings yet
Virtualization Technologies
30 pages
Distributed System
100% (1)
Distributed System
119 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Data Mining and Ware Housing
No ratings yet
Data Mining and Ware Housing
130 pages
Distributed File Systems
No ratings yet
Distributed File Systems
75 pages
Data Analytics-Lab Manual
No ratings yet
Data Analytics-Lab Manual
19 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Rdbms
100% (7)
Rdbms
60 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
By Microsoft Website: DURATION: 6 Weeks Amount Paid: Yes: Introduction To Data Science
100% (1)
By Microsoft Website: DURATION: 6 Weeks Amount Paid: Yes: Introduction To Data Science
21 pages
Multimedia Database
No ratings yet
Multimedia Database
14 pages
355955B30 Siddesh Mahind SMA Exp-5
No ratings yet
355955B30 Siddesh Mahind SMA Exp-5
11 pages
Data Mining Unit-1 Notes
No ratings yet
Data Mining Unit-1 Notes
18 pages
MC4301 - ML Unit 4 (Parametric Machine Learning)
No ratings yet
MC4301 - ML Unit 4 (Parametric Machine Learning)
56 pages
Lesson 1 Introduction To Information Security
No ratings yet
Lesson 1 Introduction To Information Security
42 pages
Data Mining Syllabus and Question
No ratings yet
Data Mining Syllabus and Question
6 pages
Lecture 4 - Introduction To UML
No ratings yet
Lecture 4 - Introduction To UML
29 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
715ECT04 Embedded Systems 2M & 16M
0% (1)
715ECT04 Embedded Systems 2M & 16M
32 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
QUESTION BANK - Dbms
No ratings yet
QUESTION BANK - Dbms
8 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
37 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
IT2403 Systems Analysis and Design: (Compulsory)
No ratings yet
IT2403 Systems Analysis and Design: (Compulsory)
6 pages
Database Management System: As Per Digital Assistant Exam Syllabus
No ratings yet
Database Management System: As Per Digital Assistant Exam Syllabus
50 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Aditya Engineering College (A) : Python Data Structures
No ratings yet
Aditya Engineering College (A) : Python Data Structures
7 pages
Data Structures
No ratings yet
Data Structures
43 pages
A Brief Introduction To Data Mining (DM) : Bs Cs - V Iii BY Sanianayab
No ratings yet
A Brief Introduction To Data Mining (DM) : Bs Cs - V Iii BY Sanianayab
23 pages
Social Media Data Mining and Analytics
From Everand
Social Media Data Mining and Analytics
Gabor Szabo
No ratings yet
Petrifilm Salmonella Express SALX Interpretation Guide - en US - FS00587
No ratings yet
Petrifilm Salmonella Express SALX Interpretation Guide - en US - FS00587
6 pages
Guidelines For Academic Writing
No ratings yet
Guidelines For Academic Writing
8 pages
Machine and Industrial Design in Mechanical Engineering (Milan Rackov, Radivoje Mitrović, Maja Čavić) (Z-Library)
No ratings yet
Machine and Industrial Design in Mechanical Engineering (Milan Rackov, Radivoje Mitrović, Maja Čavić) (Z-Library)
725 pages
Hate Speech, 2016 Report
No ratings yet
Hate Speech, 2016 Report
60 pages
Percentage Type 1
No ratings yet
Percentage Type 1
82 pages
Case Study Analysis On CWO GROUP 8
No ratings yet
Case Study Analysis On CWO GROUP 8
10 pages
Wholesalers in Ethiopia
No ratings yet
Wholesalers in Ethiopia
25 pages
Notation: Ae Aeff An
No ratings yet
Notation: Ae Aeff An
4 pages
Mendel and Heredity Worksheet
No ratings yet
Mendel and Heredity Worksheet
11 pages
HTML & SQL Programmes
No ratings yet
HTML & SQL Programmes
4 pages
Parts of Speech Test Bank
No ratings yet
Parts of Speech Test Bank
14 pages
Waiver
No ratings yet
Waiver
6 pages
Biometry and Experimental Design
100% (1)
Biometry and Experimental Design
106 pages
Tanaman Hias
No ratings yet
Tanaman Hias
8 pages
Lesson Plan: Veer Surendra Sai University of Technology
No ratings yet
Lesson Plan: Veer Surendra Sai University of Technology
2 pages
Maths - Matrices - Matrices Multiplication Symmetric - Skew-Symmetric - Assingment - 9 June 2020
100% (1)
Maths - Matrices - Matrices Multiplication Symmetric - Skew-Symmetric - Assingment - 9 June 2020
2 pages
PH Formative Assessment Training - Final Report - FINAL
No ratings yet
PH Formative Assessment Training - Final Report - FINAL
50 pages
Chapter 2 - of The Principles of Business Book
No ratings yet
Chapter 2 - of The Principles of Business Book
55 pages
Schneider Ecofit - Low and Medium Voltage Distribution Switchboards FPX
No ratings yet
Schneider Ecofit - Low and Medium Voltage Distribution Switchboards FPX
150 pages
Meo Class I Assessment Process MMD Kochi-1
No ratings yet
Meo Class I Assessment Process MMD Kochi-1
10 pages
Arts 8 LM Month 1
No ratings yet
Arts 8 LM Month 1
7 pages
6 Hobbies That Can Build Up Your Creativity and Imagination
No ratings yet
6 Hobbies That Can Build Up Your Creativity and Imagination
1 page
MY Resume
No ratings yet
MY Resume
1 page
Soap, Fatty Acids, and Synthetic Detergents: Janine Chupa, Steve Misner, Amit Sachdev, and George A. Smith
No ratings yet
Soap, Fatty Acids, and Synthetic Detergents: Janine Chupa, Steve Misner, Amit Sachdev, and George A. Smith
2 pages
Amit Yadav Project
No ratings yet
Amit Yadav Project
49 pages
Avr4311 E2
No ratings yet
Avr4311 E2
2 pages
Topo Sheet Report
No ratings yet
Topo Sheet Report
15 pages
Gifted 3
No ratings yet
Gifted 3
12 pages
Specification For Wrought Aluminium and Aluminium Alloy Plate For General Engineering Purposes
0% (1)
Specification For Wrought Aluminium and Aluminium Alloy Plate For General Engineering Purposes
7 pages