100% found this document useful (1 vote)
178 views58 pages

Text Mining PPT Merged

Ppz

Uploaded by

K ANIL KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
178 views58 pages

Text Mining PPT Merged

Ppz

Uploaded by

K ANIL KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Text Mining

- extracting essential information from standard


language text data.
What is Text Mining?
• Text data mining can be described as the process of
extracting essential data from standard language text.

• All the data that we generate via :


o Text messages
o Text documents
o Emails
o Files

• The primary source of data is e-commerce websites, social


media platforms, published articles, survey, and many
more.
• The larger part of the generated data is unstructured,
which makes it challenging and expensive for the
organizations to analyze with the help of the people.
What is Text Mining?
• Text mining, also referred to as text data mining,
roughly equivalent to text analytics, is the process of
deriving high-quality information from text.

• Text mining uses natural language processing (NLP),


allowing machines to understand the human language
and process it automatically.

• Natural language processing (NLP) is the ability of a


computer program to understand human language as
it is spoken and written. It is a component of artificial
intelligence (AI).
Why is Text Mining Important?

• Individuals and organizations generate tons of data


every day. Stats claim that almost 80% of the existing
text data is unstructured, meaning it’s not organized
in a predefined way, it’s not searchable, and it’s
almost impossible to manage. In other words, it’s just
not useful.
How Does Text Mining Work?
• Text mining helps to analyze large amounts of raw
data and find relevant insights. Combined with
machine learning, it can create text analysis models
that learn to classify or extract specific information
based on previous training.

• The first step in text mining is collecting or gathering


the data.

• Data can be internal (interactions through chats,


emails, surveys, spreadsheets, databases, etc) or
external (information from social media, review sites,
news outlets, and any other websites).
How Does Text Mining Work?

• The second step is preparing(preprocessing) your


data. Text mining systems use several NLP techniques
― like Segmentation, tokenization,lemmatization,
stemming and stop removal ― to build the inputs of
your machine learning model.
The steps to perform preprocessing of data :
• Segmentation:
- Break the entire document/article into its component
sentences by its punctuations like full stops and
commas.
The steps to perform preprocessing of data :
• Tokenizing:
- Tokenization is a process of splitting a text / sentence
into smaller units(words) which are also called tokens.
The steps to perform preprocessing of data :
• Stemming:
- Stemming is a technique used to extract the base
form of the words by removing affixes from them. For
example, the stem of the words eating, eats, eaten is
eat.

• Lemmatization:
- Lemmatization considers the context and converts
the word to its meaningful base form, which is called
Lemma. For instance, stemming the word 'Caring'
would return 'Car'. For instance, lemmatizing the word
'Caring' would return 'Care'.
The steps to perform preprocessing of data :
The steps to perform preprocessing of data :
• Filtering (or) Removing Stop Words:
- It is a process of removing non-essential words,i.e
Words such as was, in, is, and, the, are called stop
words and can be removed.
How Does Text Mining Work?

• The third step is Feature Extraction form pre


processed data.

• The mapping from textual data to real-valued vectors


is called feature extraction.

• One of the commonly used technique to extract the


features from textual data is calculating the frequency
of words/tokens in the document/corpus.
Bag of Words (BOW)
• One of the simplest technique is Bag of Words (BOW)
to represent the text in numerical format(Vectors).

• In BOW, we make a list of unique words in the text


corpus called vocabulary. Then we can represent each
sentence or document as a vector, with each word
represented as 1 for presence and 0 for absence.

• The bag-of-words (BOW) model is a representation


that turns arbitrary text into fixed-length vectors by
counting how many times each word appears. This
process is often referred to as vectorization.
Bag of Words (BOW)
• Let’s understand this with an example. Suppose we
wanted to vectorize the following:
Document 1 : the cat sat
Document 2: the cat sat in the hat
Document 3: the cat with the hat

Step 1: Determine the Vocabulary


- We first define our vocabulary, which is the set of all
unique words found in our document set.
- The Words are :
the, cat, sat, in, hat, with
Bag of Words (BOW)
Step 2: Count
Bag of Words (BOW)
Step 3: Vector Representation
Now we have length-6 vectors for each document.
• the cat sat: [1, 1, 1, 0, 0, 0]
• the cat sat in the hat: [2, 1, 1, 1, 1, 0]
• the cat with the hat: [2, 1, 0, 0, 1, 1]

- Notice that we lose contextual information, e.g.


where in the document the word appeared, when we
use BOW. It’s like a literal bag-of-words: it only tells
you what words occur in the document,
not where they occurred.
-Term frequency — Inverse document frequency (TFIDF)
-Word2Vec (W2V)
Multimedia Data Mining
- extracting
interesting information from
multimedia databases.
What is Multimedia Mining?
• Multimedia data mining discovers interesting patterns
from multimedia databases that store and manage
large collections of multimedia objects.

• The Multimedia data includes the following:


– image data,
– video data,
– audio data,
– sequence data,
– hypertext data containing text.
• Multimedia data mining has a number of uses in
today’s society. An example of this would be the use
of traffic camera footage to analyze traffic flow.

• Multimedia data mining can be defined as a process


that finds patterns in various types of data, including
images, audio, video, and animation.
Categories of Multimedia Data Mining
• Multimedia data mining is classified into two broad
categories: static and dynamic media.
Text mining
• Text Mining also referred to as text data mining and it
is used to find meaningful information from the
unstructured texts that are from various sources.

Image mining
• Image mining systems can discover meaningful
information or image patterns from a huge collection
of images.
Video mining
• Video mining has the objective of describing interesting
patterns form large amount of video data.

• Video has several type of multimedia data such as image,


text, audio, visual etc.

• It is widely used in application such as entertainment,


medicine, education, sports etc.

Audio mining
• Audio mining is the technique in which audio signals are
automatically analyzed and searched. This technique is
generally implemented in automatic speech recognition.
Applications of Multimedia Mining:

• Digital Library
• Traffic Video Sequences
• Medical Analysis
• Media Making and Broadcasting
• Surveillance system
Process of Multimedia Data Mining:
Architecture for Multimedia Data Mining:
We considered two main families of multimedia
retrieval systems, i.e. similarity search in multimedia
data.

• Description-based retrieval system creates indices


and object retrieval based on image descriptions,
such as keywords, captions, size, and creation time.

• Content-based retrieval system supports image


content retrieval, for example, color histogram,
texture, shape, objects, and wavelet transform.
Models for Multimedia Mining

The data mining models / techniques that are applied


to multimedia data are
• classification,

• clustering,

• association rule mining


Spatial Data Mining
-aspecialized subfield of data mining that deals
with extracting knowledge from spatial data.
What is Spatial Data Mining?
• Spatial data mining is a specialized subfield of data
mining that deals with extracting knowledge from
spatial data.

• Spatial data refers to data that is associated with a


particular location or geography.

• Examples of spatial data include maps, satellite


images, GPS data, and other geospatial information.
• Spatial data mining involves analyzing and
discovering patterns, relationships, and trends in this
data to gain insights and make informed decisions.

• The use of spatial data mining has become


increasingly important in various fields, such as
logistics, environmental science, urban planning,
transportation, and public health.
• For instance, a transportation company can optimize
its delivery routes for faster and more efficient
deliveries using spatial data mining techniques.

• They can analyze their delivery data along with other


spatial data, such as traffic flow, road network, and
weather patterns, to identify the most efficient routes
for each delivery.
Types of Spatial Data
• Different types of spatial data are used in spatial data
mining. These include point data, line data, and
polygon data.
Point Data
• Point data represents a single location or a set of
locations on a map. Each point is defined by its x and y
coordinates, representing its position in the
geographic space.

• Point data is commonly used to represent geographic


features such as cities, landmarks, or specific
locations of interest. Examples of point data in
transportation include delivery locations, bus stops,
or railway stations.
Line Data
• Line data represents a linear feature, such as a road,
a river, or a pipeline, on a map. Each line is defined by
a set of vertices, which represent the start and end
points of the line.

• Line data is commonly used to represent


`transportation networks, such as roads, highways,
or railways.
Polygon Data
• Polygon data represents a closed shape or an area on
a map. Each polygon is defined by a set of vertices
that connect to form a closed boundary.

• Polygon data is commonly used to represent


administrative boundaries, land use, or demographic
data.

• In transportation, polygon data can be used to


represent areas of interest, such as delivery zones or
traffic zones.
Applications of Spatial Data Mining

The following are some of the applications of spatial data mining:

Urban Planning

• Spatial Data Mining is used by urban planners to analyze and


improve urban dynamics. It can be used to enhance urban
growth, improve transportation systems, and refine decisions
about land.

Public Health

• Spatial Data Mining plays an important role in public health


research. It is used to develop strategies to identify diseases,
track the spread of infections, and optimize healthcare
resources.
Transportation
• Spatial Data Mining can be used to identify traffic
patterns, prevent congestion, manage the transportation
network, and optimize transportation routes.

Environmental Management
• Spatial Data Mining also contributes to environmental
management by detecting changes in the environment,
identifying the land at risk, conserving water and
biodiversity, and monitoring natural resources.

Crime Analysis
• Spatial Data Mining can be used to identify crime
hotspots, understand crime patterns and develop proper
strategies to prevent crimes and hence improve public
safety.
Web Mining
- Discovering interesting and useful information
from Web content and usage data
What is Web Mining?
• Web mining is a data mining technique to extract
knowledge from web data.

• Web data includes :


o web documents
o hyperlinks between documents
o usage logs of web sites

• The WWW is huge, widely distributed, global


information service centre and, therefore, It is a rich
source for data mining.
World Wide Web
 There are about 1.5 billion websites.
 But less than 400 million are active.
 Grows at about 1 million pages a day
 By the time you finish this class, thousands of new sites will
spawn.
 Lots of duplication (70-80%)

What is the most visited site in the world?


Go ahead — Google it!

Fun fact:
More than 80% of all Google searches are initiated by the Google staff,
in the process of developing and refining its search algorithms.
How Many Websites Are There in the World
World Wide Web
• Diverse types of data
– Text
– Images
– Audio & Video
Web Mining
• Web mining is the application of data mining
techniques to discover useful information
from the World Wide Web.

• It uses automated methods to extract both


structured and unstructured data from web
pages, server logs and link structures.
Web Mining
Examples:
• Web search, e.g. Google, Yahoo, Bing, Dogpile ,Duck Duck Go
Ecosia ,Gigablast ,…

• Specialized search: e.g. Squool Tube - Search for factual,


educational videos., Elephind - search the world's historical
newspaper archives.

• eCommerce : e.g. Amazon,Flipkart,eBay,Fiverr,Upwork.

• Advertising, e.g. Google Adsense

• Improving Web site design and performance


Web Mining
Web Content Mining
• Web content mining can be used to extract useful
data, information, knowledge from the web page
content.

• In web content mining, each web page is


considered as an individual document.

• The primary task of content mining is data


extraction, where structured data is extracted
from unstructured websites.
Web Content Mining
• Web content mining can be utilized to distinguish
topics on the web.
For Example, if any user searches for a specific book
on the search engine, then the user will get a list of
suggestions.

• The technologies that are normally used in web


content mining are NLP (Natural language
processing) and IR (Information Retrieval).

• Techniques in Web Content Mining :


– Classification
– Clustering
Web Content Mining
Web Usage Mining
• Web usage mining is the application of identifying or
discovering interesting usage patterns from large data
sets.

• And these patterns enable you to understand the user


behaviors or something like that.

• Web usage mining is used for mining the web log


records (access information of web pages) and helps to
discover the user access patterns of web pages.

• Web server registers a web log entry for every web


page.
Web Usage Mining
• The main source of data here is Web Server
and Application Server.

• Log files are created when a user/customer


interacts with a web page.

• Techniques in Web Usage Mining :


– Association Rules
– Classification
– Clustering
Web Usage Mining
Advantage:
• This technology has enabled e-commerce to do
personalized marketing, which eventually results in
higher trade volumes.

Disadvantage:
• This technology when used on data of personal
nature might cause concerns. The most criticized
ethical issue involving web usage mining is the
invasion of privacy.
Web Structure Mining
• Web structure mining is the application of discovering
structure information from the web.

• The structure of the web graph consists of web pages as


nodes, and hyperlinks as edges connecting related pages.

• Structure mining basically shows the structured


summary of a particular website. It identifies
relationship between web pages linked by information or
direct link connection.
• Techniques in Web Structure Mining :
– Association Rules
– Classification
Web Structure Mining
Example:
• The most important application in this regard is
the Google search engine, which estimates the
ranking of its outcomes primarily with the
PageRank algorithm.

• The rank of a page is decided by the number and


quality of links pointing to the target node.

• In general, Web structure mining can be very


useful to companies to determine the
connection between two commercial websites.
Web Structure Mining
Web Structure Mining

You might also like