Data Mining
Data Mining
Definition :
•An organization can mine its data to improve many aspects of its
business, though the technique is particularly useful for improving
sales and customer relations.
• Many Definitions
– Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
8
Applications of Data Mining
•Business: Customer segmentation, marketing strategies, sales forecasting.
•Healthcare: Predicting diseases, analyzing patient records, and optimizing
treatments.
•Finance: Fraud detection, risk management, and stock market
predictions.
•E-commerce: Recommender systems and user behavior analysis.
•Science: Analyzing experimental data or identifying patterns in complex
systems.
Tid Refund Marital Taxable
Status Income Cheat
years years
Yes No Yes No
11
Classification Example
l l ve
ir ca ir ca ati # years at
go go tit
Level of Credit
n s Tid Employed present
e e lc as
Education Worthy
t t a address
ca ca qu 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10
Test
Set
Training Learn
Set Model
Classifier
12
Examples of Classification Task
16
Machine Learning Example
l l ve
ir ca ir ca ati # years at
go go tit
Level of Credit
n s Tid Employed present
e e lc as
Education Worthy
t t a address
ca ca qu 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10
Test
Set
Training Learn
Set Model
Classifier
17
Core Concepts of Machine Learning
1.Supervised Learning:
1. Learns from labeled data.
2. Example: Predicting stock prices (regression), identifying spam emails
(classification).
2.Unsupervised Learning:
1. Learns from unlabeled data to find patterns or structure.
2. Example: Customer segmentation, anomaly detection.
3.Semi-Supervised Learning:
1. Combines labeled and unlabeled data.
2. Example: Identifying fraudulent transactions with limited labeled data.
4.Reinforcement Learning:
1. Learns by interacting with the environment and receiving feedback as
rewards or penalties.
2. Example: Training a robot to navigate a maze.
Learning Types
20
What is Text Mining?
Text mining, also known as text data mining or text analytics, is the process of
extracting meaningful information and insights from unstructured text data. It
involves converting raw textual data into a structured format to identify patterns,
trends, and valuable knowledge.
•Information Extraction: Extract structured data (like entities, relationships, or
concepts) from unstructured text.
•Text Classification: Categorize text into predefined groups or classes (e.g., spam vs.
non-spam emails).
•Topic Modeling: Discover hidden themes or topics within large collections of text.
•Text Summarization: Create concise summaries of lengthy documents.
•Trend Analysis: Identify trends and patterns in textual data over time.
Key Steps in Text Mining
1.Text Preprocessing: Raw text data often contains noise and inconsistencies. Preprocessing
is critical for cleaning and preparing the text.
1. Tokenization: Splitting text into smaller units, like words or sentences.
2. Stopword Removal: Removing common but insignificant words (e.g., "is," "the,"
"and").
3. Stemming/Lemmatization: Reducing words to their base or root form (e.g.,
"running" → "run").
4. Lowercasing: Converting text to lowercase for uniformity.
5. Removing Punctuation and Numbers: Cleaning non-alphabetic characters.
1.Feature Extraction: Transform text into numerical data for
analysis.
1. Bag of Words (BoW): Represents text as a collection of
word frequencies.
2. TF-IDF: Highlights important terms based on their
frequency in a document and rarity across the corpus.
3. Word Embeddings: Represent words in a dense vector
space (e.g., Word2Vec, GloVe).
1.Text Analysis: Apply statistical or machine learning techniques to analyze the
text.
1. Classification: Assign labels to text (e.g., spam detection).
2. Clustering: Group similar text documents together.
3. Named Entity Recognition (NER): Identify entities like names, dates, or
locations in text.
4. Sentiment Analysis: Evaluate the sentiment expressed in text data.
2.Visualization: Present insights through graphs, word clouds, or other visual
formats.
1. Word clouds for keyword importance.
2. Graphs showing trends in text usage over time.
Applications of Text Mining
1.Search Engines: Google and Bing use text mining to retrieve and rank web pages
relevant to search queries.
2.Customer Feedback Analysis: Analyzing reviews, social media posts, and survey
responses to assess customer sentiment.
3.Spam Detection: Filtering spam emails using text classification algorithms.
4.Healthcare: Extracting insights from medical records, research papers, or patient
feedback.
5.Social Media Analysis: Understanding trends and user sentiment on platforms like
Twitter and Instagram.
6.Fraud Detection: Analyzing textual data in financial transactions or insurance
claims to identify fraud.
7.Legal Document Analysis: Extracting important information from contracts, legal
cases, or government documents.
TF/IDF matrix
• TF-IDF stands for “Term Frequency — Inverse Document Frequency”.
This is a technique to quantify words in a set of documents.
• Term Frequency (TF): Measures how frequently a word appears in a
document.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).
Inverse Data Frequency (idf):
The inverse document frequency is a measure of whether a
term is common or rare in a given document corpus. It is
obtained by dividing the total number of documents by the
number of documents containing the term in the corpus.
TF/IDF
Combining these two we come up with the TF-IDF
score (w) for a word in a document in the corpus. It is
the product of tf and idf:
Let’s take an example to get a clearer understanding.