0% found this document useful (0 votes)
17 views34 pages

Data Mining

Uploaded by

ahmedjamshaid953
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views34 pages

Data Mining

Uploaded by

ahmedjamshaid953
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Data mining

Definition :

Data mining is the process of discovering patterns, trends, correlations, or


useful information from large datasets using techniques that combine
statistics, machine learning, database systems, and artificial intelligence. The
goal of data mining is to extract valuable insights from raw data that can
help in decision-making, predictions, or problem-solving.
Key Take aways

•Data mining combines statistics, artificial intelligence and machine


learning to find patterns, relationships and anomalies in large data
sets.

•An organization can mine its data to improve many aspects of its
business, though the technique is particularly useful for improving
sales and customer relations.

•Data mining can be used to find relationships and patterns in current


data and then apply those to new data to predict future trends or
detect anomalies, such as fraud.
Key Steps in Data Mining

1.Data Collection and Preparation:


1. Gather relevant data from various sources.
2. Clean the data to remove inconsistencies, errors, and redundancies.
2.Data Exploration:
1. Analyze the data using descriptive statistics and visualization techniques to
understand its structure.
3.Modeling:
1. Apply algorithms (e.g., classification, clustering, regression) to find patterns or
relationships in the data.
4.Evaluation:
1. Validate the models and ensure the results are accurate, reliable, and
meaningful.
5.Deployment:
1. Integrate the insights into business processes for decision-making or automation.
What is Data Mining?

• Many Definitions
– Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

8
Applications of Data Mining
•Business: Customer segmentation, marketing strategies, sales forecasting.
•Healthcare: Predicting diseases, analyzing patient records, and optimizing
treatments.
•Finance: Fraud detection, risk management, and stock market
predictions.
•E-commerce: Recommender systems and user behavior analysis.
•Science: Analyzing experimental data or identifying patterns in complex
systems.
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10
Predictive Modeling: Classification
• Find a model for class attribute as a function Model for predicting credit
worthiness
of the values of other attributes Employed
Class No Yes
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Graduate 5 Yes No Education
2 Yes High School 2 No
{ High school,
3 No Undergrad 1 No Graduate
Undergrad }
4 Yes High School 10 Yes
… … … … … Number of Number of
10

years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

11
Classification Example
l l ve
ir ca ir ca ati # years at
go go tit
Level of Credit
n s Tid Employed present
e e lc as
Education Worthy
t t a address
ca ca qu 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10

Test
Set

Training Learn
Set Model
Classifier

12
Examples of Classification Task

• Classifying credit card transactions


as legitimate or fraudulent

• Classifying land covers (water bodies, urban areas, forests, etc.)


using satellite data

• Categorizing news stories as finance,


weather, entertainment, sports, etc

• Identifying intruders in the cyberspace

• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein 14


What is Machine Learning?
Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on
building systems that can learn from data, identify patterns, and make decisions
with minimal human intervention. It enables computers to improve their
performance on tasks over time as they gain more experience or data.

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on


enabling machines to learn and improve from experience without being explicitly
programmed. It involves developing algorithms that can process large volumes of
data, identify patterns, and make predictions or decisions based on that data.
At its core, machine learning is about creating systems that can generalize from
data.
What is Machine Learning?

16
Machine Learning Example
l l ve
ir ca ir ca ati # years at
go go tit
Level of Credit
n s Tid Employed present
e e lc as
Education Worthy
t t a address
ca ca qu 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10

Test
Set

Training Learn
Set Model
Classifier

17
Core Concepts of Machine Learning

1.Learning from Data:


1. Machine learning systems are data-driven and rely on datasets to learn.
2. They identify relationships, trends, and patterns within the data to make
informed decisions.
2.Model: A machine learning model is the mathematical representation of the
patterns learned from the data. Example: A linear regression model that predicts
housing prices based on features like square footage, location, etc.
3.Training:
1. The process of feeding data into the algorithm to enable it to "learn."
2. The algorithm adjusts its internal parameters to minimize error and improve
predictions.
4.Prediction: Once trained, the model can make predictions on new, unseen data.
5.Feedback: Feedback mechanisms allow models to improve their accuracy by
correcting errors over time.
Types of Machine Learning

1.Supervised Learning:
1. Learns from labeled data.
2. Example: Predicting stock prices (regression), identifying spam emails
(classification).
2.Unsupervised Learning:
1. Learns from unlabeled data to find patterns or structure.
2. Example: Customer segmentation, anomaly detection.
3.Semi-Supervised Learning:
1. Combines labeled and unlabeled data.
2. Example: Identifying fraudulent transactions with limited labeled data.
4.Reinforcement Learning:
1. Learns by interacting with the environment and receiving feedback as
rewards or penalties.
2. Example: Training a robot to navigate a maze.
Learning Types

20
What is Text Mining?

Text mining, also known as text data mining or text analytics, is the process of
extracting meaningful information and insights from unstructured text data. It
involves converting raw textual data into a structured format to identify patterns,
trends, and valuable knowledge.
•Information Extraction: Extract structured data (like entities, relationships, or
concepts) from unstructured text.

•Text Classification: Categorize text into predefined groups or classes (e.g., spam vs.
non-spam emails).

•Sentiment Analysis: Determine the sentiment or emotion expressed in text (e.g.,


positive, negative, neutral).

•Topic Modeling: Discover hidden themes or topics within large collections of text.
•Text Summarization: Create concise summaries of lengthy documents.
•Trend Analysis: Identify trends and patterns in textual data over time.
Key Steps in Text Mining

1.Text Preprocessing: Raw text data often contains noise and inconsistencies. Preprocessing
is critical for cleaning and preparing the text.
1. Tokenization: Splitting text into smaller units, like words or sentences.
2. Stopword Removal: Removing common but insignificant words (e.g., "is," "the,"
"and").
3. Stemming/Lemmatization: Reducing words to their base or root form (e.g.,
"running" → "run").
4. Lowercasing: Converting text to lowercase for uniformity.
5. Removing Punctuation and Numbers: Cleaning non-alphabetic characters.
1.Feature Extraction: Transform text into numerical data for
analysis.
1. Bag of Words (BoW): Represents text as a collection of
word frequencies.
2. TF-IDF: Highlights important terms based on their
frequency in a document and rarity across the corpus.
3. Word Embeddings: Represent words in a dense vector
space (e.g., Word2Vec, GloVe).
1.Text Analysis: Apply statistical or machine learning techniques to analyze the
text.
1. Classification: Assign labels to text (e.g., spam detection).
2. Clustering: Group similar text documents together.
3. Named Entity Recognition (NER): Identify entities like names, dates, or
locations in text.
4. Sentiment Analysis: Evaluate the sentiment expressed in text data.
2.Visualization: Present insights through graphs, word clouds, or other visual
formats.
1. Word clouds for keyword importance.
2. Graphs showing trends in text usage over time.
Applications of Text Mining
1.Search Engines: Google and Bing use text mining to retrieve and rank web pages
relevant to search queries.
2.Customer Feedback Analysis: Analyzing reviews, social media posts, and survey
responses to assess customer sentiment.
3.Spam Detection: Filtering spam emails using text classification algorithms.
4.Healthcare: Extracting insights from medical records, research papers, or patient
feedback.
5.Social Media Analysis: Understanding trends and user sentiment on platforms like
Twitter and Instagram.
6.Fraud Detection: Analyzing textual data in financial transactions or insurance
claims to identify fraud.
7.Legal Document Analysis: Extracting important information from contracts, legal
cases, or government documents.
TF/IDF matrix
• TF-IDF stands for “Term Frequency — Inverse Document Frequency”.
This is a technique to quantify words in a set of documents.
• Term Frequency (TF): Measures how frequently a word appears in a
document.

TF-IDF (Term Frequency-Inverse Document Frequency)


• TF-IDF is a statistical measure used in text mining and information
retrieval to evaluate how important a word is to a document within a
collection or corpus. It is commonly used in search engines, document
ranking, and natural language processing tasks.
Term Frequency
Term Frequency (tf): gives us the frequency of the word in each
document in the corpus. It is the ratio of number of times the word
appears in a document compared to the total number of words in that
document. It increases as the number of occurrences of that word
within the document increases. Each document has its own tf.
Formula

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).
Inverse Data Frequency (idf):
The inverse document frequency is a measure of whether a
term is common or rare in a given document corpus. It is
obtained by dividing the total number of documents by the
number of documents containing the term in the corpus.
TF/IDF
Combining these two we come up with the TF-IDF
score (w) for a word in a document in the corpus. It is
the product of tf and idf:
Let’s take an example to get a clearer understanding.

Sentence 1 : The car is driven on the road.


Sentence 2: The truck is driven on the highway.

In this example, each sentence is a separate


document.

We will now calculate the TF-IDF for the above two


documents, which represent our corpus.
Thank you

You might also like