0% found this document useful (0 votes)

18 views34 pages

Data Mining

Uploaded by

ahmedjamshaid953

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views34 pages

Data Mining

Uploaded by

ahmedjamshaid953

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Data mining

Definition :

Data mining is the process of discovering patterns, trends, correlations, or

useful information from large datasets using techniques that combine
statistics, machine learning, database systems, and artificial intelligence. The
goal of data mining is to extract valuable insights from raw data that can
help in decision-making, predictions, or problem-solving.
Key Take aways

•Data mining combines statistics, artificial intelligence and machine

learning to find patterns, relationships and anomalies in large data
sets.

•An organization can mine its data to improve many aspects of its
business, though the technique is particularly useful for improving
sales and customer relations.

•Data mining can be used to find relationships and patterns in current

data and then apply those to new data to predict future trends or
detect anomalies, such as fraud.
Key Steps in Data Mining

1.Data Collection and Preparation:

1. Gather relevant data from various sources.
2. Clean the data to remove inconsistencies, errors, and redundancies.
2.Data Exploration:
1. Analyze the data using descriptive statistics and visualization techniques to
understand its structure.
3.Modeling:
1. Apply algorithms (e.g., classification, clustering, regression) to find patterns or
relationships in the data.
4.Evaluation:
1. Validate the models and ensure the results are accurate, reliable, and
meaningful.
5.Deployment:
1. Integrate the insights into business processes for decision-making or automation.
What is Data Mining?

• Many Definitions
– Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

8
Applications of Data Mining
•Business: Customer segmentation, marketing strategies, sales forecasting.
•Healthcare: Predicting diseases, analyzing patient records, and optimizing
treatments.
•Finance: Fraud detection, risk management, and stock market
predictions.
•E-commerce: Recommender systems and user behavior analysis.
•Science: Analyzing experimental data or identifying patterns in complex
systems.
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10
Predictive Modeling: Classification
• Find a model for class attribute as a function Model for predicting credit
worthiness
of the values of other attributes Employed
Class No Yes
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Graduate 5 Yes No Education
2 Yes High School 2 No
{ High school,
3 No Undergrad 1 No Graduate
Undergrad }
4 Yes High School 10 Yes
… … … … … Number of Number of
10

years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

11
Classification Example
l l ve
ir ca ir ca ati # years at
go go tit
Level of Credit
n s Tid Employed present
e e lc as
Education Worthy
t t a address
ca ca qu 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10

2 Yes High School 2 No

3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10

Test
Set

Training Learn
Set Model
Classifier

12
Examples of Classification Task

• Classifying credit card transactions

as legitimate or fraudulent

• Classifying land covers (water bodies, urban areas, forests, etc.)

using satellite data

• Categorizing news stories as finance,

weather, entertainment, sports, etc

• Identifying intruders in the cyberspace

• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein 14

What is Machine Learning?
Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on
building systems that can learn from data, identify patterns, and make decisions
with minimal human intervention. It enables computers to improve their
performance on tasks over time as they gain more experience or data.

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on

enabling machines to learn and improve from experience without being explicitly
programmed. It involves developing algorithms that can process large volumes of
data, identify patterns, and make predictions or decisions based on that data.
At its core, machine learning is about creating systems that can generalize from
data.
What is Machine Learning?

16
Machine Learning Example
l l ve
ir ca ir ca ati # years at
go go tit
Level of Credit
n s Tid Employed present
e e lc as
Education Worthy
t t a address
ca ca qu 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10

2 Yes High School 2 No

3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10

Test
Set

Training Learn
Set Model
Classifier

17
Core Concepts of Machine Learning

1.Learning from Data:

1. Machine learning systems are data-driven and rely on datasets to learn.
2. They identify relationships, trends, and patterns within the data to make
informed decisions.
2.Model: A machine learning model is the mathematical representation of the
patterns learned from the data. Example: A linear regression model that predicts
housing prices based on features like square footage, location, etc.
3.Training:
1. The process of feeding data into the algorithm to enable it to "learn."
2. The algorithm adjusts its internal parameters to minimize error and improve
predictions.
4.Prediction: Once trained, the model can make predictions on new, unseen data.
5.Feedback: Feedback mechanisms allow models to improve their accuracy by
correcting errors over time.
Types of Machine Learning

1.Supervised Learning:
1. Learns from labeled data.
2. Example: Predicting stock prices (regression), identifying spam emails
(classification).
2.Unsupervised Learning:
1. Learns from unlabeled data to find patterns or structure.
2. Example: Customer segmentation, anomaly detection.
3.Semi-Supervised Learning:
1. Combines labeled and unlabeled data.
2. Example: Identifying fraudulent transactions with limited labeled data.
4.Reinforcement Learning:
1. Learns by interacting with the environment and receiving feedback as
rewards or penalties.
2. Example: Training a robot to navigate a maze.
Learning Types

20
What is Text Mining?

Text mining, also known as text data mining or text analytics, is the process of
extracting meaningful information and insights from unstructured text data. It
involves converting raw textual data into a structured format to identify patterns,
trends, and valuable knowledge.
•Information Extraction: Extract structured data (like entities, relationships, or
concepts) from unstructured text.

•Text Classification: Categorize text into predefined groups or classes (e.g., spam vs.
non-spam emails).

•Sentiment Analysis: Determine the sentiment or emotion expressed in text (e.g.,

positive, negative, neutral).

•Topic Modeling: Discover hidden themes or topics within large collections of text.
•Text Summarization: Create concise summaries of lengthy documents.
•Trend Analysis: Identify trends and patterns in textual data over time.
Key Steps in Text Mining

1.Text Preprocessing: Raw text data often contains noise and inconsistencies. Preprocessing
is critical for cleaning and preparing the text.
1. Tokenization: Splitting text into smaller units, like words or sentences.
2. Stopword Removal: Removing common but insignificant words (e.g., "is," "the,"
"and").
3. Stemming/Lemmatization: Reducing words to their base or root form (e.g.,
"running" → "run").
4. Lowercasing: Converting text to lowercase for uniformity.
5. Removing Punctuation and Numbers: Cleaning non-alphabetic characters.
1.Feature Extraction: Transform text into numerical data for
analysis.
1. Bag of Words (BoW): Represents text as a collection of
word frequencies.
2. TF-IDF: Highlights important terms based on their
frequency in a document and rarity across the corpus.
3. Word Embeddings: Represent words in a dense vector
space (e.g., Word2Vec, GloVe).
1.Text Analysis: Apply statistical or machine learning techniques to analyze the
text.
1. Classification: Assign labels to text (e.g., spam detection).
2. Clustering: Group similar text documents together.
3. Named Entity Recognition (NER): Identify entities like names, dates, or
locations in text.
4. Sentiment Analysis: Evaluate the sentiment expressed in text data.
2.Visualization: Present insights through graphs, word clouds, or other visual
formats.
1. Word clouds for keyword importance.
2. Graphs showing trends in text usage over time.
Applications of Text Mining
1.Search Engines: Google and Bing use text mining to retrieve and rank web pages
relevant to search queries.
2.Customer Feedback Analysis: Analyzing reviews, social media posts, and survey
responses to assess customer sentiment.
3.Spam Detection: Filtering spam emails using text classification algorithms.
4.Healthcare: Extracting insights from medical records, research papers, or patient
feedback.
5.Social Media Analysis: Understanding trends and user sentiment on platforms like
Twitter and Instagram.
6.Fraud Detection: Analyzing textual data in financial transactions or insurance
claims to identify fraud.
7.Legal Document Analysis: Extracting important information from contracts, legal
cases, or government documents.
TF/IDF matrix
• TF-IDF stands for “Term Frequency — Inverse Document Frequency”.
This is a technique to quantify words in a set of documents.
• Term Frequency (TF): Measures how frequently a word appears in a
document.

TF-IDF (Term Frequency-Inverse Document Frequency)

• TF-IDF is a statistical measure used in text mining and information
retrieval to evaluate how important a word is to a document within a
collection or corpus. It is commonly used in search engines, document
ranking, and natural language processing tasks.
Term Frequency
Term Frequency (tf): gives us the frequency of the word in each
document in the corpus. It is the ratio of number of times the word
appears in a document compared to the total number of words in that
document. It increases as the number of occurrences of that word
within the document increases. Each document has its own tf.
Formula

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).
Inverse Data Frequency (idf):
The inverse document frequency is a measure of whether a
term is common or rare in a given document corpus. It is
obtained by dividing the total number of documents by the
number of documents containing the term in the corpus.
TF/IDF
Combining these two we come up with the TF-IDF
score (w) for a word in a document in the corpus. It is
the product of tf and idf:
Let’s take an example to get a clearer understanding.

Sentence 1 : The car is driven on the road.

Sentence 2: The truck is driven on the highway.

In this example, each sentence is a separate

document.

We will now calculate the TF-IDF for the above two

documents, which represent our corpus.
Thank you

Grep Command Command & Checks
100% (4)
Grep Command Command & Checks
11 pages
Solar Energy B Plan Sample
100% (1)
Solar Energy B Plan Sample
44 pages
Dataming T PDF
No ratings yet
Dataming T PDF
48 pages
NZXSTNZXSNZXDLPost Processor Manual PDF
No ratings yet
NZXSTNZXSNZXDLPost Processor Manual PDF
191 pages
Week 1A - Overview and Introduction of Data Mining
No ratings yet
Week 1A - Overview and Introduction of Data Mining
41 pages
HUAWEI Ascend P2-6070 Maintenance Manual
No ratings yet
HUAWEI Ascend P2-6070 Maintenance Manual
101 pages
Dmbi PPT 1
No ratings yet
Dmbi PPT 1
40 pages
ML Notes
No ratings yet
ML Notes
60 pages
Installation and Wiring - E1102000035GB03
No ratings yet
Installation and Wiring - E1102000035GB03
142 pages
Honda Generator-Brochure 23-24
No ratings yet
Honda Generator-Brochure 23-24
16 pages
CutOffReport GP Pune
No ratings yet
CutOffReport GP Pune
10 pages
Sop For The D.G Start and Stop
No ratings yet
Sop For The D.G Start and Stop
4 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
Datamining: by Guan Hang Su Cs157A Section 2 Fall 2005
0% (1)
Datamining: by Guan Hang Su Cs157A Section 2 Fall 2005
31 pages
WebNavigatorInformationSystem en-US PDF
No ratings yet
WebNavigatorInformationSystem en-US PDF
206 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
Muhammad Ahmed
No ratings yet
Muhammad Ahmed
2 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Special Purpose Motors - Large Fonts
No ratings yet
Special Purpose Motors - Large Fonts
26 pages
Data Mining
No ratings yet
Data Mining
254 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
46 pages
FE-5035 Manual en
No ratings yet
FE-5035 Manual en
60 pages
Expt - 2
No ratings yet
Expt - 2
5 pages
Lec 1
No ratings yet
Lec 1
48 pages
DWM Merged
No ratings yet
DWM Merged
125 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Lecture 10 - Data Mining in Practice
No ratings yet
Lecture 10 - Data Mining in Practice
41 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Data Mining
No ratings yet
Data Mining
26 pages
LMS Orientation
No ratings yet
LMS Orientation
44 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
76 pages
Unit 1
No ratings yet
Unit 1
59 pages
Use of Data Mining and Text Mining (Machine Learning)
No ratings yet
Use of Data Mining and Text Mining (Machine Learning)
42 pages
Memory Traffic
No ratings yet
Memory Traffic
45 pages
Machine Learning
No ratings yet
Machine Learning
21 pages
Very Pwnable Networks - HackFest Hollywood 2024
No ratings yet
Very Pwnable Networks - HackFest Hollywood 2024
73 pages
DM Mod1
No ratings yet
DM Mod1
29 pages
Key To Symbols Engine Fuse Box: Diagram 1 Peugeot 206 Wiring Diagrams
No ratings yet
Key To Symbols Engine Fuse Box: Diagram 1 Peugeot 206 Wiring Diagrams
1 page
Introduction Data Science
No ratings yet
Introduction Data Science
29 pages
Evolution of Machine Learning
No ratings yet
Evolution of Machine Learning
7 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
MLDM Lect1 Introduction
No ratings yet
MLDM Lect1 Introduction
40 pages
Data Mining
No ratings yet
Data Mining
21 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
Chapter-1 (Introduction)
No ratings yet
Chapter-1 (Introduction)
17 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Data Mining: Knowledge Discovery in Databases
No ratings yet
Data Mining: Knowledge Discovery in Databases
24 pages
Unit 3
No ratings yet
Unit 3
33 pages
Expo Dubai 2020
No ratings yet
Expo Dubai 2020
43 pages
Data Mining
No ratings yet
Data Mining
26 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
16 pages
Assignment 5
No ratings yet
Assignment 5
16 pages
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
No ratings yet
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
36 pages
DWDM LS1 Fall 24 25
No ratings yet
DWDM LS1 Fall 24 25
42 pages
Intro 1
No ratings yet
Intro 1
43 pages
Lec4 (Week 4)
No ratings yet
Lec4 (Week 4)
16 pages
ML Lecture 13-14
No ratings yet
ML Lecture 13-14
33 pages
BI Ch02
No ratings yet
BI Ch02
29 pages
8 Data Mining Concepts 2
No ratings yet
8 Data Mining Concepts 2
75 pages
2025 - The Database Approach To Data Management
No ratings yet
2025 - The Database Approach To Data Management
37 pages
The Journey Toward 100% Renewable Electric Energy
No ratings yet
The Journey Toward 100% Renewable Electric Energy
13 pages
Data Mining L1,2
No ratings yet
Data Mining L1,2
26 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Lect - 7.1 - MEC
No ratings yet
Lect - 7.1 - MEC
18 pages
Acp Excise
No ratings yet
Acp Excise
11 pages
Unit 10
No ratings yet
Unit 10
47 pages
Data Mining and Visualization
No ratings yet
Data Mining and Visualization
8 pages
Sat - 34.Pdf - A Systematic Approach Towards Description and Classification of Crime Incidents
No ratings yet
Sat - 34.Pdf - A Systematic Approach Towards Description and Classification of Crime Incidents
11 pages
H9 222L Series FTTH Catv Optical Receiver Technical Specification
No ratings yet
H9 222L Series FTTH Catv Optical Receiver Technical Specification
11 pages
1SDC001057G0201 - WP Ekip UP For Utility - EN
No ratings yet
1SDC001057G0201 - WP Ekip UP For Utility - EN
12 pages
Technical Manual For 15'monitor MON-1501 S.N0001160JS
No ratings yet
Technical Manual For 15'monitor MON-1501 S.N0001160JS
14 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Data Mining, Data Pattern, Machine Learning (Week 2
No ratings yet
Data Mining, Data Pattern, Machine Learning (Week 2
19 pages
Instrumentation Installation Verification Procedure:: How To Use This Document
No ratings yet
Instrumentation Installation Verification Procedure:: How To Use This Document
3 pages
Insight Into Theoretical and Applied Informatics I... - (2.2.4 Data Mining)
No ratings yet
Insight Into Theoretical and Applied Informatics I... - (2.2.4 Data Mining)
5 pages
Unit 1 DSML
No ratings yet
Unit 1 DSML
11 pages
Inf 444e - Datamining N Advanced Databases Introduction 2019
No ratings yet
Inf 444e - Datamining N Advanced Databases Introduction 2019
32 pages
Letter Head Ade XMN
No ratings yet
Letter Head Ade XMN
6 pages
HYDAC Comoso CombiCoolersforMobile
No ratings yet
HYDAC Comoso CombiCoolersforMobile
4 pages
Lesson 2: Building Visual Basic Applications: 2.1 Creating Your First Application
No ratings yet
Lesson 2: Building Visual Basic Applications: 2.1 Creating Your First Application
6 pages
CV - Nur Imam Masri
No ratings yet
CV - Nur Imam Masri
3 pages
Iba SM PPT-8
No ratings yet
Iba SM PPT-8
10 pages
CV - New Muhammad Sohaib Qayyum
No ratings yet
CV - New Muhammad Sohaib Qayyum
3 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
2 pages
الموضوع رقم 29 اختبار الفصل الثالث لغة إنجليزية ثالثة متوسط
No ratings yet
الموضوع رقم 29 اختبار الفصل الثالث لغة إنجليزية ثالثة متوسط
2 pages
Applications & Trends in Data Mining: Gaurav Gupta, Geetika Hans, Tamanna Sehgal
No ratings yet
Applications & Trends in Data Mining: Gaurav Gupta, Geetika Hans, Tamanna Sehgal
3 pages
KS - C - SE-361 Advance Java PDF
No ratings yet
KS - C - SE-361 Advance Java PDF
2 pages
Make College Yours: Methods and Mindsets for College Success
From Everand
Make College Yours: Methods and Mindsets for College Success
Layli Liss
No ratings yet

Data Mining

Uploaded by

Data Mining

Uploaded by

Data mining

Data mining is the process of discovering patterns, trends, correlations, or

•Data mining combines statistics, artificial intelligence and machine

•Data mining can be used to find relationships and patterns in current

1.Data Collection and Preparation:

1 Yes Single 125K No

> 3 yr < 3 yr > 7 yrs < 7 yrs

2 Yes High School 2 No

• Classifying credit card transactions

• Classifying land covers (water bodies, urban areas, forests, etc.)

• Categorizing news stories as finance,

• Identifying intruders in the cyberspace

• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein 14

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on

2 Yes High School 2 No

1.Learning from Data:

•Sentiment Analysis: Determine the sentiment or emotion expressed in text (e.g.,

TF-IDF (Term Frequency-Inverse Document Frequency)

Sentence 1 : The car is driven on the road.

In this example, each sentence is a separate

We will now calculate the TF-IDF for the above two

You might also like