0% found this document useful (0 votes)

8 views8 pages

Data Ming Unit 2

Data mining is the process of discovering patterns and knowledge from large datasets, involving steps such as data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Various types of data can be mined, including relational databases, data warehouses, transactional data, and more, each with specific mining tasks and applications. Technologies such as statistics, machine learning, and information retrieval enhance data mining, which is utilized in applications like customer segmentation, market basket analysis, and fraud detection.

Uploaded by

Misba firdose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

Data Ming Unit 2

Uploaded by

Misba firdose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

UNIT 2

Data mining
is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, theWeb, other information repositories, or data that are streamed into the
system dynamically.

Knowledge discovery from data, or KDD

1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) 4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures—see Section 1.4.6)
7. Knowledge presentation (where visualization and knowledge representation
techniques
are used to present mined knowledge to users)

What Kinds of Data Can Be Mined

1. Relational Database Data

 Definition: A collection of tables (relations) each with a unique name, consisting of

attributes (columns) and tuples (rows).
 Example: AllElectronics relational database includes tables like customer, item,
employee, and branch.
 Data Access: Relational queries (e.g., SQL) allow data retrieval, aggregate functions, and
trend analysis.
 Data Mining Tasks: Predicting customer behavior, detecting deviations in sales,
discovering patterns and trends.

2. Data Warehouses

 Definition: A repository of information from multiple sources, stored under a unified

schema, often at a single site.
 Characteristics: Organized around major subjects, provides historical data, and typically
uses a multidimensional data structure (data cube).
 Example: AllElectronics data warehouse stores summarized sales data for analysis.
 Data Access: Online Analytical Processing (OLAP) operations like drill-down and roll-
up for multidimensional data analysis.
 Data Mining Tasks: Summarizing sales, detecting patterns over time, facilitating
decision making through aggregated data.
3. Transactional Data

 Definition: Records of transactions, such as purchases, bookings, or user clicks, each

with a unique transaction ID and a list of items involved.
 Example: A transactional database for AllElectronics records each sales transaction with
items sold.
 Data Access: Queries to analyze transactional patterns.
 Data Mining Tasks: Market basket analysis to identify items frequently sold together,
detecting purchase patterns.

Other Kinds of Data:

4. Time-related or Sequence Data

 Definition: Data collected over time, like historical records or time-series data.
 Example: Stock exchange data, historical weather data.
 Data Mining Tasks: Detecting trends, seasonal patterns, predicting future events.

5. Data Streams

 Definition: Continuous, real-time data flows from sources like sensors or online
activities.
 Example: Video surveillance feeds, sensor data.
 Data Mining Tasks: Real-time anomaly detection, monitoring changes over time.

6. Spatial Data

 Definition: Data related to locations and spatial attributes.

 Example: Geographic Information System (GIS) data, city maps.
 Data Mining Tasks: Analyzing spatial relationships, detecting geographic patterns.

7. Engineering Design Data

 Definition: Data from the design and modeling of systems or structures.

 Example: Building blueprints, circuit designs.
 Data Mining Tasks: Optimizing designs, detecting design patterns.

8. Hypertext and Multimedia Data

 Definition: Text, images, audio, video data.

 Example: Web pages, multimedia files.
 Data Mining Tasks: Image recognition, video sequence detection, text classification.

9. Graph and Networked Data

 Definition: Data representing connections and relationships.
 Example: Social networks, web graphs.
 Data Mining Tasks: Network analysis, detecting community structures.

10. Web Data

 Definition: Data from the World Wide Web, including web pages and user interactions.
 Example: Website analytics data.
 Data Mining Tasks: Web usage mining, web structure mining, understanding web
dynamics

Technologies used:

 Statistics:

 Provides methods for collecting, analyzing, and interpreting data.

 Statistical models describe data behavior using random variables and probability
distributions.
 Used for data characterization, classification, prediction, and forecasting.
 Validates data mining results through hypothesis testing.
 Challenges include scaling statistical methods for large datasets and handling noisy or
missing data.

 Machine Learning:

 Focuses on enabling computers to learn from data and improve performance.

 Includes supervised learning (using labeled data for classification), unsupervised learning
(clustering without labels), and semi-supervised learning (combining labeled and
unlabeled data).
 Active learning involves user input to improve model quality.
 Enhances data mining by improving accuracy and handling complex data types
effectively.

 Database Systems and Data Warehouses:

 Manage creation, maintenance, and utilization of large structured datasets.

 Provide efficient storage, retrieval, and querying capabilities.
 Data warehouses integrate data from multiple sources into multidimensional data cubes.
 Facilitate advanced data analysis and mining tasks, ensuring scalability and efficiency.

 Information Retrieval:

 Focuses on searching for and retrieving information from unstructured text and
multimedia data.
 Uses probabilistic models and keyword-based queries to measure document similarities.
 Topic models identify major themes in document collections.
 Integrates with data mining to analyze vast amounts of online unstructured data,
enhancing applications like digital libraries and health care systems

Applications:

 Customer Segmentation:

 Definition: Customer segmentation involves dividing customers into groups that share
similar characteristics or behaviors.
 Application: By analyzing customer data such as demographics, purchasing history, and
interactions with the company, businesses can segment their customer base. This
segmentation helps in targeted marketing, product recommendations, and personalized
services. For example, an e-commerce company might segment its customers into groups
based on their purchasing habits and tailor promotions accordingly.

 Market Basket Analysis:

 Definition: Market basket analysis examines the purchase behavior of customers to

identify associations between products.
 Application: This technique is commonly used in retail and e-commerce. By analyzing
transactional data, businesses can uncover which products are frequently bought together.
This information is valuable for optimizing product placement, cross-selling, and creating
targeted promotions. For instance, a grocery store might find that customers who buy
cereal are also likely to buy milk, leading to strategic placement of these items in the
store.

 Risk Management:

 Definition: Risk management involves identifying, assessing, and prioritizing risks

followed by coordinated efforts to minimize, monitor, and control the impact of these
risks.
 Application: In finance, data mining techniques are used to analyze historical data and
identify patterns associated with credit default, market fluctuations, or fraudulent
activities. By analyzing customer behavior, transaction patterns, and external factors,
financial institutions can assess credit risk, detect fraudulent transactions, and make
informed decisions to mitigate potential losses.

 Fraud Detection:

 Definition: Fraud detection involves identifying and preventing fraudulent activities

within a system or organization.
 Application: Data mining techniques are employed to analyze large volumes of data to
detect anomalous patterns or behaviors that may indicate fraudulent activities. For
example, in banking, algorithms can flag transactions that deviate from a customer's
typical behavior, such as unusually large withdrawals or transactions in unusual
locations. Similarly, in insurance, data mining can help identify patterns associated with
fraudulent claims, such as multiple claims for the same incident.

Demand Prediction:

 Definition: Demand prediction involves forecasting future demand for products or

services based on historical data, market trends, and other relevant factors.
 Application: By analyzing historical sales data, seasonal trends, market conditions, and
other variables, businesses can predict future demand with a certain level of accuracy.
This information is invaluable for inventory management, production planning, and
supply chain optimization.

BENEFITS

 Manufacturing:
o In manufacturing, data mining helps optimize production processes, improve
quality control, and reduce operational costs. By analyzing data from sensors,
production lines, and supply chain operations, manufacturers can identify
inefficiencies, predict equipment failures, and streamline production schedules.
This leads to increased productivity, higher product quality, and reduced
downtime, ultimately enhancing competitiveness in the market.
 Mail Order:
o For mail order businesses, data mining enables better customer targeting,
personalized marketing, and improved inventory management. By analyzing
customer purchase history, browsing behavior, and demographic data, mail order
companies can segment their customer base and tailor promotions and offers to
individual preferences. Additionally, data mining helps optimize inventory levels
by predicting demand for different products, ensuring timely stock replenishment
and reducing inventory holding costs.
 Supermarkets:
o Supermarkets leverage data mining to enhance customer satisfaction, optimize
product placement, and increase sales revenue. By analyzing transactional data
through market basket analysis, supermarkets can identify product associations
and cross-selling opportunities, leading to more effective merchandising strategies
and increased basket size. Furthermore, data mining aids in demand forecasting,
enabling supermarkets to manage inventory levels efficiently and minimize out-
of-stock situations, thereby improving customer experience and loyalty.
 Airlines:
o Data mining plays a crucial role in the airline industry by optimizing route
planning, pricing strategies, and customer service. Airlines analyze vast amounts
of data including booking history, flight schedules, weather patterns, and
customer preferences to forecast demand, adjust ticket prices dynamically, and
optimize flight schedules. Additionally, data mining enables airlines to
personalize services, anticipate customer needs, and improve loyalty programs,
ultimately enhancing customer satisfaction and profitability while maintaining
operational efficiency.

 Insurance:
o In the insurance sector, data mining facilitates risk assessment, fraud detection,
and personalized customer experiences. By analyzing historical claims data,
demographic information, and external factors such as economic trends and
environmental risks, insurers can assess risk profiles more accurately, price
policies competitively, and customize coverage options for individual customers.
Moreover, data mining techniques help identify fraudulent claims, mitigate losses,
and enhance the overall integrity of insurance operations, fostering trust and
loyalty among policyholders.

1. Line Chart:
o Definition: A line chart displays data points connected by straight lines. It is
commonly used to show trends over time or to compare the relationship between
two variables.
o Example: A line chart might be used to visualize the monthly sales performance
of a company over the course of a year, with each data point representing sales
figures for a specific month.
2. Area Graph:
o Definition: An area graph is similar to a line chart but with the area below the
lines filled in with color. It is used to show the cumulative totals of multiple
variables over time.
o Example: An area graph could be used to depict the total revenue generated by
different product categories over several quarters, with each category represented
by a different colored area.
3. Pie Chart:
o Definition: A pie chart divides a circle into slices to represent the proportion of
different categories within a dataset. It is useful for showing the relative
distribution of categorical data.
o Example: A pie chart might be used to illustrate the percentage breakdown of
expenses in a household budget, with each slice representing a different expense
category such as housing, transportation, food, etc.
4. Flow Chart:
o Definition: A flow chart is a graphical representation of a process or workflow,
depicting the sequence of steps and decision points in a systematic manner. It is
commonly used for process documentation, analysis, and optimization.
o Example: A flow chart could be used to visualize the steps involved in the
customer support process of a company, including steps such as receiving a
support ticket, assigning it to a representative, troubleshooting, resolving the
issue, and closing the ticket.
5. Scatterplot (Correlation Types):
o Definition: A scatterplot displays individual data points as dots on a two-
dimensional graph, with one variable represented on the x-axis and another
variable on the y-axis. It is used to examine the relationship between two
continuous variables.
o Example: A scatterplot might be used to visualize the relationship between a
person's age and their income level, with each data point representing an
individual's age and income.

Correlation Types in Scatterplots:

 Positive Correlation: When the data points in a scatterplot tend to form a pattern that
slopes upward from left to right, it indicates a positive correlation between the two
variables. This means that as one variable increases, the other variable also tends to
increase.
 Negative Correlation: Conversely, when the data points form a pattern that slopes
downward from left to right, it indicates a negative correlation between the two variables.
This means that as one variable increases, the other variable tends to decrease.
 No Correlation: If the data points in a scatterplot appear randomly distributed with no
discernible pattern, it suggests that there is no correlation between the two variables. In
other words, changes in one variable are not associated with changes in the other
variable.

Each of these data visualizations serves different purposes and can provide valuable insights into
the underlying data, helping analysts and decision-makers understand relationships, patterns, and
trends more effectively.

Limitations
1. Line Chart:
o Limitation: While line charts are effective for showing trends over time, they may
oversimplify complex data relationships. They are not suitable for displaying
categorical data or data with irregular intervals. Additionally, line charts can
obscure fluctuations within data if there are too many data points or if the data is
highly variable.
2. Area Graph:
o Limitation: Area graphs suffer from similar limitations as line charts, as they are
essentially an extension of line charts with the area beneath the lines filled in.
They can make it challenging to discern individual data points or accurately
compare values between different categories, especially when multiple variables
are overlaid.
3. Pie Chart:
o Limitation: Pie charts can be misleading when used to represent data with too
many categories or when the differences between categories are small. It can be
difficult to accurately compare the sizes of the slices, especially when there are
many slices or when the slices are of similar sizes. Additionally, pie charts do not
effectively convey trends over time or relationships between variables.
4. Flow Chart:
o Limitation: Flow charts are primarily used for representing processes and
workflows, and they may not be suitable for visualizing quantitative data. They
can become overly complex and difficult to interpret when depicting intricate
processes with multiple decision points and branches. Flow charts also lack the
ability to convey quantitative information such as magnitudes or proportions.
5. Scatterplot (Correlation Types):
o Limitation: While scatterplots are effective for visualizing the relationship
between two variables, they may not capture nonlinear relationships or
interactions between multiple variables. They can also be misleading if outliers or
influential data points disproportionately affect the overall pattern. Additionally,
correlation does not imply causation, so caution should be exercised when
interpreting scatterplot relationships.

Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Paper Tablets EXP
100% (2)
Paper Tablets EXP
4 pages
Unsolved Case Files Who Whacked Jack 01
No ratings yet
Unsolved Case Files Who Whacked Jack 01
11 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
Computer Analysis of Power Systems by Jos Arrillaga, C. P. Arnold (Z-Lib - Org) - 1-125
No ratings yet
Computer Analysis of Power Systems by Jos Arrillaga, C. P. Arnold (Z-Lib - Org) - 1-125
125 pages
Data Mining and Datawarehousing CS-303
No ratings yet
Data Mining and Datawarehousing CS-303
34 pages
Y10 English Language Remote Learning 01.02.2021
No ratings yet
Y10 English Language Remote Learning 01.02.2021
8 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Marxism and Literature
67% (3)
Marxism and Literature
2 pages
DWDM Notes
No ratings yet
DWDM Notes
59 pages
Selective High School Placement Test: Session
100% (1)
Selective High School Placement Test: Session
10 pages
BlackBuck - Case Study
0% (1)
BlackBuck - Case Study
7 pages
Introduction To Agri-Business
No ratings yet
Introduction To Agri-Business
61 pages
High School DXD Volume 05 - Hellcat of The Underworld Training Camp PDF
No ratings yet
High School DXD Volume 05 - Hellcat of The Underworld Training Camp PDF
312 pages
Invitation To Tender EPC Package Two - Utilities and Offsites
No ratings yet
Invitation To Tender EPC Package Two - Utilities and Offsites
11 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
84 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Data Mining
No ratings yet
Data Mining
8 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Quiz Analyzing Heritage
No ratings yet
Quiz Analyzing Heritage
2 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Combine 056
No ratings yet
Combine 056
57 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
53 pages
Data Mining
No ratings yet
Data Mining
52 pages
DWDM - Unit - II
No ratings yet
DWDM - Unit - II
55 pages
Unit 1
No ratings yet
Unit 1
59 pages
Datawarehouse&Data Mining - ALL
No ratings yet
Datawarehouse&Data Mining - ALL
46 pages
Chapter 1 Data Mining Lecture Note
No ratings yet
Chapter 1 Data Mining Lecture Note
31 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
No ratings yet
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
36 pages
Chap 1
No ratings yet
Chap 1
45 pages
QB 2 Marker
No ratings yet
QB 2 Marker
25 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
46 pages
Introduction To Data Mining 1604
No ratings yet
Introduction To Data Mining 1604
32 pages
Data Mining 1
No ratings yet
Data Mining 1
39 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Introduction
No ratings yet
Introduction
46 pages
ESO Crafting Style Checklist
No ratings yet
ESO Crafting Style Checklist
54 pages
Module 1
No ratings yet
Module 1
41 pages
Data Warehousing & Data Mining Unit-3 Notes
No ratings yet
Data Warehousing & Data Mining Unit-3 Notes
27 pages
Introduction
No ratings yet
Introduction
27 pages
Data Mining
No ratings yet
Data Mining
14 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Web Application Attacks in Practice: Ing. Pavol Lupták, CISSP, CEH
No ratings yet
Web Application Attacks in Practice: Ing. Pavol Lupták, CISSP, CEH
29 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
Data Minng
No ratings yet
Data Minng
20 pages
Data Mining:: Dr. Hany Saleeb
No ratings yet
Data Mining:: Dr. Hany Saleeb
37 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
Lecture 01 11jan
No ratings yet
Lecture 01 11jan
29 pages
COEX Clever - Eng
No ratings yet
COEX Clever - Eng
25 pages
Week 6
No ratings yet
Week 6
19 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
Unit-1 DWDM
No ratings yet
Unit-1 DWDM
20 pages
About The Handout: Professional Statement
No ratings yet
About The Handout: Professional Statement
21 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
35 pages
Unwholesome Action Brings Suffering (Eng & Chi)
No ratings yet
Unwholesome Action Brings Suffering (Eng & Chi)
13 pages
MBA Data Mining Unit 1 Notes
No ratings yet
MBA Data Mining Unit 1 Notes
12 pages
The History of Spirometry
No ratings yet
The History of Spirometry
16 pages
BY K.Swetha Sastry, CSE Dept
No ratings yet
BY K.Swetha Sastry, CSE Dept
17 pages
Sociolinguistics - Politeness Strategy Used by Judges and Contestant in Indonesian Idol
No ratings yet
Sociolinguistics - Politeness Strategy Used by Judges and Contestant in Indonesian Idol
16 pages
Reasonable Domain and Range
No ratings yet
Reasonable Domain and Range
11 pages
Data Mining
No ratings yet
Data Mining
48 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Data Preprocessing Personal
No ratings yet
Data Preprocessing Personal
11 pages
DM Unit 1
No ratings yet
DM Unit 1
10 pages
DMDW Unit 1 Qna
No ratings yet
DMDW Unit 1 Qna
8 pages
Data Mining: What Is Data Mining?: Correlations or Patterns Among Fields in Large Relational Databases
No ratings yet
Data Mining: What Is Data Mining?: Correlations or Patterns Among Fields in Large Relational Databases
6 pages
7dm Midterm Reviewer
No ratings yet
7dm Midterm Reviewer
10 pages
Data Mining
No ratings yet
Data Mining
4 pages
Unit 1
No ratings yet
Unit 1
7 pages
ISS - Module 3
No ratings yet
ISS - Module 3
11 pages
Operation Manual WBL-100/101/200: 3, Hagavish St. Israel 58817 Tel: 972 3 5595252, Fax: 972 3 5594529
No ratings yet
Operation Manual WBL-100/101/200: 3, Hagavish St. Israel 58817 Tel: 972 3 5595252, Fax: 972 3 5594529
5 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
USAA - Document-Statement Period - 10192024 To 11202024-1
No ratings yet
USAA - Document-Statement Period - 10192024 To 11202024-1
4 pages
1st L Research PDF
No ratings yet
1st L Research PDF
5 pages
Revised AP Phase - 2 Session Wise English, Social & Maths Syllabus For 24-25 (15.10.2024) .PMD
No ratings yet
Revised AP Phase - 2 Session Wise English, Social & Maths Syllabus For 24-25 (15.10.2024) .PMD
3 pages
Straight-Edge: A Safe Refuge or A Violent Subculture?
No ratings yet
Straight-Edge: A Safe Refuge or A Violent Subculture?
6 pages
Aditya Praksh Jalan Saraswati Vidya Mandir, Kudlum: Online Class Routine
No ratings yet
Aditya Praksh Jalan Saraswati Vidya Mandir, Kudlum: Online Class Routine
1 page
Cu Mid Year Listening Test
No ratings yet
Cu Mid Year Listening Test
2 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
11 pages
S T o P S: Present Simple
No ratings yet
S T o P S: Present Simple
3 pages
SALE
No ratings yet
SALE
2 pages
Ibps Po Prelims - 25 (18-10-2024) - Rank List
No ratings yet
Ibps Po Prelims - 25 (18-10-2024) - Rank List
2 pages

Data Ming Unit 2

Uploaded by

Data Ming Unit 2

Uploaded by

UNIT 2

Knowledge discovery from data, or KDD

What Kinds of Data Can Be Mined

1. Relational Database Data

 Definition: A collection of tables (relations) each with a unique name, consisting of

 Definition: A repository of information from multiple sources, stored under a unified

 Definition: Records of transactions, such as purchases, bookings, or user clicks, each

Other Kinds of Data:

4. Time-related or Sequence Data

 Definition: Data related to locations and spatial attributes.

7. Engineering Design Data

 Definition: Data from the design and modeling of systems or structures.

8. Hypertext and Multimedia Data

 Definition: Text, images, audio, video data.

9. Graph and Networked Data

10. Web Data

 Provides methods for collecting, analyzing, and interpreting data.

 Focuses on enabling computers to learn from data and improve performance.

 Database Systems and Data Warehouses:

 Manage creation, maintenance, and utilization of large structured datasets.

 Market Basket Analysis:

 Definition: Market basket analysis examines the purchase behavior of customers to

 Definition: Risk management involves identifying, assessing, and prioritizing risks

 Definition: Fraud detection involves identifying and preventing fraudulent activities

 Definition: Demand prediction involves forecasting future demand for products or

Correlation Types in Scatterplots:

You might also like