0% found this document useful (0 votes)
8 views8 pages

Data Ming Unit 2

Data mining is the process of discovering patterns and knowledge from large datasets, involving steps such as data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Various types of data can be mined, including relational databases, data warehouses, transactional data, and more, each with specific mining tasks and applications. Technologies such as statistics, machine learning, and information retrieval enhance data mining, which is utilized in applications like customer segmentation, market basket analysis, and fraud detection.

Uploaded by

Misba firdose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Data Ming Unit 2

Data mining is the process of discovering patterns and knowledge from large datasets, involving steps such as data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Various types of data can be mined, including relational databases, data warehouses, transactional data, and more, each with specific mining tasks and applications. Technologies such as statistics, machine learning, and information retrieval enhance data mining, which is utilized in applications like customer segmentation, market basket analysis, and fraud detection.

Uploaded by

Misba firdose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT 2

Data mining
is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, theWeb, other information repositories, or data that are streamed into the
system dynamically.

Knowledge discovery from data, or KDD


1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) 4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures—see Section 1.4.6)
7. Knowledge presentation (where visualization and knowledge representation
techniques
are used to present mined knowledge to users)

What Kinds of Data Can Be Mined

1. Relational Database Data

 Definition: A collection of tables (relations) each with a unique name, consisting of


attributes (columns) and tuples (rows).
 Example: AllElectronics relational database includes tables like customer, item,
employee, and branch.
 Data Access: Relational queries (e.g., SQL) allow data retrieval, aggregate functions, and
trend analysis.
 Data Mining Tasks: Predicting customer behavior, detecting deviations in sales,
discovering patterns and trends.

2. Data Warehouses

 Definition: A repository of information from multiple sources, stored under a unified


schema, often at a single site.
 Characteristics: Organized around major subjects, provides historical data, and typically
uses a multidimensional data structure (data cube).
 Example: AllElectronics data warehouse stores summarized sales data for analysis.
 Data Access: Online Analytical Processing (OLAP) operations like drill-down and roll-
up for multidimensional data analysis.
 Data Mining Tasks: Summarizing sales, detecting patterns over time, facilitating
decision making through aggregated data.
3. Transactional Data

 Definition: Records of transactions, such as purchases, bookings, or user clicks, each


with a unique transaction ID and a list of items involved.
 Example: A transactional database for AllElectronics records each sales transaction with
items sold.
 Data Access: Queries to analyze transactional patterns.
 Data Mining Tasks: Market basket analysis to identify items frequently sold together,
detecting purchase patterns.

Other Kinds of Data:

4. Time-related or Sequence Data

 Definition: Data collected over time, like historical records or time-series data.
 Example: Stock exchange data, historical weather data.
 Data Mining Tasks: Detecting trends, seasonal patterns, predicting future events.

5. Data Streams

 Definition: Continuous, real-time data flows from sources like sensors or online
activities.
 Example: Video surveillance feeds, sensor data.
 Data Mining Tasks: Real-time anomaly detection, monitoring changes over time.

6. Spatial Data

 Definition: Data related to locations and spatial attributes.


 Example: Geographic Information System (GIS) data, city maps.
 Data Mining Tasks: Analyzing spatial relationships, detecting geographic patterns.

7. Engineering Design Data

 Definition: Data from the design and modeling of systems or structures.


 Example: Building blueprints, circuit designs.
 Data Mining Tasks: Optimizing designs, detecting design patterns.

8. Hypertext and Multimedia Data

 Definition: Text, images, audio, video data.


 Example: Web pages, multimedia files.
 Data Mining Tasks: Image recognition, video sequence detection, text classification.

9. Graph and Networked Data


 Definition: Data representing connections and relationships.
 Example: Social networks, web graphs.
 Data Mining Tasks: Network analysis, detecting community structures.

10. Web Data

 Definition: Data from the World Wide Web, including web pages and user interactions.
 Example: Website analytics data.
 Data Mining Tasks: Web usage mining, web structure mining, understanding web
dynamics

Technologies used:

 Statistics:

 Provides methods for collecting, analyzing, and interpreting data.


 Statistical models describe data behavior using random variables and probability
distributions.
 Used for data characterization, classification, prediction, and forecasting.
 Validates data mining results through hypothesis testing.
 Challenges include scaling statistical methods for large datasets and handling noisy or
missing data.

 Machine Learning:

 Focuses on enabling computers to learn from data and improve performance.


 Includes supervised learning (using labeled data for classification), unsupervised learning
(clustering without labels), and semi-supervised learning (combining labeled and
unlabeled data).
 Active learning involves user input to improve model quality.
 Enhances data mining by improving accuracy and handling complex data types
effectively.

 Database Systems and Data Warehouses:

 Manage creation, maintenance, and utilization of large structured datasets.


 Provide efficient storage, retrieval, and querying capabilities.
 Data warehouses integrate data from multiple sources into multidimensional data cubes.
 Facilitate advanced data analysis and mining tasks, ensuring scalability and efficiency.

 Information Retrieval:

 Focuses on searching for and retrieving information from unstructured text and
multimedia data.
 Uses probabilistic models and keyword-based queries to measure document similarities.
 Topic models identify major themes in document collections.
 Integrates with data mining to analyze vast amounts of online unstructured data,
enhancing applications like digital libraries and health care systems

Applications:

 Customer Segmentation:

 Definition: Customer segmentation involves dividing customers into groups that share
similar characteristics or behaviors.
 Application: By analyzing customer data such as demographics, purchasing history, and
interactions with the company, businesses can segment their customer base. This
segmentation helps in targeted marketing, product recommendations, and personalized
services. For example, an e-commerce company might segment its customers into groups
based on their purchasing habits and tailor promotions accordingly.

 Market Basket Analysis:

 Definition: Market basket analysis examines the purchase behavior of customers to


identify associations between products.
 Application: This technique is commonly used in retail and e-commerce. By analyzing
transactional data, businesses can uncover which products are frequently bought together.
This information is valuable for optimizing product placement, cross-selling, and creating
targeted promotions. For instance, a grocery store might find that customers who buy
cereal are also likely to buy milk, leading to strategic placement of these items in the
store.

 Risk Management:

 Definition: Risk management involves identifying, assessing, and prioritizing risks


followed by coordinated efforts to minimize, monitor, and control the impact of these
risks.
 Application: In finance, data mining techniques are used to analyze historical data and
identify patterns associated with credit default, market fluctuations, or fraudulent
activities. By analyzing customer behavior, transaction patterns, and external factors,
financial institutions can assess credit risk, detect fraudulent transactions, and make
informed decisions to mitigate potential losses.

 Fraud Detection:

 Definition: Fraud detection involves identifying and preventing fraudulent activities


within a system or organization.
 Application: Data mining techniques are employed to analyze large volumes of data to
detect anomalous patterns or behaviors that may indicate fraudulent activities. For
example, in banking, algorithms can flag transactions that deviate from a customer's
typical behavior, such as unusually large withdrawals or transactions in unusual
locations. Similarly, in insurance, data mining can help identify patterns associated with
fraudulent claims, such as multiple claims for the same incident.

Demand Prediction:

 Definition: Demand prediction involves forecasting future demand for products or


services based on historical data, market trends, and other relevant factors.
 Application: By analyzing historical sales data, seasonal trends, market conditions, and
other variables, businesses can predict future demand with a certain level of accuracy.
This information is invaluable for inventory management, production planning, and
supply chain optimization.

BENEFITS

 Manufacturing:
o In manufacturing, data mining helps optimize production processes, improve
quality control, and reduce operational costs. By analyzing data from sensors,
production lines, and supply chain operations, manufacturers can identify
inefficiencies, predict equipment failures, and streamline production schedules.
This leads to increased productivity, higher product quality, and reduced
downtime, ultimately enhancing competitiveness in the market.
 Mail Order:
o For mail order businesses, data mining enables better customer targeting,
personalized marketing, and improved inventory management. By analyzing
customer purchase history, browsing behavior, and demographic data, mail order
companies can segment their customer base and tailor promotions and offers to
individual preferences. Additionally, data mining helps optimize inventory levels
by predicting demand for different products, ensuring timely stock replenishment
and reducing inventory holding costs.
 Supermarkets:
o Supermarkets leverage data mining to enhance customer satisfaction, optimize
product placement, and increase sales revenue. By analyzing transactional data
through market basket analysis, supermarkets can identify product associations
and cross-selling opportunities, leading to more effective merchandising strategies
and increased basket size. Furthermore, data mining aids in demand forecasting,
enabling supermarkets to manage inventory levels efficiently and minimize out-
of-stock situations, thereby improving customer experience and loyalty.
 Airlines:
o Data mining plays a crucial role in the airline industry by optimizing route
planning, pricing strategies, and customer service. Airlines analyze vast amounts
of data including booking history, flight schedules, weather patterns, and
customer preferences to forecast demand, adjust ticket prices dynamically, and
optimize flight schedules. Additionally, data mining enables airlines to
personalize services, anticipate customer needs, and improve loyalty programs,
ultimately enhancing customer satisfaction and profitability while maintaining
operational efficiency.

 Insurance:
o In the insurance sector, data mining facilitates risk assessment, fraud detection,
and personalized customer experiences. By analyzing historical claims data,
demographic information, and external factors such as economic trends and
environmental risks, insurers can assess risk profiles more accurately, price
policies competitively, and customize coverage options for individual customers.
Moreover, data mining techniques help identify fraudulent claims, mitigate losses,
and enhance the overall integrity of insurance operations, fostering trust and
loyalty among policyholders.

1. Line Chart:
o Definition: A line chart displays data points connected by straight lines. It is
commonly used to show trends over time or to compare the relationship between
two variables.
o Example: A line chart might be used to visualize the monthly sales performance
of a company over the course of a year, with each data point representing sales
figures for a specific month.
2. Area Graph:
o Definition: An area graph is similar to a line chart but with the area below the
lines filled in with color. It is used to show the cumulative totals of multiple
variables over time.
o Example: An area graph could be used to depict the total revenue generated by
different product categories over several quarters, with each category represented
by a different colored area.
3. Pie Chart:
o Definition: A pie chart divides a circle into slices to represent the proportion of
different categories within a dataset. It is useful for showing the relative
distribution of categorical data.
o Example: A pie chart might be used to illustrate the percentage breakdown of
expenses in a household budget, with each slice representing a different expense
category such as housing, transportation, food, etc.
4. Flow Chart:
o Definition: A flow chart is a graphical representation of a process or workflow,
depicting the sequence of steps and decision points in a systematic manner. It is
commonly used for process documentation, analysis, and optimization.
o Example: A flow chart could be used to visualize the steps involved in the
customer support process of a company, including steps such as receiving a
support ticket, assigning it to a representative, troubleshooting, resolving the
issue, and closing the ticket.
5. Scatterplot (Correlation Types):
o Definition: A scatterplot displays individual data points as dots on a two-
dimensional graph, with one variable represented on the x-axis and another
variable on the y-axis. It is used to examine the relationship between two
continuous variables.
o Example: A scatterplot might be used to visualize the relationship between a
person's age and their income level, with each data point representing an
individual's age and income.

Correlation Types in Scatterplots:

 Positive Correlation: When the data points in a scatterplot tend to form a pattern that
slopes upward from left to right, it indicates a positive correlation between the two
variables. This means that as one variable increases, the other variable also tends to
increase.
 Negative Correlation: Conversely, when the data points form a pattern that slopes
downward from left to right, it indicates a negative correlation between the two variables.
This means that as one variable increases, the other variable tends to decrease.
 No Correlation: If the data points in a scatterplot appear randomly distributed with no
discernible pattern, it suggests that there is no correlation between the two variables. In
other words, changes in one variable are not associated with changes in the other
variable.

Each of these data visualizations serves different purposes and can provide valuable insights into
the underlying data, helping analysts and decision-makers understand relationships, patterns, and
trends more effectively.

Limitations
1. Line Chart:
o Limitation: While line charts are effective for showing trends over time, they may
oversimplify complex data relationships. They are not suitable for displaying
categorical data or data with irregular intervals. Additionally, line charts can
obscure fluctuations within data if there are too many data points or if the data is
highly variable.
2. Area Graph:
o Limitation: Area graphs suffer from similar limitations as line charts, as they are
essentially an extension of line charts with the area beneath the lines filled in.
They can make it challenging to discern individual data points or accurately
compare values between different categories, especially when multiple variables
are overlaid.
3. Pie Chart:
o Limitation: Pie charts can be misleading when used to represent data with too
many categories or when the differences between categories are small. It can be
difficult to accurately compare the sizes of the slices, especially when there are
many slices or when the slices are of similar sizes. Additionally, pie charts do not
effectively convey trends over time or relationships between variables.
4. Flow Chart:
o Limitation: Flow charts are primarily used for representing processes and
workflows, and they may not be suitable for visualizing quantitative data. They
can become overly complex and difficult to interpret when depicting intricate
processes with multiple decision points and branches. Flow charts also lack the
ability to convey quantitative information such as magnitudes or proportions.
5. Scatterplot (Correlation Types):
o Limitation: While scatterplots are effective for visualizing the relationship
between two variables, they may not capture nonlinear relationships or
interactions between multiple variables. They can also be misleading if outliers or
influential data points disproportionately affect the overall pattern. Additionally,
correlation does not imply causation, so caution should be exercised when
interpreting scatterplot relationships.

You might also like