Data Ming Unit 2
Data Ming Unit 2
Data mining
is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, theWeb, other information repositories, or data that are streamed into the
system dynamically.
2. Data Warehouses
Definition: Data collected over time, like historical records or time-series data.
Example: Stock exchange data, historical weather data.
Data Mining Tasks: Detecting trends, seasonal patterns, predicting future events.
5. Data Streams
Definition: Continuous, real-time data flows from sources like sensors or online
activities.
Example: Video surveillance feeds, sensor data.
Data Mining Tasks: Real-time anomaly detection, monitoring changes over time.
6. Spatial Data
Definition: Data from the World Wide Web, including web pages and user interactions.
Example: Website analytics data.
Data Mining Tasks: Web usage mining, web structure mining, understanding web
dynamics
Technologies used:
Statistics:
Machine Learning:
Information Retrieval:
Focuses on searching for and retrieving information from unstructured text and
multimedia data.
Uses probabilistic models and keyword-based queries to measure document similarities.
Topic models identify major themes in document collections.
Integrates with data mining to analyze vast amounts of online unstructured data,
enhancing applications like digital libraries and health care systems
Applications:
Customer Segmentation:
Definition: Customer segmentation involves dividing customers into groups that share
similar characteristics or behaviors.
Application: By analyzing customer data such as demographics, purchasing history, and
interactions with the company, businesses can segment their customer base. This
segmentation helps in targeted marketing, product recommendations, and personalized
services. For example, an e-commerce company might segment its customers into groups
based on their purchasing habits and tailor promotions accordingly.
Risk Management:
Fraud Detection:
Demand Prediction:
BENEFITS
Manufacturing:
o In manufacturing, data mining helps optimize production processes, improve
quality control, and reduce operational costs. By analyzing data from sensors,
production lines, and supply chain operations, manufacturers can identify
inefficiencies, predict equipment failures, and streamline production schedules.
This leads to increased productivity, higher product quality, and reduced
downtime, ultimately enhancing competitiveness in the market.
Mail Order:
o For mail order businesses, data mining enables better customer targeting,
personalized marketing, and improved inventory management. By analyzing
customer purchase history, browsing behavior, and demographic data, mail order
companies can segment their customer base and tailor promotions and offers to
individual preferences. Additionally, data mining helps optimize inventory levels
by predicting demand for different products, ensuring timely stock replenishment
and reducing inventory holding costs.
Supermarkets:
o Supermarkets leverage data mining to enhance customer satisfaction, optimize
product placement, and increase sales revenue. By analyzing transactional data
through market basket analysis, supermarkets can identify product associations
and cross-selling opportunities, leading to more effective merchandising strategies
and increased basket size. Furthermore, data mining aids in demand forecasting,
enabling supermarkets to manage inventory levels efficiently and minimize out-
of-stock situations, thereby improving customer experience and loyalty.
Airlines:
o Data mining plays a crucial role in the airline industry by optimizing route
planning, pricing strategies, and customer service. Airlines analyze vast amounts
of data including booking history, flight schedules, weather patterns, and
customer preferences to forecast demand, adjust ticket prices dynamically, and
optimize flight schedules. Additionally, data mining enables airlines to
personalize services, anticipate customer needs, and improve loyalty programs,
ultimately enhancing customer satisfaction and profitability while maintaining
operational efficiency.
Insurance:
o In the insurance sector, data mining facilitates risk assessment, fraud detection,
and personalized customer experiences. By analyzing historical claims data,
demographic information, and external factors such as economic trends and
environmental risks, insurers can assess risk profiles more accurately, price
policies competitively, and customize coverage options for individual customers.
Moreover, data mining techniques help identify fraudulent claims, mitigate losses,
and enhance the overall integrity of insurance operations, fostering trust and
loyalty among policyholders.
1. Line Chart:
o Definition: A line chart displays data points connected by straight lines. It is
commonly used to show trends over time or to compare the relationship between
two variables.
o Example: A line chart might be used to visualize the monthly sales performance
of a company over the course of a year, with each data point representing sales
figures for a specific month.
2. Area Graph:
o Definition: An area graph is similar to a line chart but with the area below the
lines filled in with color. It is used to show the cumulative totals of multiple
variables over time.
o Example: An area graph could be used to depict the total revenue generated by
different product categories over several quarters, with each category represented
by a different colored area.
3. Pie Chart:
o Definition: A pie chart divides a circle into slices to represent the proportion of
different categories within a dataset. It is useful for showing the relative
distribution of categorical data.
o Example: A pie chart might be used to illustrate the percentage breakdown of
expenses in a household budget, with each slice representing a different expense
category such as housing, transportation, food, etc.
4. Flow Chart:
o Definition: A flow chart is a graphical representation of a process or workflow,
depicting the sequence of steps and decision points in a systematic manner. It is
commonly used for process documentation, analysis, and optimization.
o Example: A flow chart could be used to visualize the steps involved in the
customer support process of a company, including steps such as receiving a
support ticket, assigning it to a representative, troubleshooting, resolving the
issue, and closing the ticket.
5. Scatterplot (Correlation Types):
o Definition: A scatterplot displays individual data points as dots on a two-
dimensional graph, with one variable represented on the x-axis and another
variable on the y-axis. It is used to examine the relationship between two
continuous variables.
o Example: A scatterplot might be used to visualize the relationship between a
person's age and their income level, with each data point representing an
individual's age and income.
Positive Correlation: When the data points in a scatterplot tend to form a pattern that
slopes upward from left to right, it indicates a positive correlation between the two
variables. This means that as one variable increases, the other variable also tends to
increase.
Negative Correlation: Conversely, when the data points form a pattern that slopes
downward from left to right, it indicates a negative correlation between the two variables.
This means that as one variable increases, the other variable tends to decrease.
No Correlation: If the data points in a scatterplot appear randomly distributed with no
discernible pattern, it suggests that there is no correlation between the two variables. In
other words, changes in one variable are not associated with changes in the other
variable.
Each of these data visualizations serves different purposes and can provide valuable insights into
the underlying data, helping analysts and decision-makers understand relationships, patterns, and
trends more effectively.
Limitations
1. Line Chart:
o Limitation: While line charts are effective for showing trends over time, they may
oversimplify complex data relationships. They are not suitable for displaying
categorical data or data with irregular intervals. Additionally, line charts can
obscure fluctuations within data if there are too many data points or if the data is
highly variable.
2. Area Graph:
o Limitation: Area graphs suffer from similar limitations as line charts, as they are
essentially an extension of line charts with the area beneath the lines filled in.
They can make it challenging to discern individual data points or accurately
compare values between different categories, especially when multiple variables
are overlaid.
3. Pie Chart:
o Limitation: Pie charts can be misleading when used to represent data with too
many categories or when the differences between categories are small. It can be
difficult to accurately compare the sizes of the slices, especially when there are
many slices or when the slices are of similar sizes. Additionally, pie charts do not
effectively convey trends over time or relationships between variables.
4. Flow Chart:
o Limitation: Flow charts are primarily used for representing processes and
workflows, and they may not be suitable for visualizing quantitative data. They
can become overly complex and difficult to interpret when depicting intricate
processes with multiple decision points and branches. Flow charts also lack the
ability to convey quantitative information such as magnitudes or proportions.
5. Scatterplot (Correlation Types):
o Limitation: While scatterplots are effective for visualizing the relationship
between two variables, they may not capture nonlinear relationships or
interactions between multiple variables. They can also be misleading if outliers or
influential data points disproportionately affect the overall pattern. Additionally,
correlation does not imply causation, so caution should be exercised when
interpreting scatterplot relationships.