Data Mining
Data Mining
The Apriori algorithm is a popular method in data mining used to identify frequent
itemsets and derive association rules from large datasets. It operates based on the
principle that:
If an itemset is frequent, all its subsets must also be frequent.
It uses a breadth-first search (level-wise) approach to generate itemsets and
count their occurrences in the dataset.
Steps in Apriori Algorithm:
1. Generate Candidate Itemsets: Start with single-item itemsets and iteratively
extend them to larger itemsets.
2. Prune Infrequent Itemsets: Use the minimum support threshold to filter out
infrequent itemsets.
3. Generate Association Rules: From frequent itemsets, calculate confidence to
derive meaningful rules.
Result:
1-itemsets: {Milk}, {Bread}, {Butter}
2-itemsets: {Milk, Bread}, {Milk, Butter}, {Bread, Butter}
No frequent 3-itemsets exist.
2. What are the different methods through, which pattern can be
evaluated?
Conclusion:
The above methods ensure that patterns discovered are not only statistically valid but
also meaningful and practical for decision-making. Each method suits different types
of data mining tasks, such as classification, clustering, or association rule mining.
1. Bar Charts
Description: Used to display and compare the frequency, count, or other
measures (e.g., sum, average) of different categories.
Use Case: Comparing sales data across different products.
Example: A bar chart showing the number of units sold for each product in a
store.
2. Pie Charts
Description: A circular chart divided into slices to illustrate numerical
proportions.
Use Case: Showing the market share of different companies.
Example: A pie chart showing the percentage of sales contributed by different
regions.
3. Line Graphs
Description: Display data points connected by lines to show trends over time.
Use Case: Tracking stock prices over weeks or months.
Example: A line graph showing the temperature change throughout the day.
4. Scatter Plots
Description: Uses dots to represent values for two variables, showing their
relationship.
Use Case: Analyzing the correlation between two variables, like height and
weight.
Example: A scatter plot showing the relationship between age and income.
5. Histograms
Description: Similar to bar charts, but used for continuous data, showing the
distribution of a dataset.
Use Case: Displaying the distribution of test scores or income levels.
Example: A histogram showing the distribution of exam scores for a class.
6. Heatmaps
Description: Uses color gradients to represent data values in a matrix or table.
Use Case: Visualizing correlation matrices or the intensity of an event over
time.
Example: A heatmap showing the website traffic over different hours of the
day.
8. Area Charts
Description: Similar to line charts but with the area below the line filled with
color. Used to show cumulative totals over time.
Use Case: Showing cumulative sales over several months.
Example: An area chart showing the cumulative rainfall over a year.
9. Tree Maps
Description: Represents hierarchical data as a set of nested rectangles, where
each rectangle's area is proportional to the value of the category.
Use Case: Showing the proportion of market share of companies.
Example: A tree map visualizing the budget allocation across different
departments in an organization.
Conclusion:
These data visualization techniques help simplify complex data and provide insights
into patterns, trends, and relationships
4. Explain decision tree induction method . Write the diferent steps in
decision tree induction algorithm ?
Decision Tree Induction Method :-
Example:
Transaction ID Items
T1 {Milk, Bread, Butter}
T2 {Milk, Bread}
T3 {Bread, Butter}
T4 {Milk, Butter}
The numbers in parentheses indicate the count of transactions for each item
along the path.
Step 3: Mine the FP-tree
Now, we mine the FP-tree to find the frequent itemsets.
1. For Milk:
o The conditional pattern base for Milk is the set of paths containing Milk:
Milk → Bread → Butter (count = 2)
Milk → Bread (count = 2)
Milk → Butter (count = 2)
o The frequent itemsets for Milk are:
{Milk, Bread}, {Milk, Butter}, {Milk, Bread, Butter}.
2. For Bread:
o The conditional pattern base for Bread is the set of paths containing Bread:
Bread → Milk → Butter (count = 2)
Bread → Butter (count = 2)
o The frequent itemsets for Bread are:
{Bread, Milk}, {Bread, Butter}.
3. For Butter:
o The conditional pattern base for Butter is the set of paths containing Butter:
Butter → Milk → Bread (count = 2)
Butter → Bread (count = 2)
o The frequent itemsets for Butter are:
{Butter, Milk}, {Butter, Bread}.
Linear Regression
Linear regression models the relationship between variables by fitting a straight line
(regression line) to the data.
The equation of the line is:
Y = mX + C
Where:
Y = Dependent variable (target)
X = Independent variable (predictor)
m = Slope of the line (effect of X on Y)
C = Y-intercept
1 50
2 60
3 70
Step 1: Find the regression line:
The relationship between X (Study Hours) and Y (Marks) is approximately:
Y = 10 × X + 40
This simple example shows how studying more hours results in higher predicted
marks!
Example
Age
Income Loan Approved (Yes/No)
25 High Yes
35 Medium No
40 High Yes
50 Low No
Key Concepts:
1. Hyperplane: In SVM, a hyperplane is a decision boundary that separates
different classes of data. For example, in a 2D space, a hyperplane is a line, and
in a 3D space, it is a plane. In higher dimensions, it is a general hyperplane.
Example of SVM:
Imagine you have a dataset with two classes: “Apple” and “Orange” based on two
features: Weight and Color.
Class 1 (Apple): {Weight: 150g, Color: Red}, {Weight: 140g, Color: Green}
Class 2 (Orange): {Weight: 160g, Color: Orange}, {Weight: 170g, Color: Orange}
SVM will attempt to find a hyperplane that maximizes the margin between the two
classes, ensuring that new points can be classified correctly.
Advantages of SVM:
1. Works well with high-dimensional data.
2. Memory efficient (uses only support vectors).
3. Good generalization on unseen data.
Disadvantages of SVM:
1. Computationally expensive for large datasets.
2. Choosing the right kernel is difficult.
Attribute selection measures are techniques used to identify the most relevant
features for a machine learning model. They evaluate how well a feature helps in
predicting the target variable by measuring its importance or correlation. Common
methods include Information Gain, Gini Index, and Chi-Square Test.
2. Gini Index
Purpose: Measures the purity of a dataset. A lower Gini index means the data
is more pure (most items belong to the same class).
Example: Splitting data by "Age" and getting clear "Yes" or "No" labels for each
group shows a low Gini index.
3. Chi-Square Test
Purpose: Tests if there’s a relationship between two categorical variables.
Example: If you want to know if "Age" affects whether someone buys a
product ("Yes"/"No"), the Chi-Square test tells you if there's a connection
between the two.
4. Mutual Information
Purpose: Measures how much knowing one variable helps predict the other.
Example: Knowing the "Color" of an object can help predict if it's a "Fruit" or
"Vegetable" if there's a strong relationship (like red = apple).
5. Correlation Coefficient
Purpose: Measures how strongly two variables are related.
Example: If "Height" increases, "Weight" might also increase, showing a strong
positive correlation (closer to 1).
6. ReliefF Algorithm
Purpose: Evaluates how well a feature distinguishes between similar and
different instances.
Example: In health data, "Blood Pressure" may strongly distinguish between
people with "Heart Disease" and those without.
7. Fisher Score
Purpose: Measures how well an attribute can separate different classes.
Example: "Weight" and "Shape" of fruit can distinguish between an "Apple"
and a "Banana," giving them a high Fisher score.
Calculate Likelihood:
P(Free = Yes | Spam) = 1/2 = 0.5
P(Discount = Yes | Spam) = 1/2 = 0.5
P(Free = Yes | Not Spam) = 1/2 = 0.5
P(Discount = Yes | Not Spam) = 0/2 = 0
Calculate Posterior Probability:
P(Spam | Free = Yes, Discount = Yes) = 0.5 * 0.5 * 0.5 = 0.125
P(Not Spam | Free = Yes, Discount = Yes) = 0.5 * 0 * 0.5 = 0
2. Contextual Outliers
Definition: These are data points that are outliers only within a specific
context. Outside the defined context, they may appear normal.
Characteristics:
o Context-specific.
o Requires domain knowledge to detect.
Example in Depth:
Imagine recording rainfall across various regions:
o In a desert, the normal rainfall is 0–5 mm/month. If a desert region
suddenly records 50 mm of rainfall in a month, it is a contextual outlier.
o However, in a rainforest, 50 mm of rainfall would be normal and not an
outlier.
Impact:
o Reveals anomalies that are context-dependent, useful in monitoring
and anomaly detection tasks.
3. Collective Outliers
Definition: These occur when a group of data points, considered together,
deviates significantly from the overall dataset, even if individual points in the
group are not outliers.
Characteristics:
o Outliers only as a group.
o Typically found in time-series or sequential data.
Example in Depth:
In stock market data:
o For a specific sector, stock prices usually fluctuate within 5–10% daily.
o If multiple stocks in the sector suddenly drop by 30% on the same day,
this group of values forms a collective outlier.
o Individually, a 30% change might not seem anomalous, but as a group,
it signals a significant event.
Impact:
o Can indicate systemic issues, such as market crashes or coordinated
activities.
2. Binary Variables
Definition: Variables with only two possible values, typically represented as 0
or 1.
Example: Gender (Male = 0, Female = 1), Yes/No responses.
Key Feature: Can be symmetric (both values are equally important) or
asymmetric (one value is more significant, like "Defective = 1").
Example
Point ID
Coordinates (X, Y)
A
(1, 1)
B (1, 2)
C (2, 2)
D
(8, 8)
E
(8, 9)
F
(25, 25)
1. Core Points:
o Point A: Neighbors B, C (distance ≤ 2) → Not Core.
o Point B: Neighbors A, C → Core Point.
o Point D: Neighbor E → Not Core.
2. Clusters:
o Points A, B, C → Cluster 1.
o Points D, E → Cluster 2.
3. Noise:
o Point F → Noise (far from all clusters).
Final Output:
Clusters:
o Cluster 1: {A, B, C}
o Cluster 2: {D, E}
Noise: {F}
Advantages
Can find clusters of arbitrary shape.
Handles noise effectively.
Disadvantages
Sensitive to parameter selection (ε and MinPts).
5. With he help of suitable example,explain k- medoids ( means)
Algorithm ?
K-Medoids Algorithm (Easy Explanation)
K-Medoids is a clustering algorithm that, like K-Means, groups data into K clusters,
but instead of using the mean to represent each cluster, it uses an actual data point
(medoid) as the cluster center. This makes K-Medoids more robust to outliers.
Steps in K-Medoids Algorithm:
1. Initialize Medoids:
o Choose K initial medoids randomly from the dataset.
2. Assign Points to Nearest Medoid:
o Assign each point to the closest medoid based on distance (e.g.,
Euclidean distance).
3. Update Medoids:
o For each cluster, choose the point within the cluster that minimizes the
total distance to all other points. This point becomes the new medoid.
4. Repeat:
o Repeat the assign and update steps until the medoids do not change.
Example:
A
(1, 2)
B (2, 3)
C (3, 3)
D
(6, 7)
E
(7, 8)
F
(8, 8)
We want to form K = 2 clusters.
1. Initialize Medoids:
o Randomly select points A and D as initial medoids.
2. Assign Points:
o Cluster 1: Points A, B, C (closest to A).
o Cluster 2: Points D, E, F (closest to D).
3. Update Medoids:
o For Cluster 1, B becomes the new medoid (minimizes the total
distance).
o Cluster 2 stays the same with D as the medoid.
4. Repeat:
o Reassign points based on new medoids (B and D).
o Final clusters are {A, B, C} with medoid B and {D, E, F} with medoid D.
Final Clusters:
Cluster 1: {A, B, C}, medoid = B.
Cluster 2: {D, E, F}, medoid = D.
Advantages:
More robust to outliers compared to K-Means.
Can handle non-Euclidean distance measures.
Disadvantages:
Computationally more expensive than K-Means.
Less efficient with large datasets.
UNIT NO 6
1. Give points of differences between data mining and test mining ?
Web Usage Mining is the process of analyzing user interaction data with a
website to uncover patterns that help improve website performance and user
experience. It works by collecting data from sources such as web server logs,
user clickstreams, and session histories. This data is then processed to identify
how users navigate the website, which pages they visit, how much time they
spend on each page, and which links they click. By using techniques like
clustering, classification, and sequence mining, web usage mining helps in
grouping users with similar behaviors, predicting their future actions, and
uncovering hidden trends. The insights gained from this analysis can be used
to optimize website content, personalize user experiences, and make informed
decisions on improving website structure and design
Example: An e-commerce website can use web usage mining to analyze the
browsing patterns of users and recommend products based on their past
behavior or the actions of similar users, enhancing the overall shopping
experience.
7. Explain the concept of Text mining and diffrent approaches for text
mining ?
Fig. Text Mining
Text Mining is the process of extracting useful information and knowledge from
unstructured text data. It involves analyzing large amounts of textual data to discover
patterns, relationships, and insights that can support decision-making, predictions,
and trend analysis. Text mining is widely used in fields such as business, healthcare,
and social media analysis.
- Text summarization is the procedure to extract its partial content reflection its
whole contents
automatically.
Mining Stream Data refers to the process of analyzing and extracting valuable
insights from continuous and fast-flowing data streams. Unlike traditional data
mining, where the data is static and stored, stream data is dynamic, arriving in real-
time from sources like sensors, social media, stock markets, and IoT devices.
Key Concepts:
1. Data Streams: Continuous data flows from sources like sensors, websites, or
social media.
2. Challenges:
o Vast Volume: Handling large amounts of data.
o High Velocity: Rapid data generation needing immediate processing.
o Limited Memory: Data is processed in real-time with limited storage.
o Concept Drift: Data patterns change over time, requiring adaptation.
Techniques:
1. Sliding Window: Keeps only recent data for analysis.
2. Sampling: Maintains a representative subset of data.
3. Approximation Algorithms: Estimates statistics with limited memory.
4. Online Learning: Updates models incrementally as new data arrives.
Applications:
1. Real-Time Analytics: Fraud detection, stock market prediction.
2. Sensor Networks: Environmental or health monitoring.
3. Recommendation Systems: Personalized product suggestions.
Example:
For an online store, stream mining analyzes customer activity (clicks, purchases) in
real-time to recommend trending products.