Data Mining
Data Mining
Experiment 1
Objective: To perform classification using the Bayesian classification algorithm in python
Bayes Theorem
• Based on prior knowledge of conditions that may be related to an event, Bayes theorem
describes the probability of the event
• conditional probability can be found this way
• Assume we have a Hypothesis(H) and evidence(E),
According to Bayes theorem, the relationship between the probability of Hypothesis before
getting the evidence represented as P(H) and the probability of the hypothesis after getting
the evidence represented as P(H|E) is:
P(H|E) = P(E|H)*P(H)/P(E)
• Prior probability = P(H) is the probability before getting the evidence
• Posterior probability = P(H|E) is the probability after getting evidence
• In general,
P(class|data) = (P(data|class) * P(class)) / P(data)
Approach:
Naive Bayes classifier calculates the probability of an event in the following steps:
• Step 1: Calculate the prior probability for given class labels
• Step 2: Find Likelihood probability with each attribute for each class
• Step 3: Put these value in Bayes Formula and calculate posterior probability.
• Step 4: See which class has a higher probability, given the input belongs to the higher
probability class.
Advantages
• It is not only a simple approach but also a fast and accurate method for prediction.
• Naive Bayes has a very low computation cost.
• It can efficiently work on a large dataset.
• It performs well in case of discrete response variable compared to the continuous variable.
• It can be used with multiple class prediction problems.
• It also performs well in the case of text analytics problems.
• When the assumption of independence holds, a Naive Bayes classifier performs better
compared to other models like logistic regression.
Disadvantages
• The assumption of independent features. In practice, it is almost impossible that model will
get a set of predictors which are entirely independent.
• If there is no training tuple of a particular class, this causes zero posterior probability. In this
case, the model is unable to make predictions. This problem is known as Zero
Probability/Frequency Problem.
X, y = make_classification(
n_features=6,
n_classes=3,n_samples=800,
n_informative=2,
random_state=1,
n_clusters_per_class=1,
)
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=y, marker="*")
Actual Value
0
Predicted Value: 0
# model evaluation
from sklearn.metrics import (
accuracy_score,
confusion_matrix,
ConfusionMatrixDisplay,
f1_score,)
y_pred = model.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test, average="weighted")
print("Accuracy:",accuray)
print("F1 Score:", f1)
Accuracy: 0.8484848484848485
F1 Score: 0.8491119695890328
Experiment 2
Objective: To perform cluster analysis by k-means method using python
K-means Clustering:
K-means is an unsupervised learning method for clustering data points. The algorithm iteratively
divides data points into K clusters by minimizing the variance in each cluster.
Approach:
First, each data point is randomly assigned to one of the K clusters. Then, we compute the centroid
(functionally the centre) of each cluster, and reassign each data point to the cluster with the closest
centroid. We repeat this process until the cluster assignments for each data point are no longer
changing.
K-means clustering requires us to select K, the number of clusters we want to group the data into.
The elbow method lets us graph the inertia (a distance-based metric) and visualize the point at
which it starts decreasing linearly. This point is referred to as the "elbow" and is a good estimate for
the best value for K based on our data.
plt.scatter(x,y)
plt.show()
for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)
plt.plot(range(1,11), inertias,marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia') plt.show()
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
plt.scatter(x,y,c=kmeans.labels_)
plt.show()
Experiment 3
Objective: To perform the hierarchical clustering using python
Hierarchical Clustering: Hierarchical clustering is an unsupervised learning method for clustering
data points. The algorithm builds clusters by measuring the dissimilarities between data.
Unsupervised learning means that a model does not have to be trained, and we do not need a
"target" variable. This method can be used on any data to visualize and interpret the relationship
between individual data points.
We will use Agglomerative Clustering, a type of hierarchical clustering that follows a bottom up
approach. We begin by treating each data point as its own cluster. Then, we join clusters together
that have the shortest distance between them to create larger clusters. This step is repeated until
one large cluster is formed containing all of the data points.
Hierarchical clustering requires us to decide on both a distance and linkage method. We will use
euclidean distance and the Ward linkage method, which attempts to minimize the variance between
clusters.
Approach:
• Step 1: Initially, assume each data point is an independent cluster, i.e. 6 clusters.
• Step 2: Into a single cluster, merge the two closest data points. By so doing, we ended up
with 5 clusters.
• Step 3: Again, merge the two closest clusters into a single cluster. By so doing, we ended up
with 4 clusters.
• Step 4: Repeat step three above until a single cluster of all data points is obtained.
Hierarchical Clustering using Python
import numpy as np
import matplotlib.pyplot as plt from
scipy.cluster.hierarchy
import dendrogram, linkage
x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21,21]
Experiment 4
Objective: Study of Regression Analysis Using Python
Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of
future events.
Linear Regression
Linear regression uses the relationship between the data-points to draw a straight line through all
them.
This line can be used to predict future values.
Python has methods for finding a relationship between data-points and to draw a line of linear
regression. We will show you how to use these methods instead of going through the mathematic
formula.
Multiple Regression
Multiple regression is like linear regression, but with more than one independent value, meaning
that we try to predict a value based on two or more variables.
Polynomial Regression
Polynomial regression, like linear regression, uses the relationship between the variables x and y
to find the best way to draw a line through the data points.
Regression Analysis Using Python
# linear regression example import
matplotlib.pyplot as plt from scipy
import stats
# x-axis represents age
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
print(speed)
85.59308314937454
# multiple regression example
import pandas from sklearn
import linear_model
df =pandas.read_csv("/content/sample_data/data.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr=linear_model.LinearRegression()
regr.fit(X, y)
EXPERIMENT -5
Objective: Outlier detection using python.
There are several ways to treat outliers in a dataset, depending on the nature of the outliers
and the problem being solved. Here are some of the most common ways of treating outlier
values:
1. Z-Score Treatment
2. IQR based filtering
3. Percentile Method
Output:
output:
12
EXPERIMENT -06
Objective: Demonstration of association rule mining using Apriory algorithm on supermarket
data.
Association rule mining is a data mining technique used to discover interesting patterns or
associations in a dataset. The Apriori algorithm is one of the most widely used algorithms for
this purpose.
Data Preparation : Prepare a dataset containing supermarket transaction data. Each
transaction should list the items purchased by a customer.
print("\nAssociation Rules:")
print(rules)
14
EXPERIMENT-07
Objective: Demonstration of FP Growth algorithm on supermarket data
The FP-Growth (Frequent Pattern Growth) algorithm is another popular technique for mining
frequent itemsets and association rules in transactional data. It has an advantage over the
Apriori algorithm in terms of speed and efficiency.
print("\nAssociation Rules:")
print(rules)
16
EXPERIMENT -08
Objective: To perform the statistical analysis of data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
data = {
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'Income': [50000, 60000, 75000, 80000, 90000, 100000, 110000, 120000, 130000, 140000],
'Score': [75, 80, 85, 88, 90, 92, 95, 96, 98, 99]}
df = pd.DataFrame(data)
summary_statistics = df.describe()
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(df['Age'], df['Income'])
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Age vs. Income')
plt.subplot(1, 2, 2)
plt.hist(df['Score'], bins=5)
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.title('Score Distribution')
plt.show()
age_income_ttest, p_value = stats.ttest_ind(df['Age'], df['Income'])
if p_value < 0.05:
print("There is a significant difference between Age and Income.")
else:
print("There is no significant difference between Age and Income.")
confidence_interval = stats.norm.interval(0.95, loc=df['Score'].mean(), scale=df['Score'].std())
import statsmodels.api as sm
X = sm.add_constant(df['Age'])
model = sm.OLS(df['Income'], X).fit()
17
regression_summary = model.summary()
print("Summary Statistics:\n", summary_statistics)
18
EXPERIMENT-9
1.Data Extraction (Python): Assume you have data in various formats (e.g., CSV files, databases).
import pandas as pd
# Load data from CSV files
sales_data = pd.read_csv('sales_data.csv')
customer_data = pd.read_csv('customer_data.csv')
product_data = pd.read_csv('product_data.csv')
2.Data Transformation
# Merge data
merged_data = pd.merge(sales_data, customer_data, on='customer_id', how='inner')
merged_data = pd.merge(merged_data, product_data, on='product_id', how='inner')
3.Data Loading
import sqlite3
conn = sqlite3.connect('data_warehouse.db')
4.Data Querying (Python/SQL): Once the data is loaded, you can run SQL queries to extract insights.
# Query the data warehouse
query = """
SELECT product_name, SUM(sales_amount) AS total_sales
FROM fact_sales
GROUP BY product_name ORDER BY total_sales DESC
"""
Output:
product_name total_sales
0 Product A 5000
1 Product B 4500
2 Product C 3500
3 Product D 3000