0% found this document useful (0 votes)
11 views13 pages

Data Mining Practicals Complete

The document contains a series of practical programming exercises focused on data mining techniques, including calculations of mean and standard deviation, reading CSV files, data filtering, sales analysis, clustering with K-Means, classification with Random Forest, regression analysis, association rule mining, and network visualization. Each practical includes code snippets, expected outputs, and descriptions of the tasks performed. The exercises utilize libraries such as NumPy, pandas, scikit-learn, and Matplotlib for data manipulation and visualization.

Uploaded by

tejasbhavsar2030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Data Mining Practicals Complete

The document contains a series of practical programming exercises focused on data mining techniques, including calculations of mean and standard deviation, reading CSV files, data filtering, sales analysis, clustering with K-Means, classification with Random Forest, regression analysis, association rule mining, and network visualization. Each practical includes code snippets, expected outputs, and descriptions of the tasks performed. The exercises utilize libraries such as NumPy, pandas, scikit-learn, and Matplotlib for data manipulation and visualization.

Uploaded by

tejasbhavsar2030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Mining Practical Programs

Practical No: 1
Practical Name: Calculate the mean and standard deviation

Code:
import numpy as np

# Example data
data = [10, 20, 30, 40, 50]

# Calculate mean
mean = np.mean(data)

# Calculate standard deviation


std_dev = np.std(data)

print("Mean:", mean)
print("Standard Deviation:", std_dev)

Output:
Mean: 30.0
Standard Deviation: 14.142135623730951
Practical No: 2
Practical Name: Read the CSV file

Code:
import csv
import pandas as pd

# Open the CSV file


with open('your_file.csv', mode='r') as file:
csv_reader = csv.reader(file)
next(csv_reader) # Skip header if necessary
for row in csv_reader:
print(row)

# Read using pandas


df = pd.read_csv('your_file.csv')
print(df.head())

Output:
['John', '25', '50000']
['Sarah', '30', '60000']
['Mike', '28', '55000']

Name Age Salary


0 John 25 50000
1 Sarah 30 60000
2 Mike 28 55000
Practical No: 3
Practical Name: Perform data filtering and calculate aggregate statistics

Code:
import pandas as pd

# Load the CSV file into a DataFrame


df = pd.read_csv('your_file.csv')

filtered_data = df[(df['Age'] > 30) & (df['Salary'] > 50000)]

# Calculate aggregate statistics


mean_salary = filtered_data['Salary'].mean()
total_salary = filtered_data['Salary'].sum()
row_count = filtered_data.shape[0]

print(f"Mean Salary: {mean_salary}")


print(f"Total Salary: {total_salary}")
print(f"Number of rows: {row_count}")

Output:
Mean Salary: 75000.0
Total Salary: 150000
Number of rows: 2
Practical No: 4
Practical Name: Calculate total sales by month

Code:
import pandas as pd

# Load the CSV file into a DataFrame


df = pd.read_csv('your_sales_data.csv')

# Convert 'Date' to datetime format


df['Date'] = pd.to_datetime(df['Date'])

# Group by Month and Sum Sales


df['YearMonth'] = df['Date'].dt.to_period('M')
total_sales_by_month = df.groupby('YearMonth')['Sales'].sum().reset_index()

print(total_sales_by_month)

Output:
YearMonth Sales
0 2023-01 25000
1 2023-02 27000
2 2023-03 30000
Practical No: 5
Practical Name: Implement Clustering using K-Means

Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic data


X, y = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_

# Print Cluster Centers


print("Cluster Centers:", kmeans.cluster_centers_)

Output:
Cluster Centers: [[ 4.6 2.3]
[ 1.2 -0.3]
[-2.1 7.5]
[ 3.2 6.1]]
Practical No: 6
Practical Name: Classification using Random Forest

Code:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load the dataset


data = load_iris()
X = data.data
y = data.target

# Split into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model


rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict and evaluate


y_pred = rf_classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Output:
Accuracy: 0.97
Practical No: 7
Practical Name: Regression Analysis using Linear Regression

Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Generate sample data


X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])

# Train the model


lin_reg = LinearRegression()
lin_reg.fit(X, y)

# Predict
y_pred = lin_reg.predict(X)
print("Predicted Values:", y_pred)

Output:
Predicted Values: [ 2. 4. 6. 8. 10.]
Practical No: 8
Practical Name: Association Rule Mining using Apriori

Code:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Sample dataset
data = {'Milk': [1, 1, 0], 'Bread': [1, 0, 1], 'Butter': [0, 1, 1]}
df = pd.DataFrame(data)

# Apply Apriori
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

print("Rules:", rules)

Output:
Rules: Empty DataFrame
Practical No: 9
Practical Name: Visualize the result of clustering and compare

Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic data


X, y = make_blobs(n_samples=300, centers=3, random_state=42)

# Apply KMeans clustering


kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_

# Plot Clustering Result


plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('K-Means Clustering')
plt.show()

Output:
A scatter plot showing clusters formed by K-Means.
Practical No: 10
Practical Name: Visualize the correlation matrix using a pseudocolor plot

Code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Generate random data


data = np.random.rand(10, 5)
df = pd.DataFrame(data, columns=[f'Feature {i+1}' for i in range(5)])

# Compute correlation matrix


corr_matrix = df.corr()

# Plot correlation heatmap


sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

Output:
A heatmap displaying the correlation matrix of generated features.
Practical No: 11
Practical Name: Use of degrees distribution of a network

Code:
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

# Create a random graph


G = nx.erdos_renyi_graph(n=100, p=0.05)

# Compute degrees
degrees = [G.degree(n) for n in G.nodes()]
degree_count = np.bincount(degrees)

# Plot degree distribution


plt.bar(range(len(degree_count)), degree_count, width=0.8, color='b', alpha=0.7)
plt.xlabel('Degree')
plt.ylabel('Frequency')
plt.title('Degree Distribution of the Network')
plt.show()

Output:
A bar plot displaying the degree distribution of a network graph.
Practical No: 12
Practical Name: Graph visualization of a network using statistical measures

Code:
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

# Create a random graph


G = nx.erdos_renyi_graph(n=100, p=0.05)

# Compute degrees and statistical measures


degrees = [G.degree(n) for n in G.nodes()]
degree_min, degree_max = np.min(degrees), np.max(degrees)
degree_median, degree_q1, degree_q3 = np.median(degrees), np.percentile(degrees, 25),
np.percentile(degrees, 75)

# Visualize the network


pos = nx.spring_layout(G)
node_colors = ['blue' if G.degree(n) == degree_min else 'red' if G.degree(n) == degree_max
else 'green' for n in G.nodes()]

plt.figure(figsize=(10, 8))
nx.draw(G, pos, node_size=50, node_color=node_colors, with_labels=False,
edge_color='lightgray')
plt.title('Network Visualization Based on Degree Statistics')
plt.show()

Output:
A network graph highlighting nodes based on statistical degree measures.

You might also like