0% found this document useful (0 votes)
28 views31 pages

Data Analytics All Practical

This document is a lab file for a Data Analytics course for B. Tech (CSE-AI) students, detailing various programming exercises in Python. It includes tasks related to numerical operations, data import/export, matrix operations, statistical analysis, data pre-processing, PCA, linear regression, K-Means clustering, and market basket analysis. Each program is accompanied by aims, code snippets, and expected outputs.

Uploaded by

csai22002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views31 pages

Data Analytics All Practical

This document is a lab file for a Data Analytics course for B. Tech (CSE-AI) students, detailing various programming exercises in Python. It includes tasks related to numerical operations, data import/export, matrix operations, statistical analysis, data pre-processing, PCA, linear regression, K-Means clustering, and market basket analysis. Each program is accompanied by aims, code snippets, and expected outputs.

Uploaded by

csai22002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

[Approved by AICTE, Govt. of India & Affiliated to Dr.

APJ Abdul Kalam


Technical University, Lucknow, U.P., India]
Department of Computer Science & Engineering (AI)

Lab File
Data Analytics Lab
(BADS651)
ACADEMIC SESSION 2024-25

COURSE: B. TECH (CSE-AI)

SEM: VI

Submitted to: Submitted by:

Mr. Piyush Kushwaha Yugank Singh


Assistant Professor 2201921520200
CSE(AI) Department
INDEX
Date of Date of
S. No. List of Programs Signature
Experiment Submission
To get the input from user and perform
numerical operations (MAX, MIN,
1.
AVG, SUM, SQRT, ROUND) using in
Python.
To perform data import/export (.CSV,
2. .XLS, .TXT) operations using data
frames in Python.
To get the input matrix from user and
perform Matrix addition, subtraction,
3. multiplication, inverse transpose and
division operations using vector concept
in Python.
To perform statistical operations (Mean,
4. Median, Mode and Standard deviation)
using Python.
To perform data pre-processing
5. operations i) Handling Missing data ii)
Min-Max normalization.
To perform dimensionality reduction
6. operation using PCA for Houses Data
Set.
To perform Simple Linear Regression
7. with Python.

To perform K-Means clustering


8.
operation and visualize for iris data set.

Write Python script to diagnose any


9. disease using KNN classification and
plot the results.

To perform market basket analysis using


10.
Association Rules (Apriori).
Program – 1
Aim: To get the input from user and perform numerical operations (MAX, MIN,
AVG, SUM, SQRT, ROUND) using in Python.
Program:
import math

# Function to perform all the operations


def perform_operations():
# Get a list of numbers from the user (space-separated)

user_input = input("Enter numbers separated by space: ")

# Convert the input string into a list of numbers


numbers = list(map(float, user_input.split()))

# Perform the operations


max_value = max(numbers)
min_value = min(numbers)
sum_value = sum(numbers)

avg_value = sum_value / len(numbers) if len(numbers) > 0 else 0


sqrt_values = [math.sqrt(num) for num in numbers]
rounded_values = [round(num, 2) for num in numbers]

# Display the results


print(f"Max Value: {max_value}")
print(f"Min Value: {min_value}")
print(f"Sum: {sum_value}")

print(f"Average: {avg_value}")
print(f"Square Root of each number: {sqrt_values}")
print(f"Rounded values (to 2 decimal places): {rounded_values}")

1
# Call the function
perform_operations()

Output:

2
Program – 2

Aim: To perform data import/export (.CSV, .XLS, .TXT) operations using data
frames in Python.
Program:
import pandas as pd

# Correct file paths using raw string (r"") or double backslashes (\\)
csv_path = r"D:\GL BAJAJ\DAata Analytics\customers-100.csv"

excel_path = r"D:\GL BAJAJ\DAata Analytics\Project-Management-Sample-Data.xlsx"


txt_path = r"D:\GL BAJAJ\DAata Analytics\sample-1.txt"
# Load CSV File
try:

csv_data = pd.read_csv(csv_path)
print("\nCSV Data:\n", csv_data.head()) # Show first 5 rows
except Exception as e:
print("Error loading CSV file:", e)

# Load Excel File


try:
excel_data = pd.read_excel(excel_path)
print("\nExcel Data:\n", excel_data.head()) # Show first 5 rows

except Exception as e:
print("Error loading Excel file:", e)

# Load TXT File (Tab-Separated)

try:
txt_data = pd.read_csv(txt_path, sep="\t", engine="python", on_bad_lines="skip") #
Auto-detect separator

3
print("\nTXT Data:\n", txt_data.head()) # Show first 5 rows
except Exception as e:
print("Error loading TXT file:", e)

4
Output:

5
Program – 3

Aim: To get the input matrix from user and perform Matrix addition,
subtraction, multiplication, inverse transpose and division operations using
vector concept in Python.
Program:
import numpy as np

# Function to get a matrix input from the user


def get_matrix_input():
rows = int(input("Enter number of rows for the matrix: "))
cols = int(input("Enter number of columns for the matrix: "))

print(f"Enter the elements of the {rows}x{cols} matrix (row by row):")


matrix = []

for i in range(rows):
row = list(map(float, input(f"Enter elements for row {i+1} separated by space: ").split()))
matrix.append(row)

return np.array(matrix)

# Function to perform matrix operations


def perform_operations(matrix1, matrix2):

try:
# Matrix Addition
matrix_addition = matrix1 + matrix2
print("Matrix Addition:\n", matrix_addition)

6
# Matrix Subtraction
matrix_subtraction = matrix1 - matrix2
print("Matrix Subtraction:\n", matrix_subtraction)

# Matrix Multiplication
matrix_multiplication = np.dot(matrix1, matrix2)
print("Matrix Multiplication:\n", matrix_multiplication)

# Matrix Inverse (if square matrix)


if matrix1.shape[0] == matrix1.shape[1]:
matrix_inverse = np.linalg.inv(matrix1)

print("Matrix Inverse:\n", matrix_inverse)


else:
print("Matrix 1 is not square, so inverse cannot be computed.")

# Matrix Transpose

matrix_transpose = np.transpose(matrix1)
print("Matrix Transpose:\n", matrix_transpose)

# Matrix Division (element-wise division)

matrix_division = np.divide(matrix1, matrix2)


print("Matrix Division (element-wise):\n", matrix_division)

except Exception as e:

print(f"Error during matrix operations: {e}")

# Main driver code


def main():

print("Matrix Operations")

7
# Get user input for two matrices
print("Enter the first matrix:")
matrix1 = get_matrix_input()

print("Enter the second matrix:")


matrix2 = get_matrix_input()

# Perform the operations


perform_operations(matrix1, matrix2)

# Run the program

main()

Output:

8
Program – 4

Aim: To perform statistical operations (Mean, Median, Mode and Standard


deviation) using Python.
Program:
import statistics

# Function to perform statistical operations


def perform_statistical_operations():

# Get user input for the data


data = list(map(float, input("Enter numbers separated by space: ").split()))

# Mean

mean_value = statistics.mean(data)
print(f"Mean: {mean_value}")

# Median
median_value = statistics.median(data)

print(f"Median: {median_value}")

# Mode
try:

mode_value = statistics.mode(data)
print(f"Mode: {mode_value}")
except statistics.StatisticsError:
print("Mode: No unique mode (multiple modes or no mode)")

# Standard Deviation
stdev_value = statistics.stdev(data)

9
print(f"Standard Deviation: {stdev_value}")

# Call the function


perform_statistical_operations()

Output:

10
Program – 5

Aim: To perform data pre-processing operations i) Handling Missing data ii)


Min-Max normalization.
Program:
i) Handling Missing Data in Python:

import pandas as pd
import numpy as np

# Create a sample DataFrame with missing data


data = {
'A': [1, 2, np.nan, 4, 5],
'B': [5, np.nan, 7, 8, 9],
'C': [10, 11, 12, np.nan, 14]
}

df = pd.DataFrame(data)

print("Original DataFrame with Missing Data:")


print(df)

# i. Remove rows with any missing values


df_dropna = df.dropna()
print("\nDataFrame after removing rows with missing values:")
print(df_dropna)

# ii. Fill missing values with the mean of the column


df_fill_mean = df.fillna(df.mean())
print("\nDataFrame after filling missing values with column mean:")
print(df_fill_mean)

# iii. Fill missing values with a specific value (e.g., 0)


df_fill_zero = df.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_fill_zero)

# iv. Forward fill missing values (using the previous value)


df_fill_forward = df.fillna(method='ffill')
print("\nDataFrame after forward filling missing values:")

11
print(df_fill_forward)

# v. Backward fill missing values (using the next value)


df_fill_backward = df.fillna(method='bfill')
print("\nDataFrame after backward filling missing values:")
print(df_fill_backward)

Output:

12
ii) Min-Max Normalization in Python:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Create a sample DataFrame


data = {
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Using pandas to perform Min-Max Normalization


df_min_max = (df - df.min()) / (df.max() - df.min())
print("\nDataFrame after Min-Max Normalization (using pandas):")
print(df_min_max)

# Alternatively, using scikit-learn's MinMaxScaler


scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

print("\nDataFrame after Min-Max Normalization (using scikit-learn):")


print(df_scaled)

Output:

13
14
Program – 6

Aim: To perform dimensionality reduction operation using PCA for Houses Data
Set.
Program:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA


from sklearn.preprocessing import StandardScaler

# Load the dataset (Ensure the correct file path)

file_path = "D:\GL BAJAJ\DAata Analytics\House price data .xlsx" # Use raw string (r"")
df = pd.read_excel(file_path, engine="openpyxl") # Ensure openpyxl is installed

# Display the first few rows


print("Original Dataset:\n", df.head())

# Step 1: Select numerical features for PCA


numeric_features = df.select_dtypes(include=[np.number]) # Select only numeric columns
numeric_features = numeric_features.dropna() # Drop rows with missing values

# Step 2: Standardize the Data (PCA works better with scaled data)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_features)

# Step 3: Apply PCA (Reduce to 2 principal components)


pca = PCA(n_components=2)

15
pca_result = pca.fit_transform(scaled_data)

# Step 4: Analyze Explained Variance

explained_variance = pca.explained_variance_ratio_ * 100


print("\nExplained Variance by Each Principal Component:", explained_variance)

# Step 5: Create a DataFrame for PCA results

pca_df = pd.DataFrame(data=pca_result, columns=["PC1", "PC2"])


print("\nPCA Transformed Data (First 5 Rows):\n", pca_df.head())

# Step 6: Plot the PCA Components

plt.figure(figsize=(8, 5))
plt.scatter(pca_result[:, 0], pca_result[:, 1], c="blue", alpha=0.5)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA on House Prices Dataset")

plt.grid()
plt.show()

# Step 7: Check cumulative explained variance for all components

pca_full = PCA().fit(scaled_data)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_) * 100

# Plot cumulative explained variance

plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker="o",
linestyle="--", color="red")
plt.xlabel("Number of Principal Components")

plt.ylabel("Cumulative Explained Variance (%)")


plt.title("Cumulative Explained Variance vs. Number of Components")

16
plt.grid()
plt.show()

Output:

17
Program – 7

Aim: To perform Simple Linear Regression with Python.


Program:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 1. Prepare the dataset (for this example, let's generate some data)
# Generate a simple linear dataset
np.random.seed(0)
X = 2 * np.random.rand(100, 1) # Feature: 100 random values between 0 and 2

y = 4 + 3 * X + np.random.randn(100, 1) # Target: y = 4 + 3*X + random noise

# Convert to pandas DataFrame (optional)


data = pd.DataFrame({'X': X.flatten(), 'y': y.flatten()})

# 2. Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Create the Linear Regression model


model = LinearRegression()

# 4. Train the model


model.fit(X_train, y_train)

18
# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Evaluate the model


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

print(f"R-squared: {r2}")

# 7. Visualize the results


plt.scatter(X_test, y_test, color='blue', label='Actual data')

plt.plot(X_test, y_pred, color='red', label='Predicted line')


plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()

plt.show()

Output:

19
Program – 8
Aim: To perform K-Means clustering operation and visualize for iris data set
Program:
!pip install mlxtend
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

from mlxtend.frequent_patterns import apriori, association_rules

# Step 1: Define the dataset (list of transactions)


dataset = [

['milk', 'bread', 'nuts', 'apple'],


['milk', 'bread', 'nuts'],
['milk', 'bread'],
['milk', 'bread', 'apple'],
['milk', 'bread', 'apple']

]
# Step 2: Convert the list of transactions into one-hot encoded DataFrame
te = TransactionEncoder()
te_data = te.fit(dataset).transform(dataset)

df = pd.DataFrame(te_data, columns=te.columns_)

print(" Transaction Data (One-Hot Encoded):")

print(df)

# Step 3: Apply Apriori to find frequent itemsets


frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)

print("\n📦 Frequent Itemsets (Support >= 0.6):")


print(frequent_itemsets)

20
# Step 4: Derive Association Rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

print("\n🔗 Association Rules (Confidence >= 0.7):")

print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])


unique, counts = np.unique(I, return_counts=True)

print("Cluster Distribution:", dict(zip(unique, counts)))


import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

# Reduce dimensions using PCA


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Scatter plot of clusters


plt.scatter(X_pca[:, 0], X_pca[:, 1], c=I.flatten(), cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('FAISS K-Means Clustering on Iris Dataset')

plt.colorbar(label="Cluster")
plt.show()
from sklearn.metrics import accuracy_score
true_labels = iris.target # Actual labels from dataset
print("Accuracy (approximate):", accuracy_score(true_labels, I.flatten()))
true_labels = iris.target

# Create a mapping between predicted clusters and true labels

mapping = {}
for cluster in range(3):
mask = (I.flatten() == cluster) # Find all data points in cluster

21
if np.sum(mask) > 0: # Ensure the mask is not empty
most_common_label = mode(true_labels[mask], keepdims=True).mode[0] # Fix
mapping[cluster] = most_common_label

# Map the predicted clusters to corrected labels


mapped_clusters = np.array([mapping[label] for label in I.flatten()])

# Compute accuracy
accuracy = accuracy_score(true_labels, mapped_clusters)
print("Corrected Accuracy:", accuracy)

Output:
Cluster Assignments: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0
0 2 2 2 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 0 2 2
0 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 0 2 2 2 0 2 2 2 0 2 2 2 0 2
2 0]

Cluster Distribution: {0: 56, 1: 50, 2: 44}

Accuracy (approximate): 0.22

Corrected Accuracy: 0.8133333333333334

22
Program – 9
Aim: Write R script to diagnose any disease using KNN classification and plot the
results.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score


from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from sklearn.ensemble import RandomForestClassifier


from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset

df = pd.read_csv('diabetes.csv') # replace with your file path if needed

# Features and target


X = df.drop('Outcome', axis=1)

y = df['Outcome']

# Normalize features
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Split the data

23
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25,
random_state=42)

# Hyperparameter tuning for KNN


k_range = range(1, 31)
cv_scores = []

for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())

# Plot accuracy vs. k


plt.figure(figsize=(10, 6))
plt.plot(k_range, cv_scores, marker='o')
plt.title('KNN Hyperparameter Tuning')

plt.xlabel('Number of Neighbors K')


plt.ylabel('Cross-Validated Accuracy')
plt.grid()
plt.show()

# Best k
best_k = k_range[cv_scores.index(max(cv_scores))]
print(f"Best K value: {best_k}")

# Train with best K


knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train, y_train)

y_pred_knn = knn_best.predict(X_test)

24
# Evaluation

print("\n✅ KNN Model Performance:")

print("Accuracy:", accuracy_score(y_test, y_pred_knn))


print("Classification Report:\n", classification_report(y_test, y_pred_knn))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Disease', 'Disease'],
yticklabels=['No Disease', 'Disease'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - KNN')

plt.tight_layout()
plt.show()

# Optional: Compare with Random Forest and SVM


models = {

"Random Forest": RandomForestClassifier(random_state=42),


"SVM": SVC(),
"KNN": knn_best
}

for name, model in models.items():


model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"\n🔍 {name} Accuracy: {accuracy_score(y_test, y_pred):.2f}")


print(classification_report(y_test, y_pred))

25
Output:

Best K value: 7

✅ KNN Model Performance:


Accuracy: 0.6875
Classification Report:
precision recall f1-score support

0 0.74 0.78 0.76 123


1 0.57 0.52 0.55 69

accuracy 0.69 192


macro avg 0.66 0.65 0.65 192
weighted avg 0.68 0.69 0.68 192

26
🔍 Random Forest Accuracy: 0.73
precision recall f1-score support

0 0.80 0.78 0.79 123


1 0.62 0.65 0.64 69

accuracy 0.73 192


macro avg 0.71 0.72 0.71 192
weighted avg 0.74 0.73 0.74 192

🔍 SVM Accuracy: 0.73


precision recall f1-score support

0 0.77 0.82 0.80 123


1 0.64 0.57 0.60 69

accuracy 0.73 192


macro avg 0.71 0.69 0.70 192
weighted avg 0.72 0.73 0.73 192

🔍 KNN Accuracy: 0.69


precision recall f1-score support
...
accuracy 0.69 192
macro avg 0.66 0.65 0.65 192
weighted avg 0.68 0.69 0.68 192

27
Program – 10
Aim: To perform market basket analysis using Association Rules (Apriori).
Program:

!pip install mlxtend


import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Step 1: Define the dataset (list of transactions)


dataset = [
['milk', 'bread', 'nuts', 'apple'],

['milk', 'bread', 'nuts'],


['milk', 'bread'],
['milk', 'bread', 'apple'],
['milk', 'bread', 'apple']
]

# Step 2: Convert the list of transactions into one-hot encoded DataFrame


te = TransactionEncoder()
te_data = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_data, columns=te.columns_)

print("🧾 Transaction Data (One-Hot Encoded):")

print(df)

# Step 3: Apply Apriori to find frequent itemsets

frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)

28
print("\n📦 Frequent Itemsets (Support >= 0.6):")

print(frequent_itemsets)

# Step 4: Derive Association Rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

print("\n🔗 Association Rules (Confidence >= 0.7):")

print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

Output:

(milk) (bread) 1.0 1.0 1.0


(apple, bread) (milk) 0.6 1.0 1.0
(apple, milk) (bread) 0.6 1.0 1.0
(apple) (bread, milk) 0.6 1.0 1.0

29

You might also like