0% found this document useful (0 votes)
20 views44 pages

DA Programs

The document outlines various data preprocessing techniques including handling missing values through mean/median imputation, forward/backward fill, and K-Nearest Neighbors (KNN) imputation. It also discusses noise detection and removal using statistical methods and machine learning, as well as identifying and eliminating data redundancy through correlation analysis. Additionally, the document covers the implementation of linear regression, logistic regression, and decision tree classification with examples and code snippets.

Uploaded by

Bejjanki Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views44 pages

DA Programs

The document outlines various data preprocessing techniques including handling missing values through mean/median imputation, forward/backward fill, and K-Nearest Neighbors (KNN) imputation. It also discusses noise detection and removal using statistical methods and machine learning, as well as identifying and eliminating data redundancy through correlation analysis. Additionally, the document covers the implementation of linear regression, logistic regression, and decision tree classification with examples and code snippets.

Uploaded by

Bejjanki Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

1.

Data Preprocessing

a. Handling missing values


1. Mean/Median Imputation
Replace missing values with the mean or median of the respective feature.

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Create a sample DataFrame


df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})
# Replace missing values with mean
df['A'].fillna(df['A'].mean(), inplace=True)
df['B'].fillna(df['B'].mean(), inplace=True)
print(df)

A B
0 1.000000 5.000000
1 2.000000 6.666667
2 2.333333 7.000000
3 4.000000 8.000000

2. Forward/Backward Fill
Replace missing values with the previous or next value in the sequence.

# Forward fill
df['A'].fillna(method='ffill', inplace=True)
df['B'].fillna(method='ffill', inplace=True)
print(df)

A B
0 1.000000 5.000000
1 2.000000 6.666667
2 2.333333 7.000000
3 4.000000 8.000000
# 3. K-Nearest Neighbors (KNN) Imputation ## Replace missing values using the KNN algorithm.

from sklearn.impute import KNNImputer


import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})
# Create a KNN imputer
imputer = KNNImputer(n_neighbors=2)
# Fit and transform the data
df_imputed = imputer.fit_transform(df)
print(df_imputed)

[[1. 5. ]
[2. 6.5]
[2.5 7. ]
[4. 8. ]]

b. Noise detection removal


1. Statistical Methods Use statistical methods such as mean, median, and standard deviation to detect and remove outliers.

import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4, 100]
})

# Calculate the mean and standard deviation


mean = df['A'].mean()
std_dev = df['A'].std()
# Remove outliers
df_cleaned = df[(df['A'] >= mean - 2*std_dev) & (df['A'] <= mean + 2*std_dev)]
print(df_cleaned)

A
0 1
1 2
2 3
3 4
4 100
2. Machine Learning Methods Use machine learning algorithms such as One-Class SVM and Local Outlier Factor (LOF) to detect anomalies.

from sklearn.svm import OneClassSVM


import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4, 100]
})
# Create a One-Class SVM model
model = OneClassSVM(kernel='rbf', gamma=0.1, nu=0.1)
# Fit the model
model.fit(df[['A']])
# Predict anomalies
anomaly = model.predict(df[['A']])
# Remove anomalies
df_cleaned = df[anomaly == 1]

print(df_cleaned)

A
1 2
2 3
3 4
4 100

c. Identifying data redundancy and elimination


Identifying Data Redundancy and Elimination Data redundancy can be identified and eliminated using various techniques such as: 1. Correlation
Analysis Use correlation analysis to identify highly correlated features and eliminate redundant ones.

import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [2, 4, 6, 8, 10],
'C': [3, 6, 9, 12, 15]
})
# Calculate the correlation matrix
corr_matrix = df.corr()
# Identify highly correlated features
high_corr_features = corr_matrix[(corr_matrix > 0.9) & (corr_matrix < 1)].index
# Eliminate redundant features
df_eliminated = df.drop(high_corr_features, axis=1)
print(df_eliminated)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
2. Implement any one imputation model
KNN Imputation Model: KNN imputation is a popular method for handling missing values. It works by finding the k most similar data points
(nearest neighbors) to the row with the missing value. The missing value is then imputed using the values from these nearest neighbors.

import pandas as pd
from sklearn.impute import KNNImputer
import numpy as np
# Create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 3, 4, 5, 6],
'C': [7, 8, 9, np.nan, 11]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
# Create a KNN imputer with k=3
imputer = KNNImputer(n_neighbors=3)
# Fit the imputer to the data and transform the missing values
imputed_data = imputer.fit_transform(df)
# Convert the imputed data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print("\nImputed DataFrame:")
print(imputed_df)

Original DataFrame:
A B C
0 1.0 NaN 7.0
1 2.0 3.0 8.0
2 NaN 4.0 9.0
3 4.0 5.0 NaN
4 5.0 6.0 11.0

Imputed DataFrame:
A B C
0 1.000000 4.0 7.000000
1 2.000000 3.0 8.000000
2 2.333333 4.0 9.000000
3 4.000000 5.0 9.333333
4 5.000000 6.0 11.000000

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
3. Implement Linear Regression
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('salary_Data.csv')

dataset

YearsExperience Salary

0 1.1 39343.0

1 1.3 46205.0

2 1.5 37731.0

3 2.0 43525.0

4 2.2 39891.0

5 2.9 56642.0

6 3.0 60150.0

7 3.2 54445.0

8 3.2 64445.0

9 3.7 57189.0

10 3.9 63218.0

11 4.0 55794.0

12 4.0 56957.0

13 4.1 57081.0

14 4.5 61111.0

15 4.9 67938.0

16 5.1 66029.0

17 5.3 83088.0

18 5.9 81363.0

19 6.0 93940.0

20 6.8 91738.0

21 7.1 98273.0

22 7.9 101302.0

23 8.2 113812.0

24 8.7 109431.0

25 9.0 105582.0

26 9.5 116969.0

27 9.6 112635.0

28 10.3 122391.0

29 10.5 121872.0

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:,1].values

X
array([[ 1.1],
[ 1.3],
[ 1.5],
[ 2. ],
[ 2.2],
[ 2.9],
[ 3. ],
[ 3.2],
[ 3.2],
[ 3.7],
[ 3.9],
[ 4. ],
[ 4. ],
[ 4.1],
[ 4.5],
[ 4.9],
[ 5.1],
[ 5.3],
[ 5.9],
[ 6. ],
[ 6.8],
[ 7.1],
[ 7.9],
[ 8.2],
[ 8.7],
[ 9. ],
[ 9.5],
[ 9.6],
[10.3],
[10.5]])

array([ 39343., 46205., 37731., 43525., 39891., 56642., 60150.,


54445., 64445., 57189., 63218., 55794., 56957., 57081.,
61111., 67938., 66029., 83088., 81363., 93940., 91738.,
98273., 101302., 113812., 109431., 105582., 116969., 112635.,
122391., 121872.])

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

from sklearn.linear_model import LinearRegression


regressor = LinearRegression()
regressor.fit(X_train, y_train)

▾ LinearRegression i ?

LinearRegression()

plt.scatter(X_train, y_train, color = 'red')


plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
4.Implementation of Logistic Regression
Import Libraries

# Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc

Read and Explore the data

# Load the diabetes dataset


diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Convert the target variable to binary (1 for diabetes, 0 for no diabetes)


y_binary = (y > np.median(y)).astype(int)

Splitting The Dataset: Train and Test dataset

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
X, y_binary, test_size=0.2, random_state=42)

Feature Scaling

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Train The Model

# Train the Logistic Regression model


model = LogisticRegression()
model.fit(X_train, y_train)

▾ LogisticRegression i ?

LogisticRegression()

Evaluation Metrics

# Evaluate the model


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Accuracy: 73.03%
Confusion Matrix and Classification Report

# evaluate the model


print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Confusion Matrix:
[[36 13]
[11 29]]

Classification Report:
precision recall f1-score support

0 0.77 0.73 0.75 49


1 0.69 0.72 0.71 40

accuracy 0.73 89
macro avg 0.73 0.73 0.73 89
weighted avg 0.73 0.73 0.73 89

Visualizing the performance of our model.

# Visualize the decision boundary with accuracy information


plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_test[:, 2], y=X_test[:, 8], hue=y_test, palette={
0: 'blue', 1: 'red'}, marker='o')
plt.xlabel("BMI")
plt.ylabel("Age")
plt.title("Logistic Regression Decision Boundary\nAccuracy: {:.2f}%".format(
accuracy * 100))
plt.legend(title="Diabetes", loc="upper right")
plt.show()

Plotting ROC Curve

# Plot ROC Curve


y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve\nAccuracy: {:.2f}%'.format(
accuracy * 100))
plt.legend(loc="lower right")
plt.show()
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
5. Decision Tree Induction for Classification

Import necessary libraries


import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

Load Dataset
# Load dataset (Iris dataset)
iris = load_iris()
X, y = iris.data, iris.target # Features and labels

Data Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Training
# Initialize and train the Decision Tree Classifier
model = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
model.fit(X_train, y_train)

▾ DecisionTreeClassifier i ?

DecisionTreeClassifier(max_depth=3, random_state=42)

Predictions
# Make predictions
y_pred = model.predict(X_test)

Model Evaluation
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 1.0

# Print classification report


print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Decision Tree Visualization


# Visualize the Decision Tree
plt.figure(figsize=(12, 8))
plot_tree(model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.title("Decision Tree Visualization")
plt.show()
Feature Importance
# Feature Importance Plot
plt.figure(figsize=(8, 5))
plt.barh(iris.feature_names, model.feature_importances_, color="skyblue")
plt.xlabel("Feature Importance Score")
plt.ylabel("Features")
plt.title("Feature Importance in Decision Tree")
plt.show()

The Decision Tree Classifier is a simple yet effective machine learning model for classification
tasks. In this implementation, we used the Iris dataset to train and evaluate the model. The
decision tree achieves high accuracy, and by visualizing the tree and feature importance, we
gain insights into how decisions are made. This method is useful for explainable AI but can be
prone to overfitting if not carefully tuned. To improve generalization, techniques such as pruning
or ensemble methods like Random Forest can be applied.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
6. Implement Random Forest Classifier
1. Import Required Libraries We will be importing Pandas, matplotlib, seaborn and sklearn to build the model.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

2. Import Dataset

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

df

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 2

146 6.3 2.5 5.0 1.9 2

147 6.5 3.0 5.2 2.0 2

148 6.2 3.4 5.4 2.3 2

149 5.9 3.0 5.1 1.8 2

150 rows × 5 columns

3. Data Preparation

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

array([[5.1, 3.5, 1.4, 0.2],


[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
4. Splitting the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Feature Scaling

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

6. Building Random Forest Classifier

classifier = RandomForestClassifier(n_estimators=100, random_state=42)


classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

7. Evaluation of the Model

accuracy = accuracy_score(y_test, y_pred)


print(f'Accuracy: {accuracy * 100:.2f}%')

conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues', cbar=False,
xticklabels=iris.target_names, yticklabels=iris.target_names)

plt.title('Confusion Matrix Heatmap')


plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

Accuracy: 100.00%
8. Feature Importance

feature_importances = classifier.feature_importances_

plt.barh(iris.feature_names, feature_importances)
plt.xlabel('Feature Importance')
plt.title('Feature Importance in Random Forest Classifier')
plt.show()

Conclusion--From the graph we can see that petal width (cm) is the most
important feature followed closely by petal length (cm). The sepal width (cm)
and sepal length (cm) have lower importance in determining the model’s
predictions. This indicates that the classifier relies more on the petal
measurements to make predictions about the flower species.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
7. Implement ARIMA on Time Series data

Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import warnings
warnings.filterwarnings('ignore')

## Load and Prepare Dataset


# Generate synthetic time series data
np.random.seed(42)
date_rng = pd.date_range(start='2020-01-01', periods=100, freq='D')
data = np.cumsum(np.random.randn(100)) # Simulated time series
df = pd.DataFrame({'Date': date_rng, 'Value': data})
df.set_index('Date', inplace=True)

df

Value

Date

2020-01-01 0.496714

2020-01-02 0.358450

2020-01-03 1.006138

2020-01-04 2.529168

2020-01-05 2.295015

... ...

2020-04-05 -10.712354

2020-04-06 -10.416233

2020-04-07 -10.155178

2020-04-08 -10.150065

2020-04-09 -10.384652

100 rows × 1 columns

# Plot the time series


df.plot(figsize=(10, 5), title='Synthetic Time Series Data')
plt.show()
Check for Stationarity
# Perform Augmented Dickey-Fuller test
adf_test = adfuller(df['Value'])
print("ADF Statistic:", adf_test[0])
print("p-value:", adf_test[1])

ADF Statistic: -1.3583317659818994


p-value: 0.6020814791099097

If p-value > 0.05, data is non-stationary and needs differencing


## Differencing for Stationarity (if needed)

if adf_test[1] > 0.05:


df['Value_diff'] = df['Value'].diff().dropna()
df['Value_diff'].plot(figsize=(10, 5), title='Differenced Time Series')
plt.show()

Fit ARIMA Model


# Define ARIMA model (using p=2, d=1, q=2 as an example)
model = ARIMA(df['Value'], order=(2, 1, 2)) # (p, d, q)
model_fit = model.fit()
print(model_fit.summary())

SARIMAX Results
==============================================================================
Dep. Variable: Value No. Observations: 100
Model: ARIMA(2, 1, 2) Log Likelihood -130.434
Date: Sat, 22 Mar 2025 AIC 270.869
Time: 22:03:24 BIC 283.845
Sample: 01-01-2020 HQIC 276.119
- 04-09-2020
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.1299 0.283 0.459 0.646 -0.425 0.684
ar.L2 0.8677 0.230 3.776 0.000 0.417 1.318
ma.L1 -0.0583 0.330 -0.177 0.860 -0.705 0.588
ma.L2 -0.9353 0.281 -3.323 0.001 -1.487 -0.384
sigma2 0.8134 0.148 5.502 0.000 0.524 1.103
===================================================================================
Ljung-Box (L1) (Q): 0.46 Jarque-Bera (JB): 0.33
Prob(Q): 0.50 Prob(JB): 0.85
Heteroskedasticity (H): 1.02 Skew: -0.14
Prob(H) (two-sided): 0.96 Kurtosis: 3.07
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Forecasting Future Values


# Forecast the next 10 time steps
forecast = model_fit.forecast(steps=10)
print("Forecasted Values:", forecast)

Forecasted Values: 2020-04-10 -10.395337


2020-04-11 -10.435398
2020-04-12 -10.449871
2020-04-13 -10.486512
2020-04-14 -10.503829
2020-04-15 -10.537871
2020-04-16 -10.557318
2020-04-17 -10.589381
2020-04-18 -10.610419
2020-04-19 -10.640973
Freq: D, Name: predicted_mean, dtype: float64

# Plot actual vs. forecasted values


plt.figure(figsize=(10, 5))
plt.plot(df.index, df['Value'], label='Actual Data')
forecast_index = pd.date_range(df.index[-1], periods=11, freq='D')[1:] # Ensure correct dimensions
plt.plot(forecast_index, forecast, label='Forecast', color='red')
plt.legend()
plt.title('ARIMA Forecast')
plt.show()
Conclusion
# The ARIMA model successfully fits and forecasts time series data by identifying trends and
patterns.
# We checked for stationarity and applied differencing if necessary.
# ACF and PACF plots helped in selecting appropriate ARIMA parameters.
# The model was trained and used for forecasting future values.
# This implementation demonstrates the effectiveness of ARIMA for time series forecasting in various
applications.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
8. Object segmentation using hierarchical based methods
Install Required Libraries

pip install scikit-image matplotlib

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the k
ernel to use updated packages.

Requirement already satisfied: scikit-image in c:\users\ayesha kausar\appdata\roaming\python\python312\site-pack


ages (0.25.2)
Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (3.9.2)
Requirement already satisfied: numpy>=1.24 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (1.
26.4)
Requirement already satisfied: scipy>=1.11.4 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (
1.13.1)
Requirement already satisfied: networkx>=3.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (
3.3)
Requirement already satisfied: pillow>=10.1 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (1
0.4.0)
Requirement already satisfied: imageio!=2.35.0,>=2.33 in c:\programdata\anaconda3\lib\site-packages (from scikit
-image) (2.33.1)
Requirement already satisfied: tifffile>=2022.8.12 in c:\programdata\anaconda3\lib\site-packages (from scikit-im
age) (2023.4.12)
Requirement already satisfied: packaging>=21 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (
24.1)
Requirement already satisfied: lazy-loader>=0.4 in c:\programdata\anaconda3\lib\site-packages (from scikit-image
) (0.4)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (0.1
1.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.4.4)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotl
ib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7
->matplotlib) (1.16.0)

Import Necessary Libraries

import numpy as np
import matplotlib.pyplot as plt
from skimage import io, segmentation, color
from skimage.filters import sobel

Load an Image

# Load an image from a URL


image_url = 'panda.jpg'
image = io.imread(image_url)

# Display the original image


plt.imshow(image)
plt.axis('off')
plt.title('Original Image')
plt.show()
Convert the Image to Grayscale
# Convert to grayscale
gray_image = color.rgb2gray(image)
gray_image

array([[0.38707765, 0.39492078, 0.41117255, ..., 0.60005961, 0.58437333,


0.56868706],
[0.4013651 , 0.40528667, 0.41509412, ..., 0.60790275, 0.5882949 ,
0.5765302 ],
[0.43609373, 0.43609373, 0.44058078, ..., 0.60790275, 0.59221647,
0.58409059],
...,
[0.05546745, 0.05546745, 0.05154588, ..., 0.6479702 , 0.64012706,
0.63228392],
[0.06331059, 0.06331059, 0.05938902, ..., 0.64012706, 0.63228392,
0.62836235],
[0.06723216, 0.06723216, 0.06331059, ..., 0.63620549, 0.63228392,
0.62836235]])

Compute Edges Using the Sobel Filter


edges = sobel(gray_image)

# Display the edges


plt.imshow(edges, cmap='gray')
plt.axis('off')
plt.title('Edge Detection (Sobel Filter)')
plt.show()

Perform Over-Segmentation Using SLIC Superpixels


segments_slic = segmentation.slic(image, n_segments=300, compactness=10, sigma=1)

# Here:
#n_segments=300 → Controls the number of superpixels.

#compactness=10 → Higher values make segments more compact.

#sigma=1 → Adds smoothing to the image.

Display the Segmented Image


# Display results
plt.imshow(color.label2rgb(segments_slic, image, kind='avg'))
plt.axis('off')
plt.title('SLIC Superpixel Segmentation')
plt.show()

Conclusion

Hierarchical object segmentation helps break an image into meaningful regions,


improving object recognition. We used SLIC Superpixels, which efficiently clusters
similar pixels while preserving boundaries. This method is a great alternative to
graph-based segmentation and is useful in various applications like medical imaging
and object detection.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
9. Perform Visualization techniques (types of maps - Bar, Colum,
Line, Scatter, 3D Cubes etc)

Step 1: Install Required Libraries


Make sure you have the necessary libraries installed:
pip install matplotlib seaborn numpy pandas

Defaulting to user installation because normal site-packages is not writeable


Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (3.9.2)
Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (1.26.4)
Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (2.2.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (0.1
1.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (
24.1)
Requirement already satisfied: pillow>=8 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (10.4.0
)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotl
ib) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2023.
3)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7
->matplotlib) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d.art3d import Poly3DCollection

Create Sample Data


# Sample dataset
np.random.seed(42)
categories = ['A', 'B', 'C', 'D', 'E']
values = np.random.randint(10, 100, size=5)
x = np.arange(len(categories))
y = np.random.randint(10, 100, size=10)
z = np.linspace(1, 10, 10)
w = np.random.randint(10, 50, size=10)

# DataFrame for seaborn plots


df = pd.DataFrame({'Category': np.random.choice(categories, 50), 'Value': np.random.randint(10, 100, 50)})

df

Category Value

0 C 98

1 B 69

2 D 50

3 D 38

4 C 24

5 D 54

6 D 74
7 A 98

8 C 80

9 E 18

10 C 97

11 E 10

12 A 17

13 B 97

14 D 72

15 A 20

16 D 90

17 B 17

18 B 44

19 A 44

20 B 42

21 E 14

22 B 50

23 D 37

24 D 16

25 D 82

26 D 81

27 E 21

28 C 43

29 A 42

30 D 57

31 B 32

32 D 71

33 B 97

34 B 46

35 D 53

36 E 95

37 B 44

38 B 74

39 D 56

40 B 87

41 B 12

42 D 10

43 D 14

44 A 99

45 E 23

46 E 36

47 B 18

48 E 88

49 B 24

10 Visualization Techniques

1. Bar Chart
plt.figure(figsize=(6, 4))
plt.bar(categories, values, color='skyblue')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart')
plt.show()

2. Column Chart (Horizontal Bar Chart)

plt.figure(figsize=(6, 4))
plt.barh(categories, values, color='salmon')
plt.xlabel('Values')
plt.ylabel('Categories')
plt.title('Column Chart')
plt.show()

3. Line Plot
plt.figure(figsize=(6, 4))
plt.plot(z, w, marker='o', linestyle='-', color='green')
plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
plt.title('Line Chart')
plt.show()
4. Scatter Plot

plt.figure(figsize=(6, 4))
plt.scatter(y, w, color='purple', alpha=0.7)
plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
plt.title('Scatter Plot')
plt.show()

5. Histogram

plt.figure(figsize=(6, 4))
plt.hist(df['Value'], bins=8, color='orange', alpha=0.7)
plt.xlabel('Value Ranges')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()
6. Box Plot
plt.figure(figsize=(6, 4))
sns.boxplot(x='Category', y='Value', data=df, palette="Set2")
plt.title('Box Plot')
plt.show()

C:\Users\Ayesha Kausar\AppData\Local\Temp\ipykernel_23708\2108633558.py:2: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable
to `hue` and set `legend=False` for the same effect.

sns.boxplot(x='Category', y='Value', data=df, palette="Set2")

7. Violin Plot
plt.figure(figsize=(6, 4))
sns.violinplot(x='Category', y='Value', data=df, palette="muted")
plt.title('Violin Plot')
plt.show()

C:\Users\Ayesha Kausar\AppData\Local\Temp\ipykernel_23708\477811748.py:2: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable
to `hue` and set `legend=False` for the same effect.

sns.violinplot(x='Category', y='Value', data=df, palette="muted")


8. Pie Chart
plt.figure(figsize=(6, 4))
plt.pie(values, labels=categories, autopct='%1.1f%%', colors=['red', 'blue', 'green', 'yellow', 'purple'])
plt.title('Pie Chart')
plt.show()

9. Heatmap

plt.figure(figsize=(6, 4))
data = np.random.rand(5, 5)
sns.heatmap(data, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Heatmap')
plt.show()
10. 3D Cube Visualization
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')

# Define cube vertices


vertices = [[0, 0, 0], [1, 0, 0], [1, 1, 0], [0, 1, 0],
[0, 0, 1], [1, 0, 1], [1, 1, 1], [0, 1, 1]]

# Define cube faces


faces = [[vertices[j] for j in [0, 1, 2, 3]],
[vertices[j] for j in [4, 5, 6, 7]],
[vertices[j] for j in [0, 1, 5, 4]],
[vertices[j] for j in [2, 3, 7, 6]],
[vertices[j] for j in [0, 3, 7, 4]],
[vertices[j] for j in [1, 2, 6, 5]]]

# Draw cube
ax.add_collection3d(Poly3DCollection(faces, alpha=0.3, linewidths=1, edgecolors='r'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('3D Cube')
plt.show()

Conclusion
This script demonstrates 10 visualization techniques to analyze data effectively:

✔ Bar & Column Charts – Show category-wise values.

✔ Line & Scatter Plots – Useful for trends and correlations.

✔ Histograms & Box Plots – Help understand data distribution.

✔ Violin Plots & Heatmaps – Provide in-depth insights into variable relationships.

✔ Pie Charts & 3D Cubes – Visually represent proportions and 3D objects.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
10. Perform Descriptive analytics on healthcare data
Install Required Libraries

pip install pandas numpy seaborn matplotlib

Defaulting to user installation because normal site-packages is not writeable


Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (2.2.2)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (1.26.4)
Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (3.9.2)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\programdata\anaconda3\lib\site-packages (from pandas
) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2023.
3)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (0.1
1.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (
24.1)
Requirement already satisfied: pillow>=8 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (10.4.0
)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(3.1.2)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.8
.2->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

Import Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step 3: Load Sample Healthcare Data

For this analysis, let's create a synthetic healthcare dataset containing:


Age (Patient's age)

Gender (Male/Female)

Blood Pressure (BP)

Cholesterol Level

Diabetes (Yes/No)

Hospital Stay (Days spent in hospital)

# Create a synthetic healthcare dataset


np.random.seed(42)

data = {
'Age': np.random.randint(20, 80, 100),
'Gender': np.random.choice(['Male', 'Female'], 100),
'Blood_Pressure': np.random.randint(90, 180, 100),
'Cholesterol': np.random.randint(150, 300, 100),
'Diabetes': np.random.choice(['Yes', 'No'], 100),
'Hospital_Stay': np.random.randint(1, 15, 100)
}

df = pd.DataFrame(data)

# Display the first few rows of the dataset


print(df.head())
Age Gender Blood_Pressure Cholesterol Diabetes Hospital_Stay
0 58 Female 104 191 No 9
1 71 Male 132 248 Yes 8
2 48 Female 118 156 No 13
3 34 Male 125 293 Yes 5
4 62 Female 102 239 No 1

Perform Descriptive Analytics

1. Basic Statistical Summary

# Summary statistics
print(df.describe())

Age Blood_Pressure Cholesterol Hospital_Stay


count 100.000000 100.000000 100.000000 100.000000
mean 49.580000 134.090000 227.250000 7.590000
std 18.031499 26.413608 44.758922 4.330057
min 21.000000 90.000000 151.000000 1.000000
25% 34.000000 113.000000 186.000000 4.000000
50% 48.000000 133.500000 234.000000 8.000000
75% 66.000000 156.250000 265.000000 12.000000
max 79.000000 179.000000 297.000000 14.000000

2. Count of Male vs Female Patients

# Count of Gender distribution


gender_count = df['Gender'].value_counts()
print(gender_count)

# Visualization
plt.figure(figsize=(5, 4))
sns.countplot(x='Gender', data=df, palette='coolwarm')
plt.title('Gender Distribution')
plt.show()

Gender
Male 61
Female 39
Name: count, dtype: int64
C:\Users\Ayesha Kausar\AppData\Local\Temp\ipykernel_24204\3193996698.py:7: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable
to `hue` and set `legend=False` for the same effect.

sns.countplot(x='Gender', data=df, palette='coolwarm')

3. Distribution of Age

plt.figure(figsize=(6, 4))
sns.histplot(df['Age'], bins=10, kde=True, color='blue')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution of Patients')
plt.show()
4. Average Hospital Stay Based on Diabetes Condition
# Group by diabetes and compute mean hospital stay
print(df.groupby('Diabetes')['Hospital_Stay'].mean())

# Visualization
plt.figure(figsize=(5, 4))
sns.boxplot(x='Diabetes', y='Hospital_Stay', data=df, palette='Set2')
plt.title('Hospital Stay Duration Based on Diabetes')
plt.show()

Diabetes
No 7.551724
Yes 7.642857
Name: Hospital_Stay, dtype: float64
C:\Users\Ayesha Kausar\AppData\Local\Temp\ipykernel_24204\2272041737.py:6: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable
to `hue` and set `legend=False` for the same effect.

sns.boxplot(x='Diabetes', y='Hospital_Stay', data=df, palette='Set2')

5. Relationship Between Blood Pressure & Cholesterol

plt.figure(figsize=(6, 4))
sns.scatterplot(x='Blood_Pressure', y='Cholesterol', hue='Diabetes', data=df, palette='coolwarm')
plt.title('Blood Pressure vs Cholesterol Level')
plt.xlabel('Blood Pressure')
plt.ylabel('Cholesterol Level')
plt.show()
Conclusion
Using Descriptive Analytics, we derived key insights:

✔ Gender Distribution: The dataset has an almost equal number of males and females.

✔ Age Distribution: Most patients fall between 30-70 years.

✔ Diabetes & Hospital Stay: Patients with diabetes tend to stay longer in the hospital.

✔ Blood Pressure vs Cholesterol: A positive correlation is observed.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
11. Perform Predictive analytics on Product Sales data

Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from statsmodels.tsa.holtwinters import ExponentialSmoothing

Step 1: Load Sample Sales Data

file_path = "sales_data_sample.csv"
df = pd.read_csv(file_path, encoding='latin1')

EDA

Step 2:Basic Data Exploration

print("Dataset Overview:\n", df.head()) # Display first few rows


print("\nSummary Statistics:\n", df.describe()) # Summary statistics
print("\nMissing Values:\n", df.isnull().sum()) # Check for missing values

Dataset Overview:
Day MONTH_ID YEAR_ID QUANTITYORDERED PRICEEACH SALES
0 24 2 2003 30 95.70 2871.00
1 7 5 2003 34 81.35 2765.90
2 1 7 2003 41 94.74 3884.34
3 25 8 2003 45 83.26 3746.70
4 10 10 2003 49 100.00 5205.27

Summary Statistics:
Day MONTH_ID YEAR_ID QUANTITYORDERED PRICEEACH \
count 2823.000000 2823.000000 2823.00000 2823.000000 2823.000000
mean 14.291534 7.092455 2003.81509 35.092809 83.658544
std 8.777409 3.656633 0.69967 9.741443 20.174277
min 1.000000 1.000000 2003.00000 6.000000 26.880000
25% 6.000000 4.000000 2003.00000 27.000000 68.860000
50% 14.000000 8.000000 2004.00000 35.000000 95.700000
75% 21.000000 11.000000 2004.00000 43.000000 100.000000
max 31.000000 12.000000 2005.00000 97.000000 100.000000

SALES
count 2823.000000
mean 3553.889072
std 1841.865106
min 482.130000
25% 2203.430000
50% 3184.800000
75% 4508.000000
max 14082.800000

Missing Values:
Day 0
MONTH_ID 0
YEAR_ID 0
QUANTITYORDERED 0
PRICEEACH 0
SALES 0
dtype: int64

Step 3: Data Cleaning

# Convert ORDERDATE to datetime format (if available)


df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE'], errors='coerce')

# Drop rows with missing values


df.dropna(inplace=True)

Step 4: Visualizing Sales Trends


plt.figure(figsize=(12, 6))
sns.lineplot(x=df['ORDERDATE'], y=df['SALES'])
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales Trend Over Time')
plt.xticks(rotation=45)
plt.show()

Step 5: Correlation Analysis

plt.figure(figsize=(10, 6))
sns.heatmap(df[['SALES', 'QUANTITYORDERED', 'PRICEEACH', 'MONTH_ID', 'YEAR_ID']].corr(), annot=True, cmap='coolwarm'
plt.title('Feature Correlation Heatmap')
plt.show()

Step 6: Distribution of Sales

plt.figure(figsize=(8, 5))
sns.histplot(df['SALES'], bins=30, kde=True)
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.title('Sales Distribution')
plt.show()

Step 7: Boxplot for Outliers

plt.figure(figsize=(8, 5))
sns.boxplot(y=df['SALES'])
plt.title('Sales Outlier Detection')
plt.show()

print("EDA Completed! Insights generated.")

EDA Completed! Insights generated.

Step 8: Predictive Analysis

# Selecting Features & Target


features = ['QUANTITYORDERED', 'PRICEEACH', 'MONTH_ID', 'YEAR_ID']
X = df[features] # Independent variables
y = df['SALES'] # Target variable

# Splitting the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 9: Train the Model


model = LinearRegression()
model.fit(X_train, y_train)

▾ LinearRegression i ?

LinearRegression()

Step 10: Evaluate Model Performance


y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Display Performance Metrics in Tabular Form


performance_df = pd.DataFrame({
'Metric': ['Mean Absolute Error', 'Mean Squared Error', 'Root Mean Squared Error', 'R2 Score'],
'Value': [mae, mse, rmse, r2]
})
print(performance_df)

Metric Value
0 Mean Absolute Error 6.557240e+02
1 Mean Squared Error 1.019664e+06
2 Root Mean Squared Error 1.009784e+03
3 R2 Score 7.396571e-01

Step 11: Predict Future Sales

future_data = pd.DataFrame({'QUANTITYORDERED': [30, 50, 70],


'PRICEEACH': [100, 200, 150],
'MONTH_ID': [4, 5, 6],
'YEAR_ID': [2025, 2025, 2025]})
future_predictions = model.predict(future_data)
print("Future Sales Predictions:", future_predictions)

Future Sales Predictions: [ 5303.62273106 12792.98585246 11670.12423585]

future_data['PREDICTED_SALES'] = future_predictions
print("\nFuture Sales Predictions:\n")
print(future_data.to_string(index=False))

Future Sales Predictions:

QUANTITYORDERED PRICEEACH MONTH_ID YEAR_ID PREDICTED_SALES


30 100 4 2025 5303.622731
50 200 5 2025 12792.985852
70 150 6 2025 11670.124236

Conclusion
print("\nConclusion:\n")
print("1. The EDA revealed strong correlations between Quantity Ordered, Price Each, and Sales.\n")
print("2. The Linear Regression model achieved an R2 score of {:.2f}, indicating {} predictive accuracy.\n".format(
print("3. Future sales predictions highlight expected revenue based on given input values.\n")
print("4. Businesses can leverage these insights to optimize pricing, inventory, and sales strategies.")

Conclusion:

1. The EDA revealed strong correlations between Quantity Ordered, Price Each, and Sales.

2. The Linear Regression model achieved an R2 score of 0.74, indicating good predictive accuracy.

3. Future sales predictions highlight expected revenue based on given input values.

4. Businesses can leverage these insights to optimize pricing, inventory, and sales strategies.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
12. Apply Predictive analytics for Weather forecasting.
Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Step 1: Load the dataset


file_path = 'weather.csv' # Update this if needed
df = pd.read_csv(file_path)

# Step 2: Explore the dataset


print("Data Overview:")
print(df.head())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nDataset Info:")
print(df.info())
Data Overview:
origin year month day hour temp dewp humid wind_dir wind_speed \
0 EWR 2013 1 1 0 37.04 21.92 53.97 230.0 10.35702
1 EWR 2013 1 1 1 37.04 21.92 53.97 230.0 13.80936
2 EWR 2013 1 1 2 37.94 21.92 52.09 230.0 12.65858
3 EWR 2013 1 1 3 37.94 23.00 54.51 230.0 13.80936
4 EWR 2013 1 1 4 37.94 24.08 57.04 240.0 14.96014

wind_gust precip pressure visib time_hour


0 11.918651 0.0 1013.9 10.0 1/01/2013 1:00
1 15.891535 0.0 1013.0 10.0 1/01/2013 2:00
2 14.567241 0.0 1012.6 10.0 1/01/2013 3:00
3 15.891535 0.0 1012.7 10.0 1/01/2013 4:00
4 17.215830 0.0 1012.8 10.0 1/01/2013 5:00

Missing Values:
origin 0
year 0
month 0
day 0
hour 0
temp 1
dewp 1
humid 1
wind_dir 418
wind_speed 3
wind_gust 3
precip 0
pressure 2730
visib 0
time_hour 0
dtype: int64

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26130 entries, 0 to 26129
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 origin 26130 non-null object
1 year 26130 non-null int64
2 month 26130 non-null int64
3 day 26130 non-null int64
4 hour 26130 non-null int64
5 temp 26129 non-null float64
6 dewp 26129 non-null float64
7 humid 26129 non-null float64
8 wind_dir 25712 non-null float64
9 wind_speed 26127 non-null float64
10 wind_gust 26127 non-null float64
11 precip 26130 non-null float64
12 pressure 23400 non-null float64
13 visib 26130 non-null float64
14 time_hour 26130 non-null object
dtypes: float64(9), int64(4), object(2)
memory usage: 3.0+ MB
None

# Step 3: Preprocessing - Handle missing values (if any)


df = df.dropna()

# Step 4: Select features and target variable


features = ['year', 'month', 'day', 'hour', 'dewp', 'humid', 'wind_dir', 'wind_speed', 'wind_gust', 'precip', 'pressu
target = 'temp' # Predicting temperature

X = df[features]
y = df[target]

# Step 5: Train-test split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Feature Scaling


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 7: Train the Model (Random Forest)


model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
▾ RandomForestRegressor i ?

RandomForestRegressor(random_state=42)

# Step 8: Make Predictions


y_pred = model.predict(X_test)

# Step 9: Evaluate the Model


mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Model Evaluation:
Mean Absolute Error (MAE): 0.10836664495115607
Mean Squared Error (MSE): 0.07925993917915358
Root Mean Squared Error (RMSE): 0.2815314177479195

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f"R² Score (Accuracy): {r2}")

R² Score (Accuracy): 0.9997561069719265

# Step 10: Visualizing Predictions


plt.figure(figsize=(15,10))

<Figure size 1500x1000 with 0 Axes>


<Figure size 1500x1000 with 0 Axes>

# Scatter plot of Actual vs Predicted values


plt.subplot(2,2,1)
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Temperature")

Text(0.5, 1.0, 'Actual vs Predicted Temperature')

# 3D Visualization - Differentiating Actual and Predicted Values


fig = plt.figure(figsize=(10,7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_test[:, features.index('humid')], y_test, y_test, c='blue', marker='o', label='Actual Values')
ax.scatter(X_test[:, features.index('humid')], y_test, y_pred, c='red', marker='^', label='Predicted Values')
ax.set_xlabel("Humidity")
ax.set_ylabel("Actual Temperature")
ax.set_zlabel("Predicted Temperature")
ax.set_title("3D Visualization: Humidity vs Actual & Predicted Temperature")
ax.legend()
plt.show()
# Final Conclusion
print("\nFinal Conclusion:")
print(f"The Random Forest model effectively predicts temperature with an accuracy (R² Score) of {r2:.2f}. Key feature

Final Conclusion:
The Random Forest model effectively predicts temperature with an accuracy (R² Score) of 1.00. Key features like
humidity, wind speed, and pressure significantly impact predictions. Visualizations confirm a strong correlation
between actual and predicted values.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like