0% found this document useful (0 votes)

20 views44 pages

DA Programs

The document outlines various data preprocessing techniques including handling missing values through mean/median imputation, forward/backward fill, and K-Nearest Neighbors (KNN) imputation. It also discusses noise detection and removal using statistical methods and machine learning, as well as identifying and eliminating data redundancy through correlation analysis. Additionally, the document covers the implementation of linear regression, logistic regression, and decision tree classification with examples and code snippets.

Uploaded by

Bejjanki Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views44 pages

DA Programs

Uploaded by

Bejjanki Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

1.

Data Preprocessing

a. Handling missing values

1. Mean/Median Imputation
Replace missing values with the mean or median of the respective feature.

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Create a sample DataFrame

df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})
# Replace missing values with mean
df['A'].fillna(df['A'].mean(), inplace=True)
df['B'].fillna(df['B'].mean(), inplace=True)
print(df)

A B
0 1.000000 5.000000
1 2.000000 6.666667
2 2.333333 7.000000
3 4.000000 8.000000

2. Forward/Backward Fill
Replace missing values with the previous or next value in the sequence.

# Forward fill
df['A'].fillna(method='ffill', inplace=True)
df['B'].fillna(method='ffill', inplace=True)
print(df)

A B
0 1.000000 5.000000
1 2.000000 6.666667
2 2.333333 7.000000
3 4.000000 8.000000
# 3. K-Nearest Neighbors (KNN) Imputation ## Replace missing values using the KNN algorithm.

from sklearn.impute import KNNImputer

import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})
# Create a KNN imputer
imputer = KNNImputer(n_neighbors=2)
# Fit and transform the data
df_imputed = imputer.fit_transform(df)
print(df_imputed)

[[1. 5. ]
[2. 6.5]
[2.5 7. ]
[4. 8. ]]

b. Noise detection removal

1. Statistical Methods Use statistical methods such as mean, median, and standard deviation to detect and remove outliers.

import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4, 100]
})

# Calculate the mean and standard deviation

mean = df['A'].mean()
std_dev = df['A'].std()
# Remove outliers
df_cleaned = df[(df['A'] >= mean - 2*std_dev) & (df['A'] <= mean + 2*std_dev)]
print(df_cleaned)

A
0 1
1 2
2 3
3 4
4 100
2. Machine Learning Methods Use machine learning algorithms such as One-Class SVM and Local Outlier Factor (LOF) to detect anomalies.

from sklearn.svm import OneClassSVM

import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4, 100]
})
# Create a One-Class SVM model
model = OneClassSVM(kernel='rbf', gamma=0.1, nu=0.1)
# Fit the model
model.fit(df[['A']])
# Predict anomalies
anomaly = model.predict(df[['A']])
# Remove anomalies
df_cleaned = df[anomaly == 1]

print(df_cleaned)

A
1 2
2 3
3 4
4 100

c. Identifying data redundancy and elimination

Identifying Data Redundancy and Elimination Data redundancy can be identified and eliminated using various techniques such as: 1. Correlation
Analysis Use correlation analysis to identify highly correlated features and eliminate redundant ones.

import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [2, 4, 6, 8, 10],
'C': [3, 6, 9, 12, 15]
})
# Calculate the correlation matrix
corr_matrix = df.corr()
# Identify highly correlated features
high_corr_features = corr_matrix[(corr_matrix > 0.9) & (corr_matrix < 1)].index
# Eliminate redundant features
df_eliminated = df.drop(high_corr_features, axis=1)
print(df_eliminated)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
2. Implement any one imputation model
KNN Imputation Model: KNN imputation is a popular method for handling missing values. It works by finding the k most similar data points
(nearest neighbors) to the row with the missing value. The missing value is then imputed using the values from these nearest neighbors.

import pandas as pd
from sklearn.impute import KNNImputer
import numpy as np
# Create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 3, 4, 5, 6],
'C': [7, 8, 9, np.nan, 11]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
# Create a KNN imputer with k=3
imputer = KNNImputer(n_neighbors=3)
# Fit the imputer to the data and transform the missing values
imputed_data = imputer.fit_transform(df)
# Convert the imputed data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print("\nImputed DataFrame:")
print(imputed_df)

Original DataFrame:
A B C
0 1.0 NaN 7.0
1 2.0 3.0 8.0
2 NaN 4.0 9.0
3 4.0 5.0 NaN
4 5.0 6.0 11.0

Imputed DataFrame:
A B C
0 1.000000 4.0 7.000000
1 2.000000 3.0 8.000000
2 2.333333 4.0 9.000000
3 4.000000 5.0 9.333333
4 5.000000 6.0 11.000000

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
3. Implement Linear Regression
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('salary_Data.csv')

dataset

YearsExperience Salary

0 1.1 39343.0

1 1.3 46205.0

2 1.5 37731.0

3 2.0 43525.0

4 2.2 39891.0

5 2.9 56642.0

6 3.0 60150.0

7 3.2 54445.0

8 3.2 64445.0

9 3.7 57189.0

10 3.9 63218.0

11 4.0 55794.0

12 4.0 56957.0

13 4.1 57081.0

14 4.5 61111.0

15 4.9 67938.0

16 5.1 66029.0

17 5.3 83088.0

18 5.9 81363.0

19 6.0 93940.0

20 6.8 91738.0

21 7.1 98273.0

22 7.9 101302.0

23 8.2 113812.0

24 8.7 109431.0

25 9.0 105582.0

26 9.5 116969.0

27 9.6 112635.0

28 10.3 122391.0

29 10.5 121872.0

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:,1].values

X
array([[ 1.1],
[ 1.3],
[ 1.5],
[ 2. ],
[ 2.2],
[ 2.9],
[ 3. ],
[ 3.2],
[ 3.2],
[ 3.7],
[ 3.9],
[ 4. ],
[ 4. ],
[ 4.1],
[ 4.5],
[ 4.9],
[ 5.1],
[ 5.3],
[ 5.9],
[ 6. ],
[ 6.8],
[ 7.1],
[ 7.9],
[ 8.2],
[ 8.7],
[ 9. ],
[ 9.5],
[ 9.6],
[10.3],
[10.5]])

array([ 39343., 46205., 37731., 43525., 39891., 56642., 60150.,

54445., 64445., 57189., 63218., 55794., 56957., 57081.,
61111., 67938., 66029., 83088., 81363., 93940., 91738.,
98273., 101302., 113812., 109431., 105582., 116969., 112635.,
122391., 121872.])

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

▾ LinearRegression i ?

LinearRegression()

plt.scatter(X_train, y_train, color = 'red')

plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
4.Implementation of Logistic Regression
Import Libraries

# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc

Read and Explore the data

# Load the diabetes dataset

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Convert the target variable to binary (1 for diabetes, 0 for no diabetes)

y_binary = (y > np.median(y)).astype(int)

Splitting The Dataset: Train and Test dataset

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(
X, y_binary, test_size=0.2, random_state=42)

Feature Scaling

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Train The Model

# Train the Logistic Regression model

model = LogisticRegression()
model.fit(X_train, y_train)

▾ LogisticRegression i ?

LogisticRegression()

Evaluation Metrics

# Evaluate the model

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Accuracy: 73.03%
Confusion Matrix and Classification Report

# evaluate the model

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Confusion Matrix:
[[36 13]
[11 29]]

Classification Report:
precision recall f1-score support

0 0.77 0.73 0.75 49

1 0.69 0.72 0.71 40

accuracy 0.73 89
macro avg 0.73 0.73 0.73 89
weighted avg 0.73 0.73 0.73 89

Visualizing the performance of our model.

# Visualize the decision boundary with accuracy information

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_test[:, 2], y=X_test[:, 8], hue=y_test, palette={
0: 'blue', 1: 'red'}, marker='o')
plt.xlabel("BMI")
plt.ylabel("Age")
plt.title("Logistic Regression Decision Boundary\nAccuracy: {:.2f}%".format(
accuracy * 100))
plt.legend(title="Diabetes", loc="upper right")
plt.show()

Plotting ROC Curve

# Plot ROC Curve

y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve\nAccuracy: {:.2f}%'.format(
accuracy * 100))
plt.legend(loc="lower right")
plt.show()
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
5. Decision Tree Induction for Classification

Import necessary libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

Load Dataset
# Load dataset (Iris dataset)
iris = load_iris()
X, y = iris.data, iris.target # Features and labels

Data Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Training
# Initialize and train the Decision Tree Classifier
model = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
model.fit(X_train, y_train)

▾ DecisionTreeClassifier i ?

DecisionTreeClassifier(max_depth=3, random_state=42)

Predictions
# Make predictions
y_pred = model.predict(X_test)

Model Evaluation
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 1.0

# Print classification report

print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10

1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Decision Tree Visualization

# Visualize the Decision Tree
plt.figure(figsize=(12, 8))
plot_tree(model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.title("Decision Tree Visualization")
plt.show()
Feature Importance
# Feature Importance Plot
plt.figure(figsize=(8, 5))
plt.barh(iris.feature_names, model.feature_importances_, color="skyblue")
plt.xlabel("Feature Importance Score")
plt.ylabel("Features")
plt.title("Feature Importance in Decision Tree")
plt.show()

The Decision Tree Classifier is a simple yet effective machine learning model for classification
tasks. In this implementation, we used the Iris dataset to train and evaluate the model. The
decision tree achieves high accuracy, and by visualizing the tree and feature importance, we
gain insights into how decisions are made. This method is useful for explainable AI but can be
prone to overfitting if not carefully tuned. To improve generalization, techniques such as pruning
or ensemble methods like Random Forest can be applied.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
6. Implement Random Forest Classifier
1. Import Required Libraries We will be importing Pandas, matplotlib, seaborn and sklearn to build the model.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

2. Import Dataset

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 2

146 6.3 2.5 5.0 1.9 2

147 6.5 3.0 5.2 2.0 2

148 6.2 3.4 5.4 2.3 2

149 5.9 3.0 5.1 1.8 2

150 rows × 5 columns

3. Data Preparation

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

array([[5.1, 3.5, 1.4, 0.2],

[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
4. Splitting the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Feature Scaling

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

6. Building Random Forest Classifier

classifier = RandomForestClassifier(n_estimators=100, random_state=42)

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

7. Evaluation of the Model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues', cbar=False,
xticklabels=iris.target_names, yticklabels=iris.target_names)

plt.title('Confusion Matrix Heatmap')

plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

Accuracy: 100.00%
8. Feature Importance

feature_importances = classifier.feature_importances_

plt.barh(iris.feature_names, feature_importances)
plt.xlabel('Feature Importance')
plt.title('Feature Importance in Random Forest Classifier')
plt.show()

Conclusion--From the graph we can see that petal width (cm) is the most
important feature followed closely by petal length (cm). The sepal width (cm)
and sepal length (cm) have lower importance in determining the model’s
predictions. This indicates that the classifier relies more on the petal
measurements to make predictions about the flower species.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
7. Implement ARIMA on Time Series data

Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import warnings
warnings.filterwarnings('ignore')

## Load and Prepare Dataset

# Generate synthetic time series data
np.random.seed(42)
date_rng = pd.date_range(start='2020-01-01', periods=100, freq='D')
data = np.cumsum(np.random.randn(100)) # Simulated time series
df = pd.DataFrame({'Date': date_rng, 'Value': data})
df.set_index('Date', inplace=True)

Value

Date

2020-01-01 0.496714

2020-01-02 0.358450

2020-01-03 1.006138

2020-01-04 2.529168

2020-01-05 2.295015

... ...

2020-04-05 -10.712354

2020-04-06 -10.416233

2020-04-07 -10.155178

2020-04-08 -10.150065

2020-04-09 -10.384652

100 rows × 1 columns

# Plot the time series

df.plot(figsize=(10, 5), title='Synthetic Time Series Data')
plt.show()
Check for Stationarity
# Perform Augmented Dickey-Fuller test
adf_test = adfuller(df['Value'])
print("ADF Statistic:", adf_test[0])
print("p-value:", adf_test[1])

ADF Statistic: -1.3583317659818994

p-value: 0.6020814791099097

If p-value > 0.05, data is non-stationary and needs differencing

## Differencing for Stationarity (if needed)

if adf_test[1] > 0.05:

df['Value_diff'] = df['Value'].diff().dropna()
df['Value_diff'].plot(figsize=(10, 5), title='Differenced Time Series')
plt.show()

Fit ARIMA Model

# Define ARIMA model (using p=2, d=1, q=2 as an example)
model = ARIMA(df['Value'], order=(2, 1, 2)) # (p, d, q)
model_fit = model.fit()
print(model_fit.summary())

SARIMAX Results
==============================================================================
Dep. Variable: Value No. Observations: 100
Model: ARIMA(2, 1, 2) Log Likelihood -130.434
Date: Sat, 22 Mar 2025 AIC 270.869
Time: 22:03:24 BIC 283.845
Sample: 01-01-2020 HQIC 276.119
- 04-09-2020
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.1299 0.283 0.459 0.646 -0.425 0.684
ar.L2 0.8677 0.230 3.776 0.000 0.417 1.318
ma.L1 -0.0583 0.330 -0.177 0.860 -0.705 0.588
ma.L2 -0.9353 0.281 -3.323 0.001 -1.487 -0.384
sigma2 0.8134 0.148 5.502 0.000 0.524 1.103
===================================================================================
Ljung-Box (L1) (Q): 0.46 Jarque-Bera (JB): 0.33
Prob(Q): 0.50 Prob(JB): 0.85
Heteroskedasticity (H): 1.02 Skew: -0.14
Prob(H) (two-sided): 0.96 Kurtosis: 3.07
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Forecasting Future Values

# Forecast the next 10 time steps
forecast = model_fit.forecast(steps=10)
print("Forecasted Values:", forecast)

Forecasted Values: 2020-04-10 -10.395337

2020-04-11 -10.435398
2020-04-12 -10.449871
2020-04-13 -10.486512
2020-04-14 -10.503829
2020-04-15 -10.537871
2020-04-16 -10.557318
2020-04-17 -10.589381
2020-04-18 -10.610419
2020-04-19 -10.640973
Freq: D, Name: predicted_mean, dtype: float64

# Plot actual vs. forecasted values

plt.figure(figsize=(10, 5))
plt.plot(df.index, df['Value'], label='Actual Data')
forecast_index = pd.date_range(df.index[-1], periods=11, freq='D')[1:] # Ensure correct dimensions
plt.plot(forecast_index, forecast, label='Forecast', color='red')
plt.legend()
plt.title('ARIMA Forecast')
plt.show()
Conclusion
# The ARIMA model successfully fits and forecasts time series data by identifying trends and
patterns.
# We checked for stationarity and applied differencing if necessary.
# ACF and PACF plots helped in selecting appropriate ARIMA parameters.
# The model was trained and used for forecasting future values.
# This implementation demonstrates the effectiveness of ARIMA for time series forecasting in various
applications.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
8. Object segmentation using hierarchical based methods
Install Required Libraries

pip install scikit-image matplotlib

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the k
ernel to use updated packages.

Requirement already satisfied: scikit-image in c:\users\ayesha kausar\appdata\roaming\python\python312\site-pack

ages (0.25.2)
Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (3.9.2)
Requirement already satisfied: numpy>=1.24 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (1.
26.4)
Requirement already satisfied: scipy>=1.11.4 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (
1.13.1)
Requirement already satisfied: networkx>=3.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (
3.3)
Requirement already satisfied: pillow>=10.1 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (1
0.4.0)
Requirement already satisfied: imageio!=2.35.0,>=2.33 in c:\programdata\anaconda3\lib\site-packages (from scikit
-image) (2.33.1)
Requirement already satisfied: tifffile>=2022.8.12 in c:\programdata\anaconda3\lib\site-packages (from scikit-im
age) (2023.4.12)
Requirement already satisfied: packaging>=21 in c:\programdata\anaconda3\lib\site-packages (from scikit-image) (
24.1)
Requirement already satisfied: lazy-loader>=0.4 in c:\programdata\anaconda3\lib\site-packages (from scikit-image
) (0.4)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (0.1
1.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.4.4)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotl
ib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7
->matplotlib) (1.16.0)

Import Necessary Libraries

import numpy as np
import matplotlib.pyplot as plt
from skimage import io, segmentation, color
from skimage.filters import sobel

Load an Image

# Load an image from a URL

image_url = 'panda.jpg'
image = io.imread(image_url)

# Display the original image

plt.imshow(image)
plt.axis('off')
plt.title('Original Image')
plt.show()
Convert the Image to Grayscale
# Convert to grayscale
gray_image = color.rgb2gray(image)
gray_image

array([[0.38707765, 0.39492078, 0.41117255, ..., 0.60005961, 0.58437333,

0.56868706],
[0.4013651 , 0.40528667, 0.41509412, ..., 0.60790275, 0.5882949 ,
0.5765302 ],
[0.43609373, 0.43609373, 0.44058078, ..., 0.60790275, 0.59221647,
0.58409059],
...,
[0.05546745, 0.05546745, 0.05154588, ..., 0.6479702 , 0.64012706,
0.63228392],
[0.06331059, 0.06331059, 0.05938902, ..., 0.64012706, 0.63228392,
0.62836235],
[0.06723216, 0.06723216, 0.06331059, ..., 0.63620549, 0.63228392,
0.62836235]])

Compute Edges Using the Sobel Filter

edges = sobel(gray_image)

# Display the edges

plt.imshow(edges, cmap='gray')
plt.axis('off')
plt.title('Edge Detection (Sobel Filter)')
plt.show()

Perform Over-Segmentation Using SLIC Superpixels

segments_slic = segmentation.slic(image, n_segments=300, compactness=10, sigma=1)

# Here:
#n_segments=300 → Controls the number of superpixels.

#compactness=10 → Higher values make segments more compact.

#sigma=1 → Adds smoothing to the image.

Display the Segmented Image

# Display results
plt.imshow(color.label2rgb(segments_slic, image, kind='avg'))
plt.axis('off')
plt.title('SLIC Superpixel Segmentation')
plt.show()

Conclusion

Hierarchical object segmentation helps break an image into meaningful regions,

improving object recognition. We used SLIC Superpixels, which efficiently clusters
similar pixels while preserving boundaries. This method is a great alternative to
graph-based segmentation and is useful in various applications like medical imaging
and object detection.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
9. Perform Visualization techniques (types of maps - Bar, Colum,
Line, Scatter, 3D Cubes etc)

Step 1: Install Required Libraries

Make sure you have the necessary libraries installed:
pip install matplotlib seaborn numpy pandas

Defaulting to user installation because normal site-packages is not writeable

Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (3.9.2)
Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (1.26.4)
Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (2.2.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (0.1
1.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (
24.1)
Requirement already satisfied: pillow>=8 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (10.4.0
)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotl
ib) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2023.
3)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7
->matplotlib) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d.art3d import Poly3DCollection

Create Sample Data

# Sample dataset
np.random.seed(42)
categories = ['A', 'B', 'C', 'D', 'E']
values = np.random.randint(10, 100, size=5)
x = np.arange(len(categories))
y = np.random.randint(10, 100, size=10)
z = np.linspace(1, 10, 10)
w = np.random.randint(10, 50, size=10)

# DataFrame for seaborn plots

df = pd.DataFrame({'Category': np.random.choice(categories, 50), 'Value': np.random.randint(10, 100, 50)})

Category Value

0 C 98

1 B 69

2 D 50

3 D 38

4 C 24

5 D 54

6 D 74
7 A 98

8 C 80

9 E 18

10 C 97

11 E 10

12 A 17

13 B 97

14 D 72

15 A 20

16 D 90

17 B 17

18 B 44

19 A 44

20 B 42

21 E 14

22 B 50

23 D 37

24 D 16

25 D 82

26 D 81

27 E 21

28 C 43

29 A 42

30 D 57

31 B 32

32 D 71

33 B 97

34 B 46

35 D 53

36 E 95

37 B 44

38 B 74

39 D 56

40 B 87

41 B 12

42 D 10

43 D 14

44 A 99

45 E 23

46 E 36

47 B 18

48 E 88

49 B 24

10 Visualization Techniques

1. Bar Chart
plt.figure(figsize=(6, 4))
plt.bar(categories, values, color='skyblue')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart')
plt.show()

2. Column Chart (Horizontal Bar Chart)

plt.figure(figsize=(6, 4))
plt.barh(categories, values, color='salmon')
plt.xlabel('Values')
plt.ylabel('Categories')
plt.title('Column Chart')
plt.show()

3. Line Plot
plt.figure(figsize=(6, 4))
plt.plot(z, w, marker='o', linestyle='-', color='green')
plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
plt.title('Line Chart')
plt.show()
4. Scatter Plot

plt.figure(figsize=(6, 4))
plt.scatter(y, w, color='purple', alpha=0.7)
plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
plt.title('Scatter Plot')
plt.show()

5. Histogram

plt.figure(figsize=(6, 4))
plt.hist(df['Value'], bins=8, color='orange', alpha=0.7)
plt.xlabel('Value Ranges')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()
6. Box Plot
plt.figure(figsize=(6, 4))
sns.boxplot(x='Category', y='Value', data=df, palette="Set2")
plt.title('Box Plot')
plt.show()

C:\Users\Ayesha Kausar\AppData\Local\Temp\ipykernel_23708\2108633558.py:2: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable
to `hue` and set `legend=False` for the same effect.

sns.boxplot(x='Category', y='Value', data=df, palette="Set2")

7. Violin Plot
plt.figure(figsize=(6, 4))
sns.violinplot(x='Category', y='Value', data=df, palette="muted")
plt.title('Violin Plot')
plt.show()

C:\Users\Ayesha Kausar\AppData\Local\Temp\ipykernel_23708\477811748.py:2: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable
to `hue` and set `legend=False` for the same effect.

sns.violinplot(x='Category', y='Value', data=df, palette="muted")

8. Pie Chart
plt.figure(figsize=(6, 4))
plt.pie(values, labels=categories, autopct='%1.1f%%', colors=['red', 'blue', 'green', 'yellow', 'purple'])
plt.title('Pie Chart')
plt.show()

9. Heatmap

plt.figure(figsize=(6, 4))
data = np.random.rand(5, 5)
sns.heatmap(data, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Heatmap')
plt.show()
10. 3D Cube Visualization
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')

# Define cube vertices

vertices = [[0, 0, 0], [1, 0, 0], [1, 1, 0], [0, 1, 0],
[0, 0, 1], [1, 0, 1], [1, 1, 1], [0, 1, 1]]

# Define cube faces

faces = [[vertices[j] for j in [0, 1, 2, 3]],
[vertices[j] for j in [4, 5, 6, 7]],
[vertices[j] for j in [0, 1, 5, 4]],
[vertices[j] for j in [2, 3, 7, 6]],
[vertices[j] for j in [0, 3, 7, 4]],
[vertices[j] for j in [1, 2, 6, 5]]]

# Draw cube
ax.add_collection3d(Poly3DCollection(faces, alpha=0.3, linewidths=1, edgecolors='r'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('3D Cube')
plt.show()

Conclusion
This script demonstrates 10 visualization techniques to analyze data effectively:

✔ Bar & Column Charts – Show category-wise values.

✔ Line & Scatter Plots – Useful for trends and correlations.

✔ Histograms & Box Plots – Help understand data distribution.

✔ Violin Plots & Heatmaps – Provide in-depth insights into variable relationships.

✔ Pie Charts & 3D Cubes – Visually represent proportions and 3D objects.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
10. Perform Descriptive analytics on healthcare data
Install Required Libraries

pip install pandas numpy seaborn matplotlib

Defaulting to user installation because normal site-packages is not writeable

Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (2.2.2)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (1.26.4)
Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (3.9.2)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\programdata\anaconda3\lib\site-packages (from pandas
) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2023.
3)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (0.1
1.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (
24.1)
Requirement already satisfied: pillow>=8 in c:\programdata\anaconda3\lib\site-packages (from matplotlib) (10.4.0
)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib)
(3.1.2)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.8
.2->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

Import Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step 3: Load Sample Healthcare Data

For this analysis, let's create a synthetic healthcare dataset containing:

Age (Patient's age)

Gender (Male/Female)

Blood Pressure (BP)

Cholesterol Level

Diabetes (Yes/No)

Hospital Stay (Days spent in hospital)

# Create a synthetic healthcare dataset

np.random.seed(42)

data = {
'Age': np.random.randint(20, 80, 100),
'Gender': np.random.choice(['Male', 'Female'], 100),
'Blood_Pressure': np.random.randint(90, 180, 100),
'Cholesterol': np.random.randint(150, 300, 100),
'Diabetes': np.random.choice(['Yes', 'No'], 100),
'Hospital_Stay': np.random.randint(1, 15, 100)
}

df = pd.DataFrame(data)

# Display the first few rows of the dataset

print(df.head())
Age Gender Blood_Pressure Cholesterol Diabetes Hospital_Stay
0 58 Female 104 191 No 9
1 71 Male 132 248 Yes 8
2 48 Female 118 156 No 13
3 34 Male 125 293 Yes 5
4 62 Female 102 239 No 1

Perform Descriptive Analytics

1. Basic Statistical Summary

# Summary statistics
print(df.describe())

Age Blood_Pressure Cholesterol Hospital_Stay

count 100.000000 100.000000 100.000000 100.000000
mean 49.580000 134.090000 227.250000 7.590000
std 18.031499 26.413608 44.758922 4.330057
min 21.000000 90.000000 151.000000 1.000000
25% 34.000000 113.000000 186.000000 4.000000
50% 48.000000 133.500000 234.000000 8.000000
75% 66.000000 156.250000 265.000000 12.000000
max 79.000000 179.000000 297.000000 14.000000

2. Count of Male vs Female Patients

# Count of Gender distribution

gender_count = df['Gender'].value_counts()
print(gender_count)

# Visualization
plt.figure(figsize=(5, 4))
sns.countplot(x='Gender', data=df, palette='coolwarm')
plt.title('Gender Distribution')
plt.show()

Gender
Male 61
Female 39
Name: count, dtype: int64
C:\Users\Ayesha Kausar\AppData\Local\Temp\ipykernel_24204\3193996698.py:7: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable
to `hue` and set `legend=False` for the same effect.

sns.countplot(x='Gender', data=df, palette='coolwarm')

3. Distribution of Age

plt.figure(figsize=(6, 4))
sns.histplot(df['Age'], bins=10, kde=True, color='blue')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution of Patients')
plt.show()
4. Average Hospital Stay Based on Diabetes Condition
# Group by diabetes and compute mean hospital stay
print(df.groupby('Diabetes')['Hospital_Stay'].mean())

# Visualization
plt.figure(figsize=(5, 4))
sns.boxplot(x='Diabetes', y='Hospital_Stay', data=df, palette='Set2')
plt.title('Hospital Stay Duration Based on Diabetes')
plt.show()

Diabetes
No 7.551724
Yes 7.642857
Name: Hospital_Stay, dtype: float64
C:\Users\Ayesha Kausar\AppData\Local\Temp\ipykernel_24204\2272041737.py:6: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable
to `hue` and set `legend=False` for the same effect.

sns.boxplot(x='Diabetes', y='Hospital_Stay', data=df, palette='Set2')

5. Relationship Between Blood Pressure & Cholesterol

plt.figure(figsize=(6, 4))
sns.scatterplot(x='Blood_Pressure', y='Cholesterol', hue='Diabetes', data=df, palette='coolwarm')
plt.title('Blood Pressure vs Cholesterol Level')
plt.xlabel('Blood Pressure')
plt.ylabel('Cholesterol Level')
plt.show()
Conclusion
Using Descriptive Analytics, we derived key insights:

✔ Gender Distribution: The dataset has an almost equal number of males and females.

✔ Age Distribution: Most patients fall between 30-70 years.

✔ Diabetes & Hospital Stay: Patients with diabetes tend to stay longer in the hospital.

✔ Blood Pressure vs Cholesterol: A positive correlation is observed.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
11. Perform Predictive analytics on Product Sales data

Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from statsmodels.tsa.holtwinters import ExponentialSmoothing

Step 1: Load Sample Sales Data

file_path = "sales_data_sample.csv"
df = pd.read_csv(file_path, encoding='latin1')

EDA

Step 2:Basic Data Exploration

print("Dataset Overview:\n", df.head()) # Display first few rows

print("\nSummary Statistics:\n", df.describe()) # Summary statistics
print("\nMissing Values:\n", df.isnull().sum()) # Check for missing values

Dataset Overview:
Day MONTH_ID YEAR_ID QUANTITYORDERED PRICEEACH SALES
0 24 2 2003 30 95.70 2871.00
1 7 5 2003 34 81.35 2765.90
2 1 7 2003 41 94.74 3884.34
3 25 8 2003 45 83.26 3746.70
4 10 10 2003 49 100.00 5205.27

Summary Statistics:
Day MONTH_ID YEAR_ID QUANTITYORDERED PRICEEACH \
count 2823.000000 2823.000000 2823.00000 2823.000000 2823.000000
mean 14.291534 7.092455 2003.81509 35.092809 83.658544
std 8.777409 3.656633 0.69967 9.741443 20.174277
min 1.000000 1.000000 2003.00000 6.000000 26.880000
25% 6.000000 4.000000 2003.00000 27.000000 68.860000
50% 14.000000 8.000000 2004.00000 35.000000 95.700000
75% 21.000000 11.000000 2004.00000 43.000000 100.000000
max 31.000000 12.000000 2005.00000 97.000000 100.000000

SALES
count 2823.000000
mean 3553.889072
std 1841.865106
min 482.130000
25% 2203.430000
50% 3184.800000
75% 4508.000000
max 14082.800000

Missing Values:
Day 0
MONTH_ID 0
YEAR_ID 0
QUANTITYORDERED 0
PRICEEACH 0
SALES 0
dtype: int64

Step 3: Data Cleaning

# Convert ORDERDATE to datetime format (if available)

df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE'], errors='coerce')

# Drop rows with missing values

df.dropna(inplace=True)

Step 4: Visualizing Sales Trends

plt.figure(figsize=(12, 6))
sns.lineplot(x=df['ORDERDATE'], y=df['SALES'])
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales Trend Over Time')
plt.xticks(rotation=45)
plt.show()

Step 5: Correlation Analysis

plt.figure(figsize=(10, 6))
sns.heatmap(df[['SALES', 'QUANTITYORDERED', 'PRICEEACH', 'MONTH_ID', 'YEAR_ID']].corr(), annot=True, cmap='coolwarm'
plt.title('Feature Correlation Heatmap')
plt.show()

Step 6: Distribution of Sales

plt.figure(figsize=(8, 5))
sns.histplot(df['SALES'], bins=30, kde=True)
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.title('Sales Distribution')
plt.show()

Step 7: Boxplot for Outliers

plt.figure(figsize=(8, 5))
sns.boxplot(y=df['SALES'])
plt.title('Sales Outlier Detection')
plt.show()

print("EDA Completed! Insights generated.")

EDA Completed! Insights generated.

Step 8: Predictive Analysis

# Selecting Features & Target

features = ['QUANTITYORDERED', 'PRICEEACH', 'MONTH_ID', 'YEAR_ID']
X = df[features] # Independent variables
y = df['SALES'] # Target variable

# Splitting the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 9: Train the Model

model = LinearRegression()
model.fit(X_train, y_train)

▾ LinearRegression i ?

LinearRegression()

Step 10: Evaluate Model Performance

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Display Performance Metrics in Tabular Form

performance_df = pd.DataFrame({
'Metric': ['Mean Absolute Error', 'Mean Squared Error', 'Root Mean Squared Error', 'R2 Score'],
'Value': [mae, mse, rmse, r2]
})
print(performance_df)

Metric Value
0 Mean Absolute Error 6.557240e+02
1 Mean Squared Error 1.019664e+06
2 Root Mean Squared Error 1.009784e+03
3 R2 Score 7.396571e-01

Step 11: Predict Future Sales

future_data = pd.DataFrame({'QUANTITYORDERED': [30, 50, 70],

'PRICEEACH': [100, 200, 150],
'MONTH_ID': [4, 5, 6],
'YEAR_ID': [2025, 2025, 2025]})
future_predictions = model.predict(future_data)
print("Future Sales Predictions:", future_predictions)

Future Sales Predictions: [ 5303.62273106 12792.98585246 11670.12423585]

future_data['PREDICTED_SALES'] = future_predictions
print("\nFuture Sales Predictions:\n")
print(future_data.to_string(index=False))

Future Sales Predictions:

QUANTITYORDERED PRICEEACH MONTH_ID YEAR_ID PREDICTED_SALES

30 100 4 2025 5303.622731
50 200 5 2025 12792.985852
70 150 6 2025 11670.124236

Conclusion
print("\nConclusion:\n")
print("1. The EDA revealed strong correlations between Quantity Ordered, Price Each, and Sales.\n")
print("2. The Linear Regression model achieved an R2 score of {:.2f}, indicating {} predictive accuracy.\n".format(
print("3. Future sales predictions highlight expected revenue based on given input values.\n")
print("4. Businesses can leverage these insights to optimize pricing, inventory, and sales strategies.")

Conclusion:

1. The EDA revealed strong correlations between Quantity Ordered, Price Each, and Sales.

2. The Linear Regression model achieved an R2 score of 0.74, indicating good predictive accuracy.

3. Future sales predictions highlight expected revenue based on given input values.

4. Businesses can leverage these insights to optimize pricing, inventory, and sales strategies.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
12. Apply Predictive analytics for Weather forecasting.
Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Step 1: Load the dataset

file_path = 'weather.csv' # Update this if needed
df = pd.read_csv(file_path)

# Step 2: Explore the dataset

print("Data Overview:")
print(df.head())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nDataset Info:")
print(df.info())
Data Overview:
origin year month day hour temp dewp humid wind_dir wind_speed \
0 EWR 2013 1 1 0 37.04 21.92 53.97 230.0 10.35702
1 EWR 2013 1 1 1 37.04 21.92 53.97 230.0 13.80936
2 EWR 2013 1 1 2 37.94 21.92 52.09 230.0 12.65858
3 EWR 2013 1 1 3 37.94 23.00 54.51 230.0 13.80936
4 EWR 2013 1 1 4 37.94 24.08 57.04 240.0 14.96014

wind_gust precip pressure visib time_hour

0 11.918651 0.0 1013.9 10.0 1/01/2013 1:00
1 15.891535 0.0 1013.0 10.0 1/01/2013 2:00
2 14.567241 0.0 1012.6 10.0 1/01/2013 3:00
3 15.891535 0.0 1012.7 10.0 1/01/2013 4:00
4 17.215830 0.0 1012.8 10.0 1/01/2013 5:00

Missing Values:
origin 0
year 0
month 0
day 0
hour 0
temp 1
dewp 1
humid 1
wind_dir 418
wind_speed 3
wind_gust 3
precip 0
pressure 2730
visib 0
time_hour 0
dtype: int64

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26130 entries, 0 to 26129
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 origin 26130 non-null object
1 year 26130 non-null int64
2 month 26130 non-null int64
3 day 26130 non-null int64
4 hour 26130 non-null int64
5 temp 26129 non-null float64
6 dewp 26129 non-null float64
7 humid 26129 non-null float64
8 wind_dir 25712 non-null float64
9 wind_speed 26127 non-null float64
10 wind_gust 26127 non-null float64
11 precip 26130 non-null float64
12 pressure 23400 non-null float64
13 visib 26130 non-null float64
14 time_hour 26130 non-null object
dtypes: float64(9), int64(4), object(2)
memory usage: 3.0+ MB
None

# Step 3: Preprocessing - Handle missing values (if any)

df = df.dropna()

# Step 4: Select features and target variable

features = ['year', 'month', 'day', 'hour', 'dewp', 'humid', 'wind_dir', 'wind_speed', 'wind_gust', 'precip', 'pressu
target = 'temp' # Predicting temperature

X = df[features]
y = df[target]

# Step 5: Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Feature Scaling

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 7: Train the Model (Random Forest)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
▾ RandomForestRegressor i ?

RandomForestRegressor(random_state=42)

# Step 8: Make Predictions

y_pred = model.predict(X_test)

# Step 9: Evaluate the Model

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Model Evaluation:
Mean Absolute Error (MAE): 0.10836664495115607
Mean Squared Error (MSE): 0.07925993917915358
Root Mean Squared Error (RMSE): 0.2815314177479195

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f"R² Score (Accuracy): {r2}")

R² Score (Accuracy): 0.9997561069719265

# Step 10: Visualizing Predictions

plt.figure(figsize=(15,10))

<Figure size 1500x1000 with 0 Axes>

# Scatter plot of Actual vs Predicted values

plt.subplot(2,2,1)
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Temperature")

Text(0.5, 1.0, 'Actual vs Predicted Temperature')

# 3D Visualization - Differentiating Actual and Predicted Values

fig = plt.figure(figsize=(10,7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_test[:, features.index('humid')], y_test, y_test, c='blue', marker='o', label='Actual Values')
ax.scatter(X_test[:, features.index('humid')], y_test, y_pred, c='red', marker='^', label='Predicted Values')
ax.set_xlabel("Humidity")
ax.set_ylabel("Actual Temperature")
ax.set_zlabel("Predicted Temperature")
ax.set_title("3D Visualization: Humidity vs Actual & Predicted Temperature")
ax.legend()
plt.show()
# Final Conclusion
print("\nFinal Conclusion:")
print(f"The Random Forest model effectively predicts temperature with an accuracy (R² Score) of {r2:.2f}. Key feature

Final Conclusion:
The Random Forest model effectively predicts temperature with an accuracy (R² Score) of 1.00. Key features like
humidity, wind speed, and pressure significantly impact predictions. Visualizations confirm a strong correlation
between actual and predicted values.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

Supervised Learning
100% (1)
Supervised Learning
15 pages
Regression Analysis - Cheatsheet
No ratings yet
Regression Analysis - Cheatsheet
9 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Machine File
No ratings yet
Machine File
27 pages
DA Lab
No ratings yet
DA Lab
27 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Final ML File
No ratings yet
Final ML File
34 pages
Machine Learning Model Building
No ratings yet
Machine Learning Model Building
6 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
47 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
ML All Projectpdf Removed
No ratings yet
ML All Projectpdf Removed
41 pages
External
No ratings yet
External
11 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
ML Practical File
100% (2)
ML Practical File
43 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
ML (Sudhanshu)
No ratings yet
ML (Sudhanshu)
24 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Saurabh
No ratings yet
Saurabh
22 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
ML Lab Prgms Split
No ratings yet
ML Lab Prgms Split
3 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
ML
No ratings yet
ML
17 pages
Mlalllabprgs
No ratings yet
Mlalllabprgs
17 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
Data - Preprocessing - Tools - Ipynb - Colaboratory
No ratings yet
Data - Preprocessing - Tools - Ipynb - Colaboratory
4 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
MLLab Manual
No ratings yet
MLLab Manual
24 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
ML Lab
No ratings yet
ML Lab
7 pages
Machine Learning LAB
No ratings yet
Machine Learning LAB
20 pages
ML Full For Print New 1
No ratings yet
ML Full For Print New 1
38 pages
ASSESSMENT2
No ratings yet
ASSESSMENT2
22 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Iii Aid - ML
No ratings yet
Iii Aid - ML
30 pages
HIV Regression Source Code
No ratings yet
HIV Regression Source Code
26 pages
1st PGM
No ratings yet
1st PGM
10 pages
16BCB0126 VL2018195002535 Pe003
No ratings yet
16BCB0126 VL2018195002535 Pe003
40 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
S6 - Data Mining Lab Experiments (Except 1)
No ratings yet
S6 - Data Mining Lab Experiments (Except 1)
6 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
Sla4a 21im30005
No ratings yet
Sla4a 21im30005
11 pages
Machine Learning
100% (5)
Machine Learning
56 pages
ML Lab
No ratings yet
ML Lab
23 pages
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Statistics and Probability II-7
No ratings yet
Statistics and Probability II-7
7 pages
Sist Iso 2602 1996
No ratings yet
Sist Iso 2602 1996
8 pages
Sem For Dummies
No ratings yet
Sem For Dummies
28 pages
215 Final Exam Formula Sheet
No ratings yet
215 Final Exam Formula Sheet
2 pages
Unit-12 Stationary Time Series
No ratings yet
Unit-12 Stationary Time Series
20 pages
Shewhart Individuals Control Chart
No ratings yet
Shewhart Individuals Control Chart
2 pages
Tutorial 01 WK 2 - Statistics For BA
No ratings yet
Tutorial 01 WK 2 - Statistics For BA
7 pages
Minitab Session Commands: Appendix
No ratings yet
Minitab Session Commands: Appendix
8 pages
STAT Quiz 3
No ratings yet
STAT Quiz 3
3 pages
Linear Regression
No ratings yet
Linear Regression
108 pages
Reporter: Kenneth B. Dizon Joseph Kent Paring Carlo Gilla
No ratings yet
Reporter: Kenneth B. Dizon Joseph Kent Paring Carlo Gilla
7 pages
Config Test 7
No ratings yet
Config Test 7
255 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Engineering Mathematics - IV - Assignments - 1 - 2
No ratings yet
Engineering Mathematics - IV - Assignments - 1 - 2
3 pages
Module 16 - Analyzing Data - 2
No ratings yet
Module 16 - Analyzing Data - 2
37 pages
As Quiz 3 PCA Solution PDF
100% (1)
As Quiz 3 PCA Solution PDF
1 page
Pooled & Panel
No ratings yet
Pooled & Panel
34 pages
Greene 6e Solutions Manual-6
No ratings yet
Greene 6e Solutions Manual-6
1 page
Using R With Multivariate Statistics by Randall E. Schumacker
No ratings yet
Using R With Multivariate Statistics by Randall E. Schumacker
471 pages
Chapter 8.3. Maximum Likelihood Estimation: Prof. Tesler
No ratings yet
Chapter 8.3. Maximum Likelihood Estimation: Prof. Tesler
11 pages
Data Structure Paper
No ratings yet
Data Structure Paper
4 pages
Stock Watson 3U ExerciseSolutions Chapter10 Students
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter10 Students
7 pages
Assignment-1key - Statistika
No ratings yet
Assignment-1key - Statistika
2 pages
Pengaruh Penentuan Lokasi Terhadap Kesuksesan Usah
No ratings yet
Pengaruh Penentuan Lokasi Terhadap Kesuksesan Usah
12 pages
Course Schedule
No ratings yet
Course Schedule
1 page
NLP - Emotion Detection
No ratings yet
NLP - Emotion Detection
8 pages
Econometrics: Problem Set 1: Professor: Mauricio Sarrias
No ratings yet
Econometrics: Problem Set 1: Professor: Mauricio Sarrias
5 pages
EcoI Exam WS1516 1 5credits PDF
No ratings yet
EcoI Exam WS1516 1 5credits PDF
8 pages
Sample Size Estimation
No ratings yet
Sample Size Estimation
14 pages
Midterm
No ratings yet
Midterm
33 pages