0% found this document useful (0 votes)
22 views16 pages

ML Group 2

Uploaded by

Hassen Mhd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views16 pages

ML Group 2

Uploaded by

Hassen Mhd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DEBRE BERHAN UNIVERSITY

College
COLLAGE OF COMPUTING
DEPARTMENT OF SOFTWARE ENGINEERING
FUNDAMENTALS OF MACHINE LEARNING
COURSE CODE: SEng4091

13/15 + 13/15 = 26/30

NAME ID NO.
1. HASSEN MUHAMMED 17+37+26=80+2 DBUR/0280/13
2. Firdiwek Sisay 21.5+28.5+26=76+2 DBUR/1510/13
3. Yewoynhareg Mulugeta 24+35+26=85+2 DBUR/0035/13
4. Khadar Muhammed 19+29+26=74+2 DBUR/3689/13
5. Haileyesus Demes 12+15+26=53+2 DBUR/0241/13

Submitted to: Kinde B. (PHD)


Submitted date: 06/07/2024
1. Introduction to data preprocessing
#Group members id
#Hassen Muhammed DBUR/0280/13
#Firdiwek Sisay DBUR/1510/13
#Yewoynhareg Mulugeta DBUR/0035/13
#Khadar Muhammed DBUR/3689/13
#Haileyesus Demes DBUR/0241/13

# Import necessary libraries


import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset


# Make sure to replace 'path_to_iris.data' with
#the actual path to the file on your local machine(incase you
want to run #the code)
#dataset_path = 'path_to_iris.data'
column_names = ['Hassen', 'Firdiwek', 'Yewoyn hareg',
'Hayleyesus', 'Khadar']
iris_data = pd.read_csv("iris.data", header=None,
names=column_names)
# Display the number of rows and columns
rows, columns = iris_data.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

# Display the first five records where is the import statement code?
print("\nFirst five records:")
print(iris_data.head())
# Display the last five records where is the import statement code?
print("\nLast five records:")
print(iris_data.tail())

# Display the first ten records where is the import statement code?
print("\nFirst ten records:")
print(iris_data.head(10))

# Display the statistical summary of the dataset


print("\nStatistical summary:") where is the import statement code?
print(iris_data.describe())

# Display the count of each class in the dataset


print("\nClass count:") where is the import statement code?
print(iris_data['Khadar'].value_counts())
# Extract the independent features (all except the class label)
X = iris_data.drop(columns=['Khadar'])where is the import statement code?
print (iris_data.iloc[:,:-1].values)
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.1 1.5 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]

# Extract the dependent feature (the class label)


y = iris_data['Khadar'] where is the import statement code?
print (iris_data.iloc[:,4].values)
['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica']

# Plotting the histogram


where is the import statement code?
X.hist(figsize=(10, 8))
plt.suptitle("Histograms of Iris Dataset Features")
plt.show()
# Plotting the density plots where is the import statement code?
X.plot(kind='density', subplots=True, layout=(2,2),
sharex=False, figsize=(10, 8))
plt.suptitle("Density Plots of Iris Dataset Features")
plt.show()
# Plotting the boxplots where is the import statement code?
X.plot(kind='box', subplots=True, layout=(2,2), sharex=False,
sharey=False, figsize=(10, 8))
plt.suptitle("Boxplots of Iris Dataset Features")
plt.show()

13/15
2. Advanced data preprocessing

import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer, KNNImputer


from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Load the IRIS dataset


file_path = '/project2.csv'
column_names = ['khadar','furd','hassan','hylayasu','hareg']
iris = pd.read_csv(file_path, header=None, names=column_names)

# Step 2: Introduce missing values into each feature


where is the import statement code?
# Read the CSV file So, where is the file path?
missing_data = pd.read_csv(file_path)
# Introduce missing values into each feature
for col in missing_data.columns:
iris.loc[missing_data.sample(frac=0).index, col] = np.nan
print(missing_data[:10])

khadar hassan furdwik hylayassu hareg


0 NaN 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 NaN 3.2 NaN 0.2 Iris-setosa
3 4.6 3.1 NaN 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 NaN 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
# Step 3: Impute missing values with mean
imputer_mean = SimpleImputer(strategy='mean')
iris_mean_imputed = pd.read_csv(file_path)
iris_mean_imputed.iloc[:, :-1] =
imputer_mean.fit_transform(iris_mean_imputed.iloc[:, :-1])
print(iris_mean_imputed[:10])
khadar hassan furdwik hylayassu hareg
0 5.856081 3.500000 1.400000 0.2 Iris-setosa
1 4.900000 3.000000 1.400000 0.2 Iris-setosa
2 5.856081 3.200000 3.790541 0.2 Iris-setosa
3 4.600000 3.100000 3.790541 0.2 Iris-setosa
4 5.000000 3.600000 1.400000 0.2 Iris-setosa
5 5.400000 3.048322 1.700000 0.4 Iris-setosa
6 4.600000 3.400000 1.400000 0.3 Iris-setosa
7 5.000000 3.400000 1.500000 0.2 Iris-setosa
8 4.400000 2.900000 1.400000 0.2 Iris-setosa
9 4.900000 3.100000 1.500000 0.1 Iris-setosa
# Step 4: Adjust precision to 2 decimal places
where is the import statement code?
df_mean_imputed = iris_mean_imputed.copy()
df_mean_imputed = df_mean_imputed.round(2)
print(df_mean_imputed[:10])
khadar hassan furdwik hylayassu hareg
0 5.86 3.50 1.40 0.2 Iris-setosa
1 4.90 3.00 1.40 0.2 Iris-setosa
2 5.86 3.20 3.79 0.2 Iris-setosa
3 4.60 3.10 3.79 0.2 Iris-setosa
4 5.00 3.60 1.40 0.2 Iris-setosa
5 5.40 3.05 1.70 0.4 Iris-setosa
6 4.60 3.40 1.40 0.3 Iris-setosa
7 5.00 3.40 1.50 0.2 Iris-setosa
8 4.40 2.90 1.40 0.2 Iris-setosa
9 4.90 3.10 1.50 0.1 Iris-setosa
# Step 5: Impute missing values with the most frequent value
imputer = SimpleImputer(strategy='most_frequent') where is the import statement code?
df_most_frequent =pd.read_csv("/project2.csv")
df_most_frequent = pd.DataFrame(imputer.fit_transform(iris),
columns=column_names)

# Display the first few rows after imputation


print(df_most_frequent[:10]) where is the import statement code?
khadar furd hassan hylayasu hareg
0 khadar hassan furdwik hylayassu hareg
1 5 3.5 1.4 0.2 Iris-setosa
2 4.9 3 1.4 0.2 Iris-setosa
3 5 3.2 1.5 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5 3.6 1.4 0.2 Iris-setosa
6 5.4 3 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
8 5 3.4 1.5 0.2 Iris-setosa
9 4.4 2.9 1.4 0.2 Iris-setosa
# Step 6: Impute missing values with a constant value of 100
imputer = SimpleImputer(strategy='constant', fill_value=100)
df_constant =pd.read_csv('/content/project2.csv')
df_constant = pd.DataFrame(imputer.fit_transform(df),
columns=df.columns)
where is the import statement code?
# Display the first few rows after imputation
print(df_constant[:10]) where is the import statement code?

khadar hassan furdwik hylayassu hareg


0 100 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 100 3.2 100 0.2 Iris-setosa
3 4.6 3.1 100 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

# Step 7: Impute missing values with KNN where N=2


imputer_knn = KNNImputer(n_neighbors=2) where is the import statement code?
iris_knn_imputed = pd.read_csv('/content/project2.csv')
iris_knn_imputed.iloc[:, :-1] =
imputer_knn.fit_transform(iris_knn_imputed.iloc[:, :-1])
print(iris_knn_imputed[:10])

khadar hassan furdwik hylayassu hareg


0 5.35 3.5 1.4 0.2 Iris-setosa
1 4.90 3.0 1.4 0.2 Iris-setosa
2 4.85 3.2 1.4 0.2 Iris-setosa
3 4.60 3.1 1.5 0.2 Iris-setosa
4 5.00 3.6 1.4 0.2 Iris-setosa
5 5.40 3.4 1.7 0.4 Iris-setosa
6 4.60 3.4 1.4 0.3 Iris-setosa
7 5.00 3.4 1.5 0.2 Iris-setosa
8 4.40 2.9 1.4 0.2 Iris-setosa
9 4.90 3.1 1.5 0.1 Iris-setosa
# Step 8: Delete records with missing values
df_no_missing = df.dropna() where is the import statement code?
print(df_no_missing[:10])
khadar hassan furdwik hylayassu hareg
1 4.9 3.0 1.4 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
# Step 9: Min-Max Normalization
where is the import statement code?
scaler_min_max = MinMaxScaler()
iris_min_max_normalized =
df_mean_imputed.drop(columns=['hareg']).copy()
iris_min_max_normalized.iloc[:, :] =
scaler_min_max.fit_transform(iris_min_max_normalized.iloc[:, :])
iris_min_max_normalized['hareg'] = df_mean_imputed['hareg']

print(iris_min_max_normalized[:10])

0 0.433333 0.625000 0.067797 0.041667 Iris-setosa


1 0.166667 0.416667 0.067797 0.041667 Iris-setosa
2 0.433333 0.500000 0.472881 0.041667 Iris-setosa
3 0.083333 0.458333 0.472881 0.041667 Iris-setosa
4 0.194444 0.666667 0.067797 0.041667 Iris-setosa
5 0.305556 0.437500 0.118644 0.125000 Iris-setosa
6 0.083333 0.583333 0.067797 0.083333 Iris-setosa
7 0.194444 0.583333 0.084746 0.041667 Iris-setosa
8 0.027778 0.375000 0.067797 0.041667 Iris-setosa
9 0.166667 0.458333 0.084746 0.000000 Iris-setosa
# Step 10: Z-Score Normalization
scaler_z_score = StandardScaler() where is the import statement code?
df_z_score_scaled =
pd.DataFrame(scaler_z_score.fit_transform(df_mean_imputed.drop(c
olumns='hareg')), columns=df.columns[:-1])
df_z_score_scaled['hareg'] = df_mean_imputed['hareg']

print(df_z_score_scaled[:10])

khadar hassan furdwik hylayassu hareg


0 0.004729 1.058877 -1.376256 -1.312977 Iris-setosa
1 -1.169357 -0.113312 -1.376256 -1.312977 Iris-setosa
2 0.004729 0.355564 -0.000307 -1.312977 Iris-setosa
3 -1.536259 0.121126 -0.000307 -1.312977 Iris-setosa
4 -1.047056 1.293314 -1.376256 -1.312977 Iris-setosa
5 -0.557854 0.003907 -1.203542 -1.050031 Iris-setosa
6 -1.536259 0.824439 -1.376256 -1.181504 Iris-setosa
7 -1.047056 0.824439 -1.318685 -1.312977 Iris-setosa
8 -1.780860 -0.347749 -1.376256 -1.312977 Iris-setosa
9 -1.169357 0.121126 -1.318685 -1.444450 Iris-setosa
df = pd.DataFrame({'Age': [42, 15, 67, 55, 1, 29, 75, 89, 4, 10, 15, 38,
22, 77]})

print("Before Transformation: ")


print(df)
Before Transformation:
Age
0 42
1 15
2 67
3 55
4 1
5 29
6 75
7 89
8 4
9 10
10 15
11 38
12 22
13 77
Label = pd.cut(x=df['Age'], bins=[0, 3, 7, 17, 63, 99],
labels=['Baby', 'Child', 'Teenage', 'Adult',
'Elderly'])
where is the import statement code?
# Printing DataFrame after sorting Continuous to
# Categories
print("After: ")
print(Label)
After:
0 Adult
1 Teenage
2 Elderly
3 Adult
4 Baby
5 Adult
6 Elderly
7 Elderly
8 Child
9 Teenage
10 Teenage
11 Adult
12 Adult
13 Elderly
Name: Age, dtype: category
Categories (5, object): ['Baby' < 'Child' < 'Teenage' < 'Adult' <
'Elderly']
# Check the number of values in each bin
print("Categories: ")
print(Label.value_counts())
Categories:
Age
Adult 5
Elderly 4
Teenage 3
Baby 1
Child 1
Name: count, dtype: int64
data = pd.concat([df, Label], axis=1)
print ("\n \n \n Merged Data \n \n", data)

Merged Data

Age Age
0 42 Adult
1 15 Teenage
2 67 Elderly
3 55 Adult
4 1 Baby
5 29 Adult
6 75 Elderly
7 89 Elderly
8 4 Child
9 10 Teenage
10 15 Teenage
11 38 Adult
12 22 Adult
13 77 Elderly
13/15

You might also like