0% found this document useful (0 votes)
46 views73 pages

DMDW Assignment 1-8

Uploaded by

suvammishra7852
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views73 pages

DMDW Assignment 1-8

Uploaded by

suvammishra7852
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

CS552 Data Mining and Data Warehousing Lab MO 2024

CS551 Machine Learning Lab


Date: 20th Aug. 2024
Lab Assignment 1

Problem 1
Use Pandas library in Python to create Data frame for detailed analysis of the data set
given by downloading data from the following URL:
https://www.espncricinfo.com/records/most-wickets-in-career-93276
i) Store the data in the tables into all_tables. Use pd.read_html command.
Check the type of the data structures of the tables.
ii) Create a Data Frame df by storing the table and display the data frame.
iii) Display first 11 rows from the data frame and all the features from the data
frame.
iv) Convert the data frame into a NumPy array. Display all the names of the
players. Also display the names of the players along with the number of
wickets taken by each of them.
v) Display the details of the player located in the index 10.
vi) Create a new data frame df1 by setting the name of the players as row index.
vii) Display first 5 records from the new data frame df1 and print the detail records
of the player in the fifth position of the data frame.
viii) Find the total number of wickets taken by the player at index 10 in the data
frame df.
ix) Calculate the wicket per match of all players and create a new field
WicketPerMatch in the data frame.
x) Normalized the values of WicketPerMatch and append as an attribute in the
data frame.
xi) Represent the relationship between Strike Rate (SR) and Batting Average (Ave)
using scatterplot (import seaborn library).
xii) Extract country from the attribute “Player” and append as a separate attribute
in the data set.
xiii) Calculate the average wickets collected by each country.
xiv) Represent Strike Rate (SR) Vs Batting Average (Ave) using Scatter plot for
Australia.
xv) Display the records from the data set where SR is less than 55 and Ave is less
than 25
xvi) Create new fields StartYear and EndYear to store the year when the player
started his career and the year of retirement. Use function to extract proper
data from the attribute “Span”. Find out the career length of each player. And
store the values for each player.
xvii) Use histogram to represent the career length. Use 20 bins to show the
distribution.
xviii) Visually represent the number of wickets taken by each country. Use catplot()
of Seaborn library and represent wickets Vs country.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

all_tables = pd.read_html("https://www.espncricinfo.com/records/most-
wickets-in-career-93276")
all_tables

[ Player Span Mat Inns Balls Overs Mdns


Runs \
0 M Muralidaran (ICC/SL) 1992-2010 133 230 44039 7339.5 1794
18180
1 SK Warne (AUS) 1992-2007 145 273 40705 6784.1 1761
17995
2 JM Anderson (ENG) 2003-2024 188 350 40037 6672.5 1730
18627
3 A Kumble (IND) 1990-2008 132 236 40850 6808.2 1576
18355
4 SCJ Broad (ENG) 2007-2023 167 309 33698 5616.2 1304
16719
.. ... ... ... ... ... ... ...
...
78 MM Ali (ENG) 2014-2023 68 119 12610 2101.4 293
7612
79 BA Stokes (ENG) 2013-2024 105 152 11795 1965.5 350
6506
80 AME Roberts (WI) 1974-1983 47 90 11135 - 382
5174
81 JA Snow (ENG) 1965-1976 49 93 12021 - 415
5387
82 JR Thomson (AUS) 1972-1985 51 90 10535 - 301
5601

Wkts BBI Ave Econ SR 4 5


0 800 9/51 22.72 2.47 55.04 45 67
1 708 8/71 25.41 2.65 57.49 48 37
2 704 7/42 26.45 2.79 56.87 32 32
3 619 10/74 29.65 2.69 65.99 31 35
4 604 8/15 27.68 2.97 55.79 28 20
.. ... ... ... ... ... .. ..
78 204 6/53 37.31 3.62 61.81 13 5
79 203 6/22 32.04 3.30 58.10 8 4
80 202 7/54 25.61 2.78 55.12 8 11
81 202 7/40 26.66 2.68 59.50 12 8
82 200 6/46 28.00 3.18 52.67 16 8

[83 rows x 15 columns]]

type(all_tables)
list

df = all_tables[0]
type(df)

pandas.core.frame.DataFrame

df.head(11)

Player Span Mat Inns Balls Overs Mdns


Runs \
0 M Muralidaran (ICC/SL) 1992-2010 133 230 44039 7339.5 1794
18180
1 SK Warne (AUS) 1992-2007 145 273 40705 6784.1 1761
17995
2 JM Anderson (ENG) 2003-2024 188 350 40037 6672.5 1730
18627
3 A Kumble (IND) 1990-2008 132 236 40850 6808.2 1576
18355
4 SCJ Broad (ENG) 2007-2023 167 309 33698 5616.2 1304
16719
5 GD McGrath (AUS) 1993-2007 124 243 29248 4874.4 1470
12186
6 NM Lyon (AUS) 2011-2024 129 242 32761 5460.1 1044
16052
7 CA Walsh (WI) 1984-2001 132 242 30019 5003.1 1144
12688
8 R Ashwin (IND) 2011-2024 100 189 26166 4361.0 889
12255
9 DW Steyn (SA) 2004-2019 93 171 18608 3101.2 660
10077
10 N Kapil Dev (IND) 1978-1994 131 227 27740 4623.2 1060
12867

Wkts BBI Ave Econ SR 4 5


0 800 9/51 22.72 2.47 55.04 45 67
1 708 8/71 25.41 2.65 57.49 48 37
2 704 7/42 26.45 2.79 56.87 32 32
3 619 10/74 29.65 2.69 65.99 31 35
4 604 8/15 27.68 2.97 55.79 28 20
5 563 8/24 21.64 2.49 51.95 28 29
6 530 8/50 30.28 2.93 61.81 24 24
7 519 7/37 24.44 2.53 57.84 32 22
8 516 7/59 23.75 2.81 50.70 25 36
9 439 7/51 22.95 3.24 42.38 27 26
10 434 9/83 29.64 2.78 63.91 17 23

arr = df.to_numpy()
type(arr)

numpy.ndarray
player_names = df['Player']

print(player_names)

0 M Muralidaran (ICC/SL)
1 SK Warne (AUS)
2 JM Anderson (ENG)
3 A Kumble (IND)
4 SCJ Broad (ENG)
...
78 MM Ali (ENG)
79 BA Stokes (ENG)
80 AME Roberts (WI)
81 JA Snow (ENG)
82 JR Thomson (AUS)
Name: Player, Length: 83, dtype: object

player_wickets = df[['Player', 'Wkts']]

print(player_wickets)

Player Wkts
0 M Muralidaran (ICC/SL) 800
1 SK Warne (AUS) 708
2 JM Anderson (ENG) 704
3 A Kumble (IND) 619
4 SCJ Broad (ENG) 604
.. ... ...
78 MM Ali (ENG) 204
79 BA Stokes (ENG) 203
80 AME Roberts (WI) 202
81 JA Snow (ENG) 202
82 JR Thomson (AUS) 200

[83 rows x 2 columns]

player_details = df.iloc[10]

print(player_details)

Player N Kapil Dev (IND)


Span 1978-1994
Mat 131
Inns 227
Balls 27740
Overs 4623.2
Mdns 1060
Runs 12867
Wkts 434
BBI 9/83
Ave 29.64
Econ 2.78
SR 63.91
4 17
5 23
Name: 10, dtype: object

df1 = df.set_index('Player')

print(df1)

Span Mat Inns Balls Overs Mdns


Runs \
Player

M Muralidaran (ICC/SL) 1992-2010 133 230 44039 7339.5 1794


18180
SK Warne (AUS) 1992-2007 145 273 40705 6784.1 1761
17995
JM Anderson (ENG) 2003-2024 188 350 40037 6672.5 1730
18627
A Kumble (IND) 1990-2008 132 236 40850 6808.2 1576
18355
SCJ Broad (ENG) 2007-2023 167 309 33698 5616.2 1304
16719
... ... ... ... ... ... ... .
..
MM Ali (ENG) 2014-2023 68 119 12610 2101.4 293
7612
BA Stokes (ENG) 2013-2024 105 152 11795 1965.5 350
6506
AME Roberts (WI) 1974-1983 47 90 11135 - 382
5174
JA Snow (ENG) 1965-1976 49 93 12021 - 415
5387
JR Thomson (AUS) 1972-1985 51 90 10535 - 301
5601

Wkts BBI Ave Econ SR 4 5


Player
M Muralidaran (ICC/SL) 800 9/51 22.72 2.47 55.04 45 67
SK Warne (AUS) 708 8/71 25.41 2.65 57.49 48 37
JM Anderson (ENG) 704 7/42 26.45 2.79 56.87 32 32
A Kumble (IND) 619 10/74 29.65 2.69 65.99 31 35
SCJ Broad (ENG) 604 8/15 27.68 2.97 55.79 28 20
... ... ... ... ... ... .. ..
MM Ali (ENG) 204 6/53 37.31 3.62 61.81 13 5
BA Stokes (ENG) 203 6/22 32.04 3.30 58.10 8 4
AME Roberts (WI) 202 7/54 25.61 2.78 55.12 8 11
JA Snow (ENG) 202 7/40 26.66 2.68 59.50 12 8
JR Thomson (AUS) 200 6/46 28.00 3.18 52.67 16 8
[83 rows x 14 columns]

print(df1.head())

Span Mat Inns Balls Overs Mdns


Runs \
Player

M Muralidaran (ICC/SL) 1992-2010 133 230 44039 7339.5 1794


18180
SK Warne (AUS) 1992-2007 145 273 40705 6784.1 1761
17995
JM Anderson (ENG) 2003-2024 188 350 40037 6672.5 1730
18627
A Kumble (IND) 1990-2008 132 236 40850 6808.2 1576
18355
SCJ Broad (ENG) 2007-2023 167 309 33698 5616.2 1304
16719

Wkts BBI Ave Econ SR 4 5


Player
M Muralidaran (ICC/SL) 800 9/51 22.72 2.47 55.04 45 67
SK Warne (AUS) 708 8/71 25.41 2.65 57.49 48 37
JM Anderson (ENG) 704 7/42 26.45 2.79 56.87 32 32
A Kumble (IND) 619 10/74 29.65 2.69 65.99 31 35
SCJ Broad (ENG) 604 8/15 27.68 2.97 55.79 28 20

player_in_fifth_position = df1.iloc[4]

print(player_in_fifth_position)

Span 2007-2023
Mat 167
Inns 309
Balls 33698
Overs 5616.2
Mdns 1304
Runs 16719
Wkts 604
BBI 8/15
Ave 27.68
Econ 2.97
SR 55.79
4 28
5 20
Name: SCJ Broad (ENG), dtype: object

wickets_at_index_10 = df.iloc[10]['Wkts']

print(wickets_at_index_10)
434

df['WicketPerMatch'] = df.apply(lambda row: row['Wkts'] / row['Mat']


if row['Mat'] > 0 else 0, axis=1)

print(df)

Player Span Mat Inns Balls Overs Mdns


Runs \
0 M Muralidaran (ICC/SL) 1992-2010 133 230 44039 7339.5 1794
18180
1 SK Warne (AUS) 1992-2007 145 273 40705 6784.1 1761
17995
2 JM Anderson (ENG) 2003-2024 188 350 40037 6672.5 1730
18627
3 A Kumble (IND) 1990-2008 132 236 40850 6808.2 1576
18355
4 SCJ Broad (ENG) 2007-2023 167 309 33698 5616.2 1304
16719
.. ... ... ... ... ... ... ...
...
78 MM Ali (ENG) 2014-2023 68 119 12610 2101.4 293
7612
79 BA Stokes (ENG) 2013-2024 105 152 11795 1965.5 350
6506
80 AME Roberts (WI) 1974-1983 47 90 11135 - 382
5174
81 JA Snow (ENG) 1965-1976 49 93 12021 - 415
5387
82 JR Thomson (AUS) 1972-1985 51 90 10535 - 301
5601

Wkts BBI Ave Econ SR 4 5 WicketPerMatch


0 800 9/51 22.72 2.47 55.04 45 67 6.015038
1 708 8/71 25.41 2.65 57.49 48 37 4.882759
2 704 7/42 26.45 2.79 56.87 32 32 3.744681
3 619 10/74 29.65 2.69 65.99 31 35 4.689394
4 604 8/15 27.68 2.97 55.79 28 20 3.616766
.. ... ... ... ... ... .. .. ...
78 204 6/53 37.31 3.62 61.81 13 5 3.000000
79 203 6/22 32.04 3.30 58.10 8 4 1.933333
80 202 7/54 25.61 2.78 55.12 8 11 4.297872
81 202 7/40 26.66 2.68 59.50 12 8 4.122449
82 200 6/46 28.00 3.18 52.67 16 8 3.921569

[83 rows x 16 columns]

min_wicket_per_match = df['WicketPerMatch'].min()
max_wicket_per_match = df['WicketPerMatch'].max()
df['NormalizedWicketPerMatch'] = (df['WicketPerMatch'] -
min_wicket_per_match) / (max_wicket_per_match - min_wicket_per_match)

print(df)

Player Span Mat Inns Balls Overs Mdns


Runs \
0 M Muralidaran (ICC/SL) 1992-2010 133 230 44039 7339.5 1794
18180
1 SK Warne (AUS) 1992-2007 145 273 40705 6784.1 1761
17995
2 JM Anderson (ENG) 2003-2024 188 350 40037 6672.5 1730
18627
3 A Kumble (IND) 1990-2008 132 236 40850 6808.2 1576
18355
4 SCJ Broad (ENG) 2007-2023 167 309 33698 5616.2 1304
16719
.. ... ... ... ... ... ... ...
...
78 MM Ali (ENG) 2014-2023 68 119 12610 2101.4 293
7612
79 BA Stokes (ENG) 2013-2024 105 152 11795 1965.5 350
6506
80 AME Roberts (WI) 1974-1983 47 90 11135 - 382
5174
81 JA Snow (ENG) 1965-1976 49 93 12021 - 415
5387
82 JR Thomson (AUS) 1972-1985 51 90 10535 - 301
5601

Wkts BBI Ave Econ SR 4 5 WicketPerMatch \


0 800 9/51 22.72 2.47 55.04 45 67 6.015038
1 708 8/71 25.41 2.65 57.49 48 37 4.882759
2 704 7/42 26.45 2.79 56.87 32 32 3.744681
3 619 10/74 29.65 2.69 65.99 31 35 4.689394
4 604 8/15 27.68 2.97 55.79 28 20 3.616766
.. ... ... ... ... ... .. .. ...
78 204 6/53 37.31 3.62 61.81 13 5 3.000000
79 203 6/22 32.04 3.30 58.10 8 4 1.933333
80 202 7/54 25.61 2.78 55.12 8 11 4.297872
81 202 7/40 26.66 2.68 59.50 12 8 4.122449
82 200 6/46 28.00 3.18 52.67 16 8 3.921569

NormalizedWicketPerMatch
0 1.000000
1 0.733957
2 0.466552
3 0.688524
4 0.436497
.. ...
78 0.291580
79 0.040953
80 0.596531
81 0.555313
82 0.508114

[83 rows x 17 columns]

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='SR', y='Ave')
plt.title('Relationship between Strike Rate (SR) and Batting Average
(Ave)')
plt.xlabel('Strike Rate (SR)')
plt.ylabel('Batting Average (Ave)')

Text(0, 0.5, 'Batting Average (Ave)')

df['Country'] = df['Player'].str.extract(r'\(([^)]+)\)')

print(df)

Player Span Mat Inns Balls Overs Mdns


Runs \
0 M Muralidaran (ICC/SL) 1992-2010 133 230 44039 7339.5 1794
18180
1 SK Warne (AUS) 1992-2007 145 273 40705 6784.1 1761
17995
2 JM Anderson (ENG) 2003-2024 188 350 40037 6672.5 1730
18627
3 A Kumble (IND) 1990-2008 132 236 40850 6808.2 1576
18355
4 SCJ Broad (ENG) 2007-2023 167 309 33698 5616.2 1304
16719
.. ... ... ... ... ... ... ...
...
78 MM Ali (ENG) 2014-2023 68 119 12610 2101.4 293
7612
79 BA Stokes (ENG) 2013-2024 105 152 11795 1965.5 350
6506
80 AME Roberts (WI) 1974-1983 47 90 11135 - 382
5174
81 JA Snow (ENG) 1965-1976 49 93 12021 - 415
5387
82 JR Thomson (AUS) 1972-1985 51 90 10535 - 301
5601

Wkts BBI Ave Econ SR 4 5 WicketPerMatch \


0 800 9/51 22.72 2.47 55.04 45 67 6.015038
1 708 8/71 25.41 2.65 57.49 48 37 4.882759
2 704 7/42 26.45 2.79 56.87 32 32 3.744681
3 619 10/74 29.65 2.69 65.99 31 35 4.689394
4 604 8/15 27.68 2.97 55.79 28 20 3.616766
.. ... ... ... ... ... .. .. ...
78 204 6/53 37.31 3.62 61.81 13 5 3.000000
79 203 6/22 32.04 3.30 58.10 8 4 1.933333
80 202 7/54 25.61 2.78 55.12 8 11 4.297872
81 202 7/40 26.66 2.68 59.50 12 8 4.122449
82 200 6/46 28.00 3.18 52.67 16 8 3.921569

NormalizedWicketPerMatch Country
0 1.000000 ICC/SL
1 0.733957 AUS
2 0.466552 ENG
3 0.688524 IND
4 0.436497 ENG
.. ... ...
78 0.291580 ENG
79 0.040953 ENG
80 0.596531 WI
81 0.555313 ENG
82 0.508114 AUS

[83 rows x 18 columns]

average_wickets_by_country = df.groupby('Country')['Wkts'].mean()
print(average_wickets_by_country)

Country
AUS 316.210526
BAN 237.000000
ENG 312.200000
ENG/ICC 226.000000
ICC/NZ 362.000000
ICC/SA 292.000000
ICC/SL 800.000000
IND 352.272727
NZ 306.500000
PAK 299.714286
SA 344.571429
SL 394.000000
WI 314.111111
ZIM 216.000000
Name: Wkts, dtype: float64

df_aus = df[df['Country'] == 'AUS']

plt.figure(figsize=(10, 6))
sns.regplot(data=df_aus, x='SR', y='Ave', scatter_kws={'s':100},
line_kws={'color':'red'})

plt.title('Strike Rate (SR) vs Batting Average (Ave) for Australia')


plt.xlabel('Strike Rate (SR)')
plt.ylabel('Batting Average (Ave)')

plt.show()
df[['StartYear', 'EndYear']] = df['Span'].str.split('-',
expand=True).astype(int)

df['CareerLength'] = df['EndYear'] - df['StartYear']

plt.figure(figsize=(10, 6))
plt.hist(df['CareerLength'].dropna(), bins=20, edgecolor='black')

plt.title('Distribution of Career Lengths')


plt.xlabel('Career Length (Years)')
plt.ylabel('Frequency')

Text(0, 0.5, 'Frequency')


wickets_by_country = df.groupby('Country')['Wkts'].sum().reset_index()

plt.figure(figsize=(12, 8))
sns.catplot(data=wickets_by_country, x='Country', y='Wkts',
kind='bar', height=6, aspect=2)

plt.title('Number of Wickets Taken by Each Country')


plt.xlabel('Country')
plt.ylabel('Number of Wickets')
plt.xticks(rotation=90)

plt.show()

<Figure size 1200x800 with 0 Axes>


CS552 Data Mining and Data Warehousing MO-24
CS551 Machine Learning Lab
Date :27/08/24
Lab Assignment 2
Problem1
1) Load the diabetes dataset from Sklearn
(X, y) = datasets.load_diabetes(return_X_y=True).
Please note that # return_X_ybool = True returns (data, target)
2) Select only one features from X.
for example third feature from X
3) Split the X data for train all values except last 25 records, and X_data for testing last 25
records only
4) Similarly split the data for y.
5) Create a linear regression object and train model using training set using Use
LinearRegression Class for Fitting the Model.
6) Predict the value of Y_pred using regression predict method when taking parameter
testing set (X_test)
7) Find the following
a) Regression Coefficients
b) Mean Squared Errors between y_test and Y_predicted
c) Residual Sum of Squares
Hint from sklearn.metrics import mean_squared_error, r2_score.
The ravel() function converts the two-dimensional list into a contiguous flattened (one-
dimensional) array
8) Plot the following graph
a) Scatter between X_test and Y_test, color =”black”
b) Plot between X_test and y_pred , color =”Red”
c) Set the X_label, Y_label and Title
Problem 2
Find linear regression equation for the following two sets of data:
x 2 4 6 8
y 3 7 5 10
a) Create x and y from the above table
b) Plot graph between x and y and color is ‘k.’
c) Set xlabel, ylabel and title of the graph
d) Apply the LinearRegression Class for Fitting the Model from sklearn.linear_model
import LinearRegression
e) Predict the value of y_val when x=3
f) Plot the Linear Regression Line using y, model.predict(y) and Color is ‘B’.
g) Find the y-intercept by predicting the y if the x is 0
h) Find the gradient of the linear regression line through the coef_ property
i) Find the value of a using the following formula

Hint use the following Formula


1 2
1 4
Where X = [1 6] and for Transpose use the following formula, X.transpose()
1 8
Y= np.array([3,7,5,10])[np.newaxis]
For inverse use np.linalg.inv

a= ( (𝑿𝒕 𝑿 −𝟏 𝑿𝒕 ) 𝒀

Problem 3
Find multiple regression equation for the following sets of data:
X1 X2 Yi
1 4 1
2 5 6
3 8 8
4 2 12
Follow problem number 2 for solving the question.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X, y = datasets.load_diabetes(return_X_y=True)

X = X[:, 2]

X_train = X[:-25]
X_test = X[-25:]

y_train = y[:-25]
y_test = y[-25:]

model = LinearRegression()
model.fit(X_train.reshape(-1, 1), y_train)

LinearRegression()

y_pred = model.predict(X_test.reshape(-1, 1))

coefficients = model.coef_
mean_square_error = mean_squared_error(y_test, y_pred)
residual_sum_of_square = ((y_test - y_pred) ** 2).sum()

print(f"Regression Coefficients: {coefficients}")


print(f"Mean Squared Error: {mean_square_error}")
print(f"Residual Sum of Squares: {residual_sum_of_square}")

Regression Coefficients: [946.22612757]


Mean Squared Error: 3477.2220885685015
Residual Sum of Squares: 86930.55221421254

plt.scatter(X_test, y_test, color='black', label='Actual')


plt.plot(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('X_test')
plt.ylabel('y_test')
plt.title('Linear Regression on Diabetes Dataset')
plt.legend()
plt.show()
In [1]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [2]: x = np.array([2, 4, 6, 8])


y = np.array([3, 7, 5, 10])

In [3]: plt.scatter(x, y, color='black', marker='.')


plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter plot of x vs y')
plt.show()

In [4]: x_reshaped = x.reshape(-1, 1)


model = LinearRegression()
model.fit(x_reshaped, y)

Out[4]: LinearRegression i ?

LinearRegression()

In [5]: x_val = np.array([[3]])


y_val = model.predict(x_val)
print(f"Predicted value of y when x = 3: {y_val[0]}")

Predicted value of y when x = 3: 4.35

In [6]: plt.scatter(x, y, color='black', marker='.')


plt.plot(x, model.predict(x_reshaped), color='blue')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression Line')
plt.show()

In [7]: y_intercept = model.predict(np.array([[0]]))


print(f"y-intercept when x = 0: {y_intercept[0]}")

y-intercept when x = 0: 1.5

In [8]: gradient = model.coef_


print(f"Gradient of the line: {gradient[0]}")

Gradient of the line: 0.95

In [9]: X = np.array([[1, 2], [1, 4], [1, 6], [1, 8]])


Xt = X.T
XtX_inv = np.linalg.inv(Xt @ X)
a = XtX_inv @ Xt @ y
print(f"Value of a: {a}")

Value of a: [1.5 0.95]

In [ ]:
In [1]: import numpy as np
from sklearn.linear_model import LinearRegression

In [2]: X1 = np.array([1, 2, 3, 4])


X2 = np.array([4, 5, 8, 2])
Y = np.array([1, 6, 8, 12])

In [3]: X = np.column_stack((X1, X2))

In [4]: model = LinearRegression()


model.fit(X, Y)

Out[4]: LinearRegression i ?

LinearRegression()

In [6]: X_val = np.array([[3, 7]])


Y_val = model.predict(X_val)
print(f"Predicted value of Y when X1=3 and X2=7 :: {Y_val[0]}")

Predicted value of Y when X1=3 and X2=7 :: 8.368852459016393

In [7]: coefficients = model.coef_


intercept = model.intercept_
print(f"Coefficients :: {coefficients}")
print(f"Intercept :: {intercept}")

Coefficients :: [ 3.48360656 -0.05464481]


Intercept :: -1.699453551912569

In [10]: Xt = X.T
XtX_inv = np.linalg.inv(Xt @ X)
a = XtX_inv @ Xt @ Y
print(f"Value of a: {a}")

Value of a: [ 3.16551127 -0.21663778]

In [ ]:
CS552 Data Mining and Data Warehousing Lab MO 2024
CS551 Machine Learning Lab
Date: 03rd Sept. 2024
Lab Assignment 4

1. Modify the data set California Housing Price by converting the values of the attribute
median_income to ‘L’, ‘M’ and ‘H’. The range of the values are given below:

Low (L): 1-4


Medium (M): 4-6
High (H): More than 6 and above
Also update the values of the attribute Ocean_proximity by replacing 1H Ocean, Inland
as ‘Far’ and Near Ocean, Near Bay, Island by ‘Near’.
Now calculate the entropy of the updated data set. Also compute the information
provided by median_income and find out the information gain for that.

2. Implement k-nearest neighbour algorithm on Iris data set. Consider only two
attributes and different values of k. Also find out the accuracy of the result. Plot the
accuracy.
import pandas as pd
from scipy.stats import entropy
from sklearn.metrics import mutual_info_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

file_path = 'housing.csv'

df = pd.read_csv(file_path)

longitude latitude housing_median_age total_rooms


total_bedrooms \
0 -122.23 37.88 41.0 880.0
129.0
1 -122.22 37.86 21.0 7099.0
1106.0
2 -122.24 37.85 52.0 1467.0
190.0
3 -122.25 37.85 52.0 1274.0
235.0
4 -122.25 37.85 52.0 1627.0
280.0

population households median_income median_house_value


ocean_proximity
0 322.0 126.0 8.3252 452600.0
NEAR BAY
1 2401.0 1138.0 8.3014 358500.0
NEAR BAY
2 496.0 177.0 7.2574 352100.0
NEAR BAY
3 558.0 219.0 5.6431 341300.0
NEAR BAY
4 565.0 259.0 3.8462 342200.0
NEAR BAY

df['median_income'] = df['median_income'].apply(lambda x: 'L' if x < 4


else ('M' if 4 <= x <= 6 else 'H'))

df['ocean_proximity'] = df['ocean_proximity'].apply(lambda x: 'Far' if


x in ['1H OCEAN', 'INLAND'] else 'Near')

ocean_proximity_counts =
df['ocean_proximity'].value_counts(normalize=True)

entropy_ocean_proximity = entropy(ocean_proximity_counts, base=2)

print(f'Entropy of ocean_proximity: {entropy_ocean_proximity}')


Entropy of ocean_proximity: 0.9015243330828411

information_median_income = mutual_info_score(df['median_income'],
df['ocean_proximity'])

print(f'Information provided by median_income:


{information_median_income}')

Information provided by median_income: 0.026037753109490705

information_gain = entropy_ocean_proximity - information_median_income

print(f'Information Gain for median_income: {information_gain}')

Information Gain for median_income: 0.8754865799733503

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['target'] = iris.target

X = df[['sepal length (cm)', 'sepal width (cm)']].values


y = df['target'].values

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)

k_values = range(1,23,2)

accuracies = []

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


accuracies.append(accuracy)

print(f'Accuracy for k={k}: {accuracy}')

Accuracy for k=1: 0.7333333333333333


Accuracy for k=3: 0.7666666666666667
Accuracy for k=5: 0.8
Accuracy for k=7: 0.7666666666666667
Accuracy for k=9: 0.8
Accuracy for k=11: 0.7666666666666667
Accuracy for k=13: 0.7666666666666667
Accuracy for k=15: 0.8333333333333334
Accuracy for k=17: 0.8666666666666667
Accuracy for k=19: 0.8666666666666667
Accuracy for k=21: 0.8666666666666667

plt.figure(figsize=(10, 6))
plt.plot(k_values, accuracies, marker='o', linestyle='--', color='b')
plt.title('Accuracy vs. k in k-NN')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()
CS552 Data Mining and Data Warehousing MO-24
CS551 Machine Learning Lab
Date :03/09/24
Lab Assignment 3
Problem1 is based on Data Cleaning
Q1
(i) Import not_clean.csv file using pandas and display first ten records from the file.
(ii) How many null values are there in each column? Replace null values by mean value
using fillna for each column. Again, check the results using isnull.sum().
(iii) Again, import not_clean.csv file using pandas. Remove the rows which contains null
values and display the results. Reset the index value using reset_index().
(iv) Again, import not_clean.csv file using pandas and remove the duplicated records
from the file. Display the results.
(v) Remove duplicates in columns by using subset parameter.
(vi) Again, import not_clean.csv file using pandas and store numerical value in another
dataframe using df.loc or df1.select_dtypes. Normalize the numerical column in
dataset to common scale using preprocessing.MinMaxScaler() from sklearn.
(vii) Import not_clean.csv file using pandas and store ‘Target_Name is separate data frame
variable. Convert Target_name label into number using LabelEncoder.
For example 'setosa' -0, 'versicolor'=1, 'virginica' =2
(viii) Import not_clean.csv file using pandas and store ‘Target_Name is separate data frame
variable. Convert Target_name label into number using LabelEncoder.
For example 'setosa' -1, 'versicolor'=2, 'virginica' =3
(ix) Import not_clean.csv file using pandas and store ‘Target_Name is separate data
variable. Convert the Target_name column using onehotencoder.
For example 'setosa' - 1., 0., 0, 'versicolor'=0,1, 0 , 'virginica' =0, 0, 1
(x) Find the probability of each using np.mean()
(xi) Import not_clean.csv file using pandas and divide the not_clean.csv into x and y.
Where x contains 'Sepal_length', 'Sepal_width','Petal_length','Petal_width'
and y contains 'Target_Name'.
xii) Split the training and testing data in such a way that test should contains 20
percentage of data using from sklearn.model_selection import train_test_split.
xiii) Split the training and testing data without using sklearn.model_selection import
train_test_split.
xiv) Find the outliers in the not_clean.csv
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('not_clean.csv')

df.head(10)

Sepal_length Sepal_width Petal_length Petal_width Target


Target_Name \
0 5.1 3.5 1.4 0.2 1
'setosa'
1 4.9 3.0 1.4 0.2 1
'setosa'
2 4.7 3.2 1.3 0.2 1
'setosa'
3 4.6 3.1 1.5 0.2 1
'setosa'
4 5.0 3.6 1.4 0.2 1
'setosa'
5 5.4 3.9 1.7 0.4 1
'setosa'
6 4.6 3.4 1.4 0.3 1
'setosa'
7 5.0 3.4 1.5 0.2 1
'setosa'
8 4.4 2.9 1.4 0.2 1
'setosa'
9 4.9 3.1 1.5 0.1 1
'setosa'

Number
0 20000
1 15000
2 10000
3 5000
4 1000
5 2000
6 3000
7 4000
8 5000
9 6000

numeric_cols = df.select_dtypes(include='number').columns

means = df[numeric_cols].mean()
df[numeric_cols] = df[numeric_cols].fillna(means)

null_values_before = df.isnull().sum()

null_values_after = df.isnull().sum()

null_values_before, null_values_after

print(df.isnull().sum())

Sepal_length 0
Sepal_width 0
Petal_length 0
Petal_width 0
Target 0
Target_Name 0
Number 0
dtype: int64

df = pd.read_csv('not_clean.csv')

df_cleaned = df.dropna()

df_cleaned.reset_index(drop=True, inplace=True)

df_cleaned.head()

Sepal_length Sepal_width Petal_length Petal_width Target


Target_Name \
0 5.1 3.5 1.4 0.2 1
'setosa'
1 4.9 3.0 1.4 0.2 1
'setosa'
2 4.7 3.2 1.3 0.2 1
'setosa'
3 4.6 3.1 1.5 0.2 1
'setosa'
4 5.0 3.6 1.4 0.2 1
'setosa'

Number
0 20000
1 15000
2 10000
3 5000
4 1000

df = pd.read_csv('not_clean.csv')
df_no_duplicates = df.drop_duplicates()

df_no_duplicates.head()

Sepal_length Sepal_width Petal_length Petal_width Target


Target_Name \
0 5.1 3.5 1.4 0.2 1
'setosa'
1 4.9 3.0 1.4 0.2 1
'setosa'
2 4.7 3.2 1.3 0.2 1
'setosa'
3 4.6 3.1 1.5 0.2 1
'setosa'
4 5.0 3.6 1.4 0.2 1
'setosa'

Number
0 20000
1 15000
2 10000
3 5000
4 1000

df = pd.read_csv('not_clean.csv')

df_subset_no_duplicates = df.drop_duplicates(subset=['Sepal_length',
'Sepal_width'])

df_subset_no_duplicates.head()

Sepal_length Sepal_width Petal_length Petal_width Target


Target_Name \
0 5.1 3.5 1.4 0.2 1
'setosa'
1 4.9 3.0 1.4 0.2 1
'setosa'
2 4.7 3.2 1.3 0.2 1
'setosa'
3 4.6 3.1 1.5 0.2 1
'setosa'
4 5.0 3.6 1.4 0.2 1
'setosa'

Number
0 20000
1 15000
2 10000
3 5000
4 1000
df = pd.read_csv('not_clean.csv')

df_numerical = df.select_dtypes(include=['float64', 'int64'])

scaler = MinMaxScaler()
df_numerical_normalized =
pd.DataFrame(scaler.fit_transform(df_numerical),
columns=df_numerical.columns)

df_numerical_normalized.head()

Sepal_length Sepal_width Petal_length Petal_width Target


Number
0 0.019656 0.023810 0.007407 0.004367 0.0
0.166662
1 0.014742 0.015873 0.007407 0.004367 0.0
0.122804
2 0.009828 0.019048 0.005556 0.004367 0.0
0.078945
3 0.007371 0.017460 0.009259 0.004367 0.0
0.035087
4 0.017199 0.025397 0.007407 0.004367 0.0
0.000000

df_target = df['Target_Name']

label_encoder = LabelEncoder()
df_target_encoded = label_encoder.fit_transform(df_target)

df_target_encoded[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

df['Target_Name'] = df['Target_Name'].map({'setosa': 1, 'versicolor':


2, 'virginica': 3})

df['Target_Name'].head()

0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: Target_Name, dtype: float64

encoder = OneHotEncoder(sparse=False)

df_target_encoded = encoder.fit_transform(df[['Target_Name']])

df_target_encoded[:10]
array([[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.]])

probabilities = df['Target_Name'].value_counts(normalize=True)

probabilities

Series([], Name: Target_Name, dtype: float64)

X = df[['Sepal_length', 'Sepal_width', 'Petal_length', 'Petal_width']]


y = df['Target_Name']

X.head(), y.head()

( Sepal_length Sepal_width Petal_length Petal_width


0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2,
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: Target_Name, dtype: float64)

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((125, 4), (32, 4), (125,), (32,))

train_size = int(0.8 * len(df))


X_train_manual, X_test_manual = X[:train_size], X[train_size:]
y_train_manual, y_test_manual = y[:train_size], y[train_size:]

X_train_manual.shape, X_test_manual.shape, y_train_manual.shape,


y_test_manual.shape

((125, 4), (32, 4), (125,), (32,))


Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))

outliers

Sepal_length Sepal_width Petal_length Petal_width Target \


0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
.. ... ... ... ... ...
152 True True True True True
153 True True True True True
154 False False False False False
155 False False False False False
156 False False False False False

Target_Name Number
0 False False
1 False False
2 False False
3 False False
4 False False
.. ... ...
152 False False
153 False False
154 False False
155 False False
156 False False

[157 rows x 7 columns]


CS552 Data Mining and Data Warehousing MO-24
CS551 Machine Learning Lab
Date :1/10/24
Lab Assignment 6

Q1(a) Load the Diabetes Dataset

(b) Create a variable X that contains the following features from the above datasets
'Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI',
'DiabetesPedigreeFunction','Age'
and y contains Outcome
© Split the dataset into a 75 percent training and 25 percent testing set using
train_test_split
(d) trains the model using logistic regression
(e) use the test set and feeds it into the model to obtain the predictions
(f) shows the number of actual and predicted labels using confusion matrix
(g) Create a heatmap for confusion matrix where xlabel is predicted label and ylabel is
Actual label using seaborn
(h) Find out the accuracy , Precision and recall score using sklearn.metrics
(i) Find out the accuracy of prediction using score (Xtest, ytest)
(j) Find the the precision, recall, and F1-score of the model using classification
_report() function of the metrics module
(k) Plot the Receiver Operating Characteristic (ROC) Curve and find the Area Under
the Curve (AUC).

Q2(a) Load the Diabetes Dataset

(b) Copies the first two features(Glucose and Blood Pressure) of the dataset into a
two-dimensional list
© Plots a scatter plot showing the distribution of points for the two features
(d) Display Diabetes in red and No Diabetes in blue color

Q3(a) Load the Diabetes Dataset

(b) Copies the first three features (Glucose, Blood Pressure and BMI) of the dataset
into a 3-dimensional list
© Plots a scatter plot showing the distribution of points for the two features
(d) Display Diabetes in red and No Diabetes in blue color
Q3(a) Load the Diabetes Dataset

(b) Select the first features (Glucose) into x variable and Outcome into y variable
from the dataset
© Plots a scatter plot showing the distribution of points for x and y. Use edge color
based on color red or blue.
(d) Use patches using import matplotlib.patches for red and blue color and label them
on Graph

(e) Display Diabetes in red and No Diabetes in blue color


(f) trains the model using logistic regression and find the Intercept and Coefficient
(g) Plot the Sigmoid Curve using formula using plot funtion
Hint :
def sigmoid(x):
return (1/(1+np.exp(-(log_regress.intercept_[0] +
(log_regress.coef_[0][0]*x)))))

where x1= np.arange(0,200,0.001)


y1 = [sigmoid(n) for n in x1]

Also plot for features x and y using scatter function


And plot function for x1 and y1 where xlabel is Glucose and ylabel is
Probability.
Please note that x and y are different from x1 and y1
(h) Make prediction based of model
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score,
recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc

data = pd.read_csv('diabetes.csv')

data

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI


\
0 6 148 72 35 0 33.6

1 1 85 66 29 0 26.6

2 8 183 64 0 0 23.3

3 1 89 66 23 94 28.1

4 0 137 40 35 168 43.1

.. ... ... ... ... ... ...

763 10 101 76 48 180 32.9

764 2 122 70 27 0 36.8

765 5 121 72 23 112 26.2

766 1 126 60 0 0 30.1

767 1 93 70 31 0 30.4

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
.. ... ... ...
763 0.171 63 0
764 0.340 27 0
765 0.245 30 0
766 0.349 47 1
767 0.315 23 0

[768 rows x 9 columns]

X = data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',


'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = data['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.29, random_state=82)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(max_iter=500)
log_reg.fit(X_train, y_train)

LogisticRegression(max_iter=500)

y_pred = log_reg.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)


print(conf_matrix)

[[137 8]
[ 31 47]]

sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')


plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')

Accuracy: 0.8251121076233184
Precision: 0.8545454545454545
Recall: 0.6025641025641025

score = log_reg.score(X_test, y_test)


print(f'Accuracy using score: {score}')

Accuracy using score: 0.8251121076233184

report = classification_report(y_test, y_pred)


print(report)

precision recall f1-score support

0 0.82 0.94 0.88 145


1 0.85 0.60 0.71 78

accuracy 0.83 223


macro avg 0.84 0.77 0.79 223
weighted avg 0.83 0.83 0.82 223
y_prob = log_reg.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_prob)


roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')


plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('diabetes.csv')

two_dim_data = data[['Glucose', 'BloodPressure']].values.tolist()

plt.scatter(data['Glucose'], data['BloodPressure'], c=data['Outcome'],


cmap='bwr')
plt.xlabel('Glucose')
plt.ylabel('Blood Pressure')
plt.show()
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('diabetes.csv')

three_dim_data = data[['Glucose', 'BloodPressure',


'BMI']].values.tolist()

plt.scatter(data['Glucose'], data['BloodPressure'], c=data['Outcome'],


cmap='bwr')
plt.xlabel('BMI')
plt.ylabel('Glucose')
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.linear_model import LogisticRegression
import numpy as np

data = pd.read_csv('diabetes.csv')

x = data['Glucose']
y = data['Outcome']

plt.scatter(x, y, c=y, cmap='bwr')


plt.xlabel('Glucose')
plt.ylabel('Outcome')
plt.show()

plt.scatter(x, y, c=y, cmap='bwr')


red_patch = mpatches.Patch(color='red', label='Diabetes')
blue_patch = mpatches.Patch(color='blue', label='No Diabetes')
plt.legend(handles=[red_patch, blue_patch])
plt.xlabel('Glucose')
plt.ylabel('Outcome')
plt.show()
log_reg_glucose = LogisticRegression()
log_reg_glucose.fit(x.values.reshape(-1, 1), y)

print('Intercept:', log_reg_glucose.intercept_)
print('Coefficient:', log_reg_glucose.coef_)

Intercept: [-5.35002807]
Coefficient: [[0.03787262]]

def sigmoid(x):
return 1 / (1 + np.exp(-(log_reg_glucose.intercept_[0] +
log_reg_glucose.coef_[0][0] * x)))

x1 = np.arange(0, 200, 0.001)


y1 = [sigmoid(n) for n in x1]

plt.scatter(x, y, c=y, cmap='bwr')


plt.plot(x1, y1, color='green')
plt.xlabel('Glucose')
plt.ylabel('Probability')
plt.show()
CS552 Data Mining and Data Warehousing MO-24
CS551 Machine Learning Lab
Date :15/10/24
Lab Assignment 7

Q1(a) Load the following dataset, and the name of the dataset is ‘svmass7.csv’, which
contains 10 records.

x1 x2 y
4 2.9 1
4 4 1
1 2.5 -1
2.5 1 -1
4.9 4.5 1
1.9 1.9 -1
3.5 4 1
0.5 1.5 -1
2 2.1 -1
4.5 2.5 1
Q1(b) Plot the data using Seaborn
Q1© Train the model using Scikitlearn’s svm module’s SVC class. Use linear kernel to
solve the problem.
Q1(d) Find the following
a) Weights
b) Bias
c) Indices of support vectors
d) Support vectors
e) Number of support vectors of each class
f) Coefficient of support vector in the decision function
Q1€ Plot the Hyperplane and the margins.
Q1(f) Predict the class of the following values
(2,7)
(5,6)
(1,4)
(2,0)
Q2(a) Create two sets of random points (a total of 1000 points and add noise 0.20)
distributed in circular fashion using the make_circles() function
(b) Plot the points out on a 2D chart-using scatter and print the xlabel and ylabel.
© Add the third axis, the z-axis (z=x*x + y*y), and plot the chart in 3D

(d) Find the value of x3 and plot the 3D Hyperplane, train the model using the third
dimension. To plot the hyperplane in 3D, use the plot_surface() function

Q3(a) Load the load_breast_cancer dataset from sklearn


Q3(b) Display the first ten records of data and target
Q3(c) Display feature names and target names
Q3(d) Store the first two features in the X variable and target in the y variable
Q3(e) Plot the points using a scatter plot. Also, proper labels, legends, and targets should be
displayed.
Q3(f) Make use of the SVC class with the linear kernel.
Use the following
C=10
SVC(kernel='linear', C=C)

Q3(g) Find the min and max values of the first and second features
Q3(h) Take the step size h = (x_max / x_min)/100
Q3(i) Generates evenly spaced values between x_min and x_max and between ymin and
y_max of values from x_min to x_max with a specified step size h. Pass these
parameters to np.mesgrid.
Creates two 2D arrays (xx and yy) using np.mesgrid representing the grid of points
over which the model's predictions will be made.
Q3(j) Predict each point and stored values in the Z variable. Changes the shape of Z to
match that of xx
Q3(k) Paint the groups (malignant and benign) in colours using the contourf() function.
Also, proper labels, legends, and targets should be displayed.
Use the following parameters in contourf()
cmap=plt.cm.coolwarm, alpha=0.6

Q3(l) Use SVM with varying values of C.


Change C = 1 or 1010 or 10−10

Q4(a) Repeat the Q3 using the Radial Basis function (RBF), also known as Gaussian Kernel
(Non-Linear kernels). Take C=1
(b) See the effects of classifying the points using the following varying values of C and
Gamma.
a) C=1, gamma =10
b) C=1, gamma =0.1
c) C=10−10 , gamma =10
d) C=1010 , gamma =0.1
© Repeat the Q3 using the polynomial kernel (Non-Linear kernels).
See the effects of classifying the points using the following varying values of degree
a) kernel='poly', degree=4, C=1, gamma='auto'
b) kernel='poly', degree=3, C=1, gamma='auto'
c) kernel='poly', degree=2, C=1, gamma='auto'
d) kernel='poly', degree=1, C=1, gamma='auto' # same as Linear
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import svm
import numpy as np
from sklearn.datasets import make_circles
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_breast_cancer

data = pd.read_excel('svmass7.csv.xlsx')
print(data)

x1 x2 y
0 4.0 2.9 1
1 4.0 4.0 1
2 1.0 2.5 -1
3 2.5 1.0 -1
4 4.9 4.5 1
5 1.9 1.9 -1
6 3.5 4.0 1
7 0.5 1.5 -1
8 2.0 2.1 -1
9 4.5 2.5 1

sns.scatterplot(x='x1', y='x2', hue='y', data=data,


palette='coolwarm', style='y', s=100)
plt.show()
X = data[['x1', 'x2']].values
y = data['y'].values

clf = svm.SVC(kernel='linear')
clf.fit(X, y)

SVC(kernel='linear')

print("Weights (w):", clf.coef_[0])


print("\nBias (b):", clf.intercept_[0])
print("\nIndices of Support Vectors:", clf.support_)
print("\nSupport Vectors:", clf.support_vectors_)
print("\nNumber of Support Vectors of each class:", clf.n_support_)
print("\nCoefficient of Support Vector in Decision Function:",
clf.dual_coef_)

Weights (w): [0.84544361 0.38517609]

Bias (b): -3.4991090319463396

Indices of Support Vectors: [3 8 0]

Support Vectors: [[2.5 1. ]


[2. 2.1]
[4. 2.9]]

Number of Support Vectors of each class: [2 1]

Coefficient of Support Vector in Decision Function: [[-0.03615281 -


0.3956072 0.43176 ]]

w = clf.coef_[0]
slope = -w[0] / w[1]
b = clf.intercept_[0]
xx = np.linspace(0, 5)
yy = slope * xx - b / w[1]

sns.scatterplot(x='x1', y='x2', hue='y', data=data,


palette='coolwarm', style='y', s=100)
plt.plot(xx, yy, 'k-')

plt.plot(xx, yy + 1/w[1], 'k--')


plt.plot(xx, yy - 1/w[1], 'k--')
plt.show()
points = [[2,7], [5,6], [1,4], [2,0]]
predictions = clf.predict(points)

print(predictions)

[ 1 1 -1 -1]

X, y = make_circles(n_samples=1000, noise=0.20)

fig = plt.figure(figsize=(10,8))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Accent, s=40)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
z = X[:, 0]**2 + X[:, 1]**2
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], z, c=y, cmap=plt.cm.Accent, s=40)

plt.show()
X, y = make_circles(n_samples=1000, noise=0.20)

z = X[:, 0]**2 + X[:, 1]**2

clf = svm.SVC(kernel='linear')
clf.fit(np.c_[X, z], y)

SVC(kernel='linear')

x3 = lambda x, y: (-clf.intercept_[0] - clf.coef_[0][0]*x -


clf.coef_[0][1]*y) / clf.coef_[0][2]
xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 100), np.linspace(-1.5,
1.5, 100))
zz = x3(xx, yy)

fig = plt.figure(figsize=(10, 8))


ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], z, c=y, cmap=plt.cm.Accent, s=20)
ax.plot_surface(xx, yy, zz, color='green', alpha=0.2)
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')

plt.show()
data = load_breast_cancer()

print(data.data[:10])
print(data.target[:10])

[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-


01
1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00
1.534e+02
6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03
2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-
01
4.601e-01 1.189e-01]
[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-
02
7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00
7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03
2.499e+01
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-
01
2.750e-01 8.902e-02]
[1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-
01
1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00
9.403e+01
6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03
2.357e+01
2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-
01
3.613e-01 8.758e-02]
[1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414e-
01
1.052e-01 2.597e-01 9.744e-02 4.956e-01 1.156e+00 3.445e+00
2.723e+01
9.110e-03 7.458e-02 5.661e-02 1.867e-02 5.963e-02 9.208e-03
1.491e+01
2.650e+01 9.887e+01 5.677e+02 2.098e-01 8.663e-01 6.869e-01 2.575e-
01
6.638e-01 1.730e-01]
[2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-
01
1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00
9.444e+01
1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03
2.254e+01
1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01 1.625e-
01
2.364e-01 7.678e-02]
[1.245e+01 1.570e+01 8.257e+01 4.771e+02 1.278e-01 1.700e-01 1.578e-
01
8.089e-02 2.087e-01 7.613e-02 3.345e-01 8.902e-01 2.217e+00
2.719e+01
7.510e-03 3.345e-02 3.672e-02 1.137e-02 2.165e-02 5.082e-03
1.547e+01
2.375e+01 1.034e+02 7.416e+02 1.791e-01 5.249e-01 5.355e-01 1.741e-
01
3.985e-01 1.244e-01]
[1.825e+01 1.998e+01 1.196e+02 1.040e+03 9.463e-02 1.090e-01 1.127e-
01
7.400e-02 1.794e-01 5.742e-02 4.467e-01 7.732e-01 3.180e+00
5.391e+01
4.314e-03 1.382e-02 2.254e-02 1.039e-02 1.369e-02 2.179e-03
2.288e+01
2.766e+01 1.532e+02 1.606e+03 1.442e-01 2.576e-01 3.784e-01 1.932e-
01
3.063e-01 8.368e-02]
[1.371e+01 2.083e+01 9.020e+01 5.779e+02 1.189e-01 1.645e-01 9.366e-
02
5.985e-02 2.196e-01 7.451e-02 5.835e-01 1.377e+00 3.856e+00
5.096e+01
8.805e-03 3.029e-02 2.488e-02 1.448e-02 1.486e-02 5.412e-03
1.706e+01
2.814e+01 1.106e+02 8.970e+02 1.654e-01 3.682e-01 2.678e-01 1.556e-
01
3.196e-01 1.151e-01]
[1.300e+01 2.182e+01 8.750e+01 5.198e+02 1.273e-01 1.932e-01 1.859e-
01
9.353e-02 2.350e-01 7.389e-02 3.063e-01 1.002e+00 2.406e+00
2.432e+01
5.731e-03 3.502e-02 3.553e-02 1.226e-02 2.143e-02 3.749e-03
1.549e+01
3.073e+01 1.062e+02 7.393e+02 1.703e-01 5.401e-01 5.390e-01 2.060e-
01
4.378e-01 1.072e-01]
[1.246e+01 2.404e+01 8.397e+01 4.759e+02 1.186e-01 2.396e-01 2.273e-
01
8.543e-02 2.030e-01 8.243e-02 2.976e-01 1.599e+00 2.039e+00
2.394e+01
7.149e-03 7.217e-02 7.743e-02 1.432e-02 1.789e-02 1.008e-02
1.509e+01
4.068e+01 9.765e+01 7.114e+02 1.853e-01 1.058e+00 1.105e+00 2.210e-
01
4.366e-01 2.075e-01]]
[0 0 0 0 0 0 0 0 0 0]

print(data.feature_names)
print(data.target_names)
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']

X = data.data[:, :2]
y = data.target

colors = ['green', 'gray']


for i, color in zip([0, 1], colors):
plt.scatter(X[y == i, 0], X[y == i, 1],
label=data.target_names[i], color=color)

plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.legend(loc='best')
plt.show()

C = 10
clf = svm.SVC(kernel='linear', C=C)
clf.fit(X, y)
SVC(C=10, kernel='linear')

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1


y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1


y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

h = (x_max - x_min) / 100

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min,


y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

fig = plt.figure(figsize=(10, 7))


plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.6)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Accent, edgecolors='b')
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.title("SVM with Linear Kernel")
plt.show()
for C in [1, 10**10, 10**-10]:
clf = svm.SVC(kernel='linear', C=C)
clf.fit(X, y)

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.6)


plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm,
edgecolors='k')
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.title("SVM with Linear Kernel")
plt.show()
clf = svm.SVC(kernel='rbf', C=1)
clf.fit(X, y)

SVC(C=1)
params = [(1, 10), (1, 0.1), (10**-10, 10), (10**10, 0.1)]
for C, gamma in params:
clf = svm.SVC(kernel='rbf', C=C, gamma=gamma)
clf.fit(X, y)

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plotting
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.6)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm,
edgecolors='k')
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.title(f"SVM with RBF Kernel (C={C}, gamma={gamma})")
plt.show()
CS552 Data Mining and Data Warehousing MO-24
CS551 Machine Learning Lab
Date :29/10/24
Lab Assignment 8

Q1(a) Load the following dataset, and the name of the dataset is ‘cluster_data.csv’,
which contains 9 records.

x1 x2
1.0 2.0
1.5 1.8
5.0 8.0
8.0 8.0
1.0 0.6
9.0 11.0
6.0 2.0
7.0 5.0
4.0 7.0

Q1(b) load the CSV file into a Pandas dataframe, and plot a scatter plot showing the
points
Q1© Generates three random centroids and marks them on the scatter plot
Q1(d) Implements the K-Means algorithm and plot a scatter plot showing the points
Q1(e ) Print out the clusters to which each point belongs
Q1(f) Find the location of each centroid
Q1(g) Now repeat the same exercise use the KMeans class in Scikit-learn to do
clustering and take cluster size =3
Q1(h) train the model using the fit() function
Q1(i) print the clusters label and centroids
Q1(j) plot the points and centroids on a scatter plot
Q1(k) Predict the cluster of the following values
(2,7)
(5,6)
(1,4)
(2,0)
Q1(l) Finding the Optimal K using Silhouette Coefficient
Q1(m) Plot a chart showing the various values of K and their corresponding Silhouette
Coefficients
Q2(a) Load the Iris dataset from sklearn using load_iris()
(b) import the data into a Pandas dataframe
© Find out its shape and clean the data if possible
(d) Plot a scatter plot showing the distribution in Sepal length and Sepal width
(e ) Cluster the points into three clusters, k=3 using Scikit-learn’s KMeans class
(f ) Plot a scatter plot showing the distribution in Sepal length and Sepal width
(g) Finding the Optimal K using Silhouette Coefficient
(h) Plot a chart showing the various values of K and their corresponding Silhouette
Coefficients
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

iris = load_iris()
iris_data = iris.data
iris_feature_names = iris.feature_names

df = pd.DataFrame(iris_data, columns=iris_feature_names)

print("Shape of the dataset:", df.shape)


print("First few rows of the dataset:")
print(df.head())

Shape of the dataset: (150, 4)


First few rows of the dataset:
sepal length (cm) sepal width (cm) petal length (cm) petal width
(cm)
0 5.1 3.5 1.4
0.2
1 4.9 3.0 1.4
0.2
2 4.7 3.2 1.3
0.2
3 4.6 3.1 1.5
0.2
4 5.0 3.6 1.4
0.2

plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'],


label='Data Points')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Scatter Plot of Sepal Length and Sepal Width')
plt.legend()
plt.show()
kmeans_model = KMeans(n_clusters=3, random_state=42)
kmeans_model.fit(df[['sepal length (cm)', 'sepal width (cm)']])

KMeans(n_clusters=3, random_state=42)

plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'],


c=kmeans_model.labels_, cmap='viridis', label='Data Points')
plt.scatter(kmeans_model.cluster_centers_[:, 0],
kmeans_model.cluster_centers_[:, 1], color='red', label='Centroids')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Scatter Plot of Sepal Length and Sepal Width with
Clusters')
plt.legend()
plt.show()
silhouette_scores = []
K_range = range(2, 10)
for k in K_range:
kmeans_model = KMeans(n_clusters=k, random_state=42)
kmeans_model.fit(df[['sepal length (cm)', 'sepal width (cm)']])
score = silhouette_score(df[['sepal length (cm)', 'sepal width
(cm)']], kmeans_model.labels_)
silhouette_scores.append(score)

plt.plot(K_range, silhouette_scores)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficient for Different Values of K')
plt.show()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from copy import deepcopy
from sklearn.metrics import silhouette_score
from sklearn import metrics

data = {'x1' : [1.0, 1.5, 5.0, 8.0, 1.0, 9.0, 6.0, 7.0, 4.0],
'x2' : [2.0, 1.8, 8.0, 8.0, 0.6, 11.0, 2.0, 5.0, 7.0]}

df = pd.DataFrame(data)

plt.scatter(df['x1'], df['x2'])
plt.title("Scatter plot of Data Points")
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

k = 3

X = np.array(list(zip(df['x1'], df['x2'])))

Cx = np.random.randint(np.min(X[:,0]), np.max(X[:,0]), size = k)


Cy = np.random.randint(np.min(X[:,1]), np.max(X[:,1]), size = k)

C = np.array(list(zip(Cx, Cy)), dtype = np.float64)


print(C)

[[4. 8.]
[4. 0.]
[5. 8.]]

plt.scatter(df['x1'], df['x2'])
plt.scatter(Cx, Cy)
plt.xlabel('x')
plt.ylabel('y')

Text(0, 0.5, 'y')

print(Cx)
print(Cy)

[4 4 5]
[8 0 8]

print(X)

[[ 1. 2. ]
[ 1.5 1.8]
[ 5. 8. ]
[ 8. 8. ]
[ 1. 0.6]
[ 9. 11. ]
[ 6. 2. ]
[ 7. 5. ]
[ 4. 7. ]]

print(C)

[[4. 8.]
[4. 0.]
[5. 8.]]

def euclidean_distance(a, b):


return np.linalg.norm(a - b)

def k_means(X, k, max_iters=100):


np.random.seed(42)
centroids = X[np.random.choice(X.shape[0], k, replace=False)]
for _ in range(max_iters):
clusters = [[] for _ in range(k)]
for point in X:
distances = [euclidean_distance(point, centroid) for
centroid in centroids]
cluster = np.argmin(distances)
clusters[cluster].append(point)
prev_centroids = centroids
centroids = [np.mean(cluster, axis=0) if cluster else prev for
cluster, prev in zip(clusters, prev_centroids)]
if np.array_equal(centroids, prev_centroids):
break
return np.array(centroids), clusters

X = df.values
centroids, clusters = k_means(X, k)

colors = ['r', 'g', 'b']


for i, cluster in enumerate(clusters):
cluster = np.array(cluster)
plt.scatter(cluster[:, 0], cluster[:, 1], color=colors[i])

plt.scatter(centroids[:, 0], centroids[:, 1], color='purple',


marker='*', s=100)
plt.show()

clust = str(cluster)
for i, cluster in enumerate(clusters):
print("\nPoint " + str(X[i]),"Cluster " + clust + "\n")

Point [1. 2.] Cluster [[ 9. 11.]]

Point [1.5 1.8] Cluster [[ 9. 11.]]

Point [5. 8.] Cluster [[ 9. 11.]]

kmeans = KMeans(n_clusters=3, random_state=42).fit(X)


print("Cluster Labels:", kmeans.labels_)
print("Centroids:", kmeans.cluster_centers_)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')


plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,
1], marker='*', s=150, c='red')
plt.show()

Cluster Labels: [2 2 1 1 2 1 0 0 1]
Centroids: [[6.5 3.5 ]
[6.5 8.5 ]
[1.16666667 1.46666667]]
new_points = np.array([[2, 7], [5, 6], [1, 4], [2, 0]])
predictions = kmeans.predict(new_points)

print("Cluster predictions for new points:", predictions)

Cluster predictions for new points: [1 0 2 2]

silhouette_avgs = []
min_k = 2

for k in range(min_k, len(X)):


kmean = KMeans(n_clusters=k).fit(X)
score = metrics.silhouette_score(X, kmean.labels_)
print("Silhouette Coefficients for k =", k, "is", score)
silhouette_avgs.append(score)

f, ax = plt.subplots(figsize=(7, 5))
ax.plot(range(min_k, len(X)), silhouette_avgs)
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Coefficients")

Optimal_K = silhouette_avgs.index(max(silhouette_avgs)) + min_k

print("Optimal K is ", Optimal_K)

Silhouette Coefficients for k = 2 is 0.525054643458589


Silhouette Coefficients for k = 3 is 0.440169862888282
Silhouette Coefficients for k = 4 is 0.5353654148536285
Silhouette Coefficients for k = 5 is 0.4435132403139085
Silhouette Coefficients for k = 6 is 0.3864925695333016
Silhouette Coefficients for k = 7 is 0.25971834812765654
Silhouette Coefficients for k = 8 is 0.1334557693573676
Optimal K is 4

You might also like