0% found this document useful (0 votes)

43 views39 pages

AIML Lab Manual

Uploaded by

Mihika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views39 pages

AIML Lab Manual

Uploaded by

Mihika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Madhav Institute of Technology and Science, Gwalior

(A Govt. Aided UGC Autonomous & NAAC Accredited Institute)

(Affiliated to R.G.P.V. Bhopal)

Department Of Electronics Engineering

Lab Manual

140617 AIML Lab

Faculty Coordinator

Prof. (Dr.) R.P.Narwaria

Prof. Madhav Singh

Lab Coordinator

Mr. D. S. Tomar
LIST OF THE EXPERIMENT

S.NO NAME OF THE EXPERIMENT PAGE

NO.

1 Perform Creation, indexing, slicing, concatenation and repetition operations 1

on Python built-in data types: Strings, List, Tuples, Dictionary, Set

2 Solve problems using decision and looping statements. 2

3 Apply Python built-in data types: Strings, list, Tuples, Dictionary, Set and their 3
methods to solve any given problem

4 Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and 3

Splitting.

5 Computation on NumPy arrays using Universal Functions and Mathematica.I 4

methods.

6 Import a CSV file and perform various Statistical and Comparison operations 5
on rows/columns

7 Create Pandas Series and DataFrame from various inputs 6

8 Import any CSV file to Pandas DataFrame and perform the following: 6
1. Visualize the first and last 10 records
2. Get the shape, index and column details
3. Select/Delete the records(rows)/columns based on conditions.
4. Perform ranking and sorting operations.
5. Do required statistical operations on the given columns.
6. Find th count and uniqueness of the given·categorical values.

9 Import any CSV file to Pandas DataFrame and perform the following: 9
1. Handle missing data by detecting and dropping/ filling missing
values.
2. Transform data using different me hods.
3. Detect and filter outliers.
...
4. Perform Vectorized String operations on Pandas Series.
5. Visualize data using Line Plots, Bar Plots, Histograms, Density
Plots and Scatter Plots.

10 Use scikit-learn package in python to implement following machine learning 14

models to solve real world problems using open source datasets:
1. Linear Regression model.
2. Multi-linear regression model.
3. Decision tree classification model.
4. Random forest model.
5. SVM model.
6. K-means clustering model.

: Experiment 1:
Perform Creation. indexing, slicing, concatenation and repetition operations ori Python built-in data types: Strings,
List, Tuples, Dictionary, Set

#String Creation
strl =•Abhinav•
str2 =•chaturvedi•

#string Concatenation
str3=str1+• •+str2
print(str3)

#string indexing
print(str3[8])

#string slicing
print(str3[:7:J)
print(str3[0:7:2])
print(str3[::-1])

#string repetition
print(strl • 3)
Abhinav Chaturvedi
A
Abhinav
Ahnv
idevrutahC vanihbA
AbhinayAbhinavAbhinav

1 [ j: #List Creation
listl =['Python', 'C++', 'JavaScript']
list2 =[l, 2, 3]

#List Concatenation
list3=listl+list2
print(list3)

#List indexing
print(list3[0])

#List slicing
print(list3[:3:])
print(list3[0:7:2])
print(list3[::-1])

#List repetition
print(listl • 3)
(•Python•, •c++•, 'JavaScript', 1, 2, 3]
Python
['Python', 'C++', 'JavaScript']
('Python', 'JavaScript', 2] ,
(3, 2, 1, 'JavaScript', 'C++' • ython
('Python', 'C++', 'JavaScript, Python, •C++•, 'JavaScript', •Python', 'C++', •JavaScript']

I,, [ 1: #tuple Creation

#items of a tupe Ca nnot be changed once create
tuple1
=('Python',
'C++'
'JavaScript')
tuple2 =(l, 2,
3) '

#'tuple Concatena ion

tuple3=tuple1+tuple2
print(tuple3)

#tuple indexing
print(tuple3f0])
print(tuple3f0][1))

#ruple slicing
print{tuple3[:3:])
print{tuple3[0:7:2])
print(tuple3[::-1])

#tuple repention
print(tuplel * 3)
('Python', 'C++,' 'JavaScri•pt', 1, , )
2 3
Python
y
('python', 'C++', 'JavaScript')
{'Python', 'JavaScript', 2)
{3, 2, 1, 'JavaScript', 'C++' 'Pyth ')
('Python', 'C++', 'JavaScript , 'Pyt :n•,
'C++', 'JavaScript', 'Python·, 'C++', 'JavaScript')
In ]: #Dicn.onary Creation
d ct1 ={7:'P hon', 2:'C++', 3:'JavaScript'}
dict2 ={ good :1, 'average':2, 'nice':3}

#diet Concatenation
def Merge(dict1, dict2):
return({**dictl, **dict2})

dict3=Merge(dictl,dict2)
print(dict3)

#diet - element access

print(dict2['good'])
print(dictl.get(l))

{1: 'Python', 2: 'C++', 3: 'JavaScript', 'good': 1, 'average': 2, 'nice': 3}

1
Python

In j: #Set Crearion
#a set cannot have mirtable elements like list:s, sets or dictionaries as its elements.
#You cannot access ite s in a set by referring to an index or a key.
Setl ={'Python', 'C++", 'JavaScript'}
Set2 ={1, 2, 3}

#Set Concatenation Setl.update(Set2) Setl.add('Java·) print(Setl)

{l, 2, 'C++', 3, 'JavaScript', 'Python', 'Java'}
: Experiment 2 :
Solve problems using decision and looping statements.

In J: def print_factors(x):
print("The factors of•,x,nare:")
for i in range(l, x + 1):
if x X i == 0:
print(i)

num=320
print_factors(num)

The factors of 320 are:

1
2
4
s
8
10
16
20
32
40
64
80
160
320
: Experiment 3 :

Apply Python built-in data types: Strings, List, Tuples, Dictionary, Set and their.methods to solve any given problem

In [ ): Age={'person2': 21,'personS':24,'person6':19,'personl':20,'person3':23,'person4':22}
sortedDict = sorted(Age)
print(sortedDict)

#sort:i.ng on the basis of value using Lambda function

print(sorted(Age.items(),·key=lambda x:x[l]))

#without using La bda function

def element_l(x):
return x[l]

sorted(Age.items(), key=element_l)
['personl', 'person2', 'person3', 'person4', 'persons', 'person6']
[('person6', 19), ('personl', 20), ('person2', 21), ('person4', 22), ('person3', 23), ('persons', 24)]
Out[ ): [('person6', 19),
('personl', 20),
('person2', 21),
( •per.son4•, 22),
('person3', 23),
(·persons', 24)]
: Experiment 4:
Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and Splitting.

In [ J: arr= np.array([l, 2, 3, 4, 5])

print(arr)

print(type(arr))

print{arr[l]) l.
pi-int{arr[::2])
print(arr[l:5])
print(arr[l:5:2])

arrl= np.array([l, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arrl.reshape(4, 3)

print(newarr)

newarr 3d = arrl.reshape(2, 3, 2)
print(;ewarr_3d)

arr3 = np.array([100, 200, 300])

arr4= np.array([400, 500, 600])

arr= np.concatenate((arr3, arr4)

print(arr)

arrl = np.array([[l, 2], [3, 4]])

arr2 = np.array([[5, 6], [7, 8]])

arr= np.concatenate((arrl, arr2), axis=l)

print(arr)

arr= np.array([l, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 3)

print(newarr)

arr= np.array([[l, 2, 3], [4, S, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15), [16, 17, 18]])

newarr = np.array_split(arr, 3)

print(newarr)

[1 2 3 4 5)
<class 'numpy.ndarray'>
2
[1 3 5]
[2 3 4 5]
[2 4]
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 1112]]
[[[ 1 2]
[ 3 4]
[ 5 6]]

[( 7 8]
[ 9 10]
(11 12]]]
[100 200 300 400 500 600]
[[1 2 5 6]
(3 4 7 8]]
[array([l, 2]), array([3, 4]), array([5, 6))]
[array([[l, 2, 3],
(4, 5, £]]), array(([ 7, 8, 9],
(10, 11, 12]]), array([[13, 14, 15],
(16, 17, 18]])]
: Experiment 5:
Computation on NumPy arrays using Universal Functions and Mathematical methods.

Note: Universal functions in Numpy are simple mathematical functions. It is just a term that we gave to mathematical
functions in the Numpy library. Numpy provides various universal functions that cover a wide variety bf operations.
These functions include standard trigonometric functions, functions for arithmetic operations, handling complex
numbers, statistical functions, etc.

In [ ]: # Python code to demonstrate trigonometric function

# create an array of angles

angles= np.array([0, 30, 45, 60, 90, 180])

#conversion of degree into radians using deg2rad function

radians= np.deg2rad(angles)

# sine of angles
print('Sine of angles in the array:')
sine value= np.sin(radians)
print(np.sin(radians))

#hypot function demonstration

base= 4
height= 3
print('hypotenuse of right triangle is·• h
· ,np. ypot(base, height))
#statistical method
weight= np.array([50.7 S2.S 50 58
:i t(. , J ' ' 55 - 6 3, 73.25, 49.5, 45])
pr_n Mean weight of the students: ')
print(np.mean(weight))

Sine of angles in the array:

[0.80000000e+00 S 00000000e 0
l.00000000e+00 1 22464680e=1 ]7.07106781e-01 8.66025404e-01
hypotenuse of right triangle is: 5.0
Mean weight of the students:
S4.322S
: Experiment 6 :
Importa CSV file and perform variou St f . I
s a tst,ca and Comparison operations on rows/columns.
In r ) : df=pd.read_csv('50_Startups.csv')
df.head(10)

Out[ ] : R&DSpend Administration Marketing Spend

State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 1S13TI.59
443898.53 California 191792.06
2 153441.51 10114S.SS 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
s 131876.90 99814.71 36286136 NewYork 156991.12
6 134615.46 147198.87 127716.82 California 156122.51
7 130298.13 14S530.06 323876.68 Florida 155752.60
8 120542.52 148718.95 311613.29 New York 152211.77
9 123334.88 108679.17 304981.62 California 149759.96

In ):
Where method to compare the values
#
# The values were stored in the new column
# By using the Where() method in NumPy, i1e are given the condition to compare the columns.
# If 'columnl' is Lesser than 'column2' and 'column1' is -Lesser than the_'column3', .
# ive print the values of 'coLumnl'. If the condition fails, we give the value as 'NaN'.
df['-new'J = np.where((df['R&D Spend']<= df['Administrat on']) I (
df['R&D Spend'] <= df['Marketing Spend']), df['R&D Spend'], np.nan)

df.head()

Out[ ] : R&D Spend Administration Marketing Spend State Profit new

0- 16534920 136897.80 4'71784_,0 New York 192261.83 165349.20 ..

1 162597.70 151377.59 443898.53 California 191792.06 162,597.70

2 153441.51 101145.55 407934.54 Florida 191050.39 153441.51

3 144372.41 118671.85 383199.62 New York 182901.99 144372.41

4 142107.34 91391.77 366168.42 Florida 166187.94 142107.34

In ]: print( df.sum())
3686080.78
R&D Spend
6067231.98
Administration
10551254.89
Marketing Spend
YorkCaliforniaFloridaNew YorkFloridaNew Yo ...
State New 5600631.96
Profit 3686080.78
new
dtype: object

In [ ): print( df['Profit'].sum( - ---- --··

. - -•A--·A- •- A •.. . - ..
¥ •

,
5600631.960000001

In ]: print(df.mean())
R&D Spend
73721.6156
Administration 121344.6396
Marketing Spend 211025.0978
Profit 112012.6392
new
73721.6156
dtype: float64

5600631.960000001
: Experiment 7:
Create Pandas Series and DataFrame from variousinputs.

In ); d t
a a= np.array(['personl', 'person2', 'person3', 'person4', 'persons'])
ser_5tud = pd.Series(data)
print(ser_stud)

# create series form a list

list-= ["95' •94• '99' •93· '.97']
ser_num = pd Serie;(list) ,
print(ser_num)
0personl
1 person2
2 person3
3 person4
4 persons
dtype: object
0 95
1 94
2 99
3 93
4 97
dtype: object

In ]: dataframe = { 'Students': ser_stud, 'numbers': ser_num}

#Creating DaraFra,-r:bey passing Dictionary
result= pd.DataFrame(dataframe)
result.head()

Out [ j: Students numbers

0 person1 95
1 person2 94
2 person3 99
3 person4 93
4 persons 97

In [ ]: data= [['person1• 10], ('person2', 15], ('person3', 14]]

df = pd.OataFrame(ea a, c.olumns={'Name', 'Age']) ,.
df.head()

OL1t [ ) ; Name Age

0 person1 10
1 person2 15

2 person3 14
: Experiment 8:
Import any CSV file to Pandas Data Frame and perform the following:

1. Visualize the first and last 10 records

2. Get the shape, index and column details
3. Select/Delete the records(rows)/columns based on conditions.
4. Perform ranking and sorting operations.
. l).o rn1111f1t1tJ I l!t/bllrnl OflfllllIl on 0,1111,,
01
0. I Incl tho f 1111111rtndtmlqt1tn
I 1.6 ,, I II Ill

vu,,1 011111111
'"''
(11111
I H1m111llfl f;/11plMn111tlpln,1,l111rtn• "', tlluc11111,"w
' ,lu, r,;,

11
' l It/ )I/ 1'•11I IIU ,/(lf 11
,JJ 11h lt1§,._1l11t·11i-1Jll. l't!ll(I
,tJ nlmtr1 1J11t11, h i!tl( H)'
V( 'plmrl■flfrt
't U\/')

lp1 ( )
0
"""' '" " i,lu<:o•a co
()
11 tll;,uoll
"''n lhldo11,o ,,,..,1,,. 11ml dlii,h 1r1Jd .a9e "'1n dl bf!tes
1411 J () 3,G (),(i27 1,3700 Trur:
ll!, (l(l
o () 20,(i (J,3!,1 1,1426 Fal
:t 11
''"
1(]'1

., ll9 GG 2
() ()
!M 2ll.1
;>J, Olin
0,1G7
.12
21
OJX)()()
0,9()(j2
True
False
... () 1· 7 40 ·!, 1GB 43,1
2,288 33 1.3790 True
s ll<i 74 0 0 2!.i,6 0201 30 0,0000 False
G 70 50 32 /18 311) 0,248 26 12608 True
7 10 115 0 0 0 35,3 0,134, 29 0.0000 False
tJ 2 197 70 4!i 543 30,5 0,158 53 1,TT30 True
9 0 125 9G 0 0 0,0 0232 54 0,0000 True

ln I J : diobctcs_dotn.tnil(10)

Out[ J : num_prcg glucose_conc d l u tol l c J > p

7S8 106
- -- ·
76
- thickness Insulin bml diab_pred age skin diabetes

0 False
0 37,5 0,197 26 0.0000
7S9 6 190 92 0 0 35.5 0.278 66 0,0000 True

760 2 80 50 26 16 28,4 0,766 22 1,0244 False

761 9 170 14 31 0 44,0 0.403 43 1.2214 True

762 9 89 62 0 0 22,5 0.142 33 0.0000 False

763 10 101 76 48 180 32,9 0,171 63 1,8912 False

764 2 122 70 27 0 36,8 0.340 27 1,0638 False

76S 5 121 72 23 112 26,2 0.245 30 0.9062 False

766 126 60 0 0 30.1 0.349 47 0.0000 True

767 93 70 31 0 30,4 0.315 23 1.2214 False

Jn [ ]: diabetes_data.shape
Out[ J; (768, 10)
In [ ) : dJaqetes_data.describe() •
Ovt( J: thickness Insulin bml diab_pred age skin
num_preg glucose_conc dlastollc_bp

768.000000 768,000000 768.000000 768,000000 768.000000 768.000000 768.000000

count 768.000000 768.000000

mean 69,105469 20,536458 79.799479 31.992578 0.471876 33.240885 0.809136

3.845052 120.894531

19,355807 15,952218 115,244002 7,884160 0.331329 11.760232 0.628517

std 3.369578 31.972618
0.000000 0.000000 0,000000 0.078000 21.000000 0.000000
min 0.000000 0,000000 0,000000

0.000000 0,000000 27,300000 0.243750 24.000000 0.000000

25% 1.000000 99,000000 62.000000

SO% 23,000000 30.500000. 32,000000 0,372500 29.000000 0.906200

3,000000 117,000000 72.000000

32.000000 127,250000 36.600000 0.626250 41.000000 1.260800

15% 6.000000 140.250000 80,000000

max 99,000000 846.000000 67.100000 2.420000 81.000000 3.90000

17,000000 199,000000 122.000000
11122122,

In f J •J 2
diab

diabbetes_data(,dJab
a etes_data. head(5) He!l_data [• d illb e 'J.map(d1abe1:eS.JMP)

Out[ ]· num_preg glucose_conc dlastollc.J>p thlckneu lnwlln bml dlab_pred age sldn dlabrltt
0 6 148 72 35 0 33.6 0,627 so 13790
85 66 0
29 0 26,6 0.351 31 1.142.6
2 6 183 64 0 0 23,3 0,672 32 0.0000
3 89 66 23 94 28.1 0.167 21 0.9062 0
4 0 137 40 35 168 43.1 2.288 33 13790

In l ]: sns.countplot(x='diabetes.,data=diabetes_data)

Ou [ ): <AxesSubplot: xlabel•'diabetes'
1
ylabel
•
•
count'>

,l,,J
C:
:::,
8

1
diabetes

In [ ]: diabetes_data['diabetes').value_counts()

Out[ J: 0 500
1 268
Name: diabetes, dtype: int64

n [ ]: diabetes_data.sort_values("age")
diast bmi dfab_pred age skin diab etes
num_preg glucose_conc olic_bp thickness insulin
Out[ ]:
113 64 35 0 33.6 0.543 21 13790
255
0 0.0 0304 21 0.0000 0
84 0 0
60 2
0 225 0262 21 0.0000 0
125 96 0
102 0
0299 21 0.7880 0
0 74 20 23 27.7
182
115 435 0347 21 1.0638 0
94 70 27
623 0

0 26.8 0.186 69 0.0000 0

80 0
123 s 132
0
0 0 0.0 0.640 69 0.0000
136 82
68,4 5
18 0 325 0235 70 0.7092
145 82
666 4
0 19.6 0.832 72 0.0000 0
0 0
453 2 119
60 25.9 0.460 81 13002 0
74 33
459 9 134
.
'11122122,
Llllt
Tri ( 1: diobct •s_d,,tn(; t1(l rllnk •) dJ,nl,ct;u do• . [,
- , 0 "Cc·1.1·t1111 c,, , cmtH 1111r II c)
In r l: dii'lbete _dntn.heod()

ut[ ] ' nun pn g glucosc_conc dlftstoll t," thlclmnu l11J11lh1 111111 dlilh. Jll'fill IIIJfl 1kl11 cJl111i11tn ii!)fl fitlll(
0 G 140 7 /I!,!;
!I' () ;!;IIi oc,:11 !,O I fJ /I (I
05 (j(j :!!) f.1,1'£l 0 31(),'
I) I, I 03!J1
2 I) 11) 1'1 0 0 ,!.;J, OIJ'/2 ;, 00000 :rnu,
3 09 GG
"
ii. f '1 i·1 0,00l:i2 7 l.O
i'fl,1 ().11,'/
,4 0 137 40 3r OJ,()
160 4·.1 .,?ll/J 1. . 70{1

111 ( J: i l, inplD •r:ol e)llw<l ..J fo,• <1Lumll/J

diabetes_data "' diobetes_cfota .drop([ • nge_ronk,], n > < •

Operations based on conditions

In [ J: diabetes_data.groupby( 'diabetes') ,nienn()

Out[ ] : num_p g glucosc_conc dfostollc_bp thickness Insulin lnnl dl11h_prad

diabetes

0 3.298000 109.980000 68.184000 19.6f,4000 60.792000 30,304200 0.429734 31,190000 0,774762

4.865672 141.257463 70.824627 22.164179 100.335821 35,142537 0,550500 37,067164 0,873269

Renaming Columns
,
In [ ]: diabetes_data.rename(columns = {'glucose_conc': 'glucose', 'diastolic_bp': 'bp'}, inplace • True)

In [ ) : diabetes_data.head()

out[ J: num_preg glucose bp thickness Insulin bml dlab_pred age skin diabetes

0 6 148 72 35 0 33.G 0.627 so 1.3790

85 66 29 0 26.6 0.351 31 1.1426 0

2 8 183 64 0 0 23.3 0.672 32 0.0000

3 89 66 23 94 28.1 0.167 21 0.9062 0

4 0 137 40 35 168 43.1 2.288 33 1,3790

In [ ] : diabetcs_data.age.unique()

Out[ ) : array( [50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 34, 57, 59, 51, 27, 41, 43,
22, 38, 60, 28, 45, 35, 46, 56, 37, 48, 40, 25, 24, SB, 42, 44, 39,
36, 23, 61, 69, 62, 55, 65, 47, 52, 66, 49, 63, 67, 72, 81, 64, 70,
68), dtype=1nt64)

In [ ) : len(diabetes_data.-age.unique())

Out[ J: 52
:Experiment 9:
Import any CSV file to Pandas DataFrame and perform the following:

1. Handle missing data by detecting and dropping/ filling missingvalues.

2. Transform data using different methods.
3. Detect and filter outliers.
4. Perform Vectorized String operations on Pandas Series.
5. Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and ScatterPlots.

Dataset Import and overview

_In [ ): df•pd.read_csv( • S0_Startups. csv')

df.hcad{l0)

()111"[ ]:
-- 0
R&DSpend Administration
165349.20 136897.80
-Marl<--cting Spend

471784.10 New York

State
Profit

192261.83
162597.70 151377.59
443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
144372.41
..
3 118671.85 383199.62
142107.34 New York 182901.99
91391.77
366168.42
Florida 166187.94
s 131876.90 99814.71 36286136
New York 156991.12
6 134615.46 147198.87 127716.82
California 156122.51
7 130298.13 145530.06 323876.68
Florida 155752.60
8 120542.52 148718.95 311613.29
New York 152211.77
9 123334.88 108679.17 304981.62
California 149759.96

In f ]: df.shape

out[ ]: (50, 5)

Handling Missing Data

In [ J: #Check if any null or empty data is present in dataset

#if not a numbe,-/nuLL -> df.dropna(inplace=True)
#df{'"R&D Spend"] = df["R&D Spend HJ. replace(np. NaN, df["R&D Spend"] .mean())
df.isna().sum()
••- w••-•• • -

Out[ ]: R&D Spend 0

Administration 0
Marketing Spend 0
State 0
Profit 0
dtype: int64

In [ ]: print("total number of rows : {0}".format(len(df)))

print("number of rows missing R&D Spend: {0}".format(len(df.loc[df['R&D Spena']== 0))))
print("number of rows missing Administration: {0}".format(len(df.loc[df['Administration'] = 0))))
print("number of rows missing Marketing Spend: {0}".format(len(df.locfdf['Marketing Spend']== 0))))
print("number of rows missing State: {0}".format(len(df.loc[df['State'] == 0))))
print("number of rows missing Profit: {0}".format(len(df.loc[df("Profit'] == 0])))·
total number of rows : 50
number of rows missing R&D Spend: 2
number of rows missing Administration: 0
number of rows missing Marketing Spend: 3
number of rows missing State: 0
number of rows missing Profit: 0

In ( ]: df.describe()
Out[ ]: R&DSpend Administration
..
Marketing Spend Profit .
50.000000 50.000000 50.000000
count 50.000000
211025.097800 112012.639200
mean 73721.615600 121344.639600

122290.310726 40306.180338
std 45902.256482 28017.802755

0.000000 14681.400000
min 0.000000 51283.140000

129300.132500 90138.902500
25% 39936370000 103730.875000
212716.240000 107978.190000
SO% 73051.080000 122699.795000
299469.085000 139765.977500
75% 101602.800000 144842.180000
471784.100000 192261.830000
max 165349.200000 182645.560000

J: rd_spend
v,.. ,. ,. -------------------- -_ _. - ,.

In ( = df["R&o Spend"] inplace=True)

rd_spend.replace(to_replace = 0, vaiue = rd_speod.mean(),
11ri7fl ' ,narket_spend = df["Marketing Spend"]
ket spend.replace(to_replace = 0 value=m k
,nar - • ar et_spend.mean(), inplace=True)
Finding Missing Values

print("total number of ows : {0}".format(len(df)))

rri [ 1: nd
print("number off rows miissing R&D_Spe : {0}".format(len(df.loc[df['R&DSpend']==-0])))
•nt("number o rows m ssing Adm1nistratio. { 0}"
pri.nt("num ber of row5 m1•ss1•ng M k
ar eting Spennd.· {e}."fformat(len(df.loc[df["Administration') == 0))))
P Of . . 5t · • ormat(len(df.loc(df['Marketing Spend']== 0])))
print("number rO\--IS m ss ng at : {e}" .format(len(df.loc(df['State'] == 0])))
print("number of r0\--1s missing Prof_it:{0 ".format(len(df.loc[df['Profit'] == e])))
total number of rows: 50
number of rows missing R&o Spend: 0
number of rows missing Administration: 0
number of rows missing Marketing Spend: 0
number of rows missing State: 0
number of rows missing Profit: 0

outliers Detection

For Normal distributions: Use empirical relations of Normal distribution.

-> The data points which fall below mean-3(sigma) or above mean+ 3(sigma) are outliers.

For Skewed distributions: Use Inter-Quartile Range (IQR) proximity rule.

-> The data points which fall below Q1 -1.S IQR or above Q3 + 15 IQR are outliers.

where Q1 and Q3 are the 25th and 75th percentile of the dataset respectively, and IQR represents the inter-quartile
range and given by Q3 - Q1.

Z-score treatment:
Assumption- The features are normally or approximately normally distributed. ·

Plotting Graph

In ]: plt.figure(figsize={l6,10))

plt.subplot(2,2,1)
sns.distplot(df["R&D Spend"])

plt.subplot(2,2,2)
sns.distplot(df["Administration"])

plt.subplot(2,2,3)
sns.distplot(df["Marketing Spend"])

plt.subplot(2,2,4)
sns.distplot(df["Profit"])

plt.show()
-
lo-5
1.2 u

1.2
1.0

1.0

o.a
r!
0.6

0.4

0.2

0.0
R6'D Spend 50000 100000 l.50000 200000
Mministration
10-6
4.0 lo-5
1.2
J.5

1.0
J.0

2.5 0.8
- If! 2.0 -
! 0.6
1.5
0.4
1.0

0.5 0.2

0.0
0.0
100000 200000 lXlOOO!J .ciroooo :!00000 600000 -50000 50000 100000 150000 200000 250000
0
Mulcetingnd
Profit

In[ ): #Finding the Boundary Values

print{"Highest allowed R&D Spend m,df["R&D Spend"J.mean() + 3*df["R&D Spend"J.std())

print("Lowest allowed R&D Spend ",df["fl&D Spend"].rnean() - 3*df["R&D Spend"].std{),"\n")

print{"Highest allowed Administration ",df["Administration"J.mean() + 3*df["Administration"J.std())

print("Lowest allowed Administration ",df["Administration"J.mean() - 3*df["Administration"J.std(),"\n")

print{"Highest allowed Marketing Spend ",df[hMarketing Spend"J.m an() + 3*df["Marketing Spend"J.std())

print{"Lowest allowed Marketing Spend ",df["Marketing Spend"J.mean() - 3*df["Marketing Spend"J.std(),"\n

print("Highest allowed Profit",df["Profit"].mean() + 3*df["Profit"J.std{))

print("Lowest allowed Profit",df["Profit"J.mean() - 3*df["Profit"J.std(),"\n")
Highest allowed R&D Spend 206619.73822903878
Lowest allowed R&D Spend -53278.77778103876

Highest allowed Administration 205398.04786646605

Lowest allowed Administration 37291.23133353396

Highest allowed Marketing Spend 553207.7653167574

Lowest allowed Marketing Spend -105834.55798075726

Highest allowed Profit 232931.18021295167

Lowest allowed Profit -8905.901812951619

In [ ]: # Finding the Outliers

df[(df["R&D Spend"] > 206619.73822) f (df["R&D Spend"]< -53278.7777))
Out[ ] :
R&D Spend Administration Marketing Spend State Profit

In[ J: # Finding the Outliers

1 f[(df["Admi is ration"] > 205398.04786) I (df["Administration"J < 37 91.23133))
Out[ ]:
R&D Spend Administration Marketing Spend State Profit

/ ln [ ) : 1 # Finding the Outliers

r-
.df[(df["Marketing Spend"]> 553207.76531) I (df["Marketing Spend")< -105834.5579))

l -
nb
t1 2/')'
R&D Spend AdmlnlstrAtlon Mnrkatlng Spend Stlltll Profit
,,,,ti 1

11 finding the o,rtL ie,·.

1
,, I J:
df[(df["Pr fit") > 232931.18021) I (df["l'roflt") < •090 .9018 2)]

R&D Spend Administration M11rketlng s1,and State rro11t

Plotting Graph for Outliers

sns,boxplot(df["Admini trntion"))
Jn r ].
<AxesSubplot: >
1tJl{ ):

180000

160000

140000

120000

100000

80000

60000

In [ ] : #finding the IQR since ske1-1ed

percentile25 = df['Administration').quantile(0.25)
percentile75 = df['Administration').quantile(0.75)
iqr=percentile75-percentile25#q3-q1

upper_limit percentile75 + 1.5 • iqr

lower_limit percentile25 - 1.5 • iqr

df[df["Administration"] > upper_limit]

df[df["Administration"] < lower_limit]
multi_df=df

Vectorized String operations

In [ ]: # Panda series use in this section

names pd.Series(['Walter White', 'Jesse Pinkman', 'Skyler White', 'Hank Shrader', 'Mike Ehrmantraut',
'Gus Fring'])
names
Out[ ] : 0 Walter White
1 Jesse Pinkman
2 Skyler White
3 Hank Shrader
4 Mike Ehrmantraut
5 Gus Fring
dtype: object
111
[ l
J: names.str.upper()

file:Jt1F:17th ,,
sel1\/Departmental Lab/ml_lab/Abhlnav_Chaturvedl-0901CS191003/lab.html
13/33
Lab
WALTER WHITE
JESSE PINKMAN
SKYLER WHITE
HANK SHRADER
4 MIKE EHRMANTRAUT
5 GUS FRING
dtype: object

rn [ J: names.str.len()
]: 0 12
outf 1 13
2 12
3 12
4 16
S 9
dtype: int64

In f J: names.str.startswith('W')

0 True
out( ]:
1 False
2 False
3 False
4 False
5 False
dtype: bool
Vectorized indexing and slicing

In [ ]: names.str[0]

out[ J: 0 w
1 J
2 s
3 H
4 M
5 G
dtype: object

In [ j: names.str.slice(0,2)

Out[ ]: 0 Wa
1 Je
2 Sk
3 Ha
4 Mi
s Gu
dtype: object

In [ ]: names.str.split()

Out[ j: 0 [Walter, White]

1 [Jesse, Pinkman]
2 [Skyler, White] ..
3 [Hank, Shrader]
4 [Mike, Ehrmantraut]
s [Gus, Fring]
dtype: object
In [ ]: names.str.split().str.get(0)

Out[ ]: 0 Walter
1 Jesse
2 Skyler
3 Hank
4 Mike
s Gus
dtype: object
: Experiment 10:
Use scikit-learn package in python to implement following machine learning models to solve real world problems
·-'e using open source datasets

unear Regression Model

linear df=pd.read_csv('Salary Oat a.csv')

- -

linear_df.head( 10)

VearsExperience S a l a r_
y

1.1 39343.0
0
13 46205.0
1

2
1.5 37731.0

2.0 43525.0
3

22 39891.0
4

s .2.9 56642.0

6 3.0 60150.0

7 32 54445.0

8 32 64445.0

9 3.7 57189.0

linear_df .info() ·-. ........ · ..-..•.•. _ .... _ -··

<class •pandas.core.frame.DataFrame'>
Rangeindex: 30 entries, 0 to 29
Data columns (total 2 columns):
# Column Non-Null Count Dtype

0 YearsExperience 30 non-null float64

1 Salary 30 non-null float64
dtypes: float64(2)
memory usage: 608.0 bytes

x = linear_df[ ['YearsExperience']]
X
sE,cperience 2.0
3
1.1 2.2

v
0
13
2.9
"
e
1 s
a 1.5 3.0
r 2
6
3.2
1
3.2
8
3.7
9
3.9
10
11 4.0

12 4.0

13 4.1

14 4.5
15 4.9

16 5.1

17 53

18 5.9

19 6.0

20 6.8

21 7.1

22 7.9

23 8.2

24 8.7

25 9.0

26 9.5

27 9.6

28 10.3

29 10.5

In[ ) : · ·· ·· ·
Y: linear_df.iloc[:,1].values
y
•
Oltt[ ): 56642., 60150.,
array( [ 39343., 46205., 37731., 43525., 39891.,
56957., 57081.,
54445., 64445., 57189., 63218., 55794., 93940., 91738.,
81363.,
61111., 67938., 66029., 83088., 116969., 112635.,
105582.,
98273., 101302., 113812., 109431.,
122391., 121872.])
In [ ]:
Plt.scatter(x,y)
Plt.show()
•• • •

rn[ ]:

In [ ]:

Out[ J:

In [ ] : y_pred = model. predict(x)

y_pred

Out[ ]: array([ 36187.15875227, 38077.15121656,

39967.14368085, 44692.12484158,
46582.11730587, 53197.09093089, 54142.08716303, 56032.07962732,
56032.07962732, 60757.06078805, 62647.05325234, 63592.04948449,
63592.04948449, 64537.04571663, 68317.03064522, 72097.0155738,
73987.00803809, 75877.00050238, 81546.97789525, 82491.9741274,
90051.94398456, 92886.932681 , 100446.90253816, 103281.8912346,
108006.87239533, 110841.86109176, 115566.84225249, 116511.83848464,
123126.81210966, 125016.80457395])
In [ ] :
Plt.scatter(x,y)
Plt.title("Linear Regression using Ordinary Least Square Meth0d")
plt.plot(x,y_pred,color='red',label='Best Fit Line')
Plt.legend()
It.show()
Ltlb
Linear R gr sslon using Ordl
nary Least
- Oest Flt Lin quare Method
120000

100000

aoooo

60000

40000

1n I ): model.coef_

out[ ]: array([9449.96232146))

In I ): model.intercept_

Out[ ]: 25792.20019866871

In f ]: model.predict([[4)))

Out[ ): array([63592.04948449))

In [ ): from sklearn.metrics import r2_score

r2_score(y,y_pred)•100

Ou [ ) : 95.69566641435085

Mult-linear Regression Model

!n [ ): x = multi_df.iloc[:,:-1) # Independent features

y = multi_df.iloc[:,-1) # Dependent feature
x.head()

Out[ J: R&DSpend Administration Marketing Spend State

0 165349.20 136897.80 471784.10 New Yori::

1 162597.70 151377.59 443898.53 California
2 153441.51 101145.55 407934.54 Florida
3 144372A1 118671.85 383199.62 New Yori::
4 142107.34 91391.77 366168.42 Florida

r I l : y,head()

Ouq J. e 192261.83
1 191792.06
2 191050,39
3 182901.99
4 166187.94
Name: Profit, dtype: float64
In l l: x,State.value_counts()
lab
New York 17
California 17
Florida 16
Name: State, dtype: int64

".,conve,-t these catego,·ic values into one-ltot

rn( 1 ·· one_hot_states pd.get_dummies(x.state) encoding.

one_hot_states.head()
111 [ l:
Califomla Florida New Yori<
outf J:
0 0 0

0 0

2 0 0

3 0 0

0
" 0

IO
): x.drop(["State"], axis= 1, inplace = True)

In [ }: x = pd.concat([x, one_hot_states]. axis l)

In( ]: x.head(S)

out[ ] : R&D Spend Administration Matl<eting Spend California Aorida New Yori<

0 16534920 136897.80 471784.10 0 0

1 162597.70 151377.59 443898.53 .0 0
2 153441.51 101145.55 407934.54 0 0
3 144372.41 118671.85 383199.62 0 0

" 142107.34 91391.77 366168.42 0 0

In ]; from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In[ ]: xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, ra dom_state 0)

In ]:· multi_linear_reg = linearRegression()

multi_linear_reg.fit(xtrain, ytrain)

Out [ l: T LinearRegression.

LinearRegression()

rn[ ]: multi_linear_reg_predictions = u1.:::,_1: ear_reg.prejict(xtes )

ln [ l: print("R2 score:", r2_score(ytest, multi_linear_reg_predictions))

R2 score: 0.8711226942394046

Decision Tree Classification Model

In [ ); diabetes_data.head()
bml dlab_pred age skin diabetes
Out[ ]: num_preg glucose bp thickness Insulin

-----------
0
..---·-
35 0 33.6 0.627 so 1.3790
6 148 72
0.351 31 1.1426 0
66 29 0 26.6
85
0 23.3 0.672 32 0.0000
2 8 183 64 0
0.167 21 0.9062 0
3 66 23 94 28.1
89
43.1 2.288 33 1.3790
35 168
4 0 137 40
1f1,(}.2•
diabetes_data.c
olumns
1,, l l·
Index(['num_preg', 'glucose', 'bp', 'thickness',
cwt[ }' •age•, •skin·, 'diabetes'), 'insulin', 'bmi', 'diabJ)red',
dtype='object')

diabetes_data.dcscribc()
JO [ J:
num_preg glucose bp thickness
ouiC 1:
Insulin bml dlab_pred age skin diab<

count 768,000000 768.000000 768,000000 768.000000 768.000000 768,000000 768.000000 768,000000 766.000000 768,000

n,ean 3.64S0S2 120.894S31 69.105469 20.S36458 0.809136 0348

79,799479 31,992578 0.471876 332A0885
std 3.369S78 31.972618 1935S807 1S.9S2218 115,244002 0.628517 OA16
7.884160 0.331329 11.760232
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000
0.000000 0.000000 0.078000 21.000000
2S% 1.000000 99.000000 62.000000 0.000000 0.000000 24.000000 0.000000 0,000
27.300000 0.243750

SO% .3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.906200 0,000

7S% 6.000000 1402S0000 80.000000 32.000000 1272S0000 36.600000 0.6262S0 41.000000 1260800 1.000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 3.900600 1,000

'.1't
""
diabetes_data.isnull().sum()
In l ]:
num_preg 0
out[ ] :
glucose 0
bp 0
thickness 0
insulin 0
bmi 0
diab_pred 0
age 0
skin 0
diabetes 0
dtype: int64

plt.figure(figsize=(12,10))
In l :] b has an easy method to showcase heatmap
# sea orn () annot=True,cmap ='RdVlGn')
p = sns.heatmap( diabetes:-:- at . orr , - -· - - -

...

20/33
,
2f7.Z,
ti
1.0

0.28 0.24 0.21

0.8

V
0.39 0.18
£
C
1.,§, 0.2 0.19 -0.6
-

!
- 0.4
0 Hi
'O

'1
.J:J
10
'6
i- 0.2

0.0

10
'6

num_preg glucose bp thickness insulin bmi diab_pred age skin diabetes

In [ ]:·diabetes_data_copy = diabetes_data.copy(deep = True)

diabetes_data_copy[['glucose','bp','thickness','insulin','bmi']] = diabetes_data_copy[['glucose','bp',
'thickness','insulin','bmi']].replace(0,np.NaN)

lu Showing the Count of NANs

lprint(diabetes_data_copy.isnull().sum())
num_preg 0
glucose 5
bp 35
thickness 227
insulin 374
bmi 11
diab_pred 0
age 0
skin 0
diabetes 0
dtype: int64

Data Visualization

In [ l: ·P -= diabetes_data.hist(figsize = (20,20))
Lab

ISO
100

100

40 UC

thkkness
nst.ltin

1SO

200
100

100
5,0

20 40 60 80 l.00 200 400 600 eoo

skin

T 2-0

lia

500

:,00

-- ---
- --
o.2 0.4 0.6 o.a U)
0.0

In [ ): diabetes_data_copy['glucose'].fillna(diabetes_data_copy['glucose').mean(), inplace = True)

diabetes_data_copy['bp'].fillna(diabetes_data_copy['bp').mean(), inplace= True)
diabetes_data_copy['thickness'].fillna(diabetes_data_copy['thickness').median(), inplace = True)
diabetes_data_copy['insulin'].fillna(diabetes_data_copy['insulin'].median(), inplace = True)
diabetes_data_copy['bmi'].fillna(diabetes_data_copy['bmi'].median(), inplace = True)

In [ ]:lp= diabetes_data_copy.hist(figsize= (20,20))

22
1 191
Lob

"••"• 150

uo )00

IO<i
150
10
..., lf,I

..,
Ill
JO

C,O
IW IA<l
"'"
500 JO-,

no 11,
400
I!.</

lOO IJ>

IOO
JOO ,,
50
100
,.
JOO «)O IOO 20
,,, .., 50 C,O

200

IOO
IOO

40 ,0 60 70 10
1..5

clabet s
,00

400

200

IOO

In [ ] : diabetes_data_copy.head()
thickness insulin bmi diab_pred age skin diabetes
num_preg glucose bp
Out[ ] :
125.0 33.6 0.627 50 1.3790
6 148.0 72.0 35.0
0
26.6 0351 31 1.1426 0
85.0 66.0 29.0 125.0
1
125.0 233 0.672 32 0.0000
8 183.0 64.0 29.0
2
0.167 21 0.9062 0
66.0 23.0 94.0 28.1
3 89.0
168.0 43.1 2.288 33 13790
137.0 40.0 35.0
4 0

Standard scaling

In [ ): from sklearn.preprocessing import StandardScaler

1) ,), columns=['num_pre,
!Xsc=_X =pdS.tOaantdafarradmSec(aslce_rX(.)fit_transform(diabetes_data_copy.drop(["diabetes"),axis
•glucose•, 'bp·, ·thickness', 'insulin', · bmi', 'diab_pred', 'age',· skin'])
X,heat
Lab
num_preg glucose bp thickness
Olft[ ]: ----· - --· Insulin bmi diab_pred age skin
0 0.639947 0.865108 -0.033518 0.670643 -0.181541 0.166619 0.468492 1.425995 0.907270
1 -0.844885 -1206162 -0.529859 -0.012301 -0.181541 -0.852200 -0.365061 -0.190672 0.530902
2 1233880 2.015813 -0.695306 -0.012301 -0.181541 -1.332500 0.604397 -0.105584 -1288212
3 -0.844885 -1.074652 -0.529859 -0.695245 -0.540642 -0.633881 -0.920763 -1.041549 0.154533
-1.141852 0.503458 -2.680669 0.670643
"' 0.316566 1.549303 5.484909 -0.020496 0.907270

rn[ ]: y = diabetes_data_copy.diabetes
y

out[ l: 0 1
1 0
2 1
3 0
4 1

763 0
764 0
765 0
766 1
767 0
Name: diabetes, Length: 768, dtype: int64
Splitting the dataset

In ]: X = diabetes_data.drop('diabetes', axis=l)
y = diabetes_data['diabetes']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33,random_state=7)

Model Training - Decision Tree

In ]: from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

Olft[ ] :

In ]: from sklearn im ort metrics

:predictions= dtr;e.predict(X_test)
print(uAccuracy Sco e =", format(metrics.accuracy_score(y_test,predictions)))
,. - . : - .. - . . - --· . .

Accuracy Score= 0.7086614173228346

In [ ]: from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_test, predictions))
print{classification_report(y_test,predictions))

[[130 32)
[ 42 50]]
precision recall fl-score support

0 0.76 0.80 0.78 162

1 0.61 0.54 0.57 92

accuracy 0.71 254

macro avg 0.68 0.67 0.68 254
weighted avg 0.70 0.71 0.70 254

Random Forest Model

Lab
from klcnrn.enscmblc import Rond0111ForestClassifir

rfc = RandomForestClassifier(n_cst1mators■200)
rfc.fit(X_train, y_troin)

11d l \. RandomFore t lo sifler --\

domForestCla sifier(n_e timato _..,2.00)_ I

rfc_tra1n "' 1-fc.pr die ( _J:rnin)

In l l {i'Offl sklearn import mclrl s

print(NAc uro y_ ore ·•• format(metrics.accuracy_score(y_tra1n, rfc_tra1n)))

Accuracy_Score = 1.0
so from above we can inference that training dataset our model is overfitted

In )' from sklearn import metrics

predictions= rfc.predict(X_test)
print(•Accuracy_Score =•. format(metrics.accuracy_score(y_test, predictions)))
Accuracy_Score = 0.7SS9055118110236

): from sklearn.metrics import classification_report, confusion_matrix

1
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test,predictions))

[[133 29]
[ 33 59]]
precision recall fl-score support

0 0.80 0.82 0.81 162

1 0.67 0.64 0.66 92
0.76 254
accuracy
0.74 0.73 0.73 254
macro avg
0.75 0.76 0.75 254
weighted avg

SVM Model

InL ): from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(X_train, y_train)

Out(

In ( ) :
.svc_pred = svc_model.predict_(x_ e t)
. .

In [ ]: from sklearn import metrics

svc_pred)))
" format(metrics.accuracy_score(y_t_est,
print("Accuracy Score=•
Accuracy Score= 0.7519685039370079
t confusion_matrix
import classification_repor,
lt1 ( l: froM sklearn.metrics
( t st svc_pred))
Print(confusion_matrix y_ret(,y test,svc_p red))
pr1nt(class1f1cat1on_repo -

:
11ri2f2 I

((146 16]
[ 47 45]]
precision recall fl-score support
0 0.76 0.90 0.82 162
1 0.74 0.49 0.59 92

accuracy 0.75 254

macro avg 0.75 0.70 0.71 254
weighted avg 0.75 0.75 0.74 254

K-Means Clustering Model

Int ]: customers_df=pd.read_csv('Mall_Customers.csv')
customers_df.head()

out[ 1. CustomerlD Gender Age Annual Income (k$)

J • Spending Score (1-100)

0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

Inf ]: customers_df.corr()

Out[ J: Customer1D Age Annual Income (k$) Spending Score (1-100)

CustomertD 1.000000 -0.026763 0.977548 0.013835

.Age -0.026763 1.000000 -0.012398 -0327227

Annual Income (le$) 0.977548 -0.012398 1.000000 0.009903

Spending Score (1-100) -0.013835 -0327227 0.009903 1.000000

In [ J: . #Distribution of Annnua,lIncome
.plt_.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.•distplot(customers_df['Annual Income (k$) '])
plt.title{'Distribution of Annu l Income (k$)', fontsize = 20)
plt.xlabel{'Range of Annual Income (k$)')
.plt.ylabel('Count')

Out(·]: Text(0, 0 5, 'Count')

. ...
Lab

Distribution of Annual Income (k$)

0.016

0.014

0.012

0.010

8 c 0.008
0.006

0.004

0.002

0.000 --
50 75 100 125 150
0 25
Range of Annual Income (kS)

In [ ): #Distribution of age
plt.figure(figsize=(l0, 6))
sns.set(style = 'whitegrid')
sns.distplot(customers_df['Age'])
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Range of Age')
plt.ylabel('Count')

Out[ ]: Text(0! 0.5, 'Count')

Distribution of Age
0.035

0.030

0.025

.... 0.020
C:
:,
8
O.D15

0.010

0.005

0.000

In [ ]: #Distribution of spending score

plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(customers_df['Spending Score (1-100)'))
plt.title('Distribution of Spending Score (1-100)', fontsize = 20)
plt.xlabel('Range of Spending Score(1- ,
Lab
plt.ylabel('Count') ) )
100
out[ J: rext(0, 0.5, 'Count')

0.018 Distribution of Spending Score (1-100)

0.016

II
0.014

0.012

0.010
c::,
8 0.008
,- .

0.006

0.004

0.002

0.000
-20 0 20 40 60 80 100 120
Range of Spending Score (1-100)

In [ ): genders= customers_df.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values)
• plt.show()

100

0
Female Male

i -- -
In [ ]: 1#We take just the Annual Income and Spending score
:dfl=customers_df[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-100)"]]
_X=dfl[["Annual Income (k$)","Spending Score (1-100)"]]

In[ ]: i X.head()
Llt{ J

Annual Income (k$) pending Score (1-100)

0 15
39
15
81
2 16 6
3 16 17
4 17 40

111 [ ) : #Scatterplot of the input data

plt.figure(figsize=(le,6))
sns.scatte,rplot(x = 'Annua1 Income (k$), _ ,
plt.xlabel( AnnualIncome(k$)') ,y - Spending Score (1-100)' data c X ,s c 60)
plt.ylabel('Spending Score (1-100)') '
t i t l e ('Spending Score (1-100
plt. s h ow ( ) ) vs Annual Income (k$)')

Spending Score (1-100) vs Annual Income (k$)

100
• e • • • •
• • • ••• i • • • •• •
• • •
00
•• •
0 • •• •
0
"'7
60 • • (0 (0
•
Cl) •e i E
0 e ,... .._v ,..
0 .. , ! !e
Cl)
•09 c .•••"
O>
• •
...
C :
:s 40
•• •• ••
•
C:
Cl)
a.
• •• •• •
Cl)

20 : ••
• •
• ••
•••••
i
.:
•
.• • •
•
••• • •• ••••
0
100 120 140
40 60 80
20
Annual Income (k$)

In [ ): #Importing KMeans from sklearn

from sklearn.cluster import KMeans

In[ ): # Within Cluster Sum of Squared Errors (WSS) for different values of k
wcss=[]
for i in range(l,11):
km=KMeans(n_clusters=i)
km.fit(X)
wcss.append(km.inertia_)

In [ ): #The elbow curve

plt.figure(figsize=(12,6))
plt.plot(range(l,11),wcss)
plt.plot(range(l,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(l,11,1))
plt.ylabel("WCSS")
plt.show() .

291
250000

200000

(/)
(/) 150000

100000

50000

2 3 4 5 6 7 8 9 10
KValue

in the graph, after 5 the drop is minimal, so we take s to be the number of clusters.

In [ ]: #Taking 5 clusters
kml=KMeans(n_clusters=S)
#Fitting the input data
kml.fit(X)
#predicting the Labels of the input data
y=kml.predict(X)
#adding the Labels to a column named Label
dfl["label"] = y
#The new dataframe with the clustering done
dfl.head( 10)

Out[ ] : CustomerlD Gender Age Annual Income (k$) Spending Score (1-100) label

0 Male 19 15 39 0

1 2 Male 21 15 81 3

2 3 Female 20 16 6 0

3 4 Female 23 16 77 3

4 5 Female 31 17 40 0

s 6 Female 22 17 76 3

6 7 Female 35 18 6 0

7 8 Female 23 18 94 3

Male 64 19 3 0
8 9

9 10 Female 30 19 72 3

In [ ): #Scatterplot of the clusters

!Plt.figure(figsize=(10,6))
sns.scatterplot(x= 'Annual Income (k$)',y = 'Spending Score (1-100)',hue="label",
palette=['green','orange','brown','dodgerblue','red'), legend='full',data dfl ,s 61
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Spending Score (1-100) vs Annual Income (k$)')
plt.show()
Spending Scar (1 10
100 •• e - 0) vs Annual Income (k$)

• •
• •
-
• I • •
• • ••
-
0
• •
-8
0

,.!.
Q)
60 • label
• 0
Cl)
Cl 1

•• •• • 2
C
:Cs 40
• • • 3
•
CD
a. • 4
Cl)

@
•
0
•• e

20 40 60 80 100 120 140

Annual Income (k$)

In ( ): #Taking the features

df2=customers_df[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-100)"]]
X2=df2[["Age","Annual Income (k$)","Spending Score (1-100)"]]
#Now we caLcuLate the Within Cluster Sum of Squared Errors (WSS) for different values of k.
wcss = []
fork in range(1,11):
kmeans = KMeans(n_clusters=k, init="k-means++")
kmeans.fit(X2)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(l2,6))
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(l,11,1))
plt.ylabel("WCSS")
plt.show()

300000

250000

200000
Cf)
Cf)

150000

100000

50000

2 3 4 5 6 7 8 9 10
KValue

In [ ) : km2 = KMeans(n_clusters=S)
y2 = km.fit_predict(X2)
df2["label"] = y2.
#The dara wirh Labels
df2.head()
CustomerlD Gender Age Annual Income (k$)
Olld l: Spending Score (1·100) label
0 Male 19 15 39
2 Male 21 15 01 5
2 3 Female 20 16 6
3 4 Female 23 16 77 5
4 5 Female 31 17 40

rnr ]: #3D Plot as we did the clustering on the basis of 3 input features
fig= plt.figure(figsize=(20,10}}
ax= fig.add_subplot(lll, projection='3d')
ax.scatter(df2.Age[df2.label == 0], df2["Annual Income (k$)")[df2.label == 0),
df2["Spendinc Score (1-100}"][df2.label == 0), c='purple', s=60}
ax.scatter(df2.Age[df2.label == 1], df2["Annual Income (k$)")[df2.label == 1),
df2["Spending Score (1-100}"][df2.label == 1], c='red', s=60)
ax.scatter(df2.Age[df2.label == 2], df2["Annual Income (k$)"][df2.label == 2),
df2["Spending Score (l-100}"][df2.label == 2), c='blue', s=60}
ax.scatter(df2.Age[df2.label == 3], df2["Annual Income (k$)"][df2.label == 3),
df2["Spending Score (1-100)"][df2.label == 3), c='green', s=60)
ax.scatter(df2.Age[df2.label == 4], df2["Annual Income (k$)"][df2.label == 4],
df2[" pendin3 Score (1-100}"][df2.label == 4), c='yellow', s=60}
ax.view_init(35, 185)
plt.xlabel("Age")
plt.ylabel("Annual Int m! k .)")
ax.set_zlabel('Spe11dir g . ,..re {1-.!Ni)•)
plt.show()

• •• ••••
• • ,. -4ri •

, -· ... . .
60
• • • ••
•.
.
70

• .•.
,....
I 0
50
C/)
60
:C:,t>
e:, : 40
co
• 50
C/)

8 • - Age

-
30
-,
(D 40

-
I 20
0 30
0
10
20
140 120 100 80 60 40 20
Annual Income (kS)

: '
-
custl=df2[df2("label"]==l]
print('Number of customer in 1st r _,
print('They are-•, cust1("Custom!r; ?.J • len(cust1))
print(" .values)
cust2=df2[df2("label"]==2] .,_ ..)
Print('Number of customer in 2nd group::• le (
print('They are-•, cust2["CustomerIO"], n cust2))
print(" :values)
cust3=df2[df2["label"]==0] ----------------- ")
print('Number of customer in 3rd group='
1
print('They are-•, cust3["CustomerIO"], len(cust3))
.. .va ues)
print(
cust4=df2[df2["label"]==3] ----------------- ")
print('Number of customer in 4th group-' ·
• • -, 1en(cust4))
print( They are - , cust4["CustomerID"].values)
print("-----------------------------
cust5=df2(df2("label"]==4] ------------------ ")
print{'Number of customer in 5th group-• 1 (
, , - , en custs))
print( They are - , custS("CustomerID"].values)
print(" ")
Number of customer in 1st group= 12
They are - [ 1 3 5 17 21 27 29 39 43 45 49 50]

Number of customer in 2nd group= 35

They are- [ 44 48 52 53 59 62 66 69 70 76 78 79 82 85 88 89 92 94
95 96 98 100 101104106 112 113 114 115 116 121122 123 133 143)
-
Number of customer in 3rd group= 10
They are - [181183 185 187 189 191193 195 197 199)

Number of customer in 4th group= 17

They are - [127 129 131137141147151153155 161 165 167 169 171175 177 179]

Number of customer in 5th group= 44

They are - [ 41 47 51 54 55 56 57 58 60 61 63 64 65 67 68 71 72 73
74 1s n 80 81 83 84 86 87 90 91 93 97 99 102 103 105 101
108 109 110 111117118 119 120]

From Toxic To Pure Harnessing Renewable Energy For Non Conventional Water Remediation and Green Hydrogen Production
No ratings yet
From Toxic To Pure Harnessing Renewable Energy For Non Conventional Water Remediation and Green Hydrogen Production
10 pages
FitTrack Gold Manual 2019
No ratings yet
FitTrack Gold Manual 2019
178 pages
ҰБТ тест жинағы Ағылшын
No ratings yet
ҰБТ тест жинағы Ағылшын
96 pages
Msds Coagulation
No ratings yet
Msds Coagulation
54 pages
Woobles - Three Peas in A Pod
No ratings yet
Woobles - Three Peas in A Pod
13 pages
Python Lab Manual B.Tech. Sem-2
No ratings yet
Python Lab Manual B.Tech. Sem-2
39 pages
Rufh 2
No ratings yet
Rufh 2
28 pages
IDA Lab Mannual
No ratings yet
IDA Lab Mannual
26 pages
PDSP 1
No ratings yet
PDSP 1
15 pages
Python
No ratings yet
Python
132 pages
Durability of Building and Construction Sealants and Adhesives Volume 2 Andreas T Wolf
100% (2)
Durability of Building and Construction Sealants and Adhesives Volume 2 Andreas T Wolf
132 pages
L and T Lab Assainments - Tejoprakash
No ratings yet
L and T Lab Assainments - Tejoprakash
5 pages
Rescue Asd
No ratings yet
Rescue Asd
12 pages
L and T Lab Assainments - Srikanth
No ratings yet
L and T Lab Assainments - Srikanth
5 pages
Computer Architecture & Design-Slides
No ratings yet
Computer Architecture & Design-Slides
34 pages
Python Lab File
No ratings yet
Python Lab File
4 pages
Ch11a Numpy
No ratings yet
Ch11a Numpy
8 pages
Final DAA
No ratings yet
Final DAA
31 pages
DNA As Code of Life - Rau's IAS
No ratings yet
DNA As Code of Life - Rau's IAS
5 pages
Experiments Op
No ratings yet
Experiments Op
21 pages
Foundation of Data Science Lab Manual Full
No ratings yet
Foundation of Data Science Lab Manual Full
8 pages
Green City Planning
67% (3)
Green City Planning
16 pages
DS Lab Programs
No ratings yet
DS Lab Programs
47 pages
Python
No ratings yet
Python
17 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
TSaiSrikar
No ratings yet
TSaiSrikar
45 pages
Write A Numpy Program To Reverse An Array (First Element Becomes Last)
No ratings yet
Write A Numpy Program To Reverse An Array (First Element Becomes Last)
8 pages
2221 Textile Performance Testing Year II Semester II
No ratings yet
2221 Textile Performance Testing Year II Semester II
2 pages
Distillation of Mixtures: Activity 2.3
No ratings yet
Distillation of Mixtures: Activity 2.3
4 pages
Saanp pt.3-1
No ratings yet
Saanp pt.3-1
21 pages
KJD ML File
No ratings yet
KJD ML File
45 pages
2.5L Diesel 1993 On
No ratings yet
2.5L Diesel 1993 On
2 pages
TLE-CSS Grade9 Q1 LAS1
No ratings yet
TLE-CSS Grade9 Q1 LAS1
6 pages
Python Exps Questions
No ratings yet
Python Exps Questions
10 pages
Final JWT Internship Report Finalised
No ratings yet
Final JWT Internship Report Finalised
33 pages
Manual
No ratings yet
Manual
21 pages
Python Introduction 2019
No ratings yet
Python Introduction 2019
27 pages
Data-Structures Lecture
No ratings yet
Data-Structures Lecture
13 pages
P. Guru Prasad: A LAB Record of
No ratings yet
P. Guru Prasad: A LAB Record of
57 pages
Data Science Lab Exp Lis
No ratings yet
Data Science Lab Exp Lis
72 pages
Computer Science Programs
No ratings yet
Computer Science Programs
13 pages
Autobiography of Joseph Garland
No ratings yet
Autobiography of Joseph Garland
2 pages
ML Lab File Vijay Kumar
No ratings yet
ML Lab File Vijay Kumar
27 pages
Pay Thon
No ratings yet
Pay Thon
19 pages
Ig5 Cat
No ratings yet
Ig5 Cat
16 pages
Python Lab PRG
No ratings yet
Python Lab PRG
20 pages
Development of An Automated Multi-Level Car Parking System: December 2015
No ratings yet
Development of An Automated Multi-Level Car Parking System: December 2015
8 pages
Machine Learning Codes
No ratings yet
Machine Learning Codes
30 pages
Python Internal-2
No ratings yet
Python Internal-2
6 pages
Python Lab Programs
No ratings yet
Python Lab Programs
58 pages
Fdsa Lab Manual Final
No ratings yet
Fdsa Lab Manual Final
70 pages
Python Numpy Programming: Eliot Feibush
No ratings yet
Python Numpy Programming: Eliot Feibush
66 pages
Python Basics
No ratings yet
Python Basics
49 pages
Lab Manual - ML - RIT
No ratings yet
Lab Manual - ML - RIT
54 pages
Be1227 Inductica 2012 Paper MDL PDF
No ratings yet
Be1227 Inductica 2012 Paper MDL PDF
8 pages
Final Lab Manual Python
No ratings yet
Final Lab Manual Python
32 pages
MDA File
No ratings yet
MDA File
37 pages
Grace Python Numpy MB Final
No ratings yet
Grace Python Numpy MB Final
55 pages
File of Pyhton Removed
No ratings yet
File of Pyhton Removed
31 pages
oG1M8adGXOGe DHBiQVrXgXHO6GrHU01tHWZgd tpRqUW65xGX9ufzrZMtM6hjBWlvlYViPn6r2Cgghq2M8oiXNNdf0HeL-DQvJKWM
No ratings yet
oG1M8adGXOGe DHBiQVrXgXHO6GrHU01tHWZgd tpRqUW65xGX9ufzrZMtM6hjBWlvlYViPn6r2Cgghq2M8oiXNNdf0HeL-DQvJKWM
42 pages
Rufh 4
No ratings yet
Rufh 4
24 pages
Python Record
No ratings yet
Python Record
37 pages
DVB-S2 Modem: Sk-Ip / SK-DV / Sk-Ts
No ratings yet
DVB-S2 Modem: Sk-Ip / SK-DV / Sk-Ts
6 pages
Lab1 ML Eac22050
No ratings yet
Lab1 ML Eac22050
17 pages
Document
No ratings yet
Document
16 pages
Eureka Math Grade 2 Module 2 Parent Tip Sheet 1
No ratings yet
Eureka Math Grade 2 Module 2 Parent Tip Sheet 1
2 pages
Lecture Python Basics
No ratings yet
Lecture Python Basics
23 pages
HAVE TO - NEED TO Semana 1
No ratings yet
HAVE TO - NEED TO Semana 1
5 pages
Nss 1
No ratings yet
Nss 1
2 pages
6.0001 Final Cheat Sheet PDF
No ratings yet
6.0001 Final Cheat Sheet PDF
2 pages
Dsup Lab File
No ratings yet
Dsup Lab File
18 pages
1a.install Python and Set Up The Development Environment.: Code: Print ("Hello World") Output
No ratings yet
1a.install Python and Set Up The Development Environment.: Code: Print ("Hello World") Output
14 pages
Combined Cheatsheet
No ratings yet
Combined Cheatsheet
5 pages
Diarrhea Management in The Community
No ratings yet
Diarrhea Management in The Community
23 pages
Bounded and Unbounded Sequence: (A) Definition
No ratings yet
Bounded and Unbounded Sequence: (A) Definition
2 pages
CS3361 - Data Science
No ratings yet
CS3361 - Data Science
56 pages
Python
No ratings yet
Python
16 pages
Easybib 553e7541694d58 39916757
No ratings yet
Easybib 553e7541694d58 39916757
7 pages
Python Numpy Tutorial
No ratings yet
Python Numpy Tutorial
26 pages
Engineering Materials Exp. - 2
No ratings yet
Engineering Materials Exp. - 2
6 pages
Essay Outline For The Great Gatsby
No ratings yet
Essay Outline For The Great Gatsby
5 pages
Python Lab Manual
No ratings yet
Python Lab Manual
17 pages
2024 Summer Model Answer Paper
No ratings yet
2024 Summer Model Answer Paper
28 pages
Noc20 Cs81 Assignment 01 Week 03
No ratings yet
Noc20 Cs81 Assignment 01 Week 03
5 pages
Citizen CLP-8301 Technical Manual
No ratings yet
Citizen CLP-8301 Technical Manual
259 pages
SA Health Cleaning Standard 2014 - (v1.1) CDCB Ics 20180301 PDF
No ratings yet
SA Health Cleaning Standard 2014 - (v1.1) CDCB Ics 20180301 PDF
48 pages
Illustrative Problems 1. Find Minimum in A List
No ratings yet
Illustrative Problems 1. Find Minimum in A List
11 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet