0% found this document useful (0 votes)
43 views39 pages

AIML Lab Manual

Uploaded by

Mihika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views39 pages

AIML Lab Manual

Uploaded by

Mihika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Madhav Institute of Technology and Science, Gwalior

(A Govt. Aided UGC Autonomous & NAAC Accredited Institute)


(Affiliated to R.G.P.V. Bhopal)

Department Of Electronics Engineering

Lab Manual

140617 AIML Lab

Faculty Coordinator

Prof. (Dr.) R.P.Narwaria

Prof. Madhav Singh

Lab Coordinator

Mr. D. S. Tomar
LIST OF THE EXPERIMENT

S.NO NAME OF THE EXPERIMENT PAGE


NO.

1 Perform Creation, indexing, slicing, concatenation and repetition operations 1


on Python built-in data types: Strings, List, Tuples, Dictionary, Set

2 Solve problems using decision and looping statements. 2

3 Apply Python built-in data types: Strings, list, Tuples, Dictionary, Set and their 3
methods to solve any given problem

4 Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and 3


Splitting.

5 Computation on NumPy arrays using Universal Functions and Mathematica.I 4


methods.

6 Import a CSV file and perform various Statistical and Comparison operations 5
on rows/columns

7 Create Pandas Series and DataFrame from various inputs 6

8 Import any CSV file to Pandas DataFrame and perform the following: 6
1. Visualize the first and last 10 records
2. Get the shape, index and column details
3. Select/Delete the records(rows)/columns based on conditions.
4. Perform ranking and sorting operations.
5. Do required statistical operations on the given columns.
6. Find th count and uniqueness of the given·categorical values.

9 Import any CSV file to Pandas DataFrame and perform the following: 9
1. Handle missing data by detecting and dropping/ filling missing
values.
2. Transform data using different me hods.
3. Detect and filter outliers.
...
4. Perform Vectorized String operations on Pandas Series.
5. Visualize data using Line Plots, Bar Plots, Histograms, Density
Plots and Scatter Plots.

10 Use scikit-learn package in python to implement following machine learning 14


models to solve real world problems using open source datasets:
1. Linear Regression model.
2. Multi-linear regression model.
3. Decision tree classification model.
4. Random forest model.
5. SVM model.
6. K-means clustering model.

: Experiment 1:
Perform Creation. indexing, slicing, concatenation and repetition operations ori Python built-in data types: Strings,
List, Tuples, Dictionary, Set

#String Creation
strl =•Abhinav•
str2 =•chaturvedi•

#string Concatenation
str3=str1+• •+str2
print(str3)

#string indexing
print(str3[8])

#string slicing
print(str3[:7:J)
print(str3[0:7:2])
print(str3[::-1])

#string repetition
print(strl • 3)
Abhinav Chaturvedi
A
Abhinav
Ahnv
idevrutahC vanihbA
AbhinayAbhinavAbhinav

1 [ j: #List Creation
listl =['Python', 'C++', 'JavaScript']
list2 =[l, 2, 3]

#List Concatenation
list3=listl+list2
print(list3)

#List indexing
print(list3[0])

#List slicing
print(list3[:3:])
print(list3[0:7:2])
print(list3[::-1])

#List repetition
print(listl • 3)
(•Python•, •c++•, 'JavaScript', 1, 2, 3]
Python
['Python', 'C++', 'JavaScript']
('Python', 'JavaScript', 2] ,
(3, 2, 1, 'JavaScript', 'C++' • ython
('Python', 'C++', 'JavaScript, Python, •C++•, 'JavaScript', •Python', 'C++', •JavaScript']

I,, [ 1: #tuple Creation


#items of a tupe Ca nnot be changed once create
tuple1
=('Python',
'C++'
'JavaScript')
tuple2 =(l, 2,
3) '

#'tuple Concatena ion


tuple3=tuple1+tuple2
print(tuple3)

#tuple indexing
print(tuple3f0])
print(tuple3f0][1))

#ruple slicing
print{tuple3[:3:])
print{tuple3[0:7:2])
print(tuple3[::-1])

#tuple repention
print(tuplel * 3)
('Python', 'C++,' 'JavaScri•pt', 1, , )
2 3
Python
y
('python', 'C++', 'JavaScript')
{'Python', 'JavaScript', 2)
{3, 2, 1, 'JavaScript', 'C++' 'Pyth ')
('Python', 'C++', 'JavaScript , 'Pyt :n•,
'C++', 'JavaScript', 'Python·, 'C++', 'JavaScript')
In ]: #Dicn.onary Creation
d ct1 ={7:'P hon', 2:'C++', 3:'JavaScript'}
dict2 ={ good :1, 'average':2, 'nice':3}

#diet Concatenation
def Merge(dict1, dict2):
return({**dictl, **dict2})

dict3=Merge(dictl,dict2)
print(dict3)

#diet - element access


print(dict2['good'])
print(dictl.get(l))

{1: 'Python', 2: 'C++', 3: 'JavaScript', 'good': 1, 'average': 2, 'nice': 3}


1
Python

In j: #Set Crearion
#a set cannot have mirtable elements like list:s, sets or dictionaries as its elements.
#You cannot access ite s in a set by referring to an index or a key.
Setl ={'Python', 'C++", 'JavaScript'}
Set2 ={1, 2, 3}

#Set Concatenation Setl.update(Set2) Setl.add('Java·) print(Setl)


{l, 2, 'C++', 3, 'JavaScript', 'Python', 'Java'}
: Experiment 2 :
Solve problems using decision and looping statements.

In J: def print_factors(x):
print("The factors of•,x,nare:")
for i in range(l, x + 1):
if x X i == 0:
print(i)

num=320
print_factors(num)

The factors of 320 are:


1
2
4
s
8
10
16
20
32
40
64
80
160
320
: Experiment 3 :

Apply Python built-in data types: Strings, List, Tuples, Dictionary, Set and their.methods to solve any given problem

In [ ): Age={'person2': 21,'personS':24,'person6':19,'personl':20,'person3':23,'person4':22}
sortedDict = sorted(Age)
print(sortedDict)

#sort:i.ng on the basis of value using Lambda function


print(sorted(Age.items(),·key=lambda x:x[l]))

#without using La bda function


def element_l(x):
return x[l]

sorted(Age.items(), key=element_l)
['personl', 'person2', 'person3', 'person4', 'persons', 'person6']
[('person6', 19), ('personl', 20), ('person2', 21), ('person4', 22), ('person3', 23), ('persons', 24)]
Out[ ): [('person6', 19),
('personl', 20),
('person2', 21),
( •per.son4•, 22),
('person3', 23),
(·persons', 24)]
: Experiment 4:
Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and Splitting.

In [ J: arr= np.array([l, 2, 3, 4, 5])

print(arr)

print(type(arr))

print{arr[l]) l.
pi-int{arr[::2])
print(arr[l:5])
print(arr[l:5:2])

arrl= np.array([l, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arrl.reshape(4, 3)

print(newarr)

newarr 3d = arrl.reshape(2, 3, 2)
print(;ewarr_3d)

arr3 = np.array([100, 200, 300])

arr4= np.array([400, 500, 600])

arr= np.concatenate((arr3, arr4)

print(arr)

arrl = np.array([[l, 2], [3, 4]])

arr2 = np.array([[5, 6], [7, 8]])

arr= np.concatenate((arrl, arr2), axis=l)

print(arr)

arr= np.array([l, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 3)

print(newarr)

arr= np.array([[l, 2, 3], [4, S, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15), [16, 17, 18]])

newarr = np.array_split(arr, 3)

print(newarr)

[1 2 3 4 5)
<class 'numpy.ndarray'>
2
[1 3 5]
[2 3 4 5]
[2 4]
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 1112]]
[[[ 1 2]
[ 3 4]
[ 5 6]]

[( 7 8]
[ 9 10]
(11 12]]]
[100 200 300 400 500 600]
[[1 2 5 6]
(3 4 7 8]]
[array([l, 2]), array([3, 4]), array([5, 6))]
[array([[l, 2, 3],
(4, 5, £]]), array(([ 7, 8, 9],
(10, 11, 12]]), array([[13, 14, 15],
(16, 17, 18]])]
: Experiment 5:
Computation on NumPy arrays using Universal Functions and Mathematical methods.

Note: Universal functions in Numpy are simple mathematical functions. It is just a term that we gave to mathematical
functions in the Numpy library. Numpy provides various universal functions that cover a wide variety bf operations.
These functions include standard trigonometric functions, functions for arithmetic operations, handling complex
numbers, statistical functions, etc.

In [ ]: # Python code to demonstrate trigonometric function

# create an array of angles


angles= np.array([0, 30, 45, 60, 90, 180])

#conversion of degree into radians using deg2rad function


radians= np.deg2rad(angles)

# sine of angles
print('Sine of angles in the array:')
sine value= np.sin(radians)
print(np.sin(radians))

#hypot function demonstration


base= 4
height= 3
print('hypotenuse of right triangle is·• h
· ,np. ypot(base, height))
#statistical method
weight= np.array([50.7 S2.S 50 58
:i t(. , J ' ' 55 - 6 3, 73.25, 49.5, 45])
pr_n Mean weight of the students: ')
print(np.mean(weight))

Sine of angles in the array:


[0.80000000e+00 S 00000000e 0
l.00000000e+00 1 22464680e=1 ]7.07106781e-01 8.66025404e-01
hypotenuse of right triangle is: 5.0
Mean weight of the students:
S4.322S
: Experiment 6 :
Importa CSV file and perform variou St f . I
s a tst,ca and Comparison operations on rows/columns.
In r ) : df=pd.read_csv('50_Startups.csv')
df.head(10)

Out[ ] : R&DSpend Administration Marketing Spend


State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 1S13TI.59
443898.53 California 191792.06
2 153441.51 10114S.SS 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
s 131876.90 99814.71 36286136 NewYork 156991.12
6 134615.46 147198.87 127716.82 California 156122.51
7 130298.13 14S530.06 323876.68 Florida 155752.60
8 120542.52 148718.95 311613.29 New York 152211.77
9 123334.88 108679.17 304981.62 California 149759.96

In ):
Where method to compare the values
#
# The values were stored in the new column
# By using the Where() method in NumPy, i1e are given the condition to compare the columns.
# If 'columnl' is Lesser than 'column2' and 'column1' is -Lesser than the_'column3', .
# ive print the values of 'coLumnl'. If the condition fails, we give the value as 'NaN'.
df['-new'J = np.where((df['R&D Spend']<= df['Administrat on']) I (
df['R&D Spend'] <= df['Marketing Spend']), df['R&D Spend'], np.nan)

df.head()

Out[ ] : R&D Spend Administration Marketing Spend State Profit new

0- 16534920 136897.80 4'71784_,0 New York 192261.83 165349.20 ..

1 162597.70 151377.59 443898.53 California 191792.06 162,597.70

2 153441.51 101145.55 407934.54 Florida 191050.39 153441.51

3 144372.41 118671.85 383199.62 New York 182901.99 144372.41

4 142107.34 91391.77 366168.42 Florida 166187.94 142107.34

In ]: print( df.sum())
3686080.78
R&D Spend
6067231.98
Administration
10551254.89
Marketing Spend
YorkCaliforniaFloridaNew YorkFloridaNew Yo ...
State New 5600631.96
Profit 3686080.78
new
dtype: object

In [ ): print( df['Profit'].sum( - ---- --··


. - -•A--·A- •- A •.. . - ..
¥ •

,
5600631.960000001

In ]: print(df.mean())
R&D Spend
73721.6156
Administration 121344.6396
Marketing Spend 211025.0978
Profit 112012.6392
new
73721.6156
dtype: float64

5600631.960000001
: Experiment 7:
Create Pandas Series and DataFrame from variousinputs.

In ); d t
a a= np.array(['personl', 'person2', 'person3', 'person4', 'persons'])
ser_5tud = pd.Series(data)
print(ser_stud)

# create series form a list


list-= ["95' •94• '99' •93· '.97']
ser_num = pd Serie;(list) ,
print(ser_num)
0personl
1 person2
2 person3
3 person4
4 persons
dtype: object
0 95
1 94
2 99
3 93
4 97
dtype: object

In ]: dataframe = { 'Students': ser_stud, 'numbers': ser_num}


#Creating DaraFra,-r:bey passing Dictionary
result= pd.DataFrame(dataframe)
result.head()

Out [ j: Students numbers

0 person1 95
1 person2 94
2 person3 99
3 person4 93
4 persons 97

In [ ]: data= [['person1• 10], ('person2', 15], ('person3', 14]]


df = pd.OataFrame(ea a, c.olumns={'Name', 'Age']) ,.
df.head()

OL1t [ ) ; Name Age

0 person1 10
1 person2 15

2 person3 14
: Experiment 8:
Import any CSV file to Pandas Data Frame and perform the following:

1. Visualize the first and last 10 records


2. Get the shape, index and column details
3. Select/Delete the records(rows)/columns based on conditions.
4. Perform ranking and sorting operations.
. l).o rn1111f1t1tJ I l!t/bllrnl OflfllllIl on 0,1111,,
01
0. I Incl tho f 1111111rtndtmlqt1tn
I 1.6 ,, I II Ill

vu,,1 011111111
'"''
(11111
I H1m111llfl f;/11plMn111tlpln,1,l111rtn• "', tlluc11111,"w
' ,lu, r,;,

11
' l It/ )I/ 1'•11I IIU ,/(lf 11
,JJ 11h lt1§,._1l11t·11i-1Jll. l't!ll(I
,tJ nlmtr1 1J11t11, h i!tl( H)'
V( 'plmrl■flfrt
't U\/')

lp1 ( )
0
"""' '" " i,lu<:o•a co
()
11 tll;,uoll
"''n lhldo11,o ,,,..,1,,. 11ml dlii,h 1r1Jd .a9e "'1n dl bf!tes
1411 J () 3,G (),(i27 1,3700 Trur:
ll!, (l(l
o () 20,(i (J,3!,1 1,1426 Fal
:t 11
''"
1(]'1

., ll9 GG 2
() ()
!M 2ll.1
;>J, Olin
0,1G7
.12
21
OJX)()()
0,9()(j2
True
False
... () 1· 7 40 ·!, 1GB 43,1
2,288 33 1.3790 True
s ll<i 74 0 0 2!.i,6 0201 30 0,0000 False
G 70 50 32 /18 311) 0,248 26 12608 True
7 10 115 0 0 0 35,3 0,134, 29 0.0000 False
tJ 2 197 70 4!i 543 30,5 0,158 53 1,TT30 True
9 0 125 9G 0 0 0,0 0232 54 0,0000 True

ln I J : diobctcs_dotn.tnil(10)

Out[ J : num_prcg glucose_conc d l u tol l c J > p

7S8 106
- -- ·
76
- thickness Insulin bml diab_pred age skin diabetes

0 False
0 37,5 0,197 26 0.0000
7S9 6 190 92 0 0 35.5 0.278 66 0,0000 True

760 2 80 50 26 16 28,4 0,766 22 1,0244 False

761 9 170 14 31 0 44,0 0.403 43 1.2214 True

762 9 89 62 0 0 22,5 0.142 33 0.0000 False

763 10 101 76 48 180 32,9 0,171 63 1,8912 False

764 2 122 70 27 0 36,8 0.340 27 1,0638 False

76S 5 121 72 23 112 26,2 0.245 30 0.9062 False

766 126 60 0 0 30.1 0.349 47 0.0000 True

767 93 70 31 0 30,4 0.315 23 1.2214 False

Jn [ ]: diabetes_data.shape
Out[ J; (768, 10)
In [ ) : dJaqetes_data.describe() •
Ovt( J: thickness Insulin bml diab_pred age skin
num_preg glucose_conc dlastollc_bp

768.000000 768,000000 768.000000 768,000000 768.000000 768.000000 768.000000


count 768.000000 768.000000

mean 69,105469 20,536458 79.799479 31.992578 0.471876 33.240885 0.809136


3.845052 120.894531

19,355807 15,952218 115,244002 7,884160 0.331329 11.760232 0.628517


std 3.369578 31.972618
0.000000 0.000000 0,000000 0.078000 21.000000 0.000000
min 0.000000 0,000000 0,000000

0.000000 0,000000 27,300000 0.243750 24.000000 0.000000


25% 1.000000 99,000000 62.000000

SO% 23,000000 30.500000. 32,000000 0,372500 29.000000 0.906200


3,000000 117,000000 72.000000

32.000000 127,250000 36.600000 0.626250 41.000000 1.260800


15% 6.000000 140.250000 80,000000

max 99,000000 846.000000 67.100000 2.420000 81.000000 3.90000


17,000000 199,000000 122.000000
11122122,

In f J •J 2
diab

diabbetes_data(,dJab
a etes_data. head(5) He!l_data [• d illb e 'J.map(d1abe1:eS.JMP)

Out[ ]· num_preg glucose_conc dlastollc.J>p thlckneu lnwlln bml dlab_pred age sldn dlabrltt
0 6 148 72 35 0 33.6 0,627 so 13790
85 66 0
29 0 26,6 0.351 31 1.142.6
2 6 183 64 0 0 23,3 0,672 32 0.0000
3 89 66 23 94 28.1 0.167 21 0.9062 0
4 0 137 40 35 168 43.1 2.288 33 13790

In l ]: sns.countplot(x='diabetes.,data=diabetes_data)

Ou [ ): <AxesSubplot: xlabel•'diabetes'
1
ylabel


count'>

,l,,J
C:
:::,
8

1
diabetes

In [ ]: diabetes_data['diabetes').value_counts()

Out[ J: 0 500
1 268
Name: diabetes, dtype: int64

n [ ]: diabetes_data.sort_values("age")
diast bmi dfab_pred age skin diab etes
num_preg glucose_conc olic_bp thickness insulin
Out[ ]:
113 64 35 0 33.6 0.543 21 13790
255
0 0.0 0304 21 0.0000 0
84 0 0
60 2
0 225 0262 21 0.0000 0
125 96 0
102 0
0299 21 0.7880 0
0 74 20 23 27.7
182
115 435 0347 21 1.0638 0
94 70 27
623 0

0 26.8 0.186 69 0.0000 0


80 0
123 s 132
0
0 0 0.0 0.640 69 0.0000
136 82
68,4 5
18 0 325 0235 70 0.7092
145 82
666 4
0 19.6 0.832 72 0.0000 0
0 0
453 2 119
60 25.9 0.460 81 13002 0
74 33
459 9 134
.
'11122122,
Llllt
Tri ( 1: diobct •s_d,,tn(; t1(l rllnk •) dJ,nl,ct;u do• . [,
- , 0 "Cc·1.1·t1111 c,, , cmtH 1111r II c)
In r l: dii'lbete _dntn.heod()

ut[ ] ' nun pn g glucosc_conc dlftstoll t," thlclmnu l11J11lh1 111111 dlilh. Jll'fill IIIJfl 1kl11 cJl111i11tn ii!)fl fitlll(
0 G 140 7 /I!,!;
!I' () ;!;IIi oc,:11 !,O I fJ /I (I
05 (j(j :!!) f.1,1'£l 0 31(),'
I) I, I 03!J1
2 I) 11) 1'1 0 0 ,!.;J, OIJ'/2 ;, 00000 :rnu,
3 09 GG
"
ii. f '1 i·1 0,00l:i2 7 l.O
i'fl,1 ().11,'/
,4 0 137 40 3r OJ,()
160 4·.1 .,?ll/J 1. . 70{1

111 ( J: i l, inplD •r:ol e)llw<l ..J fo,• <1Lumll/J


diabetes_data "' diobetes_cfota .drop([ • nge_ronk,], n > < •

Operations based on conditions

In [ J: diabetes_data.groupby( 'diabetes') ,nienn()

Out[ ] : num_p g glucosc_conc dfostollc_bp thickness Insulin lnnl dl11h_prad


diabetes

0 3.298000 109.980000 68.184000 19.6f,4000 60.792000 30,304200 0.429734 31,190000 0,774762


4.865672 141.257463 70.824627 22.164179 100.335821 35,142537 0,550500 37,067164 0,873269

Renaming Columns
,
In [ ]: diabetes_data.rename(columns = {'glucose_conc': 'glucose', 'diastolic_bp': 'bp'}, inplace • True)

In [ ) : diabetes_data.head()

out[ J: num_preg glucose bp thickness Insulin bml dlab_pred age skin diabetes

0 6 148 72 35 0 33.G 0.627 so 1.3790


85 66 29 0 26.6 0.351 31 1.1426 0

2 8 183 64 0 0 23.3 0.672 32 0.0000

3 89 66 23 94 28.1 0.167 21 0.9062 0

4 0 137 40 35 168 43.1 2.288 33 1,3790

In [ ] : diabetcs_data.age.unique()

Out[ ) : array( [50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 34, 57, 59, 51, 27, 41, 43,
22, 38, 60, 28, 45, 35, 46, 56, 37, 48, 40, 25, 24, SB, 42, 44, 39,
36, 23, 61, 69, 62, 55, 65, 47, 52, 66, 49, 63, 67, 72, 81, 64, 70,
68), dtype=1nt64)

In [ ) : len(diabetes_data.-age.unique())

Out[ J: 52
:Experiment 9:
Import any CSV file to Pandas DataFrame and perform the following:

1. Handle missing data by detecting and dropping/ filling missingvalues.


2. Transform data using different methods.
3. Detect and filter outliers.
4. Perform Vectorized String operations on Pandas Series.
5. Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and ScatterPlots.

Dataset Import and overview

_In [ ): df•pd.read_csv( • S0_Startups. csv')


df.hcad{l0)

()111"[ ]:
-- 0
R&DSpend Administration
165349.20 136897.80
-Marl<--cting Spend

471784.10 New York


State
Profit

192261.83
162597.70 151377.59
443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
144372.41
..
3 118671.85 383199.62
142107.34 New York 182901.99
91391.77
366168.42
Florida 166187.94
s 131876.90 99814.71 36286136
New York 156991.12
6 134615.46 147198.87 127716.82
California 156122.51
7 130298.13 145530.06 323876.68
Florida 155752.60
8 120542.52 148718.95 311613.29
New York 152211.77
9 123334.88 108679.17 304981.62
California 149759.96

In f ]: df.shape

out[ ]: (50, 5)

Handling Missing Data

In [ J: #Check if any null or empty data is present in dataset


#if not a numbe,-/nuLL -> df.dropna(inplace=True)
#df{'"R&D Spend"] = df["R&D Spend HJ. replace(np. NaN, df["R&D Spend"] .mean())
df.isna().sum()
••- w••-•• • -

Out[ ]: R&D Spend 0


Administration 0
Marketing Spend 0
State 0
Profit 0
dtype: int64

In [ ]: print("total number of rows : {0}".format(len(df)))


print("number of rows missing R&D Spend: {0}".format(len(df.loc[df['R&D Spena']== 0))))
print("number of rows missing Administration: {0}".format(len(df.loc[df['Administration'] = 0))))
print("number of rows missing Marketing Spend: {0}".format(len(df.locfdf['Marketing Spend']== 0))))
print("number of rows missing State: {0}".format(len(df.loc[df['State'] == 0))))
print("number of rows missing Profit: {0}".format(len(df.loc[df("Profit'] == 0])))·
total number of rows : 50
number of rows missing R&D Spend: 2
number of rows missing Administration: 0
number of rows missing Marketing Spend: 3
number of rows missing State: 0
number of rows missing Profit: 0

In ( ]: df.describe()
Out[ ]: R&DSpend Administration
..
Marketing Spend Profit .
50.000000 50.000000 50.000000
count 50.000000
211025.097800 112012.639200
mean 73721.615600 121344.639600

122290.310726 40306.180338
std 45902.256482 28017.802755

0.000000 14681.400000
min 0.000000 51283.140000

129300.132500 90138.902500
25% 39936370000 103730.875000
212716.240000 107978.190000
SO% 73051.080000 122699.795000
299469.085000 139765.977500
75% 101602.800000 144842.180000
471784.100000 192261.830000
max 165349.200000 182645.560000

J: rd_spend
v,.. ,. ,. -------------------- -_ _. - ,.

In ( = df["R&o Spend"] inplace=True)


rd_spend.replace(to_replace = 0, vaiue = rd_speod.mean(),
11ri7fl ' ,narket_spend = df["Marketing Spend"]
ket spend.replace(to_replace = 0 value=m k
,nar - • ar et_spend.mean(), inplace=True)
Finding Missing Values

print("total number of ows : {0}".format(len(df)))


rri [ 1: nd
print("number off rows miissing R&D_Spe : {0}".format(len(df.loc[df['R&DSpend']==-0])))
•nt("number o rows m ssing Adm1nistratio. { 0}"
pri.nt("num ber of row5 m1•ss1•ng M k
ar eting Spennd.· {e}."fformat(len(df.loc[df["Administration') == 0))))
P Of . . 5t · • ormat(len(df.loc(df['Marketing Spend']== 0])))
print("number rO\--IS m ss ng at : {e}" .format(len(df.loc(df['State'] == 0])))
print("number of r0\--1s missing Prof_it:{0 ".format(len(df.loc[df['Profit'] == e])))
total number of rows: 50
number of rows missing R&o Spend: 0
number of rows missing Administration: 0
number of rows missing Marketing Spend: 0
number of rows missing State: 0
number of rows missing Profit: 0

outliers Detection

For Normal distributions: Use empirical relations of Normal distribution.

-> The data points which fall below mean-3(sigma) or above mean+ 3(sigma) are outliers.

For Skewed distributions: Use Inter-Quartile Range (IQR) proximity rule.

-> The data points which fall below Q1 -1.S IQR or above Q3 + 15 IQR are outliers.

where Q1 and Q3 are the 25th and 75th percentile of the dataset respectively, and IQR represents the inter-quartile
range and given by Q3 - Q1.

Z-score treatment:
Assumption- The features are normally or approximately normally distributed. ·

Plotting Graph

In ]: plt.figure(figsize={l6,10))

plt.subplot(2,2,1)
sns.distplot(df["R&D Spend"])

plt.subplot(2,2,2)
sns.distplot(df["Administration"])

plt.subplot(2,2,3)
sns.distplot(df["Marketing Spend"])

plt.subplot(2,2,4)
sns.distplot(df["Profit"])

plt.show()
-
lo-5
1.2 u

1.2
1.0

1.0

o.a
r!
0.6

0.4

0.2

0.0
R6'D Spend 50000 100000 l.50000 200000
Mministration
10-6
4.0 lo-5
1.2
J.5

1.0
J.0

2.5 0.8
- If! 2.0 -
! 0.6
1.5
0.4
1.0

0.5 0.2

0.0
0.0
100000 200000 lXlOOO!J .ciroooo :!00000 600000 -50000 50000 100000 150000 200000 250000
0
Mulcetingnd
Profit

In[ ): #Finding the Boundary Values

print{"Highest allowed R&D Spend m,df["R&D Spend"J.mean() + 3*df["R&D Spend"J.std())


print("Lowest allowed R&D Spend ",df["fl&D Spend"].rnean() - 3*df["R&D Spend"].std{),"\n")

print{"Highest allowed Administration ",df["Administration"J.mean() + 3*df["Administration"J.std())


print("Lowest allowed Administration ",df["Administration"J.mean() - 3*df["Administration"J.std(),"\n")

print{"Highest allowed Marketing Spend ",df[hMarketing Spend"J.m an() + 3*df["Marketing Spend"J.std())


print{"Lowest allowed Marketing Spend ",df["Marketing Spend"J.mean() - 3*df["Marketing Spend"J.std(),"\n

print("Highest allowed Profit",df["Profit"].mean() + 3*df["Profit"J.std{))


print("Lowest allowed Profit",df["Profit"J.mean() - 3*df["Profit"J.std(),"\n")
Highest allowed R&D Spend 206619.73822903878
Lowest allowed R&D Spend -53278.77778103876

Highest allowed Administration 205398.04786646605


Lowest allowed Administration 37291.23133353396

Highest allowed Marketing Spend 553207.7653167574


Lowest allowed Marketing Spend -105834.55798075726

Highest allowed Profit 232931.18021295167


Lowest allowed Profit -8905.901812951619

In [ ]: # Finding the Outliers


df[(df["R&D Spend"] > 206619.73822) f (df["R&D Spend"]< -53278.7777))
Out[ ] :
R&D Spend Administration Marketing Spend State Profit

In[ J: # Finding the Outliers


1 f[(df["Admi is ration"] > 205398.04786) I (df["Administration"J < 37 91.23133))
Out[ ]:
R&D Spend Administration Marketing Spend State Profit

/ ln [ ) : 1 # Finding the Outliers


r-
.df[(df["Marketing Spend"]> 553207.76531) I (df["Marketing Spend")< -105834.5579))

l -
nb
t1 2/')'
R&D Spend AdmlnlstrAtlon Mnrkatlng Spend Stlltll Profit
,,,,ti 1

11 finding the o,rtL ie,·.


1
,, I J:
df[(df["Pr fit") > 232931.18021) I (df["l'roflt") < •090 .9018 2)]

R&D Spend Administration M11rketlng s1,and State rro11t

Plotting Graph for Outliers

sns,boxplot(df["Admini trntion"))
Jn r ].
<AxesSubplot: >
1tJl{ ):

180000

160000

140000

120000

100000

80000

60000

In [ ] : #finding the IQR since ske1-1ed


percentile25 = df['Administration').quantile(0.25)
percentile75 = df['Administration').quantile(0.75)
iqr=percentile75-percentile25#q3-q1

upper_limit percentile75 + 1.5 • iqr


lower_limit percentile25 - 1.5 • iqr

df[df["Administration"] > upper_limit]


df[df["Administration"] < lower_limit]
multi_df=df

Vectorized String operations

In [ ]: # Panda series use in this section


names pd.Series(['Walter White', 'Jesse Pinkman', 'Skyler White', 'Hank Shrader', 'Mike Ehrmantraut',
'Gus Fring'])
names
Out[ ] : 0 Walter White
1 Jesse Pinkman
2 Skyler White
3 Hank Shrader
4 Mike Ehrmantraut
5 Gus Fring
dtype: object
111
[ l
J: names.str.upper()

file:Jt1F:17th ,,
sel1\/Departmental Lab/ml_lab/Abhlnav_Chaturvedl-0901CS191003/lab.html
13/33
Lab
WALTER WHITE
JESSE PINKMAN
SKYLER WHITE
HANK SHRADER
4 MIKE EHRMANTRAUT
5 GUS FRING
dtype: object

rn [ J: names.str.len()
]: 0 12
outf 1 13
2 12
3 12
4 16
S 9
dtype: int64

In f J: names.str.startswith('W')

0 True
out( ]:
1 False
2 False
3 False
4 False
5 False
dtype: bool
Vectorized indexing and slicing

In [ ]: names.str[0]

out[ J: 0 w
1 J
2 s
3 H
4 M
5 G
dtype: object

In [ j: names.str.slice(0,2)

Out[ ]: 0 Wa
1 Je
2 Sk
3 Ha
4 Mi
s Gu
dtype: object

In [ ]: names.str.split()

Out[ j: 0 [Walter, White]


1 [Jesse, Pinkman]
2 [Skyler, White] ..
3 [Hank, Shrader]
4 [Mike, Ehrmantraut]
s [Gus, Fring]
dtype: object
In [ ]: names.str.split().str.get(0)

Out[ ]: 0 Walter
1 Jesse
2 Skyler
3 Hank
4 Mike
s Gus
dtype: object
: Experiment 10:
Use scikit-learn package in python to implement following machine learning models to solve real world problems
·-'e using open source datasets

unear Regression Model

linear df=pd.read_csv('Salary Oat a.csv')


- -

linear_df.head( 10)

VearsExperience S a l a r_
y

1.1 39343.0
0
13 46205.0
1

2
1.5 37731.0

2.0 43525.0
3

22 39891.0
4

s .2.9 56642.0

6 3.0 60150.0

7 32 54445.0

8 32 64445.0

9 3.7 57189.0

linear_df .info() ·-. ........ · ..-..•.•. _ .... _ -··


<class •pandas.core.frame.DataFrame'>
Rangeindex: 30 entries, 0 to 29
Data columns (total 2 columns):
# Column Non-Null Count Dtype

0 YearsExperience 30 non-null float64


1 Salary 30 non-null float64
dtypes: float64(2)
memory usage: 608.0 bytes

x = linear_df[ ['YearsExperience']]
X
sE,cperience 2.0
3
1.1 2.2

v
0
13
2.9
"
e
1 s
a 1.5 3.0
r 2
6
3.2
1
3.2
8
3.7
9
3.9
10
11 4.0

12 4.0

13 4.1

14 4.5
15 4.9

16 5.1

17 53

18 5.9

19 6.0

20 6.8

21 7.1

22 7.9

23 8.2

24 8.7

25 9.0

26 9.5

27 9.6

28 10.3

29 10.5

In[ ) : · ·· ·· ·
Y: linear_df.iloc[:,1].values
y

Oltt[ ): 56642., 60150.,
array( [ 39343., 46205., 37731., 43525., 39891.,
56957., 57081.,
54445., 64445., 57189., 63218., 55794., 93940., 91738.,
81363.,
61111., 67938., 66029., 83088., 116969., 112635.,
105582.,
98273., 101302., 113812., 109431.,
122391., 121872.])
In [ ]:
Plt.scatter(x,y)
Plt.show()
•• • •

10

rn[ ]:

In [ ]:

Out[ J:

In [ ] : y_pred = model. predict(x)


y_pred

Out[ ]: array([ 36187.15875227, 38077.15121656,


39967.14368085, 44692.12484158,
46582.11730587, 53197.09093089, 54142.08716303, 56032.07962732,
56032.07962732, 60757.06078805, 62647.05325234, 63592.04948449,
63592.04948449, 64537.04571663, 68317.03064522, 72097.0155738,
73987.00803809, 75877.00050238, 81546.97789525, 82491.9741274,
90051.94398456, 92886.932681 , 100446.90253816, 103281.8912346,
108006.87239533, 110841.86109176, 115566.84225249, 116511.83848464,
123126.81210966, 125016.80457395])
In [ ] :
Plt.scatter(x,y)
Plt.title("Linear Regression using Ordinary Least Square Meth0d")
plt.plot(x,y_pred,color='red',label='Best Fit Line')
Plt.legend()
It.show()
Ltlb
Linear R gr sslon using Ordl
nary Least
- Oest Flt Lin quare Method
120000

100000

aoooo

60000

40000

1n I ): model.coef_

out[ ]: array([9449.96232146))

In I ): model.intercept_

Out[ ]: 25792.20019866871

In f ]: model.predict([[4)))

Out[ ): array([63592.04948449))

In [ ): from sklearn.metrics import r2_score


r2_score(y,y_pred)•100

Ou [ ) : 95.69566641435085

Mult-linear Regression Model

!n [ ): x = multi_df.iloc[:,:-1) # Independent features


y = multi_df.iloc[:,-1) # Dependent feature
x.head()

Out[ J: R&DSpend Administration Marketing Spend State

0 165349.20 136897.80 471784.10 New Yori::


1 162597.70 151377.59 443898.53 California
2 153441.51 101145.55 407934.54 Florida
3 144372A1 118671.85 383199.62 New Yori::
4 142107.34 91391.77 366168.42 Florida

r I l : y,head()

Ouq J. e 192261.83
1 191792.06
2 191050,39
3 182901.99
4 166187.94
Name: Profit, dtype: float64
In l l: x,State.value_counts()
lab
New York 17
California 17
Florida 16
Name: State, dtype: int64

".,conve,-t these catego,·ic values into one-ltot


rn( 1 ·· one_hot_states pd.get_dummies(x.state) encoding.

one_hot_states.head()
111 [ l:
Califomla Florida New Yori<
outf J:
0 0 0

0 0

2 0 0

3 0 0

0
" 0

IO
): x.drop(["State"], axis= 1, inplace = True)

In [ }: x = pd.concat([x, one_hot_states]. axis l)

In( ]: x.head(S)

out[ ] : R&D Spend Administration Matl<eting Spend California Aorida New Yori<

0 16534920 136897.80 471784.10 0 0


1 162597.70 151377.59 443898.53 .0 0
2 153441.51 101145.55 407934.54 0 0
3 144372.41 118671.85 383199.62 0 0

" 142107.34 91391.77 366168.42 0 0

In ]; from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In[ ]: xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, ra dom_state 0)

In ]:· multi_linear_reg = linearRegression()


multi_linear_reg.fit(xtrain, ytrain)

Out [ l: T LinearRegression.

LinearRegression()

rn[ ]: multi_linear_reg_predictions = u1.:::,_1: ear_reg.prejict(xtes )

ln [ l: print("R2 score:", r2_score(ytest, multi_linear_reg_predictions))


R2 score: 0.8711226942394046

Decision Tree Classification Model

In [ ); diabetes_data.head()
bml dlab_pred age skin diabetes
Out[ ]: num_preg glucose bp thickness Insulin

-----------
0
..---·-
35 0 33.6 0.627 so 1.3790
6 148 72
0.351 31 1.1426 0
66 29 0 26.6
85
0 23.3 0.672 32 0.0000
2 8 183 64 0
0.167 21 0.9062 0
3 66 23 94 28.1
89
43.1 2.288 33 1.3790
35 168
4 0 137 40
1f1,(}.2•
diabetes_data.c
olumns
1,, l l·
Index(['num_preg', 'glucose', 'bp', 'thickness',
cwt[ }' •age•, •skin·, 'diabetes'), 'insulin', 'bmi', 'diabJ)red',
dtype='object')

diabetes_data.dcscribc()
JO [ J:
num_preg glucose bp thickness
ouiC 1:
Insulin bml dlab_pred age skin diab<

count 768,000000 768.000000 768,000000 768.000000 768.000000 768,000000 768.000000 768,000000 766.000000 768,000

n,ean 3.64S0S2 120.894S31 69.105469 20.S36458 0.809136 0348


79,799479 31,992578 0.471876 332A0885
std 3.369S78 31.972618 1935S807 1S.9S2218 115,244002 0.628517 OA16
7.884160 0.331329 11.760232
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000
0.000000 0.000000 0.078000 21.000000
2S% 1.000000 99.000000 62.000000 0.000000 0.000000 24.000000 0.000000 0,000
27.300000 0.243750

SO% .3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.906200 0,000

7S% 6.000000 1402S0000 80.000000 32.000000 1272S0000 36.600000 0.6262S0 41.000000 1260800 1.000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 3.900600 1,000

'.1't
""
diabetes_data.isnull().sum()
In l ]:
num_preg 0
out[ ] :
glucose 0
bp 0
thickness 0
insulin 0
bmi 0
diab_pred 0
age 0
skin 0
diabetes 0
dtype: int64

plt.figure(figsize=(12,10))
In l :] b has an easy method to showcase heatmap
# sea orn () annot=True,cmap ='RdVlGn')
p = sns.heatmap( diabetes:-:- at . orr , - -· - - -

...

20/33
,
2f7.Z,
ti
1.0

QI

0.28 0.24 0.21


0.8

V
0.39 0.18
£
C
1.,§, 0.2 0.19 -0.6
-

!
- 0.4
0 Hi
'O

'1
.J:J
10
'6
i- 0.2

0.0

10
'6

num_preg glucose bp thickness insulin bmi diab_pred age skin diabetes

In [ ]:·diabetes_data_copy = diabetes_data.copy(deep = True)


diabetes_data_copy[['glucose','bp','thickness','insulin','bmi']] = diabetes_data_copy[['glucose','bp',
'thickness','insulin','bmi']].replace(0,np.NaN)

lu Showing the Count of NANs


lprint(diabetes_data_copy.isnull().sum())
num_preg 0
glucose 5
bp 35
thickness 227
insulin 374
bmi 11
diab_pred 0
age 0
skin 0
diabetes 0
dtype: int64

Data Visualization

In [ l: ·P -= diabetes_data.hist(figsize = (20,20))
Lab

ISO
100

100

40 UC

thkkness
nst.ltin

1SO

200
100

100
5,0

20 40 60 80 l.00 200 400 600 eoo

skin

T 2-0

lia

500

--

:,00

-- ---
- --
o.2 0.4 0.6 o.a U)
0.0

In [ ): diabetes_data_copy['glucose'].fillna(diabetes_data_copy['glucose').mean(), inplace = True)


diabetes_data_copy['bp'].fillna(diabetes_data_copy['bp').mean(), inplace= True)
diabetes_data_copy['thickness'].fillna(diabetes_data_copy['thickness').median(), inplace = True)
diabetes_data_copy['insulin'].fillna(diabetes_data_copy['insulin'].median(), inplace = True)
diabetes_data_copy['bmi'].fillna(diabetes_data_copy['bmi'].median(), inplace = True)

In [ ]:lp= diabetes_data_copy.hist(figsize= (20,20))

22
1 191
Lob

"••"• 150

uo )00

IO<i
150
10
..., lf,I

..,
Ill
JO

C,O
IW IA<l
"'"
500 JO-,

no 11,
400
I!.</

lOO IJ>

IOO
JOO ,,
50
100
,.
JOO «)O IOO 20
,,, .., 50 C,O

200

IOO
IOO

40 ,0 60 70 10
1..5

clabet s
,00

400

200

IOO

In [ ] : diabetes_data_copy.head()
thickness insulin bmi diab_pred age skin diabetes
num_preg glucose bp
Out[ ] :
125.0 33.6 0.627 50 1.3790
6 148.0 72.0 35.0
0
26.6 0351 31 1.1426 0
85.0 66.0 29.0 125.0
1
125.0 233 0.672 32 0.0000
8 183.0 64.0 29.0
2
0.167 21 0.9062 0
66.0 23.0 94.0 28.1
3 89.0
168.0 43.1 2.288 33 13790
137.0 40.0 35.0
4 0

Standard scaling

In [ ): from sklearn.preprocessing import StandardScaler


1) ,), columns=['num_pre,
!Xsc=_X =pdS.tOaantdafarradmSec(aslce_rX(.)fit_transform(diabetes_data_copy.drop(["diabetes"),axis
•glucose•, 'bp·, ·thickness', 'insulin', · bmi', 'diab_pred', 'age',· skin'])
X,heat
Lab
num_preg glucose bp thickness
Olft[ ]: ----· - --· Insulin bmi diab_pred age skin
0 0.639947 0.865108 -0.033518 0.670643 -0.181541 0.166619 0.468492 1.425995 0.907270
1 -0.844885 -1206162 -0.529859 -0.012301 -0.181541 -0.852200 -0.365061 -0.190672 0.530902
2 1233880 2.015813 -0.695306 -0.012301 -0.181541 -1.332500 0.604397 -0.105584 -1288212
3 -0.844885 -1.074652 -0.529859 -0.695245 -0.540642 -0.633881 -0.920763 -1.041549 0.154533
-1.141852 0.503458 -2.680669 0.670643
"' 0.316566 1.549303 5.484909 -0.020496 0.907270

rn[ ]: y = diabetes_data_copy.diabetes
y

out[ l: 0 1
1 0
2 1
3 0
4 1

763 0
764 0
765 0
766 1
767 0
Name: diabetes, Length: 768, dtype: int64
Splitting the dataset

In ]: X = diabetes_data.drop('diabetes', axis=l)
y = diabetes_data['diabetes']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33,random_state=7)

Model Training - Decision Tree

In ]: from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

Olft[ ] :

In ]: from sklearn im ort metrics

:predictions= dtr;e.predict(X_test)
print(uAccuracy Sco e =", format(metrics.accuracy_score(y_test,predictions)))
,. - . : - .. - . . - --· . .

Accuracy Score= 0.7086614173228346

In [ ]: from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_test, predictions))
print{classification_report(y_test,predictions))

[[130 32)
[ 42 50]]
precision recall fl-score support

0 0.76 0.80 0.78 162


1 0.61 0.54 0.57 92

accuracy 0.71 254


macro avg 0.68 0.67 0.68 254
weighted avg 0.70 0.71 0.70 254

Random Forest Model


Lab
from klcnrn.enscmblc import Rond0111ForestClassifir

rfc = RandomForestClassifier(n_cst1mators■200)
rfc.fit(X_train, y_troin)

11d l \. RandomFore t lo sifler --\


domForestCla sifier(n_e timato _..,2.00)_ I

rfc_tra1n "' 1-fc.pr die ( _J:rnin)


In l l {i'Offl sklearn import mclrl s

print(NAc uro y_ ore ·•• format(metrics.accuracy_score(y_tra1n, rfc_tra1n)))


Accuracy_Score = 1.0
so from above we can inference that training dataset our model is overfitted

In )' from sklearn import metrics


predictions= rfc.predict(X_test)
print(•Accuracy_Score =•. format(metrics.accuracy_score(y_test, predictions)))
Accuracy_Score = 0.7SS9055118110236

): from sklearn.metrics import classification_report, confusion_matrix


1
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test,predictions))

[[133 29]
[ 33 59]]
precision recall fl-score support

0 0.80 0.82 0.81 162


1 0.67 0.64 0.66 92
0.76 254
accuracy
0.74 0.73 0.73 254
macro avg
0.75 0.76 0.75 254
weighted avg

SVM Model

InL ): from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(X_train, y_train)

Out(

In ( ) :
.svc_pred = svc_model.predict_(x_ e t)
. .

In [ ]: from sklearn import metrics


svc_pred)))
" format(metrics.accuracy_score(y_t_est,
print("Accuracy Score=•
Accuracy Score= 0.7519685039370079
t confusion_matrix
import classification_repor,
lt1 ( l: froM sklearn.metrics
( t st svc_pred))
Print(confusion_matrix y_ret(,y test,svc_p red))
pr1nt(class1f1cat1on_repo -

:
11ri2f2 I

((146 16]
[ 47 45]]
precision recall fl-score support
0 0.76 0.90 0.82 162
1 0.74 0.49 0.59 92

accuracy 0.75 254


macro avg 0.75 0.70 0.71 254
weighted avg 0.75 0.75 0.74 254

K-Means Clustering Model

Int ]: customers_df=pd.read_csv('Mall_Customers.csv')
customers_df.head()

out[ 1. CustomerlD Gender Age Annual Income (k$)


J • Spending Score (1-100)

0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

Inf ]: customers_df.corr()

Out[ J: Customer1D Age Annual Income (k$) Spending Score (1-100)

CustomertD 1.000000 -0.026763 0.977548 0.013835

.Age -0.026763 1.000000 -0.012398 -0327227

Annual Income (le$) 0.977548 -0.012398 1.000000 0.009903

Spending Score (1-100) -0.013835 -0327227 0.009903 1.000000

In [ J: . #Distribution of Annnua,lIncome
.plt_.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.•distplot(customers_df['Annual Income (k$) '])
plt.title{'Distribution of Annu l Income (k$)', fontsize = 20)
plt.xlabel{'Range of Annual Income (k$)')
.plt.ylabel('Count')

Out(·]: Text(0, 0 5, 'Count')

. ...
Lab

Distribution of Annual Income (k$)


0.016

0.014

0.012

0.010

8 c 0.008
0.006

0.004

0.002

0.000 --
50 75 100 125 150
0 25
Range of Annual Income (kS)

In [ ): #Distribution of age
plt.figure(figsize=(l0, 6))
sns.set(style = 'whitegrid')
sns.distplot(customers_df['Age'])
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Range of Age')
plt.ylabel('Count')

Out[ ]: Text(0! 0.5, 'Count')


Distribution of Age
0.035

0.030

0.025

.... 0.020
C:
:,
8
O.D15

0.010

0.005

0.000

In [ ]: #Distribution of spending score


plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(customers_df['Spending Score (1-100)'))
plt.title('Distribution of Spending Score (1-100)', fontsize = 20)
plt.xlabel('Range of Spending Score(1- ,
Lab
plt.ylabel('Count') ) )
100
out[ J: rext(0, 0.5, 'Count')

0.018 Distribution of Spending Score (1-100)

0.016

II
0.014

0.012

0.010
c::,
8 0.008
,- .

0.006

0.004

0.002

0.000
-20 0 20 40 60 80 100 120
Range of Spending Score (1-100)

In [ ): genders= customers_df.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values)
• plt.show()

100

80

60

40

20

0
Female Male

i -- -
In [ ]: 1#We take just the Annual Income and Spending score
:dfl=customers_df[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-100)"]]
_X=dfl[["Annual Income (k$)","Spending Score (1-100)"]]

In[ ]: i X.head()
Llt{ J

Annual Income (k$) pending Score (1-100)

0 15
39
15
81
2 16 6
3 16 17
4 17 40

111 [ ) : #Scatterplot of the input data


plt.figure(figsize=(le,6))
sns.scatte,rplot(x = 'Annua1 Income (k$), _ ,
plt.xlabel( AnnualIncome(k$)') ,y - Spending Score (1-100)' data c X ,s c 60)
plt.ylabel('Spending Score (1-100)') '
t i t l e ('Spending Score (1-100
plt. s h ow ( ) ) vs Annual Income (k$)')

Spending Score (1-100) vs Annual Income (k$)


100
• e • • • •
• • • ••• i • • • •• •
• • •
00
•• •
0 • •• •
0
"'7
60 • • (0 (0

Cl) •e i E
0 e ,... .._v ,..
0 .. , ! !e
Cl)
•09 c .•••"
O>
• •
...
C :
:s 40
•• •• ••

C:
Cl)
a.
• •• •• •
Cl)

20 : ••
• •
• ••
•••••
i
.:

.• • •

••• • •• ••••
0
100 120 140
40 60 80
20
Annual Income (k$)

In [ ): #Importing KMeans from sklearn


from sklearn.cluster import KMeans

In[ ): # Within Cluster Sum of Squared Errors (WSS) for different values of k
wcss=[]
for i in range(l,11):
km=KMeans(n_clusters=i)
km.fit(X)
wcss.append(km.inertia_)

In [ ): #The elbow curve


plt.figure(figsize=(12,6))
plt.plot(range(l,11),wcss)
plt.plot(range(l,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(l,11,1))
plt.ylabel("WCSS")
plt.show() .

291
250000

200000

(/)
(/) 150000

100000

50000

2 3 4 5 6 7 8 9 10
KValue

in the graph, after 5 the drop is minimal, so we take s to be the number of clusters.

In [ ]: #Taking 5 clusters
kml=KMeans(n_clusters=S)
#Fitting the input data
kml.fit(X)
#predicting the Labels of the input data
y=kml.predict(X)
#adding the Labels to a column named Label
dfl["label"] = y
#The new dataframe with the clustering done
dfl.head( 10)

Out[ ] : CustomerlD Gender Age Annual Income (k$) Spending Score (1-100) label

0 Male 19 15 39 0

1 2 Male 21 15 81 3

2 3 Female 20 16 6 0

3 4 Female 23 16 77 3

4 5 Female 31 17 40 0

s 6 Female 22 17 76 3

6 7 Female 35 18 6 0

7 8 Female 23 18 94 3

Male 64 19 3 0
8 9

9 10 Female 30 19 72 3

In [ ): #Scatterplot of the clusters


!Plt.figure(figsize=(10,6))
sns.scatterplot(x= 'Annual Income (k$)',y = 'Spending Score (1-100)',hue="label",
palette=['green','orange','brown','dodgerblue','red'), legend='full',data dfl ,s 61
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Spending Score (1-100) vs Annual Income (k$)')
plt.show()
Spending Scar (1 10
100 •• e - 0) vs Annual Income (k$)

• •
• •
-
• I • •
• • ••
-
0
• •
-8
0

,.!.
Q)
60 • label
• 0
Cl)
Cl 1

•• •• • 2
C
:Cs 40
• • • 3

CD
a. • 4
Cl)

20

@

0
•• e

20 40 60 80 100 120 140


Annual Income (k$)

In ( ): #Taking the features


df2=customers_df[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-100)"]]
X2=df2[["Age","Annual Income (k$)","Spending Score (1-100)"]]
#Now we caLcuLate the Within Cluster Sum of Squared Errors (WSS) for different values of k.
wcss = []
fork in range(1,11):
kmeans = KMeans(n_clusters=k, init="k-means++")
kmeans.fit(X2)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(l2,6))
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(l,11,1))
plt.ylabel("WCSS")
plt.show()

300000

250000

200000
Cf)
Cf)

150000

100000

50000

2 3 4 5 6 7 8 9 10
KValue

In [ ) : km2 = KMeans(n_clusters=S)
y2 = km.fit_predict(X2)
df2["label"] = y2.
#The dara wirh Labels
df2.head()
CustomerlD Gender Age Annual Income (k$)
Olld l: Spending Score (1·100) label
0 Male 19 15 39
2 Male 21 15 01 5
2 3 Female 20 16 6
3 4 Female 23 16 77 5
4 5 Female 31 17 40

rnr ]: #3D Plot as we did the clustering on the basis of 3 input features
fig= plt.figure(figsize=(20,10}}
ax= fig.add_subplot(lll, projection='3d')
ax.scatter(df2.Age[df2.label == 0], df2["Annual Income (k$)")[df2.label == 0),
df2["Spendinc Score (1-100}"][df2.label == 0), c='purple', s=60}
ax.scatter(df2.Age[df2.label == 1], df2["Annual Income (k$)")[df2.label == 1),
df2["Spending Score (1-100}"][df2.label == 1], c='red', s=60)
ax.scatter(df2.Age[df2.label == 2], df2["Annual Income (k$)"][df2.label == 2),
df2["Spending Score (l-100}"][df2.label == 2), c='blue', s=60}
ax.scatter(df2.Age[df2.label == 3], df2["Annual Income (k$)"][df2.label == 3),
df2["Spending Score (1-100)"][df2.label == 3), c='green', s=60)
ax.scatter(df2.Age[df2.label == 4], df2["Annual Income (k$)"][df2.label == 4],
df2[" pendin3 Score (1-100}"][df2.label == 4), c='yellow', s=60}
ax.view_init(35, 185)
plt.xlabel("Age")
plt.ylabel("Annual Int m! k .)")
ax.set_zlabel('Spe11dir g . ,..re {1-.!Ni)•)
plt.show()

• •• ••••
• • ,. -4ri •

, -· ... . .
60
• • • ••
•.
.
70

• .•.
,....
I 0
50
C/)
60
:C:,t>
e:, : 40
co
• 50
C/)

8 • - Age

-
30
-,
(D 40

-
I 20
0 30
0
10
20
140 120 100 80 60 40 20
Annual Income (kS)

: '
-
custl=df2[df2("label"]==l]
print('Number of customer in 1st r _,
print('They are-•, cust1("Custom!r; ?.J • len(cust1))
print(" .values)
cust2=df2[df2("label"]==2] .,_ ..)
Print('Number of customer in 2nd group::• le (
print('They are-•, cust2["CustomerIO"], n cust2))
print(" :values)
cust3=df2[df2["label"]==0] ----------------- ")
print('Number of customer in 3rd group='
1
print('They are-•, cust3["CustomerIO"], len(cust3))
.. .va ues)
print(
cust4=df2[df2["label"]==3] ----------------- ")
print('Number of customer in 4th group-' ·
• • -, 1en(cust4))
print( They are - , cust4["CustomerID"].values)
print("-----------------------------
cust5=df2(df2("label"]==4] ------------------ ")
print{'Number of customer in 5th group-• 1 (
, , - , en custs))
print( They are - , custS("CustomerID"].values)
print(" ")
Number of customer in 1st group= 12
They are - [ 1 3 5 17 21 27 29 39 43 45 49 50]

Number of customer in 2nd group= 35


They are- [ 44 48 52 53 59 62 66 69 70 76 78 79 82 85 88 89 92 94
95 96 98 100 101104106 112 113 114 115 116 121122 123 133 143)
-
Number of customer in 3rd group= 10
They are - [181183 185 187 189 191193 195 197 199)

Number of customer in 4th group= 17


They are - [127 129 131137141147151153155 161 165 167 169 171175 177 179]

Number of customer in 5th group= 44


They are - [ 41 47 51 54 55 56 57 58 60 61 63 64 65 67 68 71 72 73
74 1s n 80 81 83 84 86 87 90 91 93 97 99 102 103 105 101
108 109 110 111117118 119 120]

You might also like