AIML Lab Manual
AIML Lab Manual
Lab Manual
Faculty Coordinator
Lab Coordinator
Mr. D. S. Tomar
LIST OF THE EXPERIMENT
3 Apply Python built-in data types: Strings, list, Tuples, Dictionary, Set and their 3
methods to solve any given problem
6 Import a CSV file and perform various Statistical and Comparison operations 5
on rows/columns
8 Import any CSV file to Pandas DataFrame and perform the following: 6
1. Visualize the first and last 10 records
2. Get the shape, index and column details
3. Select/Delete the records(rows)/columns based on conditions.
4. Perform ranking and sorting operations.
5. Do required statistical operations on the given columns.
6. Find th count and uniqueness of the given·categorical values.
9 Import any CSV file to Pandas DataFrame and perform the following: 9
1. Handle missing data by detecting and dropping/ filling missing
values.
2. Transform data using different me hods.
3. Detect and filter outliers.
...
4. Perform Vectorized String operations on Pandas Series.
5. Visualize data using Line Plots, Bar Plots, Histograms, Density
Plots and Scatter Plots.
: Experiment 1:
Perform Creation. indexing, slicing, concatenation and repetition operations ori Python built-in data types: Strings,
List, Tuples, Dictionary, Set
#String Creation
strl =•Abhinav•
str2 =•chaturvedi•
#string Concatenation
str3=str1+• •+str2
print(str3)
#string indexing
print(str3[8])
#string slicing
print(str3[:7:J)
print(str3[0:7:2])
print(str3[::-1])
#string repetition
print(strl • 3)
Abhinav Chaturvedi
A
Abhinav
Ahnv
idevrutahC vanihbA
AbhinayAbhinavAbhinav
1 [ j: #List Creation
listl =['Python', 'C++', 'JavaScript']
list2 =[l, 2, 3]
#List Concatenation
list3=listl+list2
print(list3)
#List indexing
print(list3[0])
#List slicing
print(list3[:3:])
print(list3[0:7:2])
print(list3[::-1])
#List repetition
print(listl • 3)
(•Python•, •c++•, 'JavaScript', 1, 2, 3]
Python
['Python', 'C++', 'JavaScript']
('Python', 'JavaScript', 2] ,
(3, 2, 1, 'JavaScript', 'C++' • ython
('Python', 'C++', 'JavaScript, Python, •C++•, 'JavaScript', •Python', 'C++', •JavaScript']
#tuple indexing
print(tuple3f0])
print(tuple3f0][1))
#ruple slicing
print{tuple3[:3:])
print{tuple3[0:7:2])
print(tuple3[::-1])
#tuple repention
print(tuplel * 3)
('Python', 'C++,' 'JavaScri•pt', 1, , )
2 3
Python
y
('python', 'C++', 'JavaScript')
{'Python', 'JavaScript', 2)
{3, 2, 1, 'JavaScript', 'C++' 'Pyth ')
('Python', 'C++', 'JavaScript , 'Pyt :n•,
'C++', 'JavaScript', 'Python·, 'C++', 'JavaScript')
In ]: #Dicn.onary Creation
d ct1 ={7:'P hon', 2:'C++', 3:'JavaScript'}
dict2 ={ good :1, 'average':2, 'nice':3}
#diet Concatenation
def Merge(dict1, dict2):
return({**dictl, **dict2})
dict3=Merge(dictl,dict2)
print(dict3)
In j: #Set Crearion
#a set cannot have mirtable elements like list:s, sets or dictionaries as its elements.
#You cannot access ite s in a set by referring to an index or a key.
Setl ={'Python', 'C++", 'JavaScript'}
Set2 ={1, 2, 3}
In J: def print_factors(x):
print("The factors of•,x,nare:")
for i in range(l, x + 1):
if x X i == 0:
print(i)
num=320
print_factors(num)
Apply Python built-in data types: Strings, List, Tuples, Dictionary, Set and their.methods to solve any given problem
In [ ): Age={'person2': 21,'personS':24,'person6':19,'personl':20,'person3':23,'person4':22}
sortedDict = sorted(Age)
print(sortedDict)
sorted(Age.items(), key=element_l)
['personl', 'person2', 'person3', 'person4', 'persons', 'person6']
[('person6', 19), ('personl', 20), ('person2', 21), ('person4', 22), ('person3', 23), ('persons', 24)]
Out[ ): [('person6', 19),
('personl', 20),
('person2', 21),
( •per.son4•, 22),
('person3', 23),
(·persons', 24)]
: Experiment 4:
Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and Splitting.
print(arr)
print(type(arr))
print{arr[l]) l.
pi-int{arr[::2])
print(arr[l:5])
print(arr[l:5:2])
newarr = arrl.reshape(4, 3)
print(newarr)
newarr 3d = arrl.reshape(2, 3, 2)
print(;ewarr_3d)
print(arr)
print(arr)
newarr = np.array_split(arr, 3)
print(newarr)
arr= np.array([[l, 2, 3], [4, S, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15), [16, 17, 18]])
newarr = np.array_split(arr, 3)
print(newarr)
[1 2 3 4 5)
<class 'numpy.ndarray'>
2
[1 3 5]
[2 3 4 5]
[2 4]
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 1112]]
[[[ 1 2]
[ 3 4]
[ 5 6]]
[( 7 8]
[ 9 10]
(11 12]]]
[100 200 300 400 500 600]
[[1 2 5 6]
(3 4 7 8]]
[array([l, 2]), array([3, 4]), array([5, 6))]
[array([[l, 2, 3],
(4, 5, £]]), array(([ 7, 8, 9],
(10, 11, 12]]), array([[13, 14, 15],
(16, 17, 18]])]
: Experiment 5:
Computation on NumPy arrays using Universal Functions and Mathematical methods.
Note: Universal functions in Numpy are simple mathematical functions. It is just a term that we gave to mathematical
functions in the Numpy library. Numpy provides various universal functions that cover a wide variety bf operations.
These functions include standard trigonometric functions, functions for arithmetic operations, handling complex
numbers, statistical functions, etc.
# sine of angles
print('Sine of angles in the array:')
sine value= np.sin(radians)
print(np.sin(radians))
In ):
Where method to compare the values
#
# The values were stored in the new column
# By using the Where() method in NumPy, i1e are given the condition to compare the columns.
# If 'columnl' is Lesser than 'column2' and 'column1' is -Lesser than the_'column3', .
# ive print the values of 'coLumnl'. If the condition fails, we give the value as 'NaN'.
df['-new'J = np.where((df['R&D Spend']<= df['Administrat on']) I (
df['R&D Spend'] <= df['Marketing Spend']), df['R&D Spend'], np.nan)
df.head()
In ]: print( df.sum())
3686080.78
R&D Spend
6067231.98
Administration
10551254.89
Marketing Spend
YorkCaliforniaFloridaNew YorkFloridaNew Yo ...
State New 5600631.96
Profit 3686080.78
new
dtype: object
,
5600631.960000001
In ]: print(df.mean())
R&D Spend
73721.6156
Administration 121344.6396
Marketing Spend 211025.0978
Profit 112012.6392
new
73721.6156
dtype: float64
5600631.960000001
: Experiment 7:
Create Pandas Series and DataFrame from variousinputs.
In ); d t
a a= np.array(['personl', 'person2', 'person3', 'person4', 'persons'])
ser_5tud = pd.Series(data)
print(ser_stud)
0 person1 95
1 person2 94
2 person3 99
3 person4 93
4 persons 97
0 person1 10
1 person2 15
2 person3 14
: Experiment 8:
Import any CSV file to Pandas Data Frame and perform the following:
vu,,1 011111111
'"''
(11111
I H1m111llfl f;/11plMn111tlpln,1,l111rtn• "', tlluc11111,"w
' ,lu, r,;,
11
' l It/ )I/ 1'•11I IIU ,/(lf 11
,JJ 11h lt1§,._1l11t·11i-1Jll. l't!ll(I
,tJ nlmtr1 1J11t11, h i!tl( H)'
V( 'plmrl■flfrt
't U\/')
lp1 ( )
0
"""' '" " i,lu<:o•a co
()
11 tll;,uoll
"''n lhldo11,o ,,,..,1,,. 11ml dlii,h 1r1Jd .a9e "'1n dl bf!tes
1411 J () 3,G (),(i27 1,3700 Trur:
ll!, (l(l
o () 20,(i (J,3!,1 1,1426 Fal
:t 11
''"
1(]'1
., ll9 GG 2
() ()
!M 2ll.1
;>J, Olin
0,1G7
.12
21
OJX)()()
0,9()(j2
True
False
... () 1· 7 40 ·!, 1GB 43,1
2,288 33 1.3790 True
s ll<i 74 0 0 2!.i,6 0201 30 0,0000 False
G 70 50 32 /18 311) 0,248 26 12608 True
7 10 115 0 0 0 35,3 0,134, 29 0.0000 False
tJ 2 197 70 4!i 543 30,5 0,158 53 1,TT30 True
9 0 125 9G 0 0 0,0 0232 54 0,0000 True
ln I J : diobctcs_dotn.tnil(10)
7S8 106
- -- ·
76
- thickness Insulin bml diab_pred age skin diabetes
0 False
0 37,5 0,197 26 0.0000
7S9 6 190 92 0 0 35.5 0.278 66 0,0000 True
Jn [ ]: diabetes_data.shape
Out[ J; (768, 10)
In [ ) : dJaqetes_data.describe() •
Ovt( J: thickness Insulin bml diab_pred age skin
num_preg glucose_conc dlastollc_bp
In f J •J 2
diab
diabbetes_data(,dJab
a etes_data. head(5) He!l_data [• d illb e 'J.map(d1abe1:eS.JMP)
Out[ ]· num_preg glucose_conc dlastollc.J>p thlckneu lnwlln bml dlab_pred age sldn dlabrltt
0 6 148 72 35 0 33.6 0,627 so 13790
85 66 0
29 0 26,6 0.351 31 1.142.6
2 6 183 64 0 0 23,3 0,672 32 0.0000
3 89 66 23 94 28.1 0.167 21 0.9062 0
4 0 137 40 35 168 43.1 2.288 33 13790
In l ]: sns.countplot(x='diabetes.,data=diabetes_data)
Ou [ ): <AxesSubplot: xlabel•'diabetes'
1
ylabel
•
•
count'>
,l,,J
C:
:::,
8
1
diabetes
In [ ]: diabetes_data['diabetes').value_counts()
Out[ J: 0 500
1 268
Name: diabetes, dtype: int64
n [ ]: diabetes_data.sort_values("age")
diast bmi dfab_pred age skin diab etes
num_preg glucose_conc olic_bp thickness insulin
Out[ ]:
113 64 35 0 33.6 0.543 21 13790
255
0 0.0 0304 21 0.0000 0
84 0 0
60 2
0 225 0262 21 0.0000 0
125 96 0
102 0
0299 21 0.7880 0
0 74 20 23 27.7
182
115 435 0347 21 1.0638 0
94 70 27
623 0
ut[ ] ' nun pn g glucosc_conc dlftstoll t," thlclmnu l11J11lh1 111111 dlilh. Jll'fill IIIJfl 1kl11 cJl111i11tn ii!)fl fitlll(
0 G 140 7 /I!,!;
!I' () ;!;IIi oc,:11 !,O I fJ /I (I
05 (j(j :!!) f.1,1'£l 0 31(),'
I) I, I 03!J1
2 I) 11) 1'1 0 0 ,!.;J, OIJ'/2 ;, 00000 :rnu,
3 09 GG
"
ii. f '1 i·1 0,00l:i2 7 l.O
i'fl,1 ().11,'/
,4 0 137 40 3r OJ,()
160 4·.1 .,?ll/J 1. . 70{1
Renaming Columns
,
In [ ]: diabetes_data.rename(columns = {'glucose_conc': 'glucose', 'diastolic_bp': 'bp'}, inplace • True)
In [ ) : diabetes_data.head()
out[ J: num_preg glucose bp thickness Insulin bml dlab_pred age skin diabetes
In [ ] : diabetcs_data.age.unique()
Out[ ) : array( [50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 34, 57, 59, 51, 27, 41, 43,
22, 38, 60, 28, 45, 35, 46, 56, 37, 48, 40, 25, 24, SB, 42, 44, 39,
36, 23, 61, 69, 62, 55, 65, 47, 52, 66, 49, 63, 67, 72, 81, 64, 70,
68), dtype=1nt64)
In [ ) : len(diabetes_data.-age.unique())
Out[ J: 52
:Experiment 9:
Import any CSV file to Pandas DataFrame and perform the following:
()111"[ ]:
-- 0
R&DSpend Administration
165349.20 136897.80
-Marl<--cting Spend
192261.83
162597.70 151377.59
443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
144372.41
..
3 118671.85 383199.62
142107.34 New York 182901.99
91391.77
366168.42
Florida 166187.94
s 131876.90 99814.71 36286136
New York 156991.12
6 134615.46 147198.87 127716.82
California 156122.51
7 130298.13 145530.06 323876.68
Florida 155752.60
8 120542.52 148718.95 311613.29
New York 152211.77
9 123334.88 108679.17 304981.62
California 149759.96
In f ]: df.shape
out[ ]: (50, 5)
In ( ]: df.describe()
Out[ ]: R&DSpend Administration
..
Marketing Spend Profit .
50.000000 50.000000 50.000000
count 50.000000
211025.097800 112012.639200
mean 73721.615600 121344.639600
122290.310726 40306.180338
std 45902.256482 28017.802755
0.000000 14681.400000
min 0.000000 51283.140000
129300.132500 90138.902500
25% 39936370000 103730.875000
212716.240000 107978.190000
SO% 73051.080000 122699.795000
299469.085000 139765.977500
75% 101602.800000 144842.180000
471784.100000 192261.830000
max 165349.200000 182645.560000
J: rd_spend
v,.. ,. ,. -------------------- -_ _. - ,.
outliers Detection
-> The data points which fall below mean-3(sigma) or above mean+ 3(sigma) are outliers.
-> The data points which fall below Q1 -1.S IQR or above Q3 + 15 IQR are outliers.
where Q1 and Q3 are the 25th and 75th percentile of the dataset respectively, and IQR represents the inter-quartile
range and given by Q3 - Q1.
Z-score treatment:
Assumption- The features are normally or approximately normally distributed. ·
Plotting Graph
In ]: plt.figure(figsize={l6,10))
plt.subplot(2,2,1)
sns.distplot(df["R&D Spend"])
plt.subplot(2,2,2)
sns.distplot(df["Administration"])
plt.subplot(2,2,3)
sns.distplot(df["Marketing Spend"])
plt.subplot(2,2,4)
sns.distplot(df["Profit"])
plt.show()
-
lo-5
1.2 u
1.2
1.0
1.0
o.a
r!
0.6
0.4
0.2
0.0
R6'D Spend 50000 100000 l.50000 200000
Mministration
10-6
4.0 lo-5
1.2
J.5
1.0
J.0
2.5 0.8
- If! 2.0 -
! 0.6
1.5
0.4
1.0
0.5 0.2
0.0
0.0
100000 200000 lXlOOO!J .ciroooo :!00000 600000 -50000 50000 100000 150000 200000 250000
0
Mulcetingnd
Profit
l -
nb
t1 2/')'
R&D Spend AdmlnlstrAtlon Mnrkatlng Spend Stlltll Profit
,,,,ti 1
sns,boxplot(df["Admini trntion"))
Jn r ].
<AxesSubplot: >
1tJl{ ):
180000
160000
140000
120000
100000
80000
60000
file:Jt1F:17th ,,
sel1\/Departmental Lab/ml_lab/Abhlnav_Chaturvedl-0901CS191003/lab.html
13/33
Lab
WALTER WHITE
JESSE PINKMAN
SKYLER WHITE
HANK SHRADER
4 MIKE EHRMANTRAUT
5 GUS FRING
dtype: object
rn [ J: names.str.len()
]: 0 12
outf 1 13
2 12
3 12
4 16
S 9
dtype: int64
In f J: names.str.startswith('W')
0 True
out( ]:
1 False
2 False
3 False
4 False
5 False
dtype: bool
Vectorized indexing and slicing
In [ ]: names.str[0]
out[ J: 0 w
1 J
2 s
3 H
4 M
5 G
dtype: object
In [ j: names.str.slice(0,2)
Out[ ]: 0 Wa
1 Je
2 Sk
3 Ha
4 Mi
s Gu
dtype: object
In [ ]: names.str.split()
Out[ ]: 0 Walter
1 Jesse
2 Skyler
3 Hank
4 Mike
s Gus
dtype: object
: Experiment 10:
Use scikit-learn package in python to implement following machine learning models to solve real world problems
·-'e using open source datasets
linear_df.head( 10)
VearsExperience S a l a r_
y
1.1 39343.0
0
13 46205.0
1
2
1.5 37731.0
2.0 43525.0
3
22 39891.0
4
s .2.9 56642.0
6 3.0 60150.0
7 32 54445.0
8 32 64445.0
9 3.7 57189.0
x = linear_df[ ['YearsExperience']]
X
sE,cperience 2.0
3
1.1 2.2
v
0
13
2.9
"
e
1 s
a 1.5 3.0
r 2
6
3.2
1
3.2
8
3.7
9
3.9
10
11 4.0
12 4.0
13 4.1
14 4.5
15 4.9
16 5.1
17 53
18 5.9
19 6.0
20 6.8
21 7.1
22 7.9
23 8.2
24 8.7
25 9.0
26 9.5
27 9.6
28 10.3
29 10.5
In[ ) : · ·· ·· ·
Y: linear_df.iloc[:,1].values
y
•
Oltt[ ): 56642., 60150.,
array( [ 39343., 46205., 37731., 43525., 39891.,
56957., 57081.,
54445., 64445., 57189., 63218., 55794., 93940., 91738.,
81363.,
61111., 67938., 66029., 83088., 116969., 112635.,
105582.,
98273., 101302., 113812., 109431.,
122391., 121872.])
In [ ]:
Plt.scatter(x,y)
Plt.show()
•• • •
10
rn[ ]:
In [ ]:
Out[ J:
100000
aoooo
60000
40000
1n I ): model.coef_
out[ ]: array([9449.96232146))
In I ): model.intercept_
Out[ ]: 25792.20019866871
In f ]: model.predict([[4)))
Out[ ): array([63592.04948449))
Ou [ ) : 95.69566641435085
r I l : y,head()
Ouq J. e 192261.83
1 191792.06
2 191050,39
3 182901.99
4 166187.94
Name: Profit, dtype: float64
In l l: x,State.value_counts()
lab
New York 17
California 17
Florida 16
Name: State, dtype: int64
one_hot_states.head()
111 [ l:
Califomla Florida New Yori<
outf J:
0 0 0
0 0
2 0 0
3 0 0
0
" 0
IO
): x.drop(["State"], axis= 1, inplace = True)
In( ]: x.head(S)
out[ ] : R&D Spend Administration Matl<eting Spend California Aorida New Yori<
Out [ l: T LinearRegression.
LinearRegression()
In [ ); diabetes_data.head()
bml dlab_pred age skin diabetes
Out[ ]: num_preg glucose bp thickness Insulin
-----------
0
..---·-
35 0 33.6 0.627 so 1.3790
6 148 72
0.351 31 1.1426 0
66 29 0 26.6
85
0 23.3 0.672 32 0.0000
2 8 183 64 0
0.167 21 0.9062 0
3 66 23 94 28.1
89
43.1 2.288 33 1.3790
35 168
4 0 137 40
1f1,(}.2•
diabetes_data.c
olumns
1,, l l·
Index(['num_preg', 'glucose', 'bp', 'thickness',
cwt[ }' •age•, •skin·, 'diabetes'), 'insulin', 'bmi', 'diabJ)red',
dtype='object')
diabetes_data.dcscribc()
JO [ J:
num_preg glucose bp thickness
ouiC 1:
Insulin bml dlab_pred age skin diab<
count 768,000000 768.000000 768,000000 768.000000 768.000000 768,000000 768.000000 768,000000 766.000000 768,000
SO% .3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.906200 0,000
7S% 6.000000 1402S0000 80.000000 32.000000 1272S0000 36.600000 0.6262S0 41.000000 1260800 1.000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 3.900600 1,000
'.1't
""
diabetes_data.isnull().sum()
In l ]:
num_preg 0
out[ ] :
glucose 0
bp 0
thickness 0
insulin 0
bmi 0
diab_pred 0
age 0
skin 0
diabetes 0
dtype: int64
plt.figure(figsize=(12,10))
In l :] b has an easy method to showcase heatmap
# sea orn () annot=True,cmap ='RdVlGn')
p = sns.heatmap( diabetes:-:- at . orr , - -· - - -
...
20/33
,
2f7.Z,
ti
1.0
QI
V
0.39 0.18
£
C
1.,§, 0.2 0.19 -0.6
-
!
- 0.4
0 Hi
'O
'1
.J:J
10
'6
i- 0.2
0.0
10
'6
Data Visualization
In [ l: ·P -= diabetes_data.hist(figsize = (20,20))
Lab
ISO
100
100
40 UC
thkkness
nst.ltin
1SO
200
100
100
5,0
skin
T 2-0
lia
500
--
:,00
-- ---
- --
o.2 0.4 0.6 o.a U)
0.0
22
1 191
Lob
"••"• 150
uo )00
IO<i
150
10
..., lf,I
..,
Ill
JO
C,O
IW IA<l
"'"
500 JO-,
no 11,
400
I!.</
lOO IJ>
IOO
JOO ,,
50
100
,.
JOO «)O IOO 20
,,, .., 50 C,O
200
IOO
IOO
40 ,0 60 70 10
1..5
clabet s
,00
400
200
IOO
In [ ] : diabetes_data_copy.head()
thickness insulin bmi diab_pred age skin diabetes
num_preg glucose bp
Out[ ] :
125.0 33.6 0.627 50 1.3790
6 148.0 72.0 35.0
0
26.6 0351 31 1.1426 0
85.0 66.0 29.0 125.0
1
125.0 233 0.672 32 0.0000
8 183.0 64.0 29.0
2
0.167 21 0.9062 0
66.0 23.0 94.0 28.1
3 89.0
168.0 43.1 2.288 33 13790
137.0 40.0 35.0
4 0
Standard scaling
rn[ ]: y = diabetes_data_copy.diabetes
y
out[ l: 0 1
1 0
2 1
3 0
4 1
763 0
764 0
765 0
766 1
767 0
Name: diabetes, Length: 768, dtype: int64
Splitting the dataset
In ]: X = diabetes_data.drop('diabetes', axis=l)
y = diabetes_data['diabetes']
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
Olft[ ] :
:predictions= dtr;e.predict(X_test)
print(uAccuracy Sco e =", format(metrics.accuracy_score(y_test,predictions)))
,. - . : - .. - . . - --· . .
print(confusion_matrix(y_test, predictions))
print{classification_report(y_test,predictions))
[[130 32)
[ 42 50]]
precision recall fl-score support
rfc = RandomForestClassifier(n_cst1mators■200)
rfc.fit(X_train, y_troin)
[[133 29]
[ 33 59]]
precision recall fl-score support
SVM Model
svc_model = SVC()
svc_model.fit(X_train, y_train)
Out(
In ( ) :
.svc_pred = svc_model.predict_(x_ e t)
. .
:
11ri2f2 I
((146 16]
[ 47 45]]
precision recall fl-score support
0 0.76 0.90 0.82 162
1 0.74 0.49 0.59 92
Int ]: customers_df=pd.read_csv('Mall_Customers.csv')
customers_df.head()
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
Inf ]: customers_df.corr()
In [ J: . #Distribution of Annnua,lIncome
.plt_.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.•distplot(customers_df['Annual Income (k$) '])
plt.title{'Distribution of Annu l Income (k$)', fontsize = 20)
plt.xlabel{'Range of Annual Income (k$)')
.plt.ylabel('Count')
. ...
Lab
0.014
0.012
0.010
8 c 0.008
0.006
0.004
0.002
0.000 --
50 75 100 125 150
0 25
Range of Annual Income (kS)
In [ ): #Distribution of age
plt.figure(figsize=(l0, 6))
sns.set(style = 'whitegrid')
sns.distplot(customers_df['Age'])
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Range of Age')
plt.ylabel('Count')
0.030
0.025
.... 0.020
C:
:,
8
O.D15
0.010
0.005
0.000
0.016
II
0.014
0.012
0.010
c::,
8 0.008
,- .
0.006
0.004
0.002
0.000
-20 0 20 40 60 80 100 120
Range of Spending Score (1-100)
In [ ): genders= customers_df.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values)
• plt.show()
100
80
60
40
20
0
Female Male
i -- -
In [ ]: 1#We take just the Annual Income and Spending score
:dfl=customers_df[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-100)"]]
_X=dfl[["Annual Income (k$)","Spending Score (1-100)"]]
In[ ]: i X.head()
Llt{ J
0 15
39
15
81
2 16 6
3 16 17
4 17 40
20 : ••
• •
• ••
•••••
i
.:
•
.• • •
•
••• • •• ••••
0
100 120 140
40 60 80
20
Annual Income (k$)
In[ ): # Within Cluster Sum of Squared Errors (WSS) for different values of k
wcss=[]
for i in range(l,11):
km=KMeans(n_clusters=i)
km.fit(X)
wcss.append(km.inertia_)
291
250000
200000
(/)
(/) 150000
100000
50000
2 3 4 5 6 7 8 9 10
KValue
in the graph, after 5 the drop is minimal, so we take s to be the number of clusters.
In [ ]: #Taking 5 clusters
kml=KMeans(n_clusters=S)
#Fitting the input data
kml.fit(X)
#predicting the Labels of the input data
y=kml.predict(X)
#adding the Labels to a column named Label
dfl["label"] = y
#The new dataframe with the clustering done
dfl.head( 10)
Out[ ] : CustomerlD Gender Age Annual Income (k$) Spending Score (1-100) label
0 Male 19 15 39 0
1 2 Male 21 15 81 3
2 3 Female 20 16 6 0
3 4 Female 23 16 77 3
4 5 Female 31 17 40 0
s 6 Female 22 17 76 3
6 7 Female 35 18 6 0
7 8 Female 23 18 94 3
Male 64 19 3 0
8 9
9 10 Female 30 19 72 3
• •
• •
-
• I • •
• • ••
-
0
• •
-8
0
,.!.
Q)
60 • label
• 0
Cl)
Cl 1
•• •• • 2
C
:Cs 40
• • • 3
•
CD
a. • 4
Cl)
20
@
•
0
•• e
300000
250000
200000
Cf)
Cf)
150000
100000
50000
2 3 4 5 6 7 8 9 10
KValue
In [ ) : km2 = KMeans(n_clusters=S)
y2 = km.fit_predict(X2)
df2["label"] = y2.
#The dara wirh Labels
df2.head()
CustomerlD Gender Age Annual Income (k$)
Olld l: Spending Score (1·100) label
0 Male 19 15 39
2 Male 21 15 01 5
2 3 Female 20 16 6
3 4 Female 23 16 77 5
4 5 Female 31 17 40
rnr ]: #3D Plot as we did the clustering on the basis of 3 input features
fig= plt.figure(figsize=(20,10}}
ax= fig.add_subplot(lll, projection='3d')
ax.scatter(df2.Age[df2.label == 0], df2["Annual Income (k$)")[df2.label == 0),
df2["Spendinc Score (1-100}"][df2.label == 0), c='purple', s=60}
ax.scatter(df2.Age[df2.label == 1], df2["Annual Income (k$)")[df2.label == 1),
df2["Spending Score (1-100}"][df2.label == 1], c='red', s=60)
ax.scatter(df2.Age[df2.label == 2], df2["Annual Income (k$)"][df2.label == 2),
df2["Spending Score (l-100}"][df2.label == 2), c='blue', s=60}
ax.scatter(df2.Age[df2.label == 3], df2["Annual Income (k$)"][df2.label == 3),
df2["Spending Score (1-100)"][df2.label == 3), c='green', s=60)
ax.scatter(df2.Age[df2.label == 4], df2["Annual Income (k$)"][df2.label == 4],
df2[" pendin3 Score (1-100}"][df2.label == 4), c='yellow', s=60}
ax.view_init(35, 185)
plt.xlabel("Age")
plt.ylabel("Annual Int m! k .)")
ax.set_zlabel('Spe11dir g . ,..re {1-.!Ni)•)
plt.show()
• •• ••••
• • ,. -4ri •
, -· ... . .
60
• • • ••
•.
.
70
• .•.
,....
I 0
50
C/)
60
:C:,t>
e:, : 40
co
• 50
C/)
8 • - Age
-
30
-,
(D 40
-
I 20
0 30
0
10
20
140 120 100 80 60 40 20
Annual Income (kS)
: '
-
custl=df2[df2("label"]==l]
print('Number of customer in 1st r _,
print('They are-•, cust1("Custom!r; ?.J • len(cust1))
print(" .values)
cust2=df2[df2("label"]==2] .,_ ..)
Print('Number of customer in 2nd group::• le (
print('They are-•, cust2["CustomerIO"], n cust2))
print(" :values)
cust3=df2[df2["label"]==0] ----------------- ")
print('Number of customer in 3rd group='
1
print('They are-•, cust3["CustomerIO"], len(cust3))
.. .va ues)
print(
cust4=df2[df2["label"]==3] ----------------- ")
print('Number of customer in 4th group-' ·
• • -, 1en(cust4))
print( They are - , cust4["CustomerID"].values)
print("-----------------------------
cust5=df2(df2("label"]==4] ------------------ ")
print{'Number of customer in 5th group-• 1 (
, , - , en custs))
print( They are - , custS("CustomerID"].values)
print(" ")
Number of customer in 1st group= 12
They are - [ 1 3 5 17 21 27 29 39 43 45 49 50]