45B AIML Prac1.3
45B AIML Prac1.3
3 Roll No 45/B
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often
employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier
for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
0 1 444 abc
1 2 555 def
2 4 666 aghi
3 5 444 xyz
4 6 777 ghj
italicized text
the head() method returns the first five by default. Here we provide to number to customize it
ahmed_df1.head(2)
0 1 444 abc
1 2 555 def
ahmed_df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 5 non-null int64
1 col2 5 non-null int64
2 col3 5 non-null object
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes
The unique() method returns the unique values of a series, while nunique() – the number of unique values in a dataframe or a series.
ahmed_df1['col2'].unique()
For a dataframe, nunique(), by default, returns the results by column. Otherwise, passing in axis=1 or axis='columns' will give the results by
row.
ahmed_df1['col2'].nunique()
Returns the count of each unique value in a series. By default, the outputs are not normalized, or sorted in descending order, and the null
values are not considered.
ahmed_df1['col2'].value_counts()
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 1/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
444 2
555 1
666 1
777 1
Name: col2, dtype: int64
the head() method returns the first five by default. Here we provide to number to customize it
print(ahmed_df1.head(2))
the tail() method returns the last five by default. Here we provide to number to customize it
print(ahmed_df1.tail(2))
Applying Functions
ahmed_df1['col3'].apply(len)
0 3
1 3
2 4
3 3
4 3
Name: col3, dtype: int64
sum() method adds valus in each column and returns sum of each column
ahmed_df1['col1'].sum()
18
ahmed_df1['col1'].median()
4.0
ahmed_df1['col1'].mean()
3.6
del ahmed_df1['col1']
ahmed_df1
col2 col3
0 444 abc
1 555 def
2 666 aghi
3 444 xyz
4 777 ghj
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 2/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
ahmed_df1.columns
ahmed_df1.index
ahmed_df1
col2 col3
0 444 abc
1 555 def
2 666 aghi
3 444 xyz
4 777 ghj
ahmed_df1.sort_values(by='col3')
col2 col3
0 444 abc
2 666 aghi
1 555 def
4 777 ghj
3 444 xyz
Data Skewness
ahmed_df1.describe()
col2
count 5.000000
mean 577.200000
std 144.726293
min 444.000000
25% 444.000000
50% 555.000000
75% 666.000000
max 777.000000
ahmed_df1.describe().transpose()
ahmed_df1.min()
col2 444
col3 abc
dtype: object
ahmed_df1.max()
col2 777
col3 xyz
dtype: object
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 3/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
0002- Frazier
0 Female 37 Yes 0 93225 34.827662 -118.999073
ORFBO Park
0003-
1 Male 46 No 0 Glendale 91206 34.162515 -118.203869
MKNFE
0004- Costa
2 Male 50 No 0 92627 33.645672 -117.922613
TLHLJ Mesa
0011-
3 Male 78 Yes 0 Martinez 94553 38.014457 -122.115432
IGKFF
0013-
4 Female 75 Yes 0 Camarillo 93010 34.227846 -119.079903
EXCHZ
5 rows × 38 columns
ahmed_df.head(3)
0002- Frazier
0 Female 37 Yes 0 93225 34.827662 -118.999073
ORFBO Park
0003-
1 Male 46 No 0 Glendale 91206 34.162515 -118.203869
MKNFE
0004- Costa
2 Male 50 No 0 92627 33.645672 -117.922613
TLHLJ Mesa
3 rows × 38 columns
ahmed_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer ID 7043 non-null object
1 Gender 7043 non-null object
2 Age 7043 non-null int64
3 Married 7043 non-null object
4 Number of Dependents 7043 non-null int64
5 City 7043 non-null object
6 Zip Code 7043 non-null int64
7 Latitude 7043 non-null float64
8 Longitude 7043 non-null float64
9 Number of Referrals 7043 non-null int64
10 Tenure in Months 7043 non-null int64
11 Offer 7043 non-null object
12 Phone Service 7043 non-null object
13 Avg Monthly Long Distance Charges 6361 non-null float64
14 Multiple Lines 6361 non-null object
15 Internet Service 7043 non-null object
16 Internet Type 5517 non-null object
17 Avg Monthly GB Download 5517 non-null float64
18 Online Security 5517 non-null object
19 Online Backup 5517 non-null object
20 Device Protection Plan 5517 non-null object
21 Premium Tech Support 5517 non-null object
22 Streaming TV 5517 non-null object
23 Streaming Movies 5517 non-null object
24 Streaming Music 5517 non-null object
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 4/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
25 Unlimited Data 5517 non-null object
26 Contract 7043 non-null object
27 Paperless Billing 7043 non-null object
28 Payment Method 7043 non-null object
29 Monthly Charge 7043 non-null float64
30 Total Charges 7043 non-null float64
31 Total Refunds 7043 non-null float64
32 Total Extra Data Charges 7043 non-null int64
33 Total Long Distance Charges 7043 non-null float64
34 Total Revenue 7043 non-null float64
35 Customer Status 7043 non-null object
36 Churn Category 1869 non-null object
37 Churn Reason 1869 non-null object
dtypes: float64(9), int64(6), object(23)
memory usage: 2.0+ MB
ahmed_df['Total Revenue'].unique()
ahmed_df['Total Revenue'].nunique()
6975
ahmed_df['Total Revenue'].value_counts()
24.80 3
116.27 3
68.41 3
66.56 3
3386.40 2
..
976.70 1
300.65 1
3258.42 1
1713.52 1
3707.60 1
Name: Total Revenue, Length: 6975, dtype: int64
print(ahmed_df.head(5))
Monthly Charge Total Charges Total Refunds Total Extra Data Charges \
0 65.6 593.30 0.00 0
1 -4.0 542.40 38.33 10
2 73.9 280.85 0.00 0
3 98.0 1237.85 0.00 0
4 83.9 267.40 0.00 0
Total Long Distance Charges Total Revenue Customer Status Churn Category \
0 381.51 974.81 Stayed NaN
1 96.21 610.28 Stayed NaN
2 134.60 415.45 Churned Competitor
3 361.66 1599.51 Churned Dissatisfaction
4 22.14 289.54 Churned Dissatisfaction
Churn Reason
0 NaN
1 NaN
2 Competitor had better devices
3 Product dissatisfaction
4 Network reliability
[5 rows x 38 columns]
print(ahmed_df.tail(5))
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 5/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
7042 9995-HOTOH Male 36 Yes 0 Sierra City
Total Extra Data Charges Total Long Distance Charges Total Revenue \
7038 0 606.84 1349.74
7039 0 356.40 2230.10
7040 0 37.24 129.99
7041 0 142.04 4769.69
7042 0 0.00 3707.60
[5 rows x 38 columns]
0 10
1 10
2 10
3 10
4 10
..
7038 10
7039 10
7040 10
7041 10
7042 10
Name: Customer ID, Length: 7043, dtype: int64
ahmed_df['Total Revenue'].sum()
21371131.69
ahmed_df['Total Revenue'].median()
2108.64
ahmed_df['Total Revenue'].mean()
3034.3790558000856
ahmed_df
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 6/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
0002- Frazier
0 Female 37 Yes 0 93225 34.827662 -118.9990
ORFBO Park
0003-
1 Male 46 No 0 Glendale 91206 34.162515 -118.2038
MKNFE
0004- Costa
2 Male 50 No 0 92627 33.645672 -117.9226
TLHLJ Mesa
0011-
3 Male 78 Yes 0 Martinez 94553 38.014457 -122.1154
IGKFF
0013-
4 Female 75 Yes 0 Camarillo 93010 34.227846 -119.0799
EXCHZ
9987-
7038 Female 20 No 0 La Mesa 91941 32.759327 -116.9972
LUTYD
9992-
7039 Male 40 Yes 0 Riverbank 95367 37.734971 -120.9542
RRAMN
9992-
7040 Male 22 No 0 Elk 95432 39.108252 -123.6451
UJOEL
9993- Solana
7041 Male 21 Yes 0 92075 33.001813 -117.2636
LHIEB Beach
9995- Sierra
7042 Male 36 Yes 0 96125 39.600599 -120.6363
HOTOH City
ahmed_df.index
ahmed_df
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 7/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
0002- Frazier
0 Female 37 Yes 0 93225 34.827662 -118.9990
ORFBO Park
0003-
1 Male 46 No 0 Glendale 91206 34.162515 -118.2038
MKNFE
0004- Costa
2 Male 50 No 0 92627 33.645672 -117.9226
TLHLJ Mesa
0011-
3 Male 78 Yes 0 Martinez 94553 38.014457 -122.1154
IGKFF
0013-
4 Female 75 Yes 0 Camarillo 93010 34.227846 -119.0799
EXCHZ
9987-
7038 Female 20 No 0 La Mesa 91941 32.759327 -116.9972
LUTYD
9992-
7039 Male 40 Yes 0 Riverbank 95367 37.734971 -120.9542
RRAMN
9992-
7040 Male 22 No 0 Elk 95432 39.108252 -123.6451
UJOEL
9993- Solana
7041 Male 21 Yes 0 92075 33.001813 -117.2636
LHIEB Beach
9995- Sierra
7042 Male 36 Yes 0 96125 39.600599 -120.6363
HOTOH City
ahmed_df.sort_values(by='Total Charges')
2967-
2060 Male 29 Yes 0 Los Angeles 90003 33.964131 -118.27
MXRAV
9318-
6560 Male 53 No 0 Twain 95984 40.022184 -121.06
NKNFC
8992-
6350 Female 59 No 0 Arnold 95223 38.321530 -120.23
CEUEN
9975-
7033 Male 24 No 0 Sierraville 96126 39.559709 -120.34
SKRNR
1423-
981 Female 62 Yes 1 Anaheim 92808 33.850452 -117.72
BMPBQ
8879-
6275 Male 42 Yes 0 Irvine 92614 33.680302 -117.83
XUAHX
9788-
6892 Male 45 Yes 0 Cabazon 92230 33.929812 -116.76
HNGUT
9739-
6855 Female 58 Yes 1 Long Beach 90822 33.778436 -118.11
JLPQJ
7569-
5360 Female 33 Yes 3 Middletown 95461 38.787446 -122.58
NMZYQ
2889-
2003 Male 31 Yes 0 Mckinleyville 95519 40.965011 -124.01
FPWRM
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 8/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
Number of Number of
Age Zip Code Latitude Longitude
Dependents Referrals
ahmed_df.describe().transpose()
Number of
7043.0 0.468692 0.962802 0.000000 0.000000 0.000000
Dependents
Number of
7043.0 1.951867 3.001199 0.000000 0.000000 0.000000
Referrals
Tenure in
7043.0 32.386767 24.542061 1.000000 9.000000 29.000000
Months
Avg
Monthly
Long 6361.0 25.420517 14.200374 1.010000 13.050000 25.690000
Distance
Charges
Avg
Monthly GB 5517.0 26.189958 19.586585 2.000000 13.000000 21.000000
Download
ahmed_df.min()
<ipython-input-39-21277c8acac6>:1: FutureWarning: The default value of numeric_only in DataFrame.min is deprecated. In a future vers
ahmed_df.min()
Customer ID 0002-ORFBO
Gender Female
Age 19
Married No
Number of Dependents 0
City Acampo
Zip Code 90001
Latitude 32.555828
Longitude -124.301372
Number of Referrals 0
Tenure in Months 1
Offer None
Phone Service No
Avg Monthly Long Distance Charges 1.01
Internet Service No
Avg Monthly GB Download 2.0
Contract Month-to-Month
Paperless Billing No
Payment Method Bank Withdrawal
Monthly Charge -10.0
Total Charges 18.8
Total Refunds 0.0
Total Extra Data Charges 0
Total Long Distance Charges 0.0
Customer Status Churned
dtype: object
ahmed_df.max()
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 9/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
<ipython-input-41-cf016402c008>:1: FutureWarning: The default value of numeric_only in DataFrame.max is deprecated. In a future vers
ahmed_df.max()
Customer ID 9995-HOTOH
Gender Male
Age 80
Married Yes
Number of Dependents 9
City Zenia
Zip Code 96150
Latitude 41.962127
Longitude -114.192901
Number of Referrals 11
Tenure in Months 72
https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 10/10