0% found this document useful (0 votes)
26 views11 pages

45B AIML Prac1.3

Uploaded by

Ahmed Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

45B AIML Prac1.3

Uploaded by

Ahmed Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Ahmed Shaikh AIML Prac 1.

3 Roll No 45/B

Name of Student: Ahmed Mobin Ahmed Shaikh

Roll Number: 45 Lab Practical Number: 1.3

Title of Lab Assignment: To perform Exploratory Data Analysis on


provided dataset.

DOP: 06/02/24 DOS: 06/02/24

CO Mapped: PO Mapped: Signature:


CO1 PO1, PO2,
PO3, PSO1,
PSO2
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often
employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier
for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

keyboard_arrow_down Exploratory Data Analysis (EDA) - DataFrame


import pandas as pd
ahmed_df1 = pd.DataFrame({'col1':[1,2,4,5,6], 'col2':[444,555,666,444,777], 'col3':['abc','def','aghi','xyz','ghj']})
ahmed_df1.head()

col1 col2 col3

0 1 444 abc

1 2 555 def

2 4 666 aghi

3 5 444 xyz

4 6 777 ghj

italicized text

the head() method returns the first five by default. Here we provide to number to customize it

ahmed_df1.head(2)

col1 col2 col3

0 1 444 abc

1 2 555 def

Info on unique values

info() method returns overall information about a dataframe

ahmed_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 5 non-null int64
1 col2 5 non-null int64
2 col3 5 non-null object
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes

The unique() method returns the unique values of a series, while nunique() – the number of unique values in a dataframe or a series.

ahmed_df1['col2'].unique()

array([444, 555, 666, 777])

For a dataframe, nunique(), by default, returns the results by column. Otherwise, passing in axis=1 or axis='columns' will give the results by
row.

ahmed_df1['col2'].nunique()

Returns the count of each unique value in a series. By default, the outputs are not normalized, or sorted in descending order, and the null
values are not considered.

ahmed_df1['col2'].value_counts()

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 1/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory

444 2
555 1
666 1
777 1
Name: col2, dtype: int64

the head() method returns the first five by default. Here we provide to number to customize it

print(ahmed_df1.head(2))

col1 col2 col3


0 1 444 abc
1 2 555 def

the tail() method returns the last five by default. Here we provide to number to customize it

print(ahmed_df1.tail(2))

col1 col2 col3


3 5 444 xyz
4 6 777 ghj

Applying Functions

ahmed_df1['col3'].apply(len)

0 3
1 3
2 4
3 3
4 3
Name: col3, dtype: int64

sum() method adds valus in each column and returns sum of each column

ahmed_df1['col1'].sum()

18

median() method returns series with a median value of each column

ahmed_df1['col1'].median()

4.0

mean() method returns average of the given dataset

ahmed_df1['col1'].mean()

3.6

Permanently Removing a Column

del ahmed_df1['col1']

ahmed_df1

col2 col3

0 444 abc

1 555 def

2 666 aghi

3 444 xyz

4 777 ghj

Get column and index names

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 2/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
ahmed_df1.columns

Index(['col2', 'col3'], dtype='object')

ahmed_df1.index

RangeIndex(start=0, stop=5, step=1)

Sorting and ordering a dataframe

ahmed_df1

col2 col3

0 444 abc

1 555 def

2 666 aghi

3 444 xyz

4 777 ghj

ahmed_df1.sort_values(by='col3')

col2 col3

0 444 abc

2 666 aghi

1 555 def

4 777 ghj

3 444 xyz

Data Skewness

ahmed_df1.describe()

col2

count 5.000000

mean 577.200000

std 144.726293

min 444.000000

25% 444.000000

50% 555.000000

75% 666.000000

max 777.000000

ahmed_df1.describe().transpose()

count mean std min 25% 50% 75% max

col2 5.0 577.2 144.726293 444.0 444.0 555.0 666.0 777.0

ahmed_df1.min()

col2 444
col3 abc
dtype: object

ahmed_df1.max()

col2 777
col3 xyz
dtype: object

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 3/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory

keyboard_arrow_down Exploratory Data Analysis (EDA) - telecom_customer_churn_Dataset


import pandas as pd
ahmed_df = pd.read_csv ("telecom_customer_churn.csv")
ahmed_df.head()

Customer Number of Zip


Gender Age Married City Latitude Longitude
ID Dependents Code

0002- Frazier
0 Female 37 Yes 0 93225 34.827662 -118.999073
ORFBO Park

0003-
1 Male 46 No 0 Glendale 91206 34.162515 -118.203869
MKNFE

0004- Costa
2 Male 50 No 0 92627 33.645672 -117.922613
TLHLJ Mesa

0011-
3 Male 78 Yes 0 Martinez 94553 38.014457 -122.115432
IGKFF

0013-
4 Female 75 Yes 0 Camarillo 93010 34.227846 -119.079903
EXCHZ

5 rows × 38 columns

ahmed_df.head(3)

Customer Number of Zip


Gender Age Married City Latitude Longitude
ID Dependents Code

0002- Frazier
0 Female 37 Yes 0 93225 34.827662 -118.999073
ORFBO Park

0003-
1 Male 46 No 0 Glendale 91206 34.162515 -118.203869
MKNFE

0004- Costa
2 Male 50 No 0 92627 33.645672 -117.922613
TLHLJ Mesa

3 rows × 38 columns

Info on unique values

ahmed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer ID 7043 non-null object
1 Gender 7043 non-null object
2 Age 7043 non-null int64
3 Married 7043 non-null object
4 Number of Dependents 7043 non-null int64
5 City 7043 non-null object
6 Zip Code 7043 non-null int64
7 Latitude 7043 non-null float64
8 Longitude 7043 non-null float64
9 Number of Referrals 7043 non-null int64
10 Tenure in Months 7043 non-null int64
11 Offer 7043 non-null object
12 Phone Service 7043 non-null object
13 Avg Monthly Long Distance Charges 6361 non-null float64
14 Multiple Lines 6361 non-null object
15 Internet Service 7043 non-null object
16 Internet Type 5517 non-null object
17 Avg Monthly GB Download 5517 non-null float64
18 Online Security 5517 non-null object
19 Online Backup 5517 non-null object
20 Device Protection Plan 5517 non-null object
21 Premium Tech Support 5517 non-null object
22 Streaming TV 5517 non-null object
23 Streaming Movies 5517 non-null object
24 Streaming Music 5517 non-null object

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 4/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
25 Unlimited Data 5517 non-null object
26 Contract 7043 non-null object
27 Paperless Billing 7043 non-null object
28 Payment Method 7043 non-null object
29 Monthly Charge 7043 non-null float64
30 Total Charges 7043 non-null float64
31 Total Refunds 7043 non-null float64
32 Total Extra Data Charges 7043 non-null int64
33 Total Long Distance Charges 7043 non-null float64
34 Total Revenue 7043 non-null float64
35 Customer Status 7043 non-null object
36 Churn Category 1869 non-null object
37 Churn Reason 1869 non-null object
dtypes: float64(9), int64(6), object(23)
memory usage: 2.0+ MB

ahmed_df['Total Revenue'].unique()

array([ 974.81, 610.28, 415.45, ..., 129.99, 4769.69, 3707.6 ])

ahmed_df['Total Revenue'].nunique()

6975

ahmed_df['Total Revenue'].value_counts()

24.80 3
116.27 3
68.41 3
66.56 3
3386.40 2
..
976.70 1
300.65 1
3258.42 1
1713.52 1
3707.60 1
Name: Total Revenue, Length: 6975, dtype: int64

print(ahmed_df.head(5))

Customer ID Gender Age Married Number of Dependents City \


0 0002-ORFBO Female 37 Yes 0 Frazier Park
1 0003-MKNFE Male 46 No 0 Glendale
2 0004-TLHLJ Male 50 No 0 Costa Mesa
3 0011-IGKFF Male 78 Yes 0 Martinez
4 0013-EXCHZ Female 75 Yes 0 Camarillo

Zip Code Latitude Longitude Number of Referrals ... Payment Method \


0 93225 34.827662 -118.999073 2 ... Credit Card
1 91206 34.162515 -118.203869 0 ... Credit Card
2 92627 33.645672 -117.922613 0 ... Bank Withdrawal
3 94553 38.014457 -122.115432 1 ... Bank Withdrawal
4 93010 34.227846 -119.079903 3 ... Credit Card

Monthly Charge Total Charges Total Refunds Total Extra Data Charges \
0 65.6 593.30 0.00 0
1 -4.0 542.40 38.33 10
2 73.9 280.85 0.00 0
3 98.0 1237.85 0.00 0
4 83.9 267.40 0.00 0

Total Long Distance Charges Total Revenue Customer Status Churn Category \
0 381.51 974.81 Stayed NaN
1 96.21 610.28 Stayed NaN
2 134.60 415.45 Churned Competitor
3 361.66 1599.51 Churned Dissatisfaction
4 22.14 289.54 Churned Dissatisfaction

Churn Reason
0 NaN
1 NaN
2 Competitor had better devices
3 Product dissatisfaction
4 Network reliability

[5 rows x 38 columns]

print(ahmed_df.tail(5))

Customer ID Gender Age Married Number of Dependents City \


7038 9987-LUTYD Female 20 No 0 La Mesa
7039 9992-RRAMN Male 40 Yes 0 Riverbank
7040 9992-UJOEL Male 22 No 0 Elk
7041 9993-LHIEB Male 21 Yes 0 Solana Beach

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 5/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory
7042 9995-HOTOH Male 36 Yes 0 Sierra City

Zip Code Latitude Longitude Number of Referrals ... \


7038 91941 32.759327 -116.997260 0 ...
7039 95367 37.734971 -120.954271 1 ...
7040 95432 39.108252 -123.645121 0 ...
7041 92075 33.001813 -117.263628 5 ...
7042 96125 39.600599 -120.636358 1 ...

Payment Method Monthly Charge Total Charges Total Refunds \


7038 Credit Card 55.15 742.90 0.0
7039 Bank Withdrawal 85.10 1873.70 0.0
7040 Credit Card 50.30 92.75 0.0
7041 Credit Card 67.85 4627.65 0.0
7042 Bank Withdrawal 59.00 3707.60 0.0

Total Extra Data Charges Total Long Distance Charges Total Revenue \
7038 0 606.84 1349.74
7039 0 356.40 2230.10
7040 0 37.24 129.99
7041 0 142.04 4769.69
7042 0 0.00 3707.60

Customer Status Churn Category Churn Reason


7038 Stayed NaN NaN
7039 Churned Dissatisfaction Product dissatisfaction
7040 Joined NaN NaN
7041 Stayed NaN NaN
7042 Stayed NaN NaN

[5 rows x 38 columns]

keyboard_arrow_down Applying Functions


ahmed_df['Customer ID'].apply(len)

0 10
1 10
2 10
3 10
4 10
..
7038 10
7039 10
7040 10
7041 10
7042 10
Name: Customer ID, Length: 7043, dtype: int64

ahmed_df['Total Revenue'].sum()

21371131.69

ahmed_df['Total Revenue'].median()

2108.64

ahmed_df['Total Revenue'].mean()

3034.3790558000856

Permanently Removing a Column

del ahmed_df['Total Revenue']

ahmed_df

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 6/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory

Customer Number of Zip


Gender Age Married City Latitude Longitu
ID Dependents Code

0002- Frazier
0 Female 37 Yes 0 93225 34.827662 -118.9990
ORFBO Park

0003-
1 Male 46 No 0 Glendale 91206 34.162515 -118.2038
MKNFE

0004- Costa
2 Male 50 No 0 92627 33.645672 -117.9226
TLHLJ Mesa

0011-
3 Male 78 Yes 0 Martinez 94553 38.014457 -122.1154
IGKFF

0013-
4 Female 75 Yes 0 Camarillo 93010 34.227846 -119.0799
EXCHZ

... ... ... ... ... ... ... ... ...

9987-
7038 Female 20 No 0 La Mesa 91941 32.759327 -116.9972
LUTYD

9992-
7039 Male 40 Yes 0 Riverbank 95367 37.734971 -120.9542
RRAMN

9992-
7040 Male 22 No 0 Elk 95432 39.108252 -123.6451
UJOEL

9993- Solana
7041 Male 21 Yes 0 92075 33.001813 -117.2636
LHIEB Beach

9995- Sierra
7042 Male 36 Yes 0 96125 39.600599 -120.6363
HOTOH City

7043 rows × 37 columns

keyboard_arrow_down Get column and index names


ahmed_df.columns

Index(['Customer ID', 'Gender', 'Age', 'Married', 'Number of Dependents',


'City', 'Zip Code', 'Latitude', 'Longitude', 'Number of Referrals',
'Tenure in Months', 'Offer', 'Phone Service',
'Avg Monthly Long Distance Charges', 'Multiple Lines',
'Internet Service', 'Internet Type', 'Avg Monthly GB Download',
'Online Security', 'Online Backup', 'Device Protection Plan',
'Premium Tech Support', 'Streaming TV', 'Streaming Movies',
'Streaming Music', 'Unlimited Data', 'Contract', 'Paperless Billing',
'Payment Method', 'Monthly Charge', 'Total Charges', 'Total Refunds',
'Total Extra Data Charges', 'Total Long Distance Charges',
'Customer Status', 'Churn Category', 'Churn Reason'],
dtype='object')

ahmed_df.index

RangeIndex(start=0, stop=7043, step=1)

Sorting and ordering a Dataset

ahmed_df

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 7/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory

Customer Number of Zip


Gender Age Married City Latitude Longitu
ID Dependents Code

0002- Frazier
0 Female 37 Yes 0 93225 34.827662 -118.9990
ORFBO Park

0003-
1 Male 46 No 0 Glendale 91206 34.162515 -118.2038
MKNFE

0004- Costa
2 Male 50 No 0 92627 33.645672 -117.9226
TLHLJ Mesa

0011-
3 Male 78 Yes 0 Martinez 94553 38.014457 -122.1154
IGKFF

0013-
4 Female 75 Yes 0 Camarillo 93010 34.227846 -119.0799
EXCHZ

... ... ... ... ... ... ... ... ...

9987-
7038 Female 20 No 0 La Mesa 91941 32.759327 -116.9972
LUTYD

9992-
7039 Male 40 Yes 0 Riverbank 95367 37.734971 -120.9542
RRAMN

9992-
7040 Male 22 No 0 Elk 95432 39.108252 -123.6451
UJOEL

9993- Solana
7041 Male 21 Yes 0 92075 33.001813 -117.2636
LHIEB Beach

9995- Sierra
7042 Male 36 Yes 0 96125 39.600599 -120.6363
HOTOH City

7043 rows × 37 columns

ahmed_df.sort_values(by='Total Charges')

Customer Number of Zip


Gender Age Married City Latitude Longi
ID Dependents Code

2967-
2060 Male 29 Yes 0 Los Angeles 90003 33.964131 -118.27
MXRAV

9318-
6560 Male 53 No 0 Twain 95984 40.022184 -121.06
NKNFC

8992-
6350 Female 59 No 0 Arnold 95223 38.321530 -120.23
CEUEN

9975-
7033 Male 24 No 0 Sierraville 96126 39.559709 -120.34
SKRNR

1423-
981 Female 62 Yes 1 Anaheim 92808 33.850452 -117.72
BMPBQ

... ... ... ... ... ... ... ... ...

8879-
6275 Male 42 Yes 0 Irvine 92614 33.680302 -117.83
XUAHX

9788-
6892 Male 45 Yes 0 Cabazon 92230 33.929812 -116.76
HNGUT

9739-
6855 Female 58 Yes 1 Long Beach 90822 33.778436 -118.11
JLPQJ

7569-
5360 Female 33 Yes 3 Middletown 95461 38.787446 -122.58
NMZYQ

2889-
2003 Male 31 Yes 0 Mckinleyville 95519 40.965011 -124.01
FPWRM

7043 rows × 37 columns

keyboard_arrow_down Data Skewness


ahmed_df.describe()

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 8/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory

Number of Number of
Age Zip Code Latitude Longitude
Dependents Referrals

count 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 70

mean 46.509726 0.468692 93486.070567 36.197455 -119.756684 1.951867

std 16.750352 0.962802 1856.767505 2.468929 2.154425 3.001199

min 19.000000 0.000000 90001.000000 32.555828 -124.301372 0.000000

25% 32.000000 0.000000 92101.000000 33.990646 -121.788090 0.000000

50% 46.000000 0.000000 93518.000000 36.205465 -119.595293 0.000000

75% 60.000000 0.000000 95329.000000 38.161321 -117.969795 3.000000

max 80.000000 9.000000 96150.000000 41.962127 -114.192901 11.000000

ahmed_df.describe().transpose()

count mean std min 25% 50%

Age 7043.0 46.509726 16.750352 19.000000 32.000000 46.000000

Number of
7043.0 0.468692 0.962802 0.000000 0.000000 0.000000
Dependents

Zip Code 7043.0 93486.070567 1856.767505 90001.000000 92101.000000 93518.000000

Latitude 7043.0 36.197455 2.468929 32.555828 33.990646 36.205465

Longitude 7043.0 -119.756684 2.154425 -124.301372 -121.788090 -119.595293

Number of
7043.0 1.951867 3.001199 0.000000 0.000000 0.000000
Referrals

Tenure in
7043.0 32.386767 24.542061 1.000000 9.000000 29.000000
Months

Avg
Monthly
Long 6361.0 25.420517 14.200374 1.010000 13.050000 25.690000
Distance
Charges

Avg
Monthly GB 5517.0 26.189958 19.586585 2.000000 13.000000 21.000000
Download

ahmed_df.min()

<ipython-input-39-21277c8acac6>:1: FutureWarning: The default value of numeric_only in DataFrame.min is deprecated. In a future vers
ahmed_df.min()
Customer ID 0002-ORFBO
Gender Female
Age 19
Married No
Number of Dependents 0
City Acampo
Zip Code 90001
Latitude 32.555828
Longitude -124.301372
Number of Referrals 0
Tenure in Months 1
Offer None
Phone Service No
Avg Monthly Long Distance Charges 1.01
Internet Service No
Avg Monthly GB Download 2.0
Contract Month-to-Month
Paperless Billing No
Payment Method Bank Withdrawal
Monthly Charge -10.0
Total Charges 18.8
Total Refunds 0.0
Total Extra Data Charges 0
Total Long Distance Charges 0.0
Customer Status Churned
dtype: object

ahmed_df.max()

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 9/10
2/11/24, 2:37 AM 45_AIML_Practical1.3_EDA.ipynb - Colaboratory

<ipython-input-41-cf016402c008>:1: FutureWarning: The default value of numeric_only in DataFrame.max is deprecated. In a future vers
ahmed_df.max()
Customer ID 9995-HOTOH
Gender Male
Age 80
Married Yes
Number of Dependents 9
City Zenia
Zip Code 96150
Latitude 41.962127
Longitude -114.192901
Number of Referrals 11
Tenure in Months 72

https://colab.research.google.com/drive/1Lfim_GIVYiuKA_HcLE10P9EAXBNIbWTo#scrollTo=LDgk0XnNtKCn&printMode=true 10/10

You might also like