0% found this document useful (0 votes)

16 views16 pages

EDA Session-3 Categorical Data Analysis

The document discusses analyzing and summarizing data from a dataset containing visa application information. It shows how to import necessary libraries, read in the data, and view the column headers. Specific columns like 'continent' are accessed as both series and dataframes. The number of unique continent values and counts of applications per continent are calculated. The continent frequencies are stored in a new dataframe and exported to a CSV file.

Uploaded by

jeeshu048

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views16 pages

EDA Session-3 Categorical Data Analysis

Uploaded by

jeeshu048

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Import the packages

In [1]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Read the data

In [4]: path=r"C:\Users\omkar\OneDrive\Documents\Data science\Naresh IT\Datafiles\Vi

visa_df=pd.read_csv(path)
visa_df.head(3)

Out[4]: case_id continent education_of_employee has_job_experience requires_job_training no_o

0 EZYV01 Asia High School N N

1 EZYV02 Asia Master's Y N

2 EZYV03 Asia Bachelor's N Y

 

Reading a specific column

In [5]: visa_df['continent'] # series type

Out[5]: 0 Asia
1 Asia
2 Asia
3 Asia
4 Africa
...
25475 Asia
25476 Asia
25477 Asia
25478 Asia
25479 Asia
Name: continent, Length: 25480, dtype: object
In [6]: visa_df[['continent']] # data frame

Out[6]: continent

0 Asia

1 Asia

2 Asia

3 Asia

4 Africa

... ...

25475 Asia

25476 Asia

25477 Asia

25478 Asia

25479 Asia

25480 rows × 1 columns

In [7]: visa_df.continent # series

Out[7]: 0 Asia
1 Asia
2 Asia
3 Asia
4 Africa
...
25475 Asia
25476 Asia
25477 Asia
25478 Asia
25479 Asia
Name: continent, Length: 25480, dtype: object

In [ ]: visa_df['continent'] # series
visa_df.continent # series
visa_df[['continent']] # df

In [8]: visa_df.columns

Out[8]: Index(['case_id', 'continent', 'education_of_employee', 'has_job_experienc

e',
'requires_job_training', 'no_of_employees', 'yr_of_estab',
'region_of_employment', 'prevailing_wage', 'unit_of_wage',
'full_time_position', 'case_status'],
dtype='object')
In [9]: cols=['continent','education_of_employee']
visa_df[cols]

Out[9]: continent education_of_employee

0 Asia High School

1 Asia Master's

2 Asia Bachelor's

3 Asia Bachelor's

4 Africa Master's

... ... ...

25475 Asia Bachelor's

25476 Asia High School

25477 Asia Master's

25478 Asia Master's

25479 Asia Bachelor's

25480 rows × 2 columns

In [11]: visa_df.values

# list of all the samples
# list of all the observations
# list of all the tuples

Out[11]: array([['EZYV01', 'Asia', 'High School', ..., 'Hour', 'Y', 'Denied'],

['EZYV02', 'Asia', "Master's", ..., 'Year', 'Y', 'Certified'],
['EZYV03', 'Asia', "Bachelor's", ..., 'Year', 'Y', 'Denied'],
...,
['EZYV25478', 'Asia', "Master's", ..., 'Year', 'N', 'Certified'],
['EZYV25479', 'Asia', "Master's", ..., 'Year', 'Y', 'Certified'],
['EZYV25480', 'Asia', "Bachelor's", ..., 'Year', 'Y', 'Certifie
d']],
dtype=object)

In [ ]: # if i give list ==== df

# if i give df ==== list

𝑐𝑜𝑛𝑡𝑖𝑛𝑒𝑛𝑡
In [16]: l1=[1,2,3]
l2=['A','B','C']
l=[l1,l2]
l
pd.DataFrame(l)

Out[16]: 0 1 2

0 1 2 3

1 A B C
In [17]: col=['continent']
visa_df[col]

Out[17]: continent

0 Asia

1 Asia

2 Asia

3 Asia

4 Africa

... ...

25475 Asia

25476 Asia

25477 Asia

25478 Asia

25479 Asia

25480 rows × 1 columns

𝑢𝑛𝑖𝑞𝑢𝑒
In [18]: # how many unique labels are there
visa_df['continent'].unique()

Out[18]: array(['Asia', 'Africa', 'North America', 'Europe', 'South America',

'Oceania'], dtype=object)

In [19]: # python basic logics

l1=['A','A','B','C'] # ['A','B','C']
set(l1)

Out[19]: {'A', 'B', 'C'}

In [21]: set(visa_df['continent'].values)

Out[21]: {'Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America'}

𝑛𝑢𝑛𝑖𝑞𝑢𝑒
In [22]: visa_df['continent'].nunique()
# number of unique elements

Out[22]: 6

in the contienent column only 7 elements repeated

{'Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America'}

Q1)out of total observations How many asia observations are there?

In [26]: con=visa_df['continent']=='Asia' # True and False

visa_df[con]

Out[26]: case_id continent education_of_employee has_job_experience requires_job_trainin

0 EZYV01 Asia High School N

1 EZYV02 Asia Master's Y

2 EZYV03 Asia Bachelor's N

3 EZYV04 Asia Bachelor's N

5 EZYV06 Asia Master's Y

... ... ... ... ...

25475 EZYV25476 Asia Bachelor's Y

25476 EZYV25477 Asia High School Y

25477 EZYV25478 Asia Master's Y

25478 EZYV25479 Asia Master's Y

25479 EZYV25480 Asia Bachelor's Y

16861 rows × 12 columns

 

In [27]: con=visa_df['continent']=='Asia' # True and False

len(visa_df[con])

Out[27]: 16861

In [28]: con=visa_df['continent']=='Africa' # True and False

len(visa_df[con])

Out[28]: 551

In [31]: unique_labels= visa_df['continent'].unique()

for i in unique_labels:
con=visa_df['continent']==i # True and False
print(i,":",len(visa_df[con]))

Asia : 16861
Africa : 551
North America : 3292
Europe : 3732
South America : 852
Oceania : 192

Frequency table
In [35]: unique_labels= visa_df['continent'].unique()
count=[]
for i in unique_labels:
con=visa_df['continent']==i # True and False
count.append(len(visa_df[con]))

continent_df=pd.DataFrame(zip(unique_labels,count),
columns=['Continent','Count'])
continent_df.to_csv('continent_df.csv',index=False)

In [ ]: visa_df # Total data frame

visa_df['continent'] # specific column
visa_df['continent']=='Asia' # Specific lable
#####################################################
len(visa_df[visa_df['continent']=='Asia'])

##################################################

unique_labels= visa_df['continent'].unique()
count=[]
for i in unique_labels:
con=visa_df['continent']==i # True and False
count.append(len(visa_df[con]))

#####################################################
continent_df=pd.DataFrame(zip(unique_labels,count),
columns=['Continent','Count'])

########################################################
continent_df.to_csv('continent_df.csv',index=False)

In [36]: continent_df

Out[36]: Continent Count

0 Asia 16861

1 Africa 551

2 North America 3292

3 Europe 3732

4 South America 852

5 Oceania 192

𝑣𝑎𝑙𝑢𝑒-𝑐𝑜𝑢𝑛𝑡𝑠
In [38]: continent_vc=visa_df['continent'].value_counts() # series
continent_vc

Out[38]: continent
Asia 16861
Europe 3732
North America 3292
South America 852
Africa 551
Oceania 192
Name: count, dtype: int64

In [ ]: visa_df
visa_df['continent']
visa_df['continent'].unique()
visa_df['continent'].nunique()
visa_df['continent'].value_counts()

In [39]: continent_vc.keys()

Out[39]: Index(['Asia', 'Europe', 'North America', 'South America', 'Africa',

'Oceania'],
dtype='object', name='continent')

In [41]: continent_vc.values

Out[41]: array([16861, 3732, 3292, 852, 551, 192], dtype=int64)

In [43]: continent_vc=visa_df['continent'].value_counts() # series

l1=continent_vc.keys()
l2=continent_vc.values
continent_vc_df=pd.DataFrame(zip(l1,l2),
columns=['continent','count'])

continent_vc_df

Out[43]: continent count

0 Asia 16861

1 Europe 3732

2 North America 3292

3 South America 852

4 Africa 551

5 Oceania 192
In [46]: visa_df # Total data frame
visa_df['continent'] # specific column
visa_df['continent']=='Asia' # Specific lable
#####################################################
len(visa_df[visa_df['continent']=='Asia'])
len(visa_df[visa_df['continent']=='Africa'])
len(visa_df[visa_df['continent']=='Europe'])
len(visa_df[visa_df['continent']=='North America'])
len(visa_df[visa_df['continent']=='South America'])
len(visa_df[visa_df['continent']=='Oceania'])

########-------Method-1---------#########################

unique_labels= visa_df['continent'].unique()
count=[]
for i in unique_labels:
con=visa_df['continent']==i # True and False
count.append(len(visa_df[con]))

continent_df=pd.DataFrame(zip(unique_labels,count),
columns=['Continent','Count'])

print(continent_df)

##################------ M-2----(Value counts)----##########################
continent_vc=visa_df['continent'].value_counts() # series
l1=continent_vc.keys()
l2=continent_vc.values
continent_vc_df=pd.DataFrame(zip(l1,l2),
columns=['continent','count'])

print(continent_vc_df)

Continent Count
0 Asia 16861
1 Africa 551
2 North America 3292
3 Europe 3732
4 South America 852
5 Oceania 192
continent count
0 Asia 16861
1 Europe 3732
2 North America 3292
3 South America 852
4 Africa 551
5 Oceania 192

In [47]: continent_vc

Out[47]: continent
Asia 16861
Europe 3732
North America 3292
South America 852
Africa 551
Oceania 192
Name: count, dtype: int64
In [48]: continent_df

Out[48]: Continent Count

0 Asia 16861

1 Africa 551

2 North America 3292

3 Europe 3732

4 South America 852

5 Oceania 192

Bar chart

in order to draw bar chart

we required one categorical colun
we required one numerical column
package: matplotlib
dataframe: continent_vc_df

In [51]: #plt.bar(<cat>,<numer>,<data>)
continent_vc_df

Out[51]: continent count

0 Asia 16861

1 Europe 3732

2 North America 3292

3 South America 852

4 Africa 551

5 Oceania 192
In [61]: plt.figure(figsize=(10,6)) # to incease the plot size
plt.bar('continent',
'count',
data=continent_vc_df)
plt.xlabel("continent") # x-axis name
plt.ylabel('count') # y-axis name
plt.title("Bar chart") # title of the chart
plt.savefig('continent_bar.jpg')
plt.show()

we read the data

we read categorical column
we made frequency table by using value counts
we plot the bar chart using matplotlib
But matplotlib required 3 arguments
x label: categorical column (width)
y label: numerical column (height)
data ( frquency table name)

Count plot

count plot can use bt seaborn package

It requires only entire dataframe and categorical column
entire dataframe name: Visadf
categorical column name: contnent
order: In which order you want plot
In [65]: visa_df['continent'].value_counts().keys()

Out[65]: Index(['Asia', 'Europe', 'North America', 'South America', 'Africa',

'Oceania'],
dtype='object', name='continent')

In [70]: plt.figure(figsize=(10,6))
# l=['Asia', 'Oceania', 'North America', 'South America', 'Africa',
# 'Europe']
l=visa_df['continent'].value_counts().keys() # order provide automatically
sns.countplot(data=visa_df,
x='continent',
order=l)
plt.xlabel("continent") # x-axis name
plt.ylabel('count') # y-axis name
plt.title("Bar chart") # title of the chart
plt.savefig('continent_bar.jpg')
plt.show()

In [ ]: # perform the same analysis on education employee

# show me the plots in whatsapp group
# take a screenshot and post in the group
In [1]: # Import packages
# and read data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

path=r"C:\Users\omkar\OneDrive\Documents\Data science\Naresh IT\Datafiles\Vi
visa_df=pd.read_csv(path)
visa_df.head(3)

Out[1]:
case_id continent education_of_employee has_job_experience requires_job_training no_o

0 EZYV01 Asia High School N N

1 EZYV02 Asia Master's Y N

2 EZYV03 Asia Bachelor's N Y

 

𝑀𝑒𝑡ℎ𝑜𝑑 − 3
we created a frequency table : matplotlib
we created bar chart using seaborn
main dataframe
column name
by using value counts
In [14]: values=visa_df['continent'].value_counts()
ax=values.plot(kind='bar')
ax.bar_label(ax.containers[0])

Out[14]: [Text(0, 0, '16861'),

Text(0, 0, '3732'),
Text(0, 0, '3292'),
Text(0, 0, '852'),
Text(0, 0, '551'),
Text(0, 0, '192')]
In [15]: plt.subplot(2,2,1)
plt.subplot(2,2,2)
plt.subplot(2,2,3)
plt.subplot(2,2,4)

Out[15]: <Axes: >

In [ ]: ######################## M-1 ###############################################
plt.figure(figsize=(10,6)) # to incease the plot size
plt.bar('continent',
'count',
data=continent_vc_df)
plt.xlabel("continent") # x-axis name
plt.ylabel('count') # y-axis name
plt.title("Bar chart") # title of the chart
plt.savefig('continent_bar.jpg')
plt.show()

######################## M-2 ###############################################

plt.figure(figsize=(10,6))
# l=['Asia', 'Oceania', 'North America', 'South America', 'Africa',
# 'Europe']
l=visa_df['continent'].value_counts().keys() # order provide automatically
sns.countplot(data=visa_df,
x='continent',
order=l)
plt.xlabel("continent") # x-axis name
plt.ylabel('count') # y-axis name
plt.title("Bar chart") # title of the chart
plt.savefig('continent_bar.jpg')
plt.show()

############################# M-3 ##########################################

values=visa_df['continent'].value_counts()
ax=values.plot(kind='bar')
ax.bar_label(ax.containers[0])

𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
In [18]: visa_df['continent'].value_counts(normalize=True)

Out[18]: continent
Asia 0.661735
Europe 0.146468
North America 0.129199
South America 0.033438
Africa 0.021625
Oceania 0.007535
Name: proportion, dtype: float64

$Pie$ $chart$

- pie chart will automatically convert values to percentages

- will take value count help with out normalize

- x is data in the form list

- labels also in the form of list
In [22]: keys=visa_df['continent'].value_counts().keys()
values=visa_df['continent'].value_counts().values
values

Out[22]: array([16861, 3732, 3292, 852, 551, 192], dtype=int64)

In [34]: plt.pie(values,
labels=keys,
autopct="%0.3f%%",
explode=[0.1,0.1,0.1,0.1,0.1,0.1],
startangle=180,
radius=2) # rotation
plt.show()

In [ ]:

Fci-Management Trainee 2013
No ratings yet
Fci-Management Trainee 2013
23 pages
Criteria For Success of Osseointegrated Endosseous Implants Zarb
No ratings yet
Criteria For Success of Osseointegrated Endosseous Implants Zarb
6 pages
The Basics of "Criminal Trial" Q and A-Part-III)
No ratings yet
The Basics of "Criminal Trial" Q and A-Part-III)
5 pages
2 Introduction To Management Science
100% (1)
2 Introduction To Management Science
16 pages
cs3591 New Computer Network 2023 24 Course File
No ratings yet
cs3591 New Computer Network 2023 24 Course File
22 pages
ROADMAP First Edition
0% (1)
ROADMAP First Edition
32 pages
Task Support Vehicle: Maintenance Repair Parts Manual
No ratings yet
Task Support Vehicle: Maintenance Repair Parts Manual
120 pages
20ME901 Automobile Engineering Unit 1
No ratings yet
20ME901 Automobile Engineering Unit 1
87 pages
Data Wrangling - Jupyter Notebook
No ratings yet
Data Wrangling - Jupyter Notebook
5 pages
Piping Material Steel
No ratings yet
Piping Material Steel
44 pages
Neural Information Processing: Teddy Mantoro Minho Lee Media Anugerah Ayu Kok Wai Wong Achmad Nizar Hidayanto
No ratings yet
Neural Information Processing: Teddy Mantoro Minho Lee Media Anugerah Ayu Kok Wai Wong Achmad Nizar Hidayanto
703 pages
Case Details SBI
No ratings yet
Case Details SBI
7 pages
Case Study Presentation Two Tough Calls A Harvard Business School
No ratings yet
Case Study Presentation Two Tough Calls A Harvard Business School
10 pages
Unit - 1
No ratings yet
Unit - 1
29 pages
GPS Antenna Cable
No ratings yet
GPS Antenna Cable
5 pages
Metamorphosis Clean
No ratings yet
Metamorphosis Clean
35 pages
Arts Manager
No ratings yet
Arts Manager
2 pages
Standard Shipment Process (Mass Processing) : LE (Logistics Execution)
No ratings yet
Standard Shipment Process (Mass Processing) : LE (Logistics Execution)
9 pages
Banking Finance Tax Test SK2019 - 1
No ratings yet
Banking Finance Tax Test SK2019 - 1
4 pages
Chuyên Đề 22 - Từ Chỉ Số Lượng
No ratings yet
Chuyên Đề 22 - Từ Chỉ Số Lượng
4 pages
Ogsd en PDF
No ratings yet
Ogsd en PDF
346 pages
Pandas Python For Data Science
100% (1)
Pandas Python For Data Science
1 page
Market Report - 26 April 2019
No ratings yet
Market Report - 26 April 2019
3 pages
Statistics Sampling Theoresm Session 8
No ratings yet
Statistics Sampling Theoresm Session 8
5 pages
Machine Learning
No ratings yet
Machine Learning
67 pages
PR2 Chapter 1-5
No ratings yet
PR2 Chapter 1-5
48 pages
c3 100 PDF
No ratings yet
c3 100 PDF
38 pages
Test-1 - Python and Stat - Jupyter Notebook
0% (1)
Test-1 - Python and Stat - Jupyter Notebook
3 pages
Whistleblower Statement PDF
No ratings yet
Whistleblower Statement PDF
2 pages
Data Analysis Project
No ratings yet
Data Analysis Project
50 pages
Gust Loads On Aircraft
No ratings yet
Gust Loads On Aircraft
59 pages
Pyhon Solution
No ratings yet
Pyhon Solution
45 pages
Fds Practical Slips Solutions
No ratings yet
Fds Practical Slips Solutions
32 pages
Scom 261 - News Release Final Version
No ratings yet
Scom 261 - News Release Final Version
2 pages
Paddy Diesease
No ratings yet
Paddy Diesease
20 pages
File Handling Demo - Jupyter Notebook
No ratings yet
File Handling Demo - Jupyter Notebook
4 pages
Intro To Pandas World Happiness
No ratings yet
Intro To Pandas World Happiness
20 pages
Final Cihan Yazıcı
No ratings yet
Final Cihan Yazıcı
6 pages
Iteration
No ratings yet
Iteration
40 pages
Five Year Dataset
No ratings yet
Five Year Dataset
15 pages
EDA - Session-6 - Bi Variate Analysis
No ratings yet
EDA - Session-6 - Bi Variate Analysis
17 pages
Pandas
No ratings yet
Pandas
34 pages
01 23 ADCB Fire Pipes Egy Gulf Rev.01
No ratings yet
01 23 ADCB Fire Pipes Egy Gulf Rev.01
3 pages
Dictionaries, Part 1: Hugo Bowne-Anderson
No ratings yet
Dictionaries, Part 1: Hugo Bowne-Anderson
60 pages
World Population Analysis
No ratings yet
World Population Analysis
14 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
60 pages
Course3 Notes
No ratings yet
Course3 Notes
44 pages
Dsba Project Main Et Easyvisa
No ratings yet
Dsba Project Main Et Easyvisa
46 pages
Lifeboat Seat Belt Requirements
No ratings yet
Lifeboat Seat Belt Requirements
9 pages
Millitary Analysis
No ratings yet
Millitary Analysis
11 pages
Python Pandas-DataFrames Complete - Jupyter Notebook
No ratings yet
Python Pandas-DataFrames Complete - Jupyter Notebook
34 pages
Code
No ratings yet
Code
3 pages
EDA - Session-2 - Data Frame Basics-2
No ratings yet
EDA - Session-2 - Data Frame Basics-2
11 pages
Lecture 12 - Art and Science of Data Visualization
No ratings yet
Lecture 12 - Art and Science of Data Visualization
21 pages
Suicide Analysis
No ratings yet
Suicide Analysis
18 pages
4 June 2024
No ratings yet
4 June 2024
5 pages
Country - Data (Record) - Jupyter Notebook
No ratings yet
Country - Data (Record) - Jupyter Notebook
5 pages
EDA - Session-7 - Convert Categorical To Numerical
No ratings yet
EDA - Session-7 - Convert Categorical To Numerical
5 pages
Python Cheatsy
No ratings yet
Python Cheatsy
1 page
Data Analyzer
No ratings yet
Data Analyzer
10 pages
Mobile1 PDF
No ratings yet
Mobile1 PDF
2 pages
Exercises Part2
No ratings yet
Exercises Part2
7 pages
Answers Practical File
No ratings yet
Answers Practical File
19 pages
Pandas - Reading in Files
No ratings yet
Pandas - Reading in Files
3 pages
Feature Engineering
No ratings yet
Feature Engineering
7 pages
IQ Levels Analysis With Python PDF 1701793924
No ratings yet
IQ Levels Analysis With Python PDF 1701793924
11 pages
Flyer Filter Sleeves
No ratings yet
Flyer Filter Sleeves
1 page
M7 Muhammad Sandhi Khadafi 2KB04 (20122007)
No ratings yet
M7 Muhammad Sandhi Khadafi 2KB04 (20122007)
16 pages
AD3301 - Data - Transformation - Ipynb - Colaboratory
No ratings yet
AD3301 - Data - Transformation - Ipynb - Colaboratory
27 pages
DV0101EN-2-2-1-Area-Plots-Histograms-and-Bar-Charts-py-v2.0: 1 Exploring Datasets With Pandas and Matplotlib
No ratings yet
DV0101EN-2-2-1-Area-Plots-Histograms-and-Bar-Charts-py-v2.0: 1 Exploring Datasets With Pandas and Matplotlib
29 pages
Group By, Rank, Concatenation in Pandas
No ratings yet
Group By, Rank, Concatenation in Pandas
11 pages
2 Tekrek M7 KNN - DGX 1
No ratings yet
2 Tekrek M7 KNN - DGX 1
15 pages
Cheat Python
No ratings yet
Cheat Python
8 pages
TURBOMAX Residential Sizing Guide
No ratings yet
TURBOMAX Residential Sizing Guide
3 pages
EDA - Session-5 - Outlier Analysis
No ratings yet
EDA - Session-5 - Outlier Analysis
11 pages
EDA - Session-4 - Numerical Data Analysis
No ratings yet
EDA - Session-4 - Numerical Data Analysis
9 pages
CovidData - Ipynb - Colaboratory
No ratings yet
CovidData - Ipynb - Colaboratory
4 pages
Area Plots, Histogram and Bar Plots in Python
No ratings yet
Area Plots, Histogram and Bar Plots in Python
9 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
Pandas - Cheat - Sheet
No ratings yet
Pandas - Cheat - Sheet
6 pages
Main - Py Text File
No ratings yet
Main - Py Text File
5 pages
XML To Dataframe
No ratings yet
XML To Dataframe
6 pages
Pandaspythonfordatascience
No ratings yet
Pandaspythonfordatascience
1 page
Csmarks Feedback 22660382 Pervea01 - 32308
No ratings yet
Csmarks Feedback 22660382 Pervea01 - 32308
11 pages
Data Visualization - New
No ratings yet
Data Visualization - New
5 pages
WEBINTEL GUIDED LAB ACTIVITY Introduction To Pandas
No ratings yet
WEBINTEL GUIDED LAB ACTIVITY Introduction To Pandas
1 page
Python Lab
No ratings yet
Python Lab
8 pages
Pandas Python For Data Science
No ratings yet
Pandas Python For Data Science
1 page
Building and Site Security Policy
No ratings yet
Building and Site Security Policy
1 page

EDA Session-3 Categorical Data Analysis

Uploaded by

EDA Session-3 Categorical Data Analysis

Uploaded by

Import the packages

In [1]: import numpy as np

Read the data

In [4]: path=r"C:\Users\omkar\OneDrive\Documents\Data science\Naresh IT\Datafiles\Vi

Out[4]: case_id continent education_of_employee has_job_experience requires_job_training no_o

0 EZYV01 Asia High School N N

1 EZYV02 Asia Master's Y N

2 EZYV03 Asia Bachelor's N Y

Reading a specific column

In [5]: visa_df['continent'] # series type

25480 rows × 1 columns

In [7]: visa_df.continent # series

Out[8]: Index(['case_id', 'continent', 'education_of_employee', 'has_job_experienc

Out[9]: continent education_of_employee

0 Asia High School

... ... ...

25475 Asia Bachelor's

25476 Asia High School

25477 Asia Master's

25478 Asia Master's

25479 Asia Bachelor's

25480 rows × 2 columns

Out[11]: array([['EZYV01', 'Asia', 'High School', ..., 'Hour', 'Y', 'Denied'],

In [ ]: # if i give list ==== df

25480 rows × 1 columns

Out[18]: array(['Asia', 'Africa', 'North America', 'Europe', 'South America',

In [19]: # python basic logics

Out[19]: {'A', 'B', 'C'}

Out[21]: {'Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America'}

in the contienent column only 7 elements repeated

{'Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America'}

In [26]: con=visa_df['continent']=='Asia' # True and False

Out[26]: case_id continent education_of_employee has_job_experience requires_job_trainin

0 EZYV01 Asia High School N

1 EZYV02 Asia Master's Y

2 EZYV03 Asia Bachelor's N

3 EZYV04 Asia Bachelor's N

5 EZYV06 Asia Master's Y

... ... ... ... ...

25475 EZYV25476 Asia Bachelor's Y

25476 EZYV25477 Asia High School Y

25477 EZYV25478 Asia Master's Y

25478 EZYV25479 Asia Master's Y

25479 EZYV25480 Asia Bachelor's Y

16861 rows × 12 columns

In [27]: con=visa_df['continent']=='Asia' # True and False

In [28]: con=visa_df['continent']=='Africa' # True and False

In [31]: unique_labels= visa_df['continent'].unique()

In [ ]: visa_df # Total data frame

Out[36]: Continent Count

2 North America 3292

4 South America 852

Out[39]: Index(['Asia', 'Europe', 'North America', 'South America', 'Africa',

Out[41]: array([16861, 3732, 3292, 852, 551, 192], dtype=int64)

In [43]: continent_vc=visa_df['continent'].value_counts() # series

Out[43]: continent count

2 North America 3292

3 South America 852

Out[48]: Continent Count

2 North America 3292

4 South America 852

in order to draw bar chart

Out[51]: continent count

2 North America 3292

3 South America 852

we read the data

count plot can use bt seaborn package

Out[65]: Index(['Asia', 'Europe', 'North America', 'South America', 'Africa',

In [ ]: # perform the same analysis on education employee

0 EZYV01 Asia High School N N

1 EZYV02 Asia Master's Y N

2 EZYV03 Asia Bachelor's N Y

Out[14]: [Text(0, 0, '16861'),

Out[15]: <Axes: >

Out[22]: array([16861, 3732, 3292, 852, 551, 192], dtype=int64)

You might also like