0% found this document useful (0 votes)

19 views41 pages

Exploratory Data Analysis Using Python

The document is a Jupyter notebook for exploratory data analysis using Python, specifically analyzing the Stack Overflow Developer Survey 2020 dataset. It details the process of downloading the dataset, loading it into a Pandas DataFrame, and preparing the data for analysis by selecting relevant columns and handling missing values. The analysis focuses on demographics, programming experience, and employment-related information of survey respondents.

Uploaded by

obrama59

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views41 pages

Exploratory Data Analysis Using Python

Uploaded by

obrama59

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

05/06/2023, 02:02 Exploratory Data Analysis using Python.

ipynb - Colaboratory

!pip install opendatasets --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

Requirement already satisfied: opendatasets in /usr/local/lib/python3.10/dist-packages (0.1.22)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from opendatasets) (4.65.0)
Requirement already satisfied: kaggle in /usr/local/lib/python3.10/dist-packages (from opendatasets) (1.5.13)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from opendatasets) (8.1.3)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.10/dist-packages (from kaggle->opendatasets) (1.1
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from kaggle->opendatasets) (2022.
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (from kaggle->opendatasets
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kaggle->opendatasets) (2.27
Requirement already satisfied: python-slugify in /usr/local/lib/python3.10/dist-packages (from kaggle->opendatasets)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.10/dist-packages (from kaggle->opendatasets) (1.26.
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.10/dist-packages (from python-slugify->
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle->opend

import opendatasets as od

od.download('stackoverflow-developer-survey-2020')
4
Downloading https://raw.githubusercontent.com/JovianML/opendatasets/master/data/stackoverflow-developer-survey-2020/
94609408it [00:15, 5945838.42it/s]
Downloading https://raw.githubusercontent.com/JovianML/opendatasets/master/data/stackoverflow-developer-survey-2020/
16384it [00:00, 190956.92it/s]
Downloading https://raw.githubusercontent.com/JovianML/opendatasets/master/data/stackoverflow-developer-survey-2020/
8192it [00:00, 71149.66it/s]

Saved successfully!
Let's verify that the dataset was downloaded into the directory stackoverflow-developer-survey-2020 and retrieve the list of files in the
dataset.

import os

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 1/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

os.listdir('stackoverflow-developer-survey-2020')

['survey_results_public.csv', 'survey_results_schema.csv', 'README.txt']

You can through the downloaded files using the "File" > "Open" menu option in Jupyter. It seems like the dataset contains three files:

README.txt - Information about the dataset

survey_results_schema.csv - The list of questions, and shortcodes for each question
survey_results_public.csv - The full list of responses to the questions

Let's load the CSV files using the Pandas library. We'll use the name survey_raw_df for the data frame to indicate this is unprocessed data
that we might clean, filter, and modify to prepare a data frame ready for analysis.

import pandas as pd

survey_raw_df = pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv')

survey_raw_df
4

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 2/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Respondent MainBranch Hobbyist Age Age1stCode CompFreq CompTotal ConvertedComp Country CurrencyDe

I am a
0 1 developer by Yes NaN 13 Monthly NaN NaN Germany European Eu
profession

I am a
United
1 2 developer by No NaN 19 NaN NaN NaN Pound sterli
Kingdom
profession

I code
Russian
2 3 primarily as a Yes NaN 15 NaN NaN NaN Na
Federation
hobby

I am a
3 4 developer by Yes 25.0 18 NaN NaN NaN Albania Albanian l
profession

I used to be a
developer by United
4 5 Yes 31.0 16 NaN NaN NaN Na
profession, States
but no...
4
... ... ... ... ... ... ... ... ... ...

United
64456 64858 NaN Yes NaN 16 NaN NaN NaN Na
States

64457 64867 NaN Yes NaN NaN NaN NaN NaN Morocco Na
Saved successfully!
The dataset
64458 contains64898
over 64,000 responses
NaN to 60
Yesquestions
NaN (although
NaNmany questions
NaN are optional).
NaN The responses
NaN haveNam
Viet been anonymized
Na to
remove personally identifiable
64459 64925 information,
NaN and each
Yes respondent
NaN hasNaN
been assigned
NaN a randomized
NaN respondentNaN
ID. Poland Na

Let's view the list of columns in the data frame.

64460 65112 NaN Yes NaN NaN NaN NaN NaN Spain Na

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 3/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

survey_raw_df.columns
64461 rows × 61 columns
Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
'MiscTechDesireNextYear', 'MiscTechWorkedWith',
'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
'WebframeWorkedWith', 'WelcomeChange', 'WorkWeekHrs', 'YearsCode',
'YearsCodePro'],
dtype='object')

It appears that shortcodes for questions have been used as column names.

We can refer to the schema file to see the full text of each question. The schema file contains only two columns: Column and QuestionText .
We can load it as Pandas Series with Column as the index and the QuestionText as the value.
4

schema_fname = 'stackoverflow-developer-survey-2020/survey_results_schema.csv'
schema_raw = pd.read_csv(schema_fname, index_col='Column').QuestionText

schema_raw

Saved successfully!
Column
Respondent Randomized respondent ID number (not in order ...
MainBranch Which of the following options best describes ...
Hobbyist Do you code as a hobby?
Age What is your age (in years)? If you prefer not...
Age1stCode At what age did you write your first line of c...
...
WebframeWorkedWith Which web frameworks have you done extensive d...
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 4/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

WelcomeChange Compared to last year, how welcome do you feel...

WorkWeekHrs On average, how many hours per week do you wor...
YearsCode Including any education, how many years have y...
YearsCodePro NOT including education, how many years have y...
Name: QuestionText, Length: 61, dtype: object

We can now use schema_raw to retrieve the full question text for any column in survey_raw_df .

schema_raw['YearsCodePro']

'NOT including education, how many years have you coded professionally (as a part of your work)?'

We've now loaded the dataset. We're ready to move on to the next step of preprocessing & cleaning the data for our analysis.

Save and upload your notebook

Whether you're running this Jupyter notebook online or on your computer, it's essential to save your work from time to time. You can continue
working on a saved notebook later or share it with friends and colleagues to let them execute your code. Jovian offers an easy way of saving
and sharing your Jupyter notebooks online.
4

# Select a project name

project='python-eda-stackoverflow-survey'

Data Preparation & Cleaning

Saved successfully!
While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:

Demographics of the survey respondents and the global programming community

Distribution of programming skills, experience, and preferences
Employment-related information, preferences, and opinions

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 5/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Let's select a subset of columns with the relevant data for our analysis.

selected_columns = [
# Demographics
'Country',
'Age',
'Gender',
'EdLevel',
'UndergradMajor',
# Programming experience
'Hobbyist',
'Age1stCode',
'YearsCode',
'YearsCodePro',
'LanguageWorkedWith',
'LanguageDesireNextYear',
'NEWLearn',
'NEWStuck',
# Employment
'Employment',
'DevType',
'WorkWeekHrs', 4

'JobSat',
'JobFactors',
'NEWOvertime',
'NEWEdImpt'
]

Saved successfully!
len(selected_columns)

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 6/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Let's extract a copy of the data from these columns into a new data frame survey_df . We can continue to modify further without affecting the
original data= frame.
survey_df survey_raw_df[selected_columns].copy()

schema = schema_raw[selected_columns]

Let's view some basic information about the data frame.

survey_df.shape

(64461, 20)

survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 64072 non-null object
1 Age 45446 non-null float64
2 Gender 50557 non-null object 4
3 EdLevel 57431 non-null object
4 UndergradMajor 50995 non-null object
5 Hobbyist 64416 non-null object
6 Age1stCode 57900 non-null object
7 YearsCode 57684 non-null object
8 YearsCodePro 46349 non-null object
9 LanguageWorkedWith 57378 non-null object
Saved10
successfully!
LanguageDesireNextYear 54113 non-null object
11 NEWLearn 56156 non-null object
12 NEWStuck 54983 non-null object
13 Employment 63854 non-null object
14 DevType 49370 non-null object
15 WorkWeekHrs 41151 non-null float64
16 JobSat 45194 non-null object
17 JobFactors 49349 non-null object
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 7/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

18 NEWOvertime 43231 non-null object

19 NEWEdImpt 48465 non-null object
dtypes: float64(2), object(18)
memory usage: 9.8+ MB

Most columns have the data type object , either because they contain values of different types or contain empty values ( NaN ). It appears that
every column contains some empty values since the Non-Null count for every column is lower than the total number of rows (64461). We'll
need to deal with empty values and manually adjust the data type for each column on a case-by-case basis.

Only two of the columns were detected as numeric columns ( Age and WorkWeekHrs ), even though a few other columns have mostly numeric
values. To make our analysis easier, let's convert some other columns into numeric data types while ignoring any non-numeric value. The non-
numeric are converted to NaN .

survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')

survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors='coerce')
survey_df['YearsCodePro'] = pd.to_numeric(survey_df.YearsCodePro, errors='coerce')

Let's now view some basic statistics about numeric columns.

4
survey_df.describe()

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 8/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Age Age1stCode YearsCode YearsCodePro WorkWeekHrs

count 45446.000000 57473.000000 56784.000000 44133.000000 41151.000000

There mean
seems to be a problem with
30.834111 the age column,
15.476572 as the minimum
12.782051 value is 1 and
8.869667 the maximum is 279. This is a common issue with surveys:
40.782174
responses
std may contain invalid values
9.585392 due to accidental
5.114081 9.490657or intentional errors while
7.759961 responding. A simple fix would be to ignore the rows where
17.816383
the age is higher than 100 years or lower than 10 years as invalid survey responses. We can do this using the .drop method, as explained
min 1.000000 5.000000 1.000000 1.000000 1.000000
here.
25% 24.000000 12.000000 6.000000 3.000000 40.000000

50% 29.000000 15.000000

survey_df.drop(survey_df[survey_df.Age <10.000000 6.000000
10].index, inplace=True) 40.000000
survey_df.drop(survey_df[survey_df.Age
75% 35.000000 18.000000 >17.000000
100].index, 12.000000
inplace=True)44.000000

max 279.000000 85.000000 50.000000 50.000000 475.000000

The same holds for WorkWeekHrs . Let's ignore entries where the value for the column is higher than 140 hours. (~20 hours per day).

survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)

The gender column also allows for picking multiple options. We'll remove values containing more than one option to simplify our analysis.

4
survey_df['Gender'].value_counts()

Man 45895
Woman 3835
Non-binary, genderqueer, or gender non-conforming 385
Man;Non-binary, genderqueer, or gender non-conforming 121
Woman;Non-binary, genderqueer, or gender non-conforming 92
Woman;Man
Saved successfully! 73
Woman;Man;Non-binary, genderqueer, or gender non-conforming 25
Name: Gender, dtype: int64

import numpy as np

survey_df.where(~(survey_df.Gender.str.contains(';', na=False)), np.nan, inplace=True)

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 9/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

We've now cleaned up and prepared the dataset for analysis. Let's take a look at a sample of rows from the data frame.

survey_df.sample(10)

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 10/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Country Age Gender EdLevel UndergradMajor Hobbyist Age1stCode YearsCode YearsCodePro

Master’s degree Another

(M.A., M.S., engineering
55323 France 27.0 Woman Yes 13.0 6.0 4.0 Bash/S
M.Eng., MBA, discipline (such as
etc.) civil,...

Computer science,
Bachelor’s degree
United computer
35965 NaN NaN (B.A., B.S., B.Eng., No 11.0 3.0 1.0
States engineering, or
etc.)
sof...

Computer science,
Bachelor’s degree
United computer
53520 28.0 Man (B.A., B.S., B.Eng., No 19.0 9.0 6.0
States engineering, or
etc.)
sof...

Some
Fine arts or
South college/university
51707 62.0 Man performing arts Yes 33.0 7.0 2.0
Africa study without
(such as graphic ...
earning ...

Computer science,
Exploratory
54551
Analysis
India 22.0
and Visualization
Bachelor’s degree
Man (B.A., B.S., B.Eng.,
computer
Yes 16.0 7.0 3.0 C;HTML
engineering, or
etc.) 4
sof...
Before we ask questions about the survey responses, it would help to understand the respondents' demographics, i.e., country, age, gender,
education level, employment level, etc. It's essential to explore these variables to understand how representative the survey is of the worldwide
Bachelor’s
programming community. A survey of this degree tends to
scale generally A business
have some selection bias.
24140 Poland NaN (B.A., B.S., B.Eng., discipline (such as
Man Yes 20.0 15.0 10.0
Let's begin by importing matplotlib.pyplot and etc.)seabornaccounting,
. fin...

Computer science,
Saved successfully! Bachelor’s degree
import seaborn as sns computer
7061 India 28.0 Man (B.A., B.S., B.Eng., Yes 17.0 10.0 6.0
import matplotlib engineering, or
etc.)
import matplotlib.pyplot as plt sof...
%matplotlib inline
Another
Bachelor’s degree
United engineering
3263 40.0
sns.set_style('darkgrid') Man (B.A., B.S., B.Eng., Yes 20.0 14.0 12.0
States discipline (such as
matplotlib.rcParams['font.size'] = 14 etc.)
civil,...
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 11/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory
matplotlib.rcParams['figure.figsize'] = (9, 5)
Some
matplotlib.rcParams['figure.facecolor'] = '#00000000'
college/university Web development
24309 Sweden 23.0 Man Yes 17.0 5.0 2.0 Bash/Sh
study without or web design
earning ...
Country
Master’s degree Computer science,
Let's look United
at the number 36.0
of countries (M.A.,
from which M.S.,are responses
there computer
in the survey Yes
and plot the ten countries 15.0
with the highest 13.0
number of
55423 Man 10.0 C#;HTML
Kingdom M.Eng., MBA, engineering, or
responses. etc.) sof...

schema.Country

'Where do you live?'

survey_df.Country.nunique()

183

We can identify the countries with the highest number of respondents using the value_counts method.

top_countries = survey_df.Country.value_counts().head(15) 4
top_countries

United States 12371

India 8364
United Kingdom 3881
Germany 3864
Canada 2175
France
Saved successfully! 1884
Brazil 1804
Netherlands 1332
Poland 1259
Australia 1199
Spain 1157
Italy 1115
Russian Federation 1085

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 12/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Sweden 879
Pakistan 802
Name: Country, dtype: int64

We can visualize this information using a bar chart.

plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
plt.title(schema.Country)
sns.barplot(x=top_countries.index, y=top_countries);

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 13/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

It appears that a disproportionately high number of respondents are from the US and India, probably because the survey is in English, and these
countries have the highest English-speaking populations. We can already see that the survey may not be representative of the global
programming community - especially from non-English speaking countries. Programmers from non-English speaking countries are almost
certainly underrepresented.

Exercise: Try finding the percentage of responses from English-speaking vs. non-English speaking countries. You can use this list of languages
spoken in different countries.

4
Age
The distribution of respondents' age is another crucial factor to look at. We can use a histogram to visualize it.

plt.figure(figsize=(12, 6))
plt.title(schema.Age)
Saved successfully!
plt.xlabel('Age')
plt.ylabel('Number of respondents')

plt.hist(survey_df.Age, bins=np.arange(10,80,5), color='purple');

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 14/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

It appears that a large percentage of respondents are 20-45 years old. It's somewhat representative of the programming community in general.
Savedyoung
Many successfully!
people have taken up computer science as their field of study or profession in the last 20 years.

Exercise: You may want to filter out responses by age (or age group) if you'd like to analyze and compare the survey results for different age
groups. Create a new column called AgeGroup containing values like Less than 10 years , 10-18 years , 18-30 years , 30-45 years ,
45-60 years and Older than 60 years . Then, repeat the analysis in the rest of this notebook for each age group.

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 15/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Gender
Let's look at the distribution of responses for the Gender. It's a well-known fact that women and non-binary genders are underrepresented in the
programming community, so we might expect to see a skewed distribution here.

schema.Gender

'Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may
leave this question blank.'

gender_counts = survey_df.Gender.value_counts()
gender_counts

Man 45895
Woman 3835
Non-binary, genderqueer, or gender non-conforming 385
Name: Gender, dtype: int64

A pie chart would be a great way to visualize the distribution.

plt.figure(figsize=(12,6))
plt.title(schema.Gender)
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=180);

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 16/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Only about 8% of survey respondents who have answered the question identify as women or non-binary. This number is lower than the overall
percentage of women & non-binary genders in the programming community - which is estimated to be around 12%.

Exercise: It would be interesting to compare the survey responses & preferences across genders. Repeat this analysis with these breakdowns.
How do the relative education levels differ across genders? How do the salaries vary? You may find this analysis on the Gender Divide in Data
Science useful.

Education Level
Formal education in computer science is often considered an essential requirement for becoming a programmer. However, there are many free
4
resources & tutorials available online to learn programming. Let's compare the education levels of respondents to gain some insight into this.
We'll use a horizontal bar plot here.

sns.countplot(y=survey_df.EdLevel)
plt.xticks(rotation=75);
plt.title(schema['EdLevel'])
Saved successfully!
plt.ylabel(None);

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 17/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

It appears that well over half of the respondents hold a bachelor's or master's degree, so most programmers seem to have some college
education. However, it's not clear from this graph alone if they hold a degree in computer science.

Exercises: The graph currently shows the number of respondents for each option. Can you modify it to show the percentage instead? Further,
try comparing the percentages for each degree for men vs. women.

Let's also plot undergraduate majors, but this time we'll convert the numbers into percentages and sort the values to make it easier to visualize
the order. 4

schema.UndergradMajor

'What was your primary field of study?'

Saved successfully!
undergrad_pct = survey_df.UndergradMajor.value_counts() * 100 / survey_df.UndergradMajor.count()

sns.barplot(x=undergrad_pct, y=undergrad_pct.index)

plt.title(schema.UndergradMajor)
plt.ylabel(None);
plt.xlabel('Percentage');

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 18/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

It turns out that 40% of programmers holding a college degree have a field of study other than computer science - which is very encouraging. It
seems to suggest that while a college education is helpful in general, you do not need to pursue a major in computer science to become a 4

successful programmer.

Exercises: Analyze the NEWEdImpt column for respondents who hold some college degree vs. those who don't. Do you notice any difference in
opinion?

Saved successfully!
Employment
Freelancing or contract work is a common choice among programmers, so it would be interesting to compare the breakdown between full-
time, part-time, and freelance work. Let's visualize the data from the Employment column.

schema.Employment
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 19/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

'Which of the following best describes your current employment status?'

(survey_df.Employment.value_counts(normalize=True, ascending=True)*100).plot(kind='barh', color='g')

plt.title(schema.Employment)
plt.xlabel('Percentage');

Saved successfully!
It appears that close to 10% of respondents are employed part time or as freelancers.

Exercise: Add a new column EmploymentType containing the values Enthusiast (student or not employed but looking for work),
Professional (employed full-time, part-time or freelancing), and Other (not employed or retired). For each of the graphs that follow, show a
comparison between Enthusiast and Professional .

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 20/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

The DevType field contains information about the roles held by respondents. Since the question allows multiple answers, the column contains
lists of values separated by a semi-colon ; , making it a bit harder to analyze directly.

schema.DevType

'Which of the following describe you? Please select all that apply.'

survey_df.DevType.value_counts()

Developer, full-stack
4396
Developer, back-end
3056
Developer, back-end;Developer, front-end;Developer, full-stack
2214
Developer, back-end;Developer, full-stack
1465
Developer, front-end
1390

...
4
Database administrator;Developer, back-end;Developer, front-end;Developer, full-stack;Developer, QA or test;Senior
executive/VP 1
Database administrator;Developer, back-end;Developer, front-end;Developer, full-stack;Product manager;Senior
executive/VP 1
Developer, back-end;Developer, full-stack;Developer, mobile;DevOps specialist;Educator;System administrator
1
Data or business analyst;Database administrator;Developer, back-end;Developer, desktop or enterprise
applications;Developer, front-end;Developer, mobile;Engineering manager 1
Saved successfully!
Data or business analyst;Developer, mobile;Senior executive/VP;System administrator
1
Name: DevType, Length: 8213, dtype: int64

Let's define a helper function that turns a column containing lists of values (like survey_df.DevType ) into a data frame with one column for
each possible option.
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 21/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

def split_multicolumn(col_series):
result_df = col_series.to_frame()
options = []
# Iterate over the column
for idx, value in col_series[col_series.notnull()].iteritems():
# Break each value into list of options
for option in value.split(';'):
# Add the option as a column to result
if not option in result_df.columns:
options.append(option)
result_df[option] = False
# Mark the value in the option column as True
result_df.at[idx, option] = True
return result_df[options]

dev_type_df = split_multicolumn(survey_df.DevType)

<ipython-input-56-26a445763b0d>:5: FutureWarning: iteritems is deprecated and will be removed in a future version. U

for idx, value in col_series[col_series.notnull()].iteritems():

4
dev_type_df

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 22/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Developer,
Developer,
desktop or Developer, Developer, Developer, Developer, Developer, DevOps
Designer game or
enterprise full-stack mobile front-end back-end QA or test specialist
graphics
applications

0 True True False False False False False False False

1 False True True False False False False False False

2 False False False False False False False False False

3 False False False False False False False False False

4 False False False False False False False False False

... ... ... ... ... ... ... ... ... ...

64456 False False False False False False False False False

64457
The dev_type_df Falsecolumn forFalse
has one Falsecan be False
each option that False
selected as a response. If a False False
respondent has False the
chosen an option, False

corresponding
64458 column'sFalse
value is TrueFalse
. Otherwise, itFalse
is False . False False False False False False

We can64459 False
now use the column-wise False
totals False
to identify the False roles. False
most common False False False False

64460 False False False False False False False False False 4
dev_type_totals = columns
64306 rows × 23 dev_type_df.sum().sort_values(ascending=False)
dev_type_totals

Developer, back-end 26996

Developer, full-stack 26915
Developer, front-end 18128
Developer, desktop or enterprise applications 11687
Saved successfully!
Developer, mobile 9406
DevOps specialist 5915
Database administrator 5658
Designer 5262
System administrator 5185
Developer, embedded applications or devices 4701
Data or business analyst 3970
Data scientist or machine learning specialist 3939
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 23/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Developer, QA or test 3893

Engineer, data 3700
Academic researcher 3502
Educator 2895
Developer, game or graphics 2751
Engineering manager 2699
Product manager 2471
Scientist 2060
Engineer, site reliability 1921
Senior executive/VP 1292
Marketing or sales professional 625
dtype: int64

As one might expect, the most common roles include "Developer" in the name.

Exercises:

Can you figure out what percentage of respondents work in roles related to data science?
Which positions have the highest percentage of women?

We've only explored a handful of columns from the 20 columns that we selected. Explore and visualize the remaining columns using the empty
cells below. 4

Saved successfully!

Asking and Answering Questions

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 24/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

We've already gained several insights about the respondents and the programming community by exploring individual columns of the dataset.
Let's ask some specific questions and try to answer them using data frame operations and visualizations.

Q: What are the most popular programming languages in 2020?

To answer, this we can use the LanguageWorkedWith column. Similar to DevType , respondents were allowed to choose multiple options here.

survey_df.LanguageWorkedWith

0 C#;HTML/CSS;JavaScript
1 JavaScript;Swift
2 Objective-C;Python;Swift
3 NaN
4 HTML/CSS;Ruby;SQL
...
64456 NaN
64457 Assembly;Bash/Shell/PowerShell;C;C#;C++;Dart;G...
64458 NaN
64459 HTML/CSS
64460 C#;HTML/CSS;Java;JavaScript;SQL
Name: LanguageWorkedWith, Length: 64306, dtype: object
4

First, we'll split this column into a data frame containing a column of each language listed in the options.

languages_worked_df = split_multicolumn(survey_df.LanguageWorkedWith)

<ipython-input-56-26a445763b0d>:5:
Saved successfully! FutureWarning: iteritems is deprecated and will be removed in a future version. U
for idx, value in col_series[col_series.notnull()].iteritems():

languages_worked_df

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 25/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Objective-
C# HTML/CSS JavaScript Swift Python Ruby SQL Java PHP ... VBA Perl Scala C++
C

0 True True True False False False False False False False ... False False False False

1 False False True True False False False False False False ... False False False False

2 False False False True True True False False False False ... False False False False

3 False False False False False False False False False False ... False False False False

4 False True False False False False True True False False ... False False False False

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

64456 False False False False False False False False False False ... False False False False

64457 True True True True True True True True True True ... True True True True

64458 False False False False False False False False False False ... False False False False

64459 False True False False False False False False False False ... False False False False

64460 True True True False False False False True True False ... False False False False

64306 rows × 25 columns

It appears that a total of 25 languages were included among the options. Let's aggregate these to identify the percentage of respondents who
selected each language.

languages_worked_percentages
Saved successfully! = languages_worked_df.mean().sort_values(ascending=False) * 100
languages_worked_percentages

JavaScript 59.893323
HTML/CSS 55.801947
SQL 48.444935
Python 39.001026
Java 35.618760
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 26/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Bash/Shell/PowerShell 29.239884
C# 27.803004
PHP 23.130035
TypeScript 22.461357
C++ 21.114670
C 19.236152
Go 7.758219
Kotlin 6.887382
Ruby 6.229590
Assembly 5.447392
VBA 5.394520
Swift 5.226573
R 5.064846
Rust 4.498803
Objective-C 3.603085
Dart 3.517557
Scala 3.150561
Perl 2.757130
Haskell 1.861413
Julia 0.782198
dtype: float64

We can plot this information using a horizontal bar chart.

4
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_worked_percentages, y=languages_worked_percentages.index)
plt.title("Languages used in the past year");
plt.xlabel('count');

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 27/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 28/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Perhaps unsurprisingly, Javascript & HTML/CSS comes out at the top as web development is one of today's most sought skills. It also happens
to be one of the easiest to get started. SQL is necessary for working with relational databases, so it's no surprise that most programmers work
with SQL regularly. Python seems to be the popular choice for other forms of development, beating out Java, which was the industry standard
for server & application development for over two decades.

Exercises:

What are the most common languages used by students? How does the list compare with the most common languages used by
professional developers?
What are the most common languages among respondents who do not describe themselves as "Developer, front-end"?
What are the most common languages among respondents who work in fields related to data science?
What are the most common languages used by developers older than 35 years of age?
What are the most common languages used by developers in your home country?

4
Q: Which languages are the most people interested to learn over the next year?

For this, we can use the LanguageDesireNextYear column, with similar processing as the previous one.

languages_interested_df = split_multicolumn(survey_df.LanguageDesireNextYear)
languages_interested_percentages = languages_interested_df.mean().sort_values(ascending=False) * 100
Saved successfully!
languages_interested_percentages

<ipython-input-56-26a445763b0d>:5: FutureWarning: iteritems is deprecated and will be removed in a future version. U

for idx, value in col_series[col_series.notnull()].iteritems():
Python 41.143906
JavaScript 40.425466
HTML/CSS 32.028116
SQL 30.799614
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 29/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

TypeScript 26.451653
C# 21.058688
Java 20.464653
Go 19.432090
Bash/Shell/PowerShell 18.057413
Rust 16.270643
C++ 15.014151
Kotlin 14.760676
PHP 10.947657
C 9.359935
Swift 8.692812
Dart 7.308805
R 6.571704
Ruby 6.425528
Scala 5.326097
Haskell 4.593662
Assembly 3.766367
Julia 2.540976
Objective-C 2.338818
Perl 1.761888
VBA 1.611047
dtype: float64

plt.figure(figsize=(12, 12)) 4
sns.barplot(x=languages_interested_percentages, y=languages_interested_percentages.index)
plt.title("Languages people are intersted in learning over the next year");
plt.xlabel('count');

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 30/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 31/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Once again, it's not surprising that Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose
programming language well suited for a variety of domains: application development, numerical computing, data analysis, machine learning,
big data, cloud automation, web scraping, scripting, etc. We're using Python for this very analysis, so we're in good company!

Exercises: Repeat the exercises from the previous question, replacing "most common languages" with "languages people are interested in
learning/using."

Q: Which are the most loved languages, i.e., a high percentage of people who have used the language want to continue learning
& using it over the next year?
While this question may seem tricky at first, it's straightforward to solve using Pandas array operations. Here's what we can do:

Create a new data frame languages_loved_df that contains a True value for a language only if the corresponding values in
languages_worked_df and languages_interested_df are both True
Take the column-wise sum of languages_loved_df and divide it by the column-wise sum of languages_worked_df to get the
4
percentage of respondents who "love" the language
Sort the results in decreasing order and plot a horizontal bar graph

languages_loved_df = languages_worked_df & languages_interested_df

languages_loved_percentages
Saved successfully! = (languages_loved_df.sum() * 100/ languages_worked_df.sum()).sort_values(ascending=False)

plt.figure(figsize=(12, 12))
sns.barplot(x=languages_loved_percentages, y=languages_loved_percentages.index)
plt.title("Most loved languages");
plt.xlabel('count');

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 32/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 33/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Rust has been StackOverflow's most-loved language for four years in a row. The second most-loved language is TypeScript, a popular
alternative to JavaScript for web development.

Python features at number 3, despite already being one of the most widely-used languages in the world. Python has a solid foundation, is easy
to learn & use, has a large ecosystem of domain-specific libraries, and a massive worldwide community.

Exercises: What are the most dreaded languages, i.e., languages which people have used in the past year but do not want to learn/use over the
next year. Hint: ~languages_interested_df . 4

Q: In which countries do developers work the highest number of hours per week? Consider countries with more than 250
responses only.

To answer this question, we'll need to use the groupby data frame method to aggregate the rows for each country. We'll also need to filter the
Saved successfully!
results to only include the countries with more than 250 respondents.

countries_df = survey_df.groupby('Country')[['WorkWeekHrs']].mean().sort_values('WorkWeekHrs', ascending=False)

high_response_countries_df = countries_df.loc[survey_df.Country.value_counts() > 250].head(15)

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 34/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

high_response_countries_df

WorkWeekHrs

Country

Iran 44.337748

Israel 43.915094

China 42.150000

United States 41.802982

Greece 41.402724

Viet Nam 41.391667

South Africa 41.023460

Turkey 40.982143

Sri Lanka 40.612245

New Zealand 40.457551

Belgium 40.444444 4

Canada 40.208837

Hungary 40.194340

Bangladesh 40.097458

India 40.090603
Saved successfully!

The Asian countries like Iran, China, and Israel have the highest working hours, followed by the United States. However, there isn't too much
variation overall, and the average working hours seem to be around 40 hours per week.

Exercises:
https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 35/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

How do the average work hours compare across continents? You may find this list of countries in each continent useful.
Which role has the highest average number of hours worked per week? Which one has the lowest?
How do the hours worked compare between freelancers and developers working full-time?

Q: How important is it to start young to build a career in programming?

Let's create a scatter plot of Age vs. YearsCodePro (i.e., years of coding experience) to answer this question.

schema.YearsCodePro

'NOT including education, how many years have you coded professionally (as a part of your work)?'

sns.scatterplot(x='Age', y='YearsCodePro', hue='Hobbyist', data=survey_df)

plt.xlabel("Age")
plt.ylabel("Years of professional coding experience");

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 36/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

You can see points all over the graph, which indicates that you can start programming professionally at any age. Many people who have been
coding for several decades professionally also seem to enjoy it as a hobby.

We can also view the distribution of the Age1stCode column to see when the respondents tried programming for the first time.

plt.title(schema.Age1stCode)
sns.histplot(x=survey_df.Age1stCode, bins=30, kde=True);

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 37/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

As you might expect, most people seem to have had some exposure to programming before the age of 40. However, but there are people of all
ages and walks of life learning to code.

Exercises:

How does programming experience change opinions & preferences? Repeat the entire analysis while comparing the responses of people
who have more than ten years of professional programming experience vs. those who don't. Do you see any interesting trends?
Compare the years of professional coding experience across different genders.

Hopefully, you are already thinking of many more questions you'd like to answer using this data. Use the empty cells below to ask and answer
more questions.

Let's save and commit our work before continuing

Saved successfully!

import jovian

jovian.commit()

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 38/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

[jovian] Detected Colab notebook...

[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian,
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian.
Also, you can also delete this cell, it's no longer necessary.

Inferences and Conclusions

We've drawn many inferences from the survey. Here's a summary of a few of them:

Based on the survey respondents' demographics, we can infer that the survey is somewhat representative of the overall programming
community. However, it has fewer responses from programmers in non-English-speaking countries and women & non-binary genders.

The programming community is not as diverse as it can be. Although things are improving, we should make more efforts to support &
encourage underrepresented communities, whether in terms of age, country, race, gender, or otherwise.

Although most programmers hold a college degree, a reasonably large percentage did not have computer science as their college major.
Hence, a computer science degree isn't compulsory for learning to code or building a career in programming.

A significant percentage of programmers either work part-time or as freelancers, which can be a great way to break into the field,
especially when you're just getting started.

Javascript & HTML/CSS are the most used programming languages in 2020, closely followed by SQL & Python. 4

Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well
suited for various domains.

Rust and TypeScript are the most "loved" languages in 2020, both of which have small but fast-growing communities. Python is a close
third, despite already being a widely used language.
Saved successfully!
Programmers worldwide seem to be working for around 40 hours a week on average, with slight variations by country.

You can learn and start programming professionally at any age. You're likely to have a long and fulfilling career if you also enjoy
programming as a hobby.

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 39/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

Exercises
There's a wealth of information to be discovered using the survey, and we've barely scratched the surface. Here are some ideas for further
exploration:

Repeat the analysis for different age groups & genders, and compare the results
Pick a different set of columns (we chose 20 out of 65) to analyze other facets of the data
Prepare an analysis focusing on diversity - and identify areas where underrepresented communities are at par with the majority (e.g.,
education) and where they aren't (e.g., salaries)
Compare the results of this year's survey with the previous years and identify interesting trends

References and Future Work

Check out the following resources to learn more about the dataset and tools used in this notebook:

Stack Overflow Developer Survey: https://insights.stackoverflow.com/survey

Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
Matplotlib user guide: https://matplotlib.org/3.3.1/users/index.html
Seaborn user guide & tutorial: https://seaborn.pydata.org/tutorial.html
opendatasets Python library: https://github.com/JovianML/opendatasets 4

As a next step, you can try out a project on another dataset of your choice: https://jovian.ml/aakashns/zerotopandas-course-project-starter .

import jovian

jovian.commit()
Saved successfully!

[jovian] Detected Colab notebook...

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 40/41
05/06/2023, 02:02 Exploratory Data Analysis using Python.ipynb - Colaboratory

4
Colab paid products - Cancel contracts here

check 0s completed at 01:59

Saved successfully!

https://colab.research.google.com/drive/1pILANvXLtIlJAjSdfl5q0Vxj9vqthN3T#scrollTo=yqxgBbUIECP5&printMode=true 41/41

English Is Cool Course Book 3 - Preview
No ratings yet
English Is Cool Course Book 3 - Preview
20 pages
Python Lab Manual
No ratings yet
Python Lab Manual
33 pages
Coderbyte
No ratings yet
Coderbyte
4 pages
Section 8 ISO 19650 3 Infographic - 280721@3xPDF
No ratings yet
Section 8 ISO 19650 3 Infographic - 280721@3xPDF
1 page
Bda Survey Assignment: Parta - Rollnumbers - Ipynb Parta - Rollnumbers - Ipynb Part A
No ratings yet
Bda Survey Assignment: Parta - Rollnumbers - Ipynb Parta - Rollnumbers - Ipynb Part A
3 pages
ChatGPT Cheat Sheet For Data Science
100% (5)
ChatGPT Cheat Sheet For Data Science
78 pages
Ip Project - Docx1
100% (4)
Ip Project - Docx1
22 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
NPV 70 Marks Set 2
No ratings yet
NPV 70 Marks Set 2
4 pages
Matplotlib Library in Python
No ratings yet
Matplotlib Library in Python
85 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
ChatGPT Cheat Sheet - DataCamp PDF
91% (11)
ChatGPT Cheat Sheet - DataCamp PDF
78 pages
Important Questions With Solutions IP
No ratings yet
Important Questions With Solutions IP
5 pages
DVAP - Final Project Report
No ratings yet
DVAP - Final Project Report
27 pages
Fin Ip
No ratings yet
Fin Ip
47 pages
01 Python For Data Analysis (Ziad)
No ratings yet
01 Python For Data Analysis (Ziad)
53 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
Chat GPT
No ratings yet
Chat GPT
74 pages
Feature Engineering - 01
No ratings yet
Feature Engineering - 01
31 pages
IP Project Deepika
No ratings yet
IP Project Deepika
26 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
0-Ip Project New
No ratings yet
0-Ip Project New
29 pages
PDSC Few Questions Answers 2020
No ratings yet
PDSC Few Questions Answers 2020
36 pages
Python For Web Development Pre
No ratings yet
Python For Web Development Pre
15 pages
Ip QP 1
No ratings yet
Ip QP 1
11 pages
Student Mental Health Vs CGPA - EDA - Colab
No ratings yet
Student Mental Health Vs CGPA - EDA - Colab
18 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
N RQgi 8 Eg DUNFS451 K4 X QXA
No ratings yet
N RQgi 8 Eg DUNFS451 K4 X QXA
61 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Nanomaterials - Introductory Slides
No ratings yet
Nanomaterials - Introductory Slides
5 pages
Class Xii (Informatics Practices) Half Yearly QP & Ms Ernakulam Region
No ratings yet
Class Xii (Informatics Practices) Half Yearly QP & Ms Ernakulam Region
5 pages
AIL303 M
No ratings yet
AIL303 M
22 pages
TV Scientific Assessment
No ratings yet
TV Scientific Assessment
9 pages
1712-34 L3 Qualification Handbook v1
No ratings yet
1712-34 L3 Qualification Handbook v1
66 pages
Matplotlib Project Report AIPT
No ratings yet
Matplotlib Project Report AIPT
6 pages
BSC Quantity Surveying Construction Economics 2 Bqs 3002module Outline
No ratings yet
BSC Quantity Surveying Construction Economics 2 Bqs 3002module Outline
4 pages
HW 1 - Version 2.ipynb - Colab
No ratings yet
HW 1 - Version 2.ipynb - Colab
5 pages
Social Media by Gautam
No ratings yet
Social Media by Gautam
25 pages
Summer Vacation Work IP
No ratings yet
Summer Vacation Work IP
4 pages
Project File 12
No ratings yet
Project File 12
22 pages
E Twinning
No ratings yet
E Twinning
13 pages
6.10 Searchable
No ratings yet
6.10 Searchable
101 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Extended Constructed Response Prompts Grades
0% (1)
Extended Constructed Response Prompts Grades
3 pages
41 DS PL MF
No ratings yet
41 DS PL MF
20 pages
Exp - 1 - Introduction To Data Analytics and Python Fundamentals - SDK - Ok
No ratings yet
Exp - 1 - Introduction To Data Analytics and Python Fundamentals - SDK - Ok
9 pages
Edap Lab
No ratings yet
Edap Lab
47 pages
Sample Worksheet 1
No ratings yet
Sample Worksheet 1
8 pages
Employee Data Analysis System (Ip Class Xii)
No ratings yet
Employee Data Analysis System (Ip Class Xii)
26 pages
What Is The Significance of The
No ratings yet
What Is The Significance of The
2 pages
DataGrokr Technical Assignment - Data Engineering
No ratings yet
DataGrokr Technical Assignment - Data Engineering
4 pages
Revised Schedule: FIITJEE Computer Based All India Test Series For JEE Advanced 2020
No ratings yet
Revised Schedule: FIITJEE Computer Based All India Test Series For JEE Advanced 2020
1 page
Python Interview Questions
No ratings yet
Python Interview Questions
23 pages
Self Intoduction 1 Project
No ratings yet
Self Intoduction 1 Project
11 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Guideline To Membership - 2019
No ratings yet
Guideline To Membership - 2019
50 pages
Assessing Environmental Perception
No ratings yet
Assessing Environmental Perception
8 pages
DEV Manual - ESEC
No ratings yet
DEV Manual - ESEC
27 pages
VIP Question Bank For DPV For Theory Exam
No ratings yet
VIP Question Bank For DPV For Theory Exam
6 pages
Municipal Quiz Bee For Grade 6 Pupils
No ratings yet
Municipal Quiz Bee For Grade 6 Pupils
3 pages
Slat Result
No ratings yet
Slat Result
1 page
SELF-COMPASSION AND SELF-KINDNESS Activity Output
No ratings yet
SELF-COMPASSION AND SELF-KINDNESS Activity Output
1 page
Socpsycho II 1 178 1 94
No ratings yet
Socpsycho II 1 178 1 94
94 pages
Applied Motor Control and Control Module ISU Ilagan - 060606
No ratings yet
Applied Motor Control and Control Module ISU Ilagan - 060606
10 pages
QP-1PB-IP-2024 Set 1
No ratings yet
QP-1PB-IP-2024 Set 1
9 pages
Assessing Quality of Education: in Perspective With Continuous Assessment and Learners' Performance in Adwa College, Ethiopia
No ratings yet
Assessing Quality of Education: in Perspective With Continuous Assessment and Learners' Performance in Adwa College, Ethiopia
11 pages
Resume Asrul Lattes
No ratings yet
Resume Asrul Lattes
5 pages
Course Teaching Plan
No ratings yet
Course Teaching Plan
5 pages
Prime Factors Mas, KVP Density
No ratings yet
Prime Factors Mas, KVP Density
10 pages
Chapter 1 - The Fundamentals of Managerial Economics - 2023 - Share
No ratings yet
Chapter 1 - The Fundamentals of Managerial Economics - 2023 - Share
57 pages
B A 2nd Sem (Old)
No ratings yet
B A 2nd Sem (Old)
1 page
cs441 Big Data Concept by Sial
No ratings yet
cs441 Big Data Concept by Sial
23 pages
NR 511 Differential Diagnosis Chamberlain University College of Nursing
No ratings yet
NR 511 Differential Diagnosis Chamberlain University College of Nursing
3 pages
Pandas
No ratings yet
Pandas
35 pages
Janae Benson Letter of Recommendation-Beth
No ratings yet
Janae Benson Letter of Recommendation-Beth
1 page
MVJ22IS52 - Computer Network Lab Manual
No ratings yet
MVJ22IS52 - Computer Network Lab Manual
59 pages
Referensi
No ratings yet
Referensi
9 pages
Qual L01
No ratings yet
Qual L01
28 pages
DS Final
No ratings yet
DS Final
46 pages
DOC-20250727-WA0012.
No ratings yet
DOC-20250727-WA0012.
8 pages
Final DAC - 24 - 2 - 2023
No ratings yet
Final DAC - 24 - 2 - 2023
122 pages
CourseDiary - MVJ22PLCK25B (B) - Introduction To Python Programming
No ratings yet
CourseDiary - MVJ22PLCK25B (B) - Introduction To Python Programming
62 pages
Coursediary - mvjscsl16 - Algorithms & Ai Laboratory
No ratings yet
Coursediary - mvjscsl16 - Algorithms & Ai Laboratory
61 pages
NEET Chapter Wise Weightage 2025 With Important Topics
No ratings yet
NEET Chapter Wise Weightage 2025 With Important Topics
13 pages
CourseDiary - MVJ22SAD22 (B) - Deep Learning
No ratings yet
CourseDiary - MVJ22SAD22 (B) - Deep Learning
60 pages
Macse502 Programming-For-data-science Eth 1.0 83 Macse502
No ratings yet
Macse502 Programming-For-data-science Eth 1.0 83 Macse502
4 pages
Introduction To The Data Analytics - Data Cleaning, Visualizations & Analysis
No ratings yet
Introduction To The Data Analytics - Data Cleaning, Visualizations & Analysis
28 pages
Librarian S Guide To Online Searching Cultivating Database Skills For Research and Instruction 4th Edition Suzanne S. Bell
No ratings yet
Librarian S Guide To Online Searching Cultivating Database Skills For Research and Instruction 4th Edition Suzanne S. Bell
47 pages
List Python Programs
No ratings yet
List Python Programs
4 pages
Top 50 Python Interview Questions
No ratings yet
Top 50 Python Interview Questions
8 pages
Introduction To Healthcare Event Report
No ratings yet
Introduction To Healthcare Event Report
12 pages
Class Representative Meeting - Report of ECE - Issues Facing by Students
No ratings yet
Class Representative Meeting - Report of ECE - Issues Facing by Students
2 pages
2022 Scheme ISE
No ratings yet
2022 Scheme ISE
12 pages
Python Answer Key Format Cie
No ratings yet
Python Answer Key Format Cie
4 pages
Introduction To Linux Event Report
No ratings yet
Introduction To Linux Event Report
4 pages
Oracle PL/SQL by Example, 6th Edition Benjamin Rosenzweig instant download
No ratings yet
Oracle PL/SQL by Example, 6th Edition Benjamin Rosenzweig instant download
125 pages
Getting An Overview of Big Data
No ratings yet
Getting An Overview of Big Data
8 pages
Action Items of VIII BoG Meeting Held On 31.08.2024
No ratings yet
Action Items of VIII BoG Meeting Held On 31.08.2024
8 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Flight of Dreams A Novel Lawhon pdf download
No ratings yet
Flight of Dreams A Novel Lawhon pdf download
102 pages
RA303 Machine Learning (1)
No ratings yet
RA303 Machine Learning (1)
2 pages

Exploratory Data Analysis Using Python

Uploaded by

Exploratory Data Analysis Using Python

Uploaded by

05/06/2023, 02:02 Exploratory Data Analysis using Python.

!pip install opendatasets --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

['survey_results_public.csv', 'survey_results_schema.csv', 'README.txt']

README.txt - Information about the dataset

Let's view the list of columns in the data frame.

WelcomeChange Compared to last year, how welcome do you feel...

Save and upload your notebook

# Select a project name

Data Preparation & Cleaning

Demographics of the survey respondents and the global programming community

Let's view some basic information about the data frame.

18 NEWOvertime 43231 non-null object

survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')

Let's now view some basic statistics about numeric columns.

Age Age1stCode YearsCode YearsCodePro WorkWeekHrs

count 45446.000000 57473.000000 56784.000000 44133.000000 41151.000000

50% 29.000000 15.000000

max 279.000000 85.000000 50.000000 50.000000 475.000000

survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)

survey_df.where(~(survey_df.Gender.str.contains(';', na=False)), np.nan, inplace=True)

Country Age Gender EdLevel UndergradMajor Hobbyist Age1stCode YearsCode YearsCodePro

Master’s degree Another

'Where do you live?'

United States 12371

We can visualize this information using a bar chart.

plt.hist(survey_df.Age, bins=np.arange(10,80,5), color='purple');

A pie chart would be a great way to visualize the distribution.

'What was your primary field of study?'

'Which of the following best describes your current employment status?'

(survey_df.Employment.value_counts(normalize=True, ascending=True)*100).plot(kind='barh', color='g')

<ipython-input-56-26a445763b0d>:5: FutureWarning: iteritems is deprecated and will be removed in a future version. U

0 True True False False False False False False False

1 False True True False False False False False False

2 False False False False False False False False False

3 False False False False False False False False False

4 False False False False False False False False False

Developer, back-end 26996

Developer, QA or test 3893

Asking and Answering Questions

Q: What are the most popular programming languages in 2020?

64306 rows × 25 columns

We can plot this information using a horizontal bar chart.

<ipython-input-56-26a445763b0d>:5: FutureWarning: iteritems is deprecated and will be removed in a future version. U

languages_loved_df = languages_worked_df & languages_interested_df

countries_df = survey_df.groupby('Country')[['WorkWeekHrs']].mean().sort_values('WorkWeekHrs', ascending=False)

high_response_countries_df = countries_df.loc[survey_df.Country.value_counts() > 250].head(15)

United States 41.802982

Viet Nam 41.391667

South Africa 41.023460

Sri Lanka 40.612245

New Zealand 40.457551

Q: How important is it to start young to build a career in programming?

sns.scatterplot(x='Age', y='YearsCodePro', hue='Hobbyist', data=survey_df)

Let's save and commit our work before continuing

[jovian] Detected Colab notebook...

Inferences and Conclusions

References and Future Work

Stack Overflow Developer Survey: https://insights.stackoverflow.com/survey

[jovian] Detected Colab notebook...

check 0s completed at 01:59

You might also like