0% found this document useful (0 votes)
9 views

MLS+1+-+Python+for+Data+Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

MLS+1+-+Python+for+Data+Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Python for Data Science

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Topics covered so far
1. Python Foundations
a. Numpy and Pandas - Operations & Functions
b. Introduction to visualization
c. Common libraries for visualization
2. Exploratory Data Analysis
a. Univariate and Multivariate Analysis
b. Missing values treatment
c. Working with outliers

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
Gauge Your Understanding
1. What are the different libraries for data manipulation in Python?
2. What are the key operations that can be performed using NumPy & Pandas?

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
Key Libraries for Data Manipulation - NumPy & Pandas

Numpy
• Numerical Python
• Fundamental package for scientific computing
• A powerful N-dimensional array object - ndarray
• Useful in linear algebra, vector calculus, and random number capabilities, etc.

Pandas
● Extremely useful for data manipulation and exploratory analysis
● Offers two major data structures - Series & DataFrame
● A DataFrame is made up of several Series - Each column of a DataFrame is a Series
● In a DataFrame, each column can have its own data type unlike NumPy array which creates all entries
with the same data type

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
NumPy Key Operations
NumPy provides many useful operations for data manipulation. Some of the most commonly used
operations and functions of NumPy are:

Operation Numpy Function

Declare a NumPy array or convert a list into a NumPy array array()

Reshape an n-dimensional array without changing the data inside the array reshape()

Concatenate two or more arrays along a specified axis concatenate()

Create evenly spaced elements in an interval, particularly useful while working arange(),
with loops linspace()

dot(), transpose(),
Working with matrices and perform different operations on them
eye()

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
Pandas Key Operations
Pandas is one of the most famous data manipulation tools which is built on top of NumPy. Some of the
commonly used operations and functions of Pandas are:

Operation Pandas Function

read_csv(), read_excel(),
Load or import the data from different sources/formats
read_html(), read_json()

Information about the data - dimension, column dtypes, non-null values and
info()
memory usage

View of basic statistical details of numeric data - quartiles, min, max, mean, std describe()

Merge two data frames with different types of join - inner join, left join, right
merge()
join, and full outer join

Explore data frames by different groups, and apply summary functions on each
groupby()
group

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
Gauge Your Understanding
1. What do you understand by data visualization and why is it important?
2. What are the commonly used libraries to use for data visualization in Python?
3. How do you choose the right visualization for your analysis?

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
Introduction to Visualization

● Visual representation of data


What is Data Visualization?
● Helps to observe and communicate patterns & trends with
naked eye

● Data visualization helps to communicate information in a


Why Data Visualization is manner that is universal, fast, and effective
important?
● Communicating insights to non-technical decision makers is
one of the most critical phases in a data science project

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Common Libraries for Visualization

● Matplotlib is one of the most popular libraries for data visualizations


● It provides high-quality graphics and a variety of plots such as
Matplotlib histograms, bar charts, pie charts, etc.
● Some important functions - plot(), hist(), bar(), pie(), scatter(), text(),
legend(), etc.

● Seaborn is complementary to Matplotlib and it specifically targets


statistical data visualizations
● A saying around matplotlib and seaborn is, “matplotlib tries to make
Seaborn easy things easier and hard things possible, seaborn tries to make a
well-defined set of hard things easy too.”
● Some important functions - displot(), boxplot(), stripplot(), pairplot()

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Which visualization to use (1/2)
There are numerous types of plots available in Matplotlib and Seaborn, each has its own usage with certain
specific data. Choosing right visualization for right purpose is very important.

Type X Variable Y Variable Purpose of analysis Type of chart Example

How the values of the X variable Histogram, Distribution of cholesterol ranges


Univariate Continuous -
are distributed? Distribution plot Distribution of horsepower of cars

What is the count of What is the count of employees for


Univariate Categorical - observations in each category of Count Plot each type of degree in an
X variable? organization?

Bivariate Continuous Continuous How Y is correlated with X? Scatter plot How tip varies with the total bill?

Time Related
Bivariate (months, hours, Continuous How Y changes over time? Line Plot How sales varies on different days?
etc.)
How tip varies at lunch and dinner?
How range of X varies for
Bivariate Continuous Categorical Box plot, Swarm Plot How tips varies with day of the
various category levels?
week?
What is the number or % of What is the percentage of smokers
Bivariate Categorical Categorical records of X which falls under Stacked Bar plot and non-smokers across fitness
each category of Y? levels.?

Note: Univariate plots can also be used to visualize relationships among two or more variables by using arguments like ‘hue’ in the plot.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Which visualization to use (2/2)
Multivariate analysis is used to study the interaction between more than two variables. Exploring more
combination of variables helps to extract deeper insights which could not observed with univariate or bivariate
analysis. Examples: Correlation, Regression analysis, etc.

Type Variables Purpose of analysis Type of chart Example

How to visualize relationship


Continuous Relation between three variables -
Multivariate across multiple combination of Pair Plot
(more than two) horsepower, weight, and acceleration
variables?

How to visualize the spread of


Continuous Correlation matrix for three variables
Multivariate values in the data with Heatmap
(more than two) horsepower, weight, and acceleration
color-encoding?

Note: Pair plot and heatmap can also be used with only two variables but are generally preferred and more useful
for visualizing more than two variables.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 11
Gauge Your Understanding
1. What do you mean by Exploratory Data Analysis and why do you need it?
2. What do you mean by Data Preprocessing and why do you need it?
3. What are the steps involved in doing Exploratory Data Analysis?
4. What are two important steps involved in data preprocessing?

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 12
Exploratory Data Analysis (EDA)

● Combination of visualization techniques and statistical methods


What is EDA?
● Exploring and summarizing key information within the data

● First step in any analysis


● Gain good insights into the data
● Uncover underlying structure of the data
Why of EDA?
● Detect and figure out the best strategy to handle unclean data
(missing values, outliers etc.)
● Identify initial set of observations and insights

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 13
What and Why of Data Preprocessing?
Data preprocessing refers to the process of preparing the raw data into a structured format before building a machine
learning model.
● Raw data is often incomplete, inconsistent and has many other fallacies
● This makes it inapt for any statistical analysis
○ It might lead to wrong insights
○ Decisions taken from this data can be counter productive for the organization
● So you can’t directly go from data to insights

In real world, the journey looks more like


this... Exploratory
Data
Analysis

Structure Insights,
Raw Data
Data Visualizations
Data
Preprocessing

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 14
Steps of EDA

Overview of Data Gain basic understanding of the data - shape, data types, etc.

Check descriptive statistics about the data - mean, std, median,


Summary Statistics
etc.

Univariate Analysis Check distribution of variables in the data, missing values,


outliers

Bivariate Analysis Find the patterns or relationships between different variables

Explore more combination of variables to unearth deeper


Multivariate Analysis insights

Identify and do the key fixes in the data. Finally, summarize the
Key fixes and summarize key findings from EDA
15

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Missing Values

What are missing values?

● Missing values occur when no data value is stored for the variable in an observation

● Missing values can have a significant effect on the inferences that are drawn from the data

What are different types of missing values?

● Values which are not actually missing and represents some information about the data

○ Example: Number of hours an employee works everyday. Missing values can mean that employee was absent
or on leave that particular day.

● Values which are actually missing and provides no information about the data

○ Example: A weighing scale that is running out of batteries. Some data will simply be missing randomly.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 16
How to deal with missing values?

● If values are not actually missing, then we can replace the missing values with the value they actually represent in
the data

○ In the example above, we can replace all the missing working hours of an employee with 0

● If values are actually missing, then we must explore the importance and extent of missing values in the data

○ If the variable has large percentage of missing values (say more than 70%) and is not significant for our
analysis, then we can drop that variable

○ If the percentage of missing values is small or the variable is significant for our analysis, then we can replace
the missing values in that variable using:

■ Mean or median of that variable if the variable is continuous

■ Mode of that variable if the variable is categorical

■ Sometimes we use functions like min, max, etc. to replace the missing values depending on the dataset

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 17
Outliers
What are outliers?

Outliers are the observations which are very different from the other observations. In order to analyze the data, sometimes
we have to work with outliers.

How to detect outliers?

1. Box plot: We can visualize the outliers by using box plot.


2. Scatter plot: We can check outliers using scatter plot by identifying the data point the lies far from other
observations. Outliers

Outliers

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 18
How to deal with outliers?

Handling outliers is subjective to the business problem we are trying to solve but some general practices are as follows:

● We should analyze outliers before treating them

● If an outlier represents the general trend, then there is no need to treat it

○ Example: Income is generally a skewed variable but all extreme points might not be outliers

● If we decide to treat the outlier after analyzing it, then:

○ We can drop them but we would lose information in other columns of the data

○ We can cap outliers at certain values, say 5th percentile or 95th percentile

○ We can set a threshold using IQR and remove the outliers greater than that threshold value

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 19
Case Study
Uber Case Study

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 20
Appendix

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Scatter Plot
● A scatter plot uses dots to represent values for two different numeric variables.
● The position of each dot on the horizontal and vertical axis indicates values for an individual data point.
● Scatter plots are used to observe relationships between continuous variables.

● This plot shows the


relationship between
the tip and the total
bill.
● We can say if the
total bill is large, the
tip can also be large

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bar Plot
● A bar chart is a chart that presents categorical data with rectangular bars with heights or lengths proportional to the
values that they represent.
● The bars can be plotted vertically or horizontally.

● Most of the students


celebrated their
birthday in June.
● In August, very less
students celebrated
their birthdays.

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Stacked Bar plot
Stacked Bar plots are used to show how a larger category is divided into smaller categories and what relationship each
category of one variable has with each category of another variable.

● This plot shows the


percentage of smoker and
non-smoker for different
fitness levels
● The percentage of smokers is
very high for people with very
poor fitness.

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 24
Line Plot
A line graph is a graphical display of information that changes continuously over time.

● This plot shows the


relationship between
the sales and the
number of days
● We can say that sales
has been the highest
on day 7

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Histogram and skewness in data
● A histogram is a graphical display of data using bars of different heights.
● In a histogram, each bar groups numbers into ranges

Skewness refers to distortion or asymmetry in a symmetrical bell curve in a set of data

● If the curve is shifted to the left, it is called left skewed. (leftmost curve in the below fig.)
● If the curve is shifted to the right, it is called right skewed. (rightmost curve in the below fig.)

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Count Plot
Count plot shows the count of observations in each category of a categorical variable using bars. A count
plot can be thought of as a histogram across a categorical, instead of continuous, variable.

● This plot shows the count


of employees for each
type of degree in an
organization
● We can see that majority
of the employees have
bachelor degree followed
by master.

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 27
Box Plot
A box plot is a type of chart often used in exploratory data analysis to visualize the distribution of numerical data and get an
idea about the skewness and outliers in the data by displaying the items included in the five point summary. The five point
summary includes:

● The minimum
● Q1 (the first quartile, or the 25% mark)
● The median (the second quartile, or the 50% marks)
● Q3 (the third quartile, or the 75% mark)
● The maximum

● This plot shows how tip


varies at lunch and dinner
times in a restaurant.
● We can see that median
value of tip is larger at the
time of dinner.

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Swarm Plot
Swarm is like a categorical scatterplot with non-overlapping points. The data points are adjusted so that they
don’t overlap. This gives a better representation of the distribution and spread of values.

● This plot shows the


amount of tip for each day
in the data
● We can see that the most
number of tips are on
Saturday and Sunday
● The amount of tips is
maximum on Saturday
● The most common tip on
all days is 2 dollars

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 29
Distribution Plot
A distribution plot is a method for visualizing the distribution of observations in data. Relative to a histogram, a
distribution plot can produce a graph that is less cluttered and more interpretable, especially when drawing
multiple distributions

● This plot shows the


distribution of horsepower for
different types of cars
● We can see that the
distribution is slightly right
skewed
● Majority of values are less
than 100
● The range of values if high. It
varies from less than 50 to
approx 250.

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 30
Pair Plot

● It is used to visualize relationship across multiple


combination of variables in a dataset.
● It gives a square matrix of plots where each
numeric variable in data will by shared across the
y-axes across a single row and the x-axes across a
single column.
● The diagonal plots are univariate distribution plot.
● The plot in the figure shows the pairwise relation
between all three variables of the mpg data from
Searborn - horsepower, weight, and acceleration.
● We can see that horsepower is positively
correlated with weight and negatively correlated
with acceleration.
● The distribution plot for acceleration shows that it
follows a normally distribution.
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 31
Heatmap

● It is used to visualize the spread of values


as a rectangular table using color-encoding
to highlight very low and very high values.

● The plot in the figure shows that heatmap


for the correlation coefficient between
three variables - horsepower, weight, and
acceleration.

● The plot shows that acceleration is


negatively correlated with horsepower and
weight.

● The variable horsepower is positively


correlated with weight.

Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 32
Happy Learning !

33
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

You might also like