MLS+1+-+Python+for+Data+Science
MLS+1+-+Python+for+Data+Science
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Topics covered so far
1. Python Foundations
a. Numpy and Pandas - Operations & Functions
b. Introduction to visualization
c. Common libraries for visualization
2. Exploratory Data Analysis
a. Univariate and Multivariate Analysis
b. Missing values treatment
c. Working with outliers
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
Gauge Your Understanding
1. What are the different libraries for data manipulation in Python?
2. What are the key operations that can be performed using NumPy & Pandas?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
Key Libraries for Data Manipulation - NumPy & Pandas
Numpy
• Numerical Python
• Fundamental package for scientific computing
• A powerful N-dimensional array object - ndarray
• Useful in linear algebra, vector calculus, and random number capabilities, etc.
Pandas
● Extremely useful for data manipulation and exploratory analysis
● Offers two major data structures - Series & DataFrame
● A DataFrame is made up of several Series - Each column of a DataFrame is a Series
● In a DataFrame, each column can have its own data type unlike NumPy array which creates all entries
with the same data type
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
NumPy Key Operations
NumPy provides many useful operations for data manipulation. Some of the most commonly used
operations and functions of NumPy are:
Reshape an n-dimensional array without changing the data inside the array reshape()
Create evenly spaced elements in an interval, particularly useful while working arange(),
with loops linspace()
dot(), transpose(),
Working with matrices and perform different operations on them
eye()
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
Pandas Key Operations
Pandas is one of the most famous data manipulation tools which is built on top of NumPy. Some of the
commonly used operations and functions of Pandas are:
read_csv(), read_excel(),
Load or import the data from different sources/formats
read_html(), read_json()
Information about the data - dimension, column dtypes, non-null values and
info()
memory usage
View of basic statistical details of numeric data - quartiles, min, max, mean, std describe()
Merge two data frames with different types of join - inner join, left join, right
merge()
join, and full outer join
Explore data frames by different groups, and apply summary functions on each
groupby()
group
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
Gauge Your Understanding
1. What do you understand by data visualization and why is it important?
2. What are the commonly used libraries to use for data visualization in Python?
3. How do you choose the right visualization for your analysis?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
Introduction to Visualization
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Common Libraries for Visualization
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Which visualization to use (1/2)
There are numerous types of plots available in Matplotlib and Seaborn, each has its own usage with certain
specific data. Choosing right visualization for right purpose is very important.
Bivariate Continuous Continuous How Y is correlated with X? Scatter plot How tip varies with the total bill?
Time Related
Bivariate (months, hours, Continuous How Y changes over time? Line Plot How sales varies on different days?
etc.)
How tip varies at lunch and dinner?
How range of X varies for
Bivariate Continuous Categorical Box plot, Swarm Plot How tips varies with day of the
various category levels?
week?
What is the number or % of What is the percentage of smokers
Bivariate Categorical Categorical records of X which falls under Stacked Bar plot and non-smokers across fitness
each category of Y? levels.?
Note: Univariate plots can also be used to visualize relationships among two or more variables by using arguments like ‘hue’ in the plot.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Which visualization to use (2/2)
Multivariate analysis is used to study the interaction between more than two variables. Exploring more
combination of variables helps to extract deeper insights which could not observed with univariate or bivariate
analysis. Examples: Correlation, Regression analysis, etc.
Note: Pair plot and heatmap can also be used with only two variables but are generally preferred and more useful
for visualizing more than two variables.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 11
Gauge Your Understanding
1. What do you mean by Exploratory Data Analysis and why do you need it?
2. What do you mean by Data Preprocessing and why do you need it?
3. What are the steps involved in doing Exploratory Data Analysis?
4. What are two important steps involved in data preprocessing?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 12
Exploratory Data Analysis (EDA)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 13
What and Why of Data Preprocessing?
Data preprocessing refers to the process of preparing the raw data into a structured format before building a machine
learning model.
● Raw data is often incomplete, inconsistent and has many other fallacies
● This makes it inapt for any statistical analysis
○ It might lead to wrong insights
○ Decisions taken from this data can be counter productive for the organization
● So you can’t directly go from data to insights
Structure Insights,
Raw Data
Data Visualizations
Data
Preprocessing
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 14
Steps of EDA
Overview of Data Gain basic understanding of the data - shape, data types, etc.
Identify and do the key fixes in the data. Finally, summarize the
Key fixes and summarize key findings from EDA
15
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Missing Values
● Missing values occur when no data value is stored for the variable in an observation
● Missing values can have a significant effect on the inferences that are drawn from the data
● Values which are not actually missing and represents some information about the data
○ Example: Number of hours an employee works everyday. Missing values can mean that employee was absent
or on leave that particular day.
● Values which are actually missing and provides no information about the data
○ Example: A weighing scale that is running out of batteries. Some data will simply be missing randomly.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 16
How to deal with missing values?
● If values are not actually missing, then we can replace the missing values with the value they actually represent in
the data
○ In the example above, we can replace all the missing working hours of an employee with 0
● If values are actually missing, then we must explore the importance and extent of missing values in the data
○ If the variable has large percentage of missing values (say more than 70%) and is not significant for our
analysis, then we can drop that variable
○ If the percentage of missing values is small or the variable is significant for our analysis, then we can replace
the missing values in that variable using:
■ Sometimes we use functions like min, max, etc. to replace the missing values depending on the dataset
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 17
Outliers
What are outliers?
Outliers are the observations which are very different from the other observations. In order to analyze the data, sometimes
we have to work with outliers.
Outliers
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 18
How to deal with outliers?
Handling outliers is subjective to the business problem we are trying to solve but some general practices are as follows:
○ Example: Income is generally a skewed variable but all extreme points might not be outliers
○ We can drop them but we would lose information in other columns of the data
○ We can cap outliers at certain values, say 5th percentile or 95th percentile
○ We can set a threshold using IQR and remove the outliers greater than that threshold value
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 19
Case Study
Uber Case Study
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 20
Appendix
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Scatter Plot
● A scatter plot uses dots to represent values for two different numeric variables.
● The position of each dot on the horizontal and vertical axis indicates values for an individual data point.
● Scatter plots are used to observe relationships between continuous variables.
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Bar Plot
● A bar chart is a chart that presents categorical data with rectangular bars with heights or lengths proportional to the
values that they represent.
● The bars can be plotted vertically or horizontally.
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Stacked Bar plot
Stacked Bar plots are used to show how a larger category is divided into smaller categories and what relationship each
category of one variable has with each category of another variable.
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 24
Line Plot
A line graph is a graphical display of information that changes continuously over time.
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Histogram and skewness in data
● A histogram is a graphical display of data using bars of different heights.
● In a histogram, each bar groups numbers into ranges
● If the curve is shifted to the left, it is called left skewed. (leftmost curve in the below fig.)
● If the curve is shifted to the right, it is called right skewed. (rightmost curve in the below fig.)
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Count Plot
Count plot shows the count of observations in each category of a categorical variable using bars. A count
plot can be thought of as a histogram across a categorical, instead of continuous, variable.
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 27
Box Plot
A box plot is a type of chart often used in exploratory data analysis to visualize the distribution of numerical data and get an
idea about the skewness and outliers in the data by displaying the items included in the five point summary. The five point
summary includes:
● The minimum
● Q1 (the first quartile, or the 25% mark)
● The median (the second quartile, or the 50% marks)
● Q3 (the third quartile, or the 75% mark)
● The maximum
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Swarm Plot
Swarm is like a categorical scatterplot with non-overlapping points. The data points are adjusted so that they
don’t overlap. This gives a better representation of the distribution and spread of values.
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 29
Distribution Plot
A distribution plot is a method for visualizing the distribution of observations in data. Relative to a histogram, a
distribution plot can produce a graph that is less cluttered and more interpretable, especially when drawing
multiple distributions
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 30
Pair Plot
Back
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 32
Happy Learning !
33
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.