0% found this document useful (0 votes)
28 views25 pages

Unit - Iii - Eda

Uploaded by

rp402948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views25 pages

Unit - Iii - Eda

Uploaded by

rp402948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Science

What is Data Science?


 Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
 It is a multidisciplinary field that uses tools and techniques to manipulate the data so that we
can find something new and meaningful.
 In short, we can say that data science is all about:
 Asking the correct questions and analyzing the raw data.
 Modeling the data using various complex and efficient algorithms.
 Visualizing the data to get a better perspective.
 Understanding the data to make better decisions and finding the final result.

Data Science Lifecycle :


 The life-cycle of data science consists of 6 stages.

1. Discovery:

 The first phase is discovery, which involves asking the right questions.
 When we start any data science project, we need to determine what are the basic
requirements, priorities, and project budget.
 In this phase, we need to determine all the requirements of the project such as the number of
people, technology, time, data, an end goal, and then we can frame the business problem on
first hypothesis level.
2. Data preparation:

 Data preparation is also known as Data Munging (Transformation).


 In this phase, we need to perform the following tasks:
1. Data cleaning
2. Data Reduction
3. Data integration
4. Data transformation
 After performing all the above tasks, we can easily use this data for our further processes.

3. Model Planning:

 In this phase, we need to determine the various methods and techniques to establish the
relation between input variables.
 We will apply Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what data can
inform us.
 Common tools used for model planning are:
1. SQL Analysis Services
2. R
3. SAS
4. Python

4. Model-building:

 In this phase, the process of model building starts.


 We will create datasets for training and testing purpose.
 We will apply different techniques such as association, classification, and clustering, to build
the model.
 Following are some common Model building tools:
1. SAS Enterprise Miner
2. WEKA
3. SPCS Modeler
4. MATLAB

5. Operationalize:

 In this phase, we will deliver the final reports of the project, along with briefings, code, and
technical documents.
 This phase provides us a clear overview of complete project performance and other
components on a small scale before the full deployment.

6. Communicate results:

 In this phase, we will check if we reach the goal, which we have set on the initial phase. We
will communicate the findings and final result with the business team.
Exploratory Data Analysis(EDA)
What is Data Analysis:

Data Analysis is a process of inspecting, cleaning, transforming, and modeling data to


discover useful information for business decision-making.

Steps for Data Analysis, Data Manipulation and Data Visualization:

 Transform Raw Data in a Desired Format


 Clean the Transformed Data (Step 1 and 2 also called as a Pre-processing of Data)
 Prepare a Model
 Analyze Trends and Make Decisions

Exploratory Data Analysis(EDA)

 Exploratory Data Analysis (EDA) is the first step in data analysis process .
 Exploratory Data Analysis (EDA) is developed by “John Tukey” in the 1970s.

What is Exploratory Data Analysis ?

 Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main
characteristics often plotting them visually.
 This step is very important especially when we arrive at modeling the data in order to apply
Machine learning.
 Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more.
 It often takes much time to explore the data.
 Exploratory Data Analysis helps us to –
 Give insight into a data set.
 Understand the underlying structure.
 Extract important parameters and relationships that hold between them.
 Test underlying assumptions

How to perform Exploratory Data Analysis ?


 There is no one method or common methods in order to perform EDA.
 EDA is performed depends on the dataset that we are working.
Example:
Consider a data set related to employee.
This dataset contains more of 1000rows and more than 8 columns

1. Importing the required libraries for EDA

 import pandas as pd
 import numpy as np
 import seaborn as sns
 import matplotlib.pyplot as plt
 %matplotlib inline
 sns.set(color_codes=True)
2. Loading the data into the data frame.

 Loading the data into the pandas data frame is certainly one of the most important steps in
EDA.
 The value from the data set is comma-separated. So read the CSV into a data frame
 Pandas data frame is used to read the data.
df =pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/employees.csv")
 df.head() # To display the top 5 rows

 df.tail() # To display the bottom 5 rows

3. Checking the types of data

 Here we check for the datatypes because sometimes the Salary or the Salary of the employees
would be stored as a string. If in that case, we have to convert that string to the integer data,
only then integer data we can plot the data via a graph.
 Here, in this case, the data is already in integer format so nothing to worry.

 df.dtypes
 Let’s see the shape of the data using the shape.
df.shape
(1000,8)
This means that this dataset has 1000 rows and 8 columns.

 Let’s get a quick summary of the dataset using the describe() method. The describe() function
applies basic statistical computations on the dataset like extreme values, count of data points
standard deviation, etc. Any missing value or NaN value is automatically skipped. describe()
function gives a good picture of the distribution of data.
df.describe()

 Now, let’s also the columns and their data types. For this, we will use the info() method.
df.info()
4. Dropping irrelevant columns.

 This step is certainly needed in every EDA because sometimes there would be many columns
that we never use. In such cases drop the irrelevant columns.
 For example, In this case, the columns such as Last Login Time, Senior Management, doesn't
make any sense so just drop.

df = df.drop([‘Last Login Time’, ‘Senior Management’], axis=1, inplace=True)

df.head(5)

5. Renaming the columns

 In this instance, most of the column names are very confusing to read, so rename their column
names.
 This is a good approach it improves the readability of the data set.
df = df.rename(columns={“Start Date": “SDate", })
df.head(5)

6. Dropping the duplicate rows

 First finding the no of rows & columns.


df.shape
df.count() # Used to count the number of rows
 Finding no of duplicate data.
duplicate_rows_df = df.duplicated()
print("number of duplicate rows: ", duplicate_rows_df.shape)
number of duplicate rows: (0, 6)
 removing rows of duplicate data
df = df.drop_duplicates()
df.count() # Used to count the number of rows

7. Dropping the missing or null values.

 Missing Data is a very big problem in real-life scenarios. Missing Data can also refer to as
NA(Not Available) values in pandas. There are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :
print(df.isnull().sum())

 We can see that every column has a different amount of missing values. Like Gender as 145
missing values and salary has 0. Now for handling these missing values there can be several
cases like dropping the rows containing NaN or replacing NaN with either mean, median,
mode, or some other value.
 Now, let’s try to fill the missing values of gender with the string “No Gender”.
df["Gender"].fillna("No Gender", inplace = True)
df.isnull().sum()

 Now, Let’s fill the senior management with the mode value.
mode = df['Senior Management'].mode().values[0]
df['Senior Management']= df['Senior Management'].replace(np.nan, mode)
df.isnull().sum()

 Now for the first name and team, we cannot fill the missing values with arbitrary data, so,
let’s drop all the rows containing these missing values.
df = df.dropna(axis = 0, how ='any')
print(df.isnull().sum())
df.shape
8. Detecting Outliers:

 Outliers are nothing but an extreme value that deviates from the other observations in the
dataset.
 An outlier is a point or set of points that are different from other points.
 Sometimes they can be very high or very low.
 It's often a good idea to detect and remove the outliers. Because outliers are one of the
primary reasons for resulting in a less accurate model.
 Hence it's a good idea to remove them.
 IQR (Inter-Quartile Range) score technique is used to detect and remove outlier.
 Outliers can be seen with visualizations using a box plot.
 sns.boxplot(x=df['Salary'])

 Herein all the plots, we can find some points are outside the box they are none other than
outliers.
 Q1 = df.quantile(0.25)
 Q3 = df.quantile(0.75)
 IQR = Q3 - Q1
 print(IQR)
 df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
 df.shape

8. Data Visualization

Data Visualization is the process of analyzing data in the form of graphs or maps, making it a lot
easier to understand the trends or patterns in the data. There are various types of visualizations –

Histogram Plot

 A histogram is a graphical representation commonly used to visualize the distribution of


numerical data
 A histogram divides the values within a numerical variable into “bins”, and counts the
number of observations that fall into each bin

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Box Plot

Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()

Heat Maps Plot

 Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.

 One of the best ways to find the relationship between the features can be done using heat
maps.
Example:

# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)

Heat Maps Plot

 A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
 Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()
Descriptive Statistics
 Descriptive Statistics is the default process in Data analysis.
 Exploratory Data Analysis (EDA) is not complete without a Descriptive Statistic analysis.
 Descriptive Statistics is divided into two parts:
1. Measure of Central Data points and
2. Measure of Dispersion.

1. Measure of Central Data points:

The following operations are performed under Measure of Central Data points. Each of these
measures describes a different indication of the typical or central value in the distribution.

1. Count
2. Mean
3. Mode
4. Median

2. Measure of Dispersion

The following operations are performed under Measure of Dispersion. Measures of dispersion can
be defined as positive real numbers that measure how homogeneous or heterogeneous the given data .

1. Range
2. Percentiles (or) Quartiles
3. Standard deviation
4. Variance
5. Skewness

Example:

 Consider a file:

https://media.geeksforgeeks.org/wp-content/uploads/employees.csv

 Before starting descriptive statistics analysis complete the data collection and cleaning
process.

Data Collection:

 # loading data set as Pandas dataframe


import pandas as pd
df = pd.DataFrame()
df =pd.read_csv("https://media.geeksforgeeks.org/wpcontent/uploads/employees.csv")

 df.head() # To display the top 5 rows


 df.tail() # To display the last 5 rows
 Here we check for the datatypes
df.dtypes
 Let’s see the shape of the data using the shape.
df.shape
df.count() # Used to count the number of rows
 Now, let’s also the columns and their data types. For this, we will use the info() method.
df.info()

Data Cleaning :

Data cleaning means fixing bad data in your data set before data analysis.

 Empty cells or Null values


 Data in wrong format
 Wrong data
 Duplicates

Describe() method :

 Let’s get a quick summary of the dataset using the describe() method.
 The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc.
 Any missing value or NaN value is automatically skipped.
 describe() function gives a good picture of the distribution of data.

Measure of Central Data points:

Count

It calculates the total count of numerical column data (or) each category of the categorical
variables.

# get column height from df

height =df["height"]

print(height)

Mean

 The Sum of values present in the column divided by total rows of that column is known as
mean.

 It is also known as average.


Median

 The center value of an attribute is known as a median.

 The median value divides the data points into two parts. That means 50% of data points are
present above the median and 50% below.

Mode

 The mode is that data point whose count is maximum in a column.

 There is only one mean and median value for each column. But, attributes can have more
than one mode value

#Calculating mean, median, mode of dataset height


mean = height.mean()
median =height.median()
mode = height.mode()
print(mean , median, mode)

Output:

53.73152709359609 54.1 0 50.8 dtype: float64

Measures of Dispersion

Range

 The difference between the max value to min value in a column is known as the range.

Standard Deviation

 The standard deviation value tells us how much all data points deviate from the mean value.

The standard deviation is affected by the outliers because it uses the mean for its calculation

 Formulas for Standard Deviation


 Notations for Standard Deviation

 σ = Standard Deviation
 xi = Terms Given in the Data
 x̄ = Mean
 n = Total number of Terms

 Example for standard deviation

#standard deviation of data set using std() function


std_dev =df.std()
print(std_dev)
# standard deviation of the specific column
sv_height=df.loc[:,"height"].std()
print(sv_height)

Output:

2.442525704031867

Variance

 Variance is the square of standard deviation.

Variance = (Standard deviation)2= σ2

 In the case of outliers, the variance value becomes large and noticeable.

 Example for standard variance

# variance of data set using var() functionvariance=df.var()


print(variance)
#variance of the specific column
var_height=df.loc[:,"height"].var()
print(var_height)

Output:

5.965931814856368

Skewness

 Ideally, the distribution of data should be in the shape of Gaussian (bell curve).

 But practically, data shapes are skewed or have asymmetry. This is known as skewness in
data.

 Formula for skewness is:

Skew = 3 * (Mean – Median) / Standard Deviation

 Skewness value can be negative (left) skew or positive (right) skew. Its value should be close
to zero.
 Example for skewness

 df.skew()

# skewness of the specific column

df.loc[:,"height"].skew()

output:

0.06413448813322854

Percentiles or Quartiles

 Column values can be spread by calculating the summary of several percentiles.

 Median is also known as the 50th percentile of data.

 Here is a different percentile.

1. The minimum value equals to 0th percentile.


2. The maximum value equals to 100th percentile.
3. The first quartile equals to 25th percentile.
4. The third quartile equals to 75th percentile.

Quartiles

 It divides the data set into four equal points.

 First quartile = 25th percentile

 Second quartile = 50th percentile (Median)

 Third quartile = 75th percentile

 Based on the quartile, there is a another measure called inter-quartile range that also measures
the variability in the dataset. It is defined as:

IQR = Q3 - Q1

 IQR is not affected by the presence of outliers.


 Example :
price = df.price.sort_values()
Q1 = np.percentile(price, 25)
Q2 = np.percentile(price, 50)
Q3 = np.percentile(price, 75)
IQR = Q3 - Q1
IQR

Output:

8718.5
Basic tools (plots, graphs and summary statistics) of EDA
 Exploratory data analysis or “EDA” is a critical first step in analyzing the data
 The uses of EDA are:
1. Detection of mistakes
2. Checking of assumptions
3. Preliminary selection of appropriate models
4. Determining relationships among the exploratory variables

Typical data format (Data Types) and the types of EDA

Data Types:

Data types are mainly classified in to 2 types. Those are

Categorical Data

 Categorical data represents characteristics.


 It can represent things like a person’s gender, language etc.
 Categorical data can also take on numerical values.
Example: 1 for female and 0 for male.
 These numbers don’t have mathematical meaning

Nominal Data

 Nominal values represent discrete units and are used to label variables that have no
quantitative value.
 Nominal data has no order.
 If the order of the values changed their meaning would not change.

Examples:

.
Ordinal Data

 Ordinal values represent discrete and ordered units.


 It is same as nominal data, except that it’s ordering matters.
Example :

Numerical Data

1. Discrete Data

 Discrete data contains values as distinct and separate.


 This type of data can’t be measured but it can be counted.
 It basically represents information that can be categorized into a classification.
 An example is the number of heads in 100 coin flips.

2. Continuous Data

 Continuous Data represents measurements and therefore their values can’t be counted but
they can be measured.
 An example would be the height of a person, which can describe by using intervals on the real
number line.

Interval Data

 Interval values represent ordered units that have the same difference.

Example:

Ratio Data
 Ratio values are also ordered units that have the same difference.
 Ratio values are the same as interval values, with the difference that they do have an absolute
zero.
 Good examples are height, weight, length etc.

Example:

Types of EDA

 The EDA types of techniques are either graphical or quantitative (non-graphical).


 While the graphical methods involve summarizing the data in a diagrammatic or visual way.
 The quantitative method, on the other hand, involves the calculation of summary statistics.
 These two types of methods are further divided into univariate and multivariate methods.
 Univariate methods consider one variable (data column) at a time.
 Multivariate methods consider two or more variables at a time to explore relationships.
 Totally there are four types of EDA .
1. Univariate graphical,
2. Multivariate graphical,
3. Univariate non-graphical, and
4. Multivariate non-graphical.

Univariate non-graphical:

 This is the simplest form of data analysis among the four options.
 In this type of analysis, the data that is being analysed consists of just a single variable.
 The main purpose of this analysis is to describe the data and to find patterns.

Univariate graphical:

 The graphical method provides the full picture of the data.


 The three main methods of analysis under this type are histogram, stem and leaf plot, and box
plots.
 The histogram represents the total count of cases for a range of values.
 Along with the data values, the stem and leaf plot shows the shape of the distribution.
 The box plots graphically depict a summary of minimum, first quartile median, third quartile,
and maximum.

Multivariate non-graphical:
 The multivariate non-graphical type of EDA generally depicts the relationship between
multiple variables of data through cross-tabulation or statistics.

Multivariate graphical:

 This type of EDA displays the relationship between two or more set of data.
 A bar chart, where each group represents a level of one of the variables and each bar within
the group represents levels of other variables.

Summary Statistics of EDA


 one purpose of EDA is to spot problems in data (as part of data wrangling) and understand
variable properties like:
1. central trends (mean)
2. spread (variance)
3. skew
 Suggest possible modeling strategies (e.g., probability distributions)
 EDA is used to understand relationship between pairs of variables, e.g. their correlation or
covariance.

COVARIANCE

 Covariance is a measure of relationship between 2 variables that is scale dependent, i.e. how
much will a variable change when another variable changes.
 This can be represented with the following equation:

 xi is the ith observation in variable x,


 x¯ is the mean for variable x,
 yi is the ith observation in variable y,
 y¯ is the mean for variable y, and
 N is the number of observations
 This can be calculated easily within Python - particulatly when using Pandas as
import pandas as pd
df = pd.DataFrame()
df = pd.read_csv(“data.csv")
df.cov()

CORRELATION

 It is a statistical metric to measure what extent different variables are interdependent.


 In short, if one variable changes, how does it affect other variable.

 Where, xi is the ith observation in variable x,


 x¯ is the mean for variable x,
 yi is the ith observation in variable y,
 y¯ is the mean for variable y, and
 N is the number of observations
 sx is the standard deviation for variable x
 sy is the standard deviation for variable y

 this can be calculated easily within Python - particulatly when using Pandas as

import pandas as pd
df = pd.DataFrame()
df = pd.read_csv(“data.csv")
df.corr()

Philosophy of Exploratory Data Analysis

 The important reasons to implement EDA when working with data are:
1. To gain intuition about the data;
2. To make comparisons between distributions;
3. For sanity checking (making sure the data is on the scale we expect, in the
format we thought it should be);
4. To find out where data is missing or if there are outliers; and to summarize the data.
 In the context of data generated from logs, EDA also helps with debugging the logging
process.
 In the end, EDA helps us to make sure the product is performing as intended.
 There’s lots of visualization involved in EDA.
 The distinguish between EDA and data visualization is that EDA is done toward the
beginning of analysis, and data visualization is done toward the end to Communicate one’s
findings.
 With EDA, the graphics are solely done for us to understand what’s going on.
 EDA are used to improve the development of algorithms.
Data Visualization
What is Data Visualization :

 Data Visualization is the presentation of data in graphical format.


 Presenting huge amount of data in a simple and easy-to-understand format and helps
communicate information clearly and effectively.
 The data in a graphical format allows them to identify new trends and patterns easily.
 The main benefits of data visualization are as follows:
1. It simplifies the complex quantitative information
2. It helps analyze and explore big data easily
3. It identifies the areas that need attention or improvement
4. It identifies the relationship between data points and variables
5. It explores new patterns and reveals hidden patterns in the data
 Three major considerations for Data Visualization:
 Clarity
 Accuracy
 Efficiency

Clarity - Clarity ensures that the data set is complete and relevant.

Accuracy – Accuracy ensures using appropriate graphical representation to convey the right
message.

Efficiency - Efficiency uses efficient visualization technique which highlights all the data
points

 Some basic factors to be aware of before visualizing the data.

 Visual effect
 Coordination System
 Data Types and Scale
 Informative Interpretation
Visual effect - Visual Effect includes the usage of appropriate shapes, colors, and
size to represent the analyzed data.
Coordination System - The Coordinate System helps to organize the data points
within the provided coordinates.
Data Types and Scale - The Data Types and Scale choose the type of data such as
numeric or categorical.
Informative Interpretation – The Informative Interpretation helps create visuals in
an effective and easily interpreted ill manner using labels, title legends, and pointers.

 Python offers multiple great graphing libraries.

 Some popular plotting libraries:

 Matplotlib
 Pandas Visualization
 Seaborn
 ggplot
 Plotly
 Plots (graphics), also known as charts, are a visual representation of data in the form of
colored (mostly) graphics.

Histogram Plot

 A histogram is a graphical representation commonly used to visualize the distribution of


numerical data
 A histogram divides the values within a numerical variable into “bins”, and counts the
number of observations that fall into each bin

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()

Box Plot

Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()
Heat Maps Plot

 Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.

 One of the best ways to find the relationship between the features can be done using heat
maps.

Example:

# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)

Heat Maps Plot

 A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
 Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.
Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()

You might also like