0% found this document useful (0 votes)

29 views100 pages

Unit 4

The document outlines the course structure and objectives for a Data Analytics program at the Noida Institute of Engineering and Technology, focusing on exploratory data analysis and various data manipulation techniques. It details the evaluation scheme, course outcomes, and prerequisites, along with the importance of handling missing data and employing visualization methods. The course aims to equip students with the necessary skills to analyze and interpret data effectively using programming languages like R and Python.

Uploaded by

asdrhmn8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views100 pages

Unit 4

Uploaded by

asdrhmn8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

Noida Institute of Engineering and Technology, Greater Noida

DATA ANALYTICS

Unit: 4

Exploratory Data Analysis

Mr.Ravi Pandey
B Tech VIIth Sem Assistant Professor
ECE

Sanchi Kaushik UNIT 04 Data Analytics 1

12/12/2023
THE CONCEPT
Faculty LEARNING TASK
Introduction

Mr. Ravi is the faculty of Discipline of Electronics

and Communication Engineering, at Noida
institute of engineering and technology Greater
Noida, Gautam Budh Nagar, Uttar Pradesh, India
since feb2016. He received the B.Tech degree in
electronic and telecommunication engineering
and M.Tech degree in instrumentation and signal
processing. His research interests include
biomedical signal processing, pattern
recognition, machine learning and deep neural
networks.

12/12/2023 2
Evaluation schemeLEARNING TASK
THE CONCEPT

Sl Sub Subject Periods Evaluation Schemes End Semester Total Credit

. ject
N Co
o. des L T P CT TA Total PS TE PE

1 Departmental Core - I 3 0 0 30 20 50 100 150 3

2 Departmental Elective 3 0 0 30 20 50 100 150 3

3 Open Elective II 3 0 0 30 20 50 100 150 3

4 Open Elective III 3 0 0 30 20 50 100 150 3

5 Lab – I 0 0 2 25 25 50 1
6 Internship Assessment 0 0 2 50 50 1

MOOCs (Essential for 0 0 2

Course applicable for –B.Tech . Data Science/AI-
Hons. Degree)
ML/AI/IOT/CSBS
Total 700 14
12/12/2023 3
CONTENT
Course objective
B. TECH. (Data Science)

Course code L T P Credits

3 0 0 3

Course title Data Analytics

Course objective:

The objective of this course is to understand the fundamental concepts of Data Science,
learn about various types of data formats and its manipulations. It helps students to
learn exploratory data analysis and visualization techniques in addition to R
programming language.

15/06/2022 Nisha UNIT 01 4

CONTENT
Course Outcomes
Course outcomes : After completion of this course students will be able to

CO 1 Understand the fundamental concepts of data analytics in the areas that plays major role K1
within the realm of data science.

CO 2 Explain and exemplify the most common forms of data and its representations. K2

CO 3 Understand and apply data pre-processing techniques. K3

CO4 Analyse data using exploratory data analysis. K4

CO 5 Illustrate various visualization methods for different types of data sets and application K3
scenarios.

15/06/2022 Nisha UNIT 01 5

THE CONCEPT
Course LEARNING
Contents TASK
/ Syllabus

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 6

THE CONCEPT
Course LEARNING
Contents TASK
/ Syllabus

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 7

Text Books
THE CONCEPT LEARNING TASK

Text books:

1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis
and Data Mining, John Wiley Publishers, 2007.
2) Data Analysis and Data Mining, 2nd Edition, John Wiley & Sons Publication, 2014.
Reference Books:

1) Open Data for Sustainable Community: Glocalized Sustainable Development Goals,

Neha Sharma, Santanu Ghosh, Monodeep Saha, Springer, 2021.
2) The Data Science Handbook, Field Cady, John Wiley & Sons, Inc, 2017
3) Data Mining Concepts and Techniques, Third Edition, Jiawei Han, Micheline Kamber,
Jian Pei, Morgan Kaufmann, 2012.

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 8

THE CONCEPT
Branch LEARNING TASK
wise Applications

• Security.
•Transportation.
•Risk detection.
•Risk Management.
•Delivery.
•Fast internet allocation.
•Reasonable Expenditure.
•Interaction with customers.
•Planning of cities

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 9

THE CONCEPT LEARNING TASK
Course Objectives

• The objective of this course is to understand the fundamental concepts of Data

analytics and learn about various types of data formats and their manipulations.

• It helps students to learn exploratory data analysis and visualization techniques in

addition to R/Python/Tableau programming language.

Neeti Taneja ACSE0403A OS Unit-3

12 December 2023 10
THE CONCEPT LEARNING TASK

Course Outcomes

Course outcome: After completion of this course students will be able to:

CO 1 Understand the fundamentals of an operating systems, functions and their K1, K2

structure and functions.

CO2 Implement concept of process management policies, CPU Scheduling and K5

thread man
agement.
CO3 Understand and implement the requirement of process synchronization K2,K5
and apply deadlock handling algorithms.

CO4 Evaluate the memory management and its allocation policies. K5

CO5 Understand and analyze the I/O management and File systems K2, K4

Nisha ACSE0403A OS Unit 5

12 December 2023 11
THE CONCEPT LEARNING TASK
Program Outcomes

1. Engineering knowledge
2. Problem analysis
3. Design/development of solutions
4.Conduct investigations of complex problems
5. Modern tool usage
6. The engineer and society
7. Environment and sustainability
8. Ethics:
9. Individual and team work
10. Communication
11. Project management and finance
12. Life-long learning

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 12

THE
COsCONCEPT
and POsLEARNING
MappingTASK

Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
1 3 2 2 - - - - - - - - 1

3 3 3 - - - - - - - - 1
2

3 3 3 - - - - - - - - 1
3

3 2 1 - - - - - - - - 1
4

3 2 2 - - - - - - - - 1
5

Average
3 2.4 2.2 - - - - - - - - 1

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 13

THE CONCEPT
Program Specific LEARNING TASK
Outcomes(PSOs)

On successful completion of B. Tech. (DS) Program, the Data

Science graduates will be able to:

• PSO1:- Analyse, design and develop solutions by applying fundamental concepts of

Data Science.
• PSO2:-Apply technical knowledge while using modern tools and technologies for
solving complex problems.
• PSO3:-Collaborate different fields of science and technology with right attitude, to
work as an individual or as a team, and demonstrating professional ethics for the
well-being of the society.

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 14

THE
COsCONCEPT LEARNING
and PSOs MappingTASK

Course Outcome PSO1 PSO2 PSO3

1 3 - -

3 2 -
2

3 2 -
3

3 2 2
4

3 2 -
5

Average
3 2 2

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 15

ProgramTHE CONCEPT LEARNING
Educational ObjectivesTASK
(PEOs)

•Solve real-time complex problems and adapt to technological changes with the ability of
lifelong learning.

•Work as data scientists, entrepreneurs, and bureaucrats for the goodwill of the society
and pursue higher education.

•Exhibit professional ethics and moral values with good leadership qualities and effective
interpersonal skills.

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 16

THE CONCEPT
Faculty LEARNING
wise Result TASK
Analysis

• NA

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 17

THE CONCEPT
End Semester LEARNING
Question Paper TASK
Templates (Offline
Pattern/Online Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 18

End Semester Question Paper
THE CONCEPT Templates
LEARNING (Offline
TASK
Pattern/Online Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 19

End Semester Question Paper
THE CONCEPT TemplatesTASK
LEARNING (Offline
Pattern/Online Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 20

End Semester Question Paper Templates (Offline
THE CONCEPT LEARNING TASK
Pattern/Online Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 21

End Semester Question Paper Templates (Offline Pattern/Online
THE CONCEPT LEARNING TASK
Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 22

CONTENT
Unit -4 Exploratory Data Analysis
 Handling Missing data,
 Removing Redundant variables,
 variable Selection,
 identifying outliers,
 Removing Outliers,
 Time series Analysis,
 Data transformation and dimensionality reduction techniques
 Principal Component Analysis (PCA),
 Factor Analysis (FA) and
 Linear Discriminant Analysis (LDA),
 Univariate and Multivariate Exploratory Data Analysis.
 Data Munging, Data Wrangling- APIs and other tools for scrapping
data from the web/ internet using R/Python.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 23

THE CONCEPT
PrerequisiteLEARNING
and Recap TASK

Prerequisite:-

• Basic Knowledge of Statistics and Probability

• Basic knowledge of Python Programming.

Recap-

 To cleared the knowledge about the types of data, data

classification and data manipulation.

 To Know the basic concept of data pre – processing.

 Student have knowledge about data reduction.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 24

THE CONCEPT LEARNING TASK
Unit Objective

The objective of the Unit 4 is :

1. To describe how to handle the missing data.
2. To remove the redundant variable.
3. To explore Data transformation and dimensionality reduction
techniques such as Principal Component Analysis (PCA), Factor
Analysis (FA) and Linear Discriminant Analysis (LDA)
4. To describe the services an operating system provides to users,
processes, and other systems
5. To discuss Data Munging, Data Wrangling- APIs and other tools for
scrapping data from the web/ internet using Python.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 25

THE CONCEPT LEARNING TASK
Topic Objective

Objective: to study about the Exploratory data analysis

(EDA) and various kind of tools regarding EDA.

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 26

CONTENT
Exploratory Data Analysis

 Exploratory data analysis (EDA) is used by data scientists to

analyze and investigate data sets and summarize their main
characteristics, often employing data visualization methods.
 It helps determine how best to manipulate data sources to get the
answers you need, making it easier for data scientists to discover
patterns, spot anomalies, test a hypothesis, or check assumptions.
 EDA is primarily used to see what data can reveal beyond the
formal modeling or hypothesis testing task and provides a
provides a better understanding of data set variables and the
relationships between them.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 27
CONTENT
Exploratory Data Analysis

Exploratory data analysis tools:-

•Clustering and dimension reduction techniques, which help create graphical
displays of high-dimensional data containing many variables.
•Univariate visualization of each field in the raw dataset, with summary
statistics.
•Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable.
•Multivariate visualizations, for mapping and understanding interactions
between different fields in the data.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 28

CONTENT
Exploratory Data Analysis

Types of exploratory data analysis:-There are four primary types of EDA:

•Univariate non-graphical. This is simplest form of data analysis, where the data
being analyzed consists of just one variable.
•Univariate graphical. Non-graphical methods don’t provide a full picture of the
data.
•Multivariate nongraphical: Multivariate data arises from more than one
variable. Multivariate non-graphical EDA techniques generally show the
relationship between two or more variables of the data through cross-
tabulation or statistics.
•Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 29

THE CONCEPT LEARNING TASK
Topic Objective

Objective: to understand how to handle the missing data.

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 30

CONTENT
Handling Missing data

Missing Values:-
The data has some missing values in its columns. There are three
major categories of missing values:
1.MCAR (Missing completely at random): These are values that
are randomly missing and do not depend on any other values.
2.MAR (Missing at random): These values are dependent on some
additional features.
3.MNAR (Missing not at random): There is a reason behind why
these values are missing.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 31
CONTENT
Handling Missing data

Handling missing data

Removing missing data
Once you know that data is MCAR and a relatively small
fraction of observations have missing values, then it may be
safe to remove observations.
Encoding as missing
In the MCAR or MAR case for categorical attributes y, a useful
approach is to encode the fact that a value is missing as a new
category and include that in subsequent analysis of attribute y.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 32

CONTENT
Handling Missing data

Imputation
1. MCAR (also for MAR, but this is not ideal)
In this case we can use a simple method for imputation of y.
For numeric attributes we replace missing values in y with
the mean of non-missing values of y.
For categorical attributes y, we replace missing values with
the most common category in the non-missing values of y.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 33

CONTENT
Handling Missing data

Imputation
2. MAR
In this case we use a more complex method by replacing
missing values for attribute y predicting from other
variables x when variables are related (we will see linear
regression using the lm and predict functions later on)

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 34

CONTENT
Handling Missing data

Implications of imputation
Imputation has some effects that can impact analysis.
1.The central tendency of data is retained. For example, if we impute
missing data using the mean of a numeric variable, the mean after
imputation will not change. This is a good reason to impute based on
estimates of central tendency.
2.The spread of the data will change. After imputation, the spread of the
data will be smaller relative to spread if we ignore missing values. This
could be problematic as underestimating the spread of data can yield
over-confident inferences in downstream analysis.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 35

CONTENT
Handling Missing data

Sanchi Kaushik UNIT 04 Data

12/12/2023 36
Analytics
THE CONCEPT LEARNING TASK
Topic Objective

Objective: To learn the techniques for removing the

redundant variable.

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 37

RemovingCONTENT
Redundant variables

1. Removing redundant columns -

Sometimes more than one column can contain the
same/similar value. In that case having two columns does
not add any value to the model. So, it is wise to delete the
redundant column.
2. Remove redundant rows -
This depends on the use case, but if having duplicate
records does not make sense, then it is wise to remove the
redundant rows -
df.drop_duplicates()
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 38
Variable selection
CONTENT

Variable selection is a collection of candidate model variables

tested for significance during model training. Candidate model
variables are also known as independent variables, predictors,
attributes, model factors, covariates, regressors, features, or
characteristics.
Variable selection is a parsimonious process that aims to identify
a minimal set of predictors for maximum gain (predictive
accuracy). This approach is the opposite of data preparation,
where as many meaningful variables as possible are added to the
mining view.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 39
Variable selection
CONTENT

Typical preventive measures during variable selection include:

•Collaboration with experts in the field to identify the important
variables.
•Awareness of any problems in relation to data source, reliability
or mismeasurement.
•Cleaning the data.
•Using control variables to account for banned variables or
specific events such as an economic drift.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 40

Variable selection
CONTENT

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 41

CONTENT
Identifying Outliers

An outlier is something separate or different from the crowd.

Outliers can be a result of a mistake during data collection or it
can be just an indication of variance in your data. Some of the
methods for detecting and handling outliers:
•Box Plot
•Scatter plot
•Z-score
•IQR(Inter-Quartile Range)

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 42

THE CONCEPT LEARNING TASK
Topic Objective

Objective: To understand the concept of Time series

analysis

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 43

Time CONTENT
Series Analysis

Time series analysis is a specific way of analyzing a sequence of

data points collected over an interval of time. In time series
analysis, analysts record data points at consistent intervals over a
set period of time rather than just recording the data points
intermittently or randomly. However, this type of analysis is not
merely the act of collecting data over time.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 44

Time CONTENT
Series Analysis

Why organizations use time series data analysis

Time series analysis helps organizations understand the
underlying causes of trends or systemic patterns over time.
Using data visualizations, business users can see seasonal trends
and dig deeper into why these trends occur.
When organizations analyze data over consistent intervals, they
can also use time series forecasting to predict the likelihood of
future events. Time series forecasting is part of predictive
analytics.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 45

Time CONTENT
Series Analysis

Time series analysis examples:- Examples of time series analysis

in action include:
•Weather data •Brain monitoring (EEG)

•Rainfall measurements •Quarterly sales

•Temperature readings •Stock prices

•Heart rate monitoring (EKG) •Automated stock trading

•Industry forecasts
•Interest rates

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 46

Time CONTENT
Series Analysis

Time Series Analysis Types

•Classification: Identifies and assigns categories to the data.
•Curve fitting: Plots the data along a curve to study the
relationships of variables within the data.
•Descriptive analysis: Identifies patterns in time series data,
like trends, cycles, or seasonal variation.
•Explanative analysis: Attempts to understand the data and
the relationships within it, as well as cause and effect.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 47

Time CONTENT
Series Analysis

Time Series Analysis Types

1. Exploratory analysis
2. Forecasting
3. Intervention analysis
4. Segmentation

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 48

THE CONCEPT LEARNING TASK
Topic Objective

Objective: To explore Data transformation and

dimensionality reduction techniques such as Principal
Component Analysis (PCA), Factor Analysis (FA) and Linear
Discriminant Analysis (LDA)

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 49

Data transformationCONTENT
and dimensionality reduction

Dimensionality Reduction refers to the technique of reducing

the dimension of a data feature set. Usually, machine learning
datasets (feature set) contain hundreds of columns (i.e.,
features) or an array of points, creating a massive sphere in a
three-dimensional space. By applying dimensionality reduction,
you can decrease or bring down the number of columns to
quantifiable counts, thereby transforming the three-dimensional
sphere into a two-dimensional object (circle).

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 50

Data transformationCONTENT
and dimensionality reduction

Dimensionality reduction has many other benefits, such as:

•It eliminates noise and redundant features.
•It helps improve the model’s accuracy and performance.
•It facilitates the usage of algorithms that are unfit for more
substantial dimensions.
•It reduces the amount of storage space required (less data
needs lesser storage space).
•It compresses the data, which reduces the computation time
and facilitates faster training of the data.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 51

Data transformationCONTENT
and dimensionality reduction

Dimensionality Reduction Techniques:-Dimensionality

reduction techniques can be categorized into two broad
categories:
1. Feature selection
The feature selection method aims to find a subset of the
input variables (that are most relevant) from the original
dataset. Feature selection includes three strategies, namely:
•Filter strategy
•Wrapper strategy
•Embedded strategy
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 52
Data transformationCONTENT
and dimensionality reduction

Dimensionality Reduction Techniques

2. Feature extraction
Feature extraction, feature projection, converts the data
from the high-dimensional space to one with lesser
dimensions. This data transformation may either be linear
or it may be nonlinear as well. This technique finds a
smaller set of new variables, each of which is a
combination of input variables (containing the same
information as the input variables).

Sanchi Kaushik UNIT 04 Data Analytics

12/12/2023 53
Data transformationCONTENT
and dimensionality reduction

1. Principal Component Analysis (PCA)

 Principal Component Analysis is one of the leading linear
techniques of dimensionality reduction.
 It is a statistical procedure that orthogonally converts the
‘n’ coordinates of a dataset into a new set
of n coordinates, known as the principal components.
This conversion results in the creation of the first
principal component having the maximum variance.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 54

Data transformationCONTENT
and dimensionality reduction

Linear discriminant analysis (LDA)

The linear discriminant analysis is a generalization of
Fisher’s linear discriminant method that is widely applied
in statistics, pattern recognition, and machine learning.
LDA represents data in a way that maximizes class
separability. While objects belonging to the same class
are juxtaposed via projection, objects from different
classes are arranged far apart.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 55

Data transformationCONTENT
and dimensionality reduction

Factor Analysis
 Factor Analysis (FA) is an exploratory data analysis
method used to search influential underlying factors or
latent variables from a set of observed variables. It helps in
data interpretations by reducing the number of variables. It
extracts maximum common variance from all variables and
puts them into a common score.
 Factor analysis is a linear statistical model. It is used to
explain the variance among the observed variable and
condense a set of the observed variable into the
unobserved variable
12/12/2023
called factors.
Sanchi Kaushik UNIT 04 Data Analytics 56
Data transformationCONTENT
and dimensionality reduction

Types of Factor Analysis

•Exploratory Factor Analysis: It is the most popular factor analysis
approach among social and management researchers. Its basic
assumption is that any observed variable is directly associated with
any factor.
•Confirmatory Factor Analysis (CFA): Its basic assumption is that
each factor is associated with a particular set of observed variables.
CFA confirms what is expected on the basic.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 57

Data transformationCONTENT
and dimensionality reduction

How does Factor Analysis Work?

The primary objective of factor analysis is to reduce the
number of observed variables and find unobservable
variables. These unobserved variables help the market
researcher to conclude the survey. This conversion of the
observed variables to unobserved variables can be achieved
in two steps:
1. Factor Extraction
2. Factor Rotation
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 58
Data transformationCONTENT
and dimensionality reduction

•Factor Extraction: In this step, the number of factors and

approach for extraction selected using variance partitioning
methods such as principal components analysis and common
factor analysis.
•Factor Rotation: In this step, rotation tries to convert factors
into uncorrelated factors — the main goal of this step to
improve the overall interpretability. There are lots of rotation
methods that are available such as: Varimax rotation method,
Quartimax rotation method, and Promax rotation method.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 59
THE CONCEPT LEARNING TASK
Topic Objective

Objective: To learn about the Univariate and Multivariate

Exploratory Data Analysis

Recap- Basic knowledge of Exploratory Data Analysis.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 60

CONTENT
Univariate and Multivariate Exploratory Data Analysis

Exploratory Data Analysis is majorly performed using the

following methods:
1. Univariate analysis
2. Bivariate analysis
3. Multivariate analysis

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 61

CONTENT
Univariate and Multivariate Exploratory Data Analysis

•Univariate analysis:- provides summary statistics for each

field in the raw data set (or) summary only on one
variable. Ex:- CDF,PDF,Box plot, Violin plot.(don't worry, will
see below what each of them is)

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 62

CONTENT
Univariate and Multivariate Exploratory Data Analysis

•Bivariate analysis:- is performed to find the relationship

between each variable in the dataset and the target
variable of interest (or) using 2 variables and finding the
relationship between them.Ex:-Box plot, Violin plot.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 63

CONTENT
Univariate and Multivariate Exploratory Data Analysis

•Multivariate analysis:- is performed to understand

interactions between different fields in the dataset (or) finding
interactions between variables more than 2. Ex:- Pair plot and
3D scatter plot.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 64

THE CONCEPT LEARNING TASK
Topic Objective

Objective: To understand the concept of Data Munging,

Data Wrangling- APIs and other tools for scrapping data
from the web/ internet using R/Python.

Recap- Basic knowledge of Exploratory Data Analysis.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 65

CONTENT
Data Munging

What is Data Munging?

Data munging, also known as data wrangling, is the
process of converting raw data into a more usable
format. Often, data munging occurs as a precursor
to data analytics or data integration. High-quality data is
essential for sophisticated data operations.
The munging process typically begins with a large
volume of raw data. Data scientists will mung the data
into shape by removing any errors or inconsistencies.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 66

CONTENT
Data Munging

The modern data munging process now involves six

main steps:

1. Discover: First, the data scientist performs a degree

of data exploration. This is a first glance at the data to
establish the most important patterns. It also allows the
scientist to identify any major structural issues, such as
invalid data formats.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 67

CONTENT
Data Munging

The modern data munging process now involves six

main steps:

2. Structure: Raw data might not have an appropriate

structure for the intended usage. The data scientists will
organize and normalize the data so that it’s more
manageable. This also makes it easier to perform the
next steps in the munging process.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 68

CONTENT
Data Munging

3. Clean: Raw data can contain corrupt, empty, or invalid

cells. There may also be values that require conversions,
such as dates and currencies.. For instance, the state in a
customer's address might appear as Texas, Tex, or TX.
The cleaning process will standardize this value for every
address.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 69

CONTENT
Data Munging

4. Enrich: Data enrichment is the process of filling in

missing details by referring to other data sources. Data
enrichment lets you fill in all address fields by looking up
the missing values elsewhere, such as in the CRM
database or a postal records lookup.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 70

CONTENT
Data Munging

5. Validate: Finally, it’s time to ensure that all data values

are logically consistent. This means checking things like
whether all phone numbers have nine digits, that there are
no numbers in name fields, and that all dates are valid
calendar dates. Data validation also involves some deeper
checks, such as ensuring that all values are compatible with
the specified data type.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 71

CONTENT
Data Munging

6. Publish: When the data munging process is complete,

the data science team will push it towards its final
destination. Often this is a data repository, where it will
integrate with data from other sources. This will make the
munged data permanently available to all consumers.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 72

CONTENT
Data Munging

Web Scraping with Python Web scraping is an automated method used to extract large
amounts of data from websites. The data on the websites are unstructured. Web
scraping helps collect these unstructured data and store it in a structured form. There
are different ways to scrape websites such as online Services, APIs or writing your own
code.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 73

CONTENT
Data Munging

To extract data using web scraping with python, you need to follow these basic steps:
1.Find the URL that you want to scrape
2.Inspecting the Page
3.Find the data you want to extract
4.Write the code
5.Run the code and extract the data
6.Store the data in the required format

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 74

CONTENT
Data Munging

Libraries used for Web Scraping

As we know, Python is has various applications and there are different libraries for
different purposes. In our further demonstration, we will be using the following
libraries:

•Selenium: Selenium is a web testing library. It is used to automate browser

activities.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 75

CONTENT
Data Munging

Libraries used for Web Scraping

•BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML
documents. It creates parse trees that is helpful to extract the data easily.

•Pandas: Pandas is a library used for data manipulation and analysis. It is used to
extract the data and store it in the desired format.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 76

CONTENT
Data Munging

Web Scraping Example : Scraping Flipkart Website

Step 1: Find the URL that you want to scrape
For this example, we are going scrape Flipkart website to extract the
Price, Name, and Rating of Laptops. The URL for this page
is https://www.flipkart.com/laptops/~buyback-guarantee-on-
laptops-
/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.
PMU_V2.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 77

CONTENT
Data Munging

Step 2: Inspecting the Page

The data is usually nested in tags. So, we inspect the page
to see, under which tag the data we want to scrape is
nested. To inspect the page, just right click on the element
and click on “Inspect”.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 78

CONTENT
Data Munging

When you click on the “Inspect” tab, you will see a “Browser
Inspector Box” open.

Sanchi Kaushik UNIT 04 Data Analytics

12/12/2023 79
CONTENT
Data Munging

Step 4: Write the code

First, let’s create a Python file. To do this, open the terminal
in Ubuntu and type gedit <your file name> with .py
extension.
gedit web-s.py
First, let us import all the necessary libraries:

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 80

CONTENT
Data Munging

To configure web driver to use Chrome browser, we have to set the

path to chrome driver

Refer the below code to open the URL:

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 81

CONTENT
Data Munging

find the div tags with those respective class-names, extract

the data and store the data in a variable

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 82

CONTENT
Data Munging

Step 5: Run the code and extract the data

To run the code, use the below command:

Step 6: Store the data in a required format

After extracting the data, you might want to store it in a format. This format varies
depending on your requirement. For this example, we will store the extracted data in
a CSV (Comma Separated Value) format.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 83

Faculty VideoTHE
Links,CONCEPT
You tube &LEARNING
NPTEL Video TASK
Links and Online
Courses Details

You Tube video

https://www.youtube.com/watch?v=q4pyaVZjqk0

https://www.youtube.com/watch?v=7sJaRHF03K8

https://www.youtube.com/watch?v=mKxFfjNyj3c

https://www.youtube.com/watch?v=azXCzI57Yfc

https://www.youtube.com/watch?v=83x5X66uWK0

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 84

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

1. Which of the following is not an example of a time series

model?

A) Naive approach
B) Exponential smoothing
C) Moving Average
D)None of the above

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 85

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

2. Which of the following can’t be a component for a time series

plot?

A) Seasonality
B) Trend
C) Cyclical
D) Noise
E) None of the above

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 86

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

3. Which of the following is relatively easier to estimate in time

series modeling?

A) Seasonality
B) Cyclical
C) No difference between Seasonality and Cyclical

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 87

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

4. Which of the following techniques would perform better for

reducing dimensions of a data set?

A. Removing columns which have too many missing values

B. Removing columns which have high variance in data

C. Removing columns with dissimilar data trends

D. None of these

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 88

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

5. [ True or False ] Dimensionality reduction algorithms are one

of the possible ways to reduce the computation time required to
build a model.

A. TRUE

B. FALSE

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 89

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

6. Which of the following algorithms cannot be used for

reducing the dimensionality of data?

A. t-SNE

B. PCA

C. LDA False

D. None of these

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 90

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

7. Which of the following statement is TRUE?

(A) Outliers should be identified and removed always from a
dataset.
(B) Outliers can never be present in the testing dataset.
(C) Outliers is a data point that is significantly close to other data
points.
(D) The nature of our business problem determines how
outliers are used.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 91

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

8. __________ are the data objects that don’t comply with the
general model or behaviour of the available data:

a. Evolution Analysis

b. Outlier Analysis

c. Classification

d. Prediction

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 92

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

9. The primary use of data cleaning is:

a. Removing the noisy data

b. Correction of the data inconsistencies

c. Transformations for correcting the wrong data

d. All of the above

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 93

THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

10. __________ means the description and trends or model

regularities for those objects whose behavior would change
eventually over time.

a. Evolution Analysis

b. Outlier Analysis

c. Classification

d. Prediction

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 94

THE CONCEPT LEARNING
Weekly/monthly/Unit TASK
Wise Assignment.

Assignment 1
1. What is an outlier and how to identify them?
2. What are the steps of Data Cleaning?
3. What are the missing values? How do you handle missing values?
4. Name two useful methods of pandas that are useful to handle the
missing values.
5. Explain the phrase “Curse of Dimensionality”.
6. What do you mean by Feature Splitting?
7. What is the importance of using PCA before the clustering?
8. Explain the standardization scaling method to normalize data.
9. What is the Differentiate between Univariate, Bivariate, and
Multivariate analysis?
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 95
THE CONCEPT
Glossary LEARNING
Questions TASK

1. ___________ is really just making up data to artificially inflate

results. It’s better to just drop cases with missing data than to
impute.
2. ____________is always the best way to deal with missing data.
3. Principal Component Analysis (PCA) is an example
of_______________.
4. PCA reduces the dimension by finding a few________.
5. ______is a tool which is used to reduce the dimension of the data.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 96

THE CONCEPT
Glossary LEARNING
Questions TASK

6. PCA is used to find _______.

7. ______ is non-zero vector that stays parallel after matrix
multiplication.
8. ______is a dimensionality reduction technique which is commonly
used for the supervised classification problems.
9. Discriminative Learning Algorithms include _______
10. The predictions for generative learning algorithms are made using
_______ .

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 97

THE CONCEPT
Expected LEARNING
Questions for University TASK
Exam

1. What is Data Manipulation?

2. Define Outliers. How are they identified?
3. Name some methods to deal with missing value imputation?
4. Explain the standardization scaling method to normalize data.
5. Name top 2 techniques to handle missing data.
6. Explain Principal Component Analysis, assumptions, equations.
7. What is the importance of using PCA before the clustering?
8. Explain the Curse of Dimensionality?
9. How can you evaluate the performance of a dimensionality reduction
algorithm on your dataset?

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 98

THE CONCEPT LEARNING TASK
Summary

 This unit provide us fundamentals domain of Exploratory

Data Analytics and its latest trends in industry.
 In this unit we are also benefitted with the knowledge of
different types of data and very important is how to
handle the missing data and also through the concept
model building which is used in industry prospects.
 This unit will impart us with knowledge of web scraping
using python

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 99

CONTENT

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 100

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Unit 1 Data Analytics
No ratings yet
Unit 1 Data Analytics
81 pages
Unit 2
No ratings yet
Unit 2
119 pages
Introduction To FreeRTOS
No ratings yet
Introduction To FreeRTOS
75 pages
AWS Cloud Practitioner Exam Cram
100% (5)
AWS Cloud Practitioner Exam Cram
142 pages
Data Analytics Quantum
No ratings yet
Data Analytics Quantum
148 pages
Data Analytics Quantum
100% (1)
Data Analytics Quantum
148 pages
GAD Configuration Using GAD Configuration Window v4
No ratings yet
GAD Configuration Using GAD Configuration Window v4
29 pages
JFo Java Foundations Learner - English
100% (2)
JFo Java Foundations Learner - English
38 pages
Unit 4 DA Revised
No ratings yet
Unit 4 DA Revised
102 pages
TE Computer 2019 Course 22.06.2021-52-99
No ratings yet
TE Computer 2019 Course 22.06.2021-52-99
48 pages
SolidWorks 3D Printing Tutorials
No ratings yet
SolidWorks 3D Printing Tutorials
35 pages
Ids Unit 1,2,3,4 & 5
No ratings yet
Ids Unit 1,2,3,4 & 5
117 pages
CS3352 FDS
No ratings yet
CS3352 FDS
23 pages
2nd - Semester - Data Science
No ratings yet
2nd - Semester - Data Science
16 pages
310251: Data Science and Big Data Analytics
No ratings yet
310251: Data Science and Big Data Analytics
2 pages
Data Science Honor Syllabus Sem-I
No ratings yet
Data Science Honor Syllabus Sem-I
5 pages
AV CSS Project
No ratings yet
AV CSS Project
30 pages
Dse Q B
No ratings yet
Dse Q B
13 pages
Bentley DGNDB Imodel Importer 2.0
No ratings yet
Bentley DGNDB Imodel Importer 2.0
6 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
3 pages
21ai402 Data Analytics Unit-3
No ratings yet
21ai402 Data Analytics Unit-3
150 pages
SYCS Minor Syllabus
No ratings yet
SYCS Minor Syllabus
12 pages
22am901 Data Science Using Python Unit 2
No ratings yet
22am901 Data Science Using Python Unit 2
116 pages
003-Storage Array Technology V1.13
No ratings yet
003-Storage Array Technology V1.13
61 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
DSP U2
No ratings yet
DSP U2
172 pages
Dev U2
No ratings yet
Dev U2
96 pages
Intro To Data Science
No ratings yet
Intro To Data Science
73 pages
BE Elex and Comp Engg - 2019 Course
No ratings yet
BE Elex and Comp Engg - 2019 Course
91 pages
Unit-2 FODS
No ratings yet
Unit-2 FODS
114 pages
DSP U1
No ratings yet
DSP U1
89 pages
840dsl Initial Commissioning
No ratings yet
840dsl Initial Commissioning
138 pages
Unit 5
No ratings yet
Unit 5
137 pages
VSA System User Guide
No ratings yet
VSA System User Guide
93 pages
Artweaver en
No ratings yet
Artweaver en
90 pages
Unit 1
No ratings yet
Unit 1
136 pages
Sem 6
No ratings yet
Sem 6
12 pages
Verticals CSE
No ratings yet
Verticals CSE
44 pages
Unit 3
No ratings yet
Unit 3
99 pages
Edit Ds
No ratings yet
Edit Ds
37 pages
Unit - One QB
No ratings yet
Unit - One QB
48 pages
Da Handbook
No ratings yet
Da Handbook
18 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Course Structure - Introduction To Data Science
No ratings yet
Course Structure - Introduction To Data Science
23 pages
Data Analytics
No ratings yet
Data Analytics
42 pages
Basic Calculator in C# - C# Tutorials - Dream
No ratings yet
Basic Calculator in C# - C# Tutorials - Dream
17 pages
Ch01 Introduction To Web Engineering
No ratings yet
Ch01 Introduction To Web Engineering
29 pages
Requirements Validation
No ratings yet
Requirements Validation
24 pages
CODE1
No ratings yet
CODE1
20 pages
Introduction To Data Science: Cpts 483-06 - Syllabus
No ratings yet
Introduction To Data Science: Cpts 483-06 - Syllabus
5 pages
Data Science (Quick Guide) For College Exams
No ratings yet
Data Science (Quick Guide) For College Exams
34 pages
II CSE - A&B (96) DS-int 1 QP ANS-set1
No ratings yet
II CSE - A&B (96) DS-int 1 QP ANS-set1
7 pages
Syllabus Sem 6
No ratings yet
Syllabus Sem 6
14 pages
Ai203h DSC-1
No ratings yet
Ai203h DSC-1
10 pages
FDS Course Plan - Update
No ratings yet
FDS Course Plan - Update
7 pages
CHO AI 105 - Data Analytics-As Shared
No ratings yet
CHO AI 105 - Data Analytics-As Shared
8 pages
Data Science - Syllabus
No ratings yet
Data Science - Syllabus
14 pages
MSC Data Science
No ratings yet
MSC Data Science
20 pages
Data Science CS481 - Course Outline Spring 2020
No ratings yet
Data Science CS481 - Course Outline Spring 2020
3 pages
CISCO ASAv - GNS3 Deployment
No ratings yet
CISCO ASAv - GNS3 Deployment
15 pages
Mass Transit
No ratings yet
Mass Transit
12 pages
OOPS PROJECT REPORT Sheraz 21
No ratings yet
OOPS PROJECT REPORT Sheraz 21
14 pages
Software Requirements Specification: Notepad
No ratings yet
Software Requirements Specification: Notepad
10 pages
MLSchool
No ratings yet
MLSchool
13 pages
Teaching Plan
No ratings yet
Teaching Plan
7 pages
Bda Aids Syllabus
No ratings yet
Bda Aids Syllabus
3 pages
19CS003 Handout
No ratings yet
19CS003 Handout
5 pages
Intro To Data-Science Final
No ratings yet
Intro To Data-Science Final
3 pages
20ad41e2 - Data Science
No ratings yet
20ad41e2 - Data Science
2 pages
U23AD492 - Data Science Syllabus
No ratings yet
U23AD492 - Data Science Syllabus
4 pages
Course Curriculum
No ratings yet
Course Curriculum
3 pages
MCA Syllabus
No ratings yet
MCA Syllabus
3 pages
Syllabus OE AIDSML.
No ratings yet
Syllabus OE AIDSML.
7 pages
CSE-R22-DA Syllabus1
No ratings yet
CSE-R22-DA Syllabus1
2 pages
July 2021 MOKASA COMPUTER P2 QNS
No ratings yet
July 2021 MOKASA COMPUTER P2 QNS
5 pages
Navigating Quipper
No ratings yet
Navigating Quipper
30 pages
IDS Syllabus
No ratings yet
IDS Syllabus
3 pages
Data Analytics
No ratings yet
Data Analytics
9 pages
A Student's Data Science and Analysis
No ratings yet
A Student's Data Science and Analysis
3 pages
227C4A Data Science
No ratings yet
227C4A Data Science
2 pages
Extractive Text Summarization Using Sentence Ranking: J.N.Madhuri Ganesh Kumar.R
No ratings yet
Extractive Text Summarization Using Sentence Ranking: J.N.Madhuri Ganesh Kumar.R
3 pages
Ispconfig Exploit
No ratings yet
Ispconfig Exploit
3 pages
IC Project Management Plan Dashboard Template Google Sheets
No ratings yet
IC Project Management Plan Dashboard Template Google Sheets
3 pages
DB Artes Ii 012011 Eng
No ratings yet
DB Artes Ii 012011 Eng
2 pages
Metaswitch Datasheet Perimeta SBC Overview
No ratings yet
Metaswitch Datasheet Perimeta SBC Overview
2 pages
Azgaar's Fantasy Map Generator v1.89.04
No ratings yet
Azgaar's Fantasy Map Generator v1.89.04
1 page
Excel 2013 Training: 100 Metres Olympic
No ratings yet
Excel 2013 Training: 100 Metres Olympic
2 pages
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
IGNOU MCA Cloud Computing and IoT Previous year Unsolved Papers MCS 227
From Everand
IGNOU MCA Cloud Computing and IoT Previous year Unsolved Papers MCS 227
Manish Soni
No ratings yet