0% found this document useful (0 votes)
29 views100 pages

Unit 4

The document outlines the course structure and objectives for a Data Analytics program at the Noida Institute of Engineering and Technology, focusing on exploratory data analysis and various data manipulation techniques. It details the evaluation scheme, course outcomes, and prerequisites, along with the importance of handling missing data and employing visualization methods. The course aims to equip students with the necessary skills to analyze and interpret data effectively using programming languages like R and Python.

Uploaded by

asdrhmn8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views100 pages

Unit 4

The document outlines the course structure and objectives for a Data Analytics program at the Noida Institute of Engineering and Technology, focusing on exploratory data analysis and various data manipulation techniques. It details the evaluation scheme, course outcomes, and prerequisites, along with the importance of handling missing data and employing visualization methods. The course aims to equip students with the necessary skills to analyze and interpret data effectively using programming languages like R and Python.

Uploaded by

asdrhmn8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Noida Institute of Engineering and Technology, Greater Noida

DATA ANALYTICS

Unit: 4

Exploratory Data Analysis

Mr.Ravi Pandey
B Tech VIIth Sem Assistant Professor
ECE

Sanchi Kaushik UNIT 04 Data Analytics 1


12/12/2023
THE CONCEPT
Faculty LEARNING TASK
Introduction

Mr. Ravi is the faculty of Discipline of Electronics


and Communication Engineering, at Noida
institute of engineering and technology Greater
Noida, Gautam Budh Nagar, Uttar Pradesh, India
since feb2016. He received the B.Tech degree in
electronic and telecommunication engineering
and M.Tech degree in instrumentation and signal
processing. His research interests include
biomedical signal processing, pattern
recognition, machine learning and deep neural
networks.

12/12/2023 2
Evaluation schemeLEARNING TASK
THE CONCEPT

Sl Sub Subject Periods Evaluation Schemes End Semester Total Credit


. ject
N Co
o. des L T P CT TA Total PS TE PE

1 Departmental Core - I 3 0 0 30 20 50 100 150 3

2 Departmental Elective 3 0 0 30 20 50 100 150 3


V

3 Open Elective II 3 0 0 30 20 50 100 150 3

4 Open Elective III 3 0 0 30 20 50 100 150 3

5 Lab – I 0 0 2 25 25 50 1
6 Internship Assessment 0 0 2 50 50 1

MOOCs (Essential for 0 0 2


Course applicable for –B.Tech . Data Science/AI-
Hons. Degree)
ML/AI/IOT/CSBS
Total 700 14
12/12/2023 3
CONTENT
Course objective
B. TECH. (Data Science)

Course code L T P Credits


3 0 0 3

Course title Data Analytics

Course objective:

The objective of this course is to understand the fundamental concepts of Data Science,
learn about various types of data formats and its manipulations. It helps students to
learn exploratory data analysis and visualization techniques in addition to R
programming language.

15/06/2022 Nisha UNIT 01 4


CONTENT
Course Outcomes
Course outcomes : After completion of this course students will be able to

CO 1 Understand the fundamental concepts of data analytics in the areas that plays major role K1
within the realm of data science.

CO 2 Explain and exemplify the most common forms of data and its representations. K2

CO 3 Understand and apply data pre-processing techniques. K3

CO4 Analyse data using exploratory data analysis. K4

CO 5 Illustrate various visualization methods for different types of data sets and application K3
scenarios.

15/06/2022 Nisha UNIT 01 5


THE CONCEPT
Course LEARNING
Contents TASK
/ Syllabus

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 6


THE CONCEPT
Course LEARNING
Contents TASK
/ Syllabus

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 7


Text Books
THE CONCEPT LEARNING TASK

Text books:

1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis
and Data Mining, John Wiley Publishers, 2007.
2) Data Analysis and Data Mining, 2nd Edition, John Wiley & Sons Publication, 2014.
Reference Books:

1) Open Data for Sustainable Community: Glocalized Sustainable Development Goals,


Neha Sharma, Santanu Ghosh, Monodeep Saha, Springer, 2021.
2) The Data Science Handbook, Field Cady, John Wiley & Sons, Inc, 2017
3) Data Mining Concepts and Techniques, Third Edition, Jiawei Han, Micheline Kamber,
Jian Pei, Morgan Kaufmann, 2012.

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 8


THE CONCEPT
Branch LEARNING TASK
wise Applications

• Security.
•Transportation.
•Risk detection.
•Risk Management.
•Delivery.
•Fast internet allocation.
•Reasonable Expenditure.
•Interaction with customers.
•Planning of cities

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 9


THE CONCEPT LEARNING TASK
Course Objectives

• The objective of this course is to understand the fundamental concepts of Data


analytics and learn about various types of data formats and their manipulations.

• It helps students to learn exploratory data analysis and visualization techniques in


addition to R/Python/Tableau programming language.

Neeti Taneja ACSE0403A OS Unit-3


12 December 2023 10
THE CONCEPT LEARNING TASK

Course Outcomes

Course outcome: After completion of this course students will be able to:

CO 1 Understand the fundamentals of an operating systems, functions and their K1, K2


structure and functions.

CO2 Implement concept of process management policies, CPU Scheduling and K5


thread man
agement.
CO3 Understand and implement the requirement of process synchronization K2,K5
and apply deadlock handling algorithms.

CO4 Evaluate the memory management and its allocation policies. K5

CO5 Understand and analyze the I/O management and File systems K2, K4

Nisha ACSE0403A OS Unit 5

12 December 2023 11
THE CONCEPT LEARNING TASK
Program Outcomes

1. Engineering knowledge
2. Problem analysis
3. Design/development of solutions
4.Conduct investigations of complex problems
5. Modern tool usage
6. The engineer and society
7. Environment and sustainability
8. Ethics:
9. Individual and team work
10. Communication
11. Project management and finance
12. Life-long learning

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 12


THE
COsCONCEPT
and POsLEARNING
MappingTASK

Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
1 3 2 2 - - - - - - - - 1

3 3 3 - - - - - - - - 1
2

3 3 3 - - - - - - - - 1
3

3 2 1 - - - - - - - - 1
4

3 2 2 - - - - - - - - 1
5

Average
3 2.4 2.2 - - - - - - - - 1

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 13


THE CONCEPT
Program Specific LEARNING TASK
Outcomes(PSOs)

On successful completion of B. Tech. (DS) Program, the Data


Science graduates will be able to:

• PSO1:- Analyse, design and develop solutions by applying fundamental concepts of


Data Science.
• PSO2:-Apply technical knowledge while using modern tools and technologies for
solving complex problems.
• PSO3:-Collaborate different fields of science and technology with right attitude, to
work as an individual or as a team, and demonstrating professional ethics for the
well-being of the society.

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 14


THE
COsCONCEPT LEARNING
and PSOs MappingTASK

Course Outcome PSO1 PSO2 PSO3

1 3 - -

3 2 -
2

3 2 -
3

3 2 2
4

3 2 -
5

Average
3 2 2

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 15


ProgramTHE CONCEPT LEARNING
Educational ObjectivesTASK
(PEOs)

•Solve real-time complex problems and adapt to technological changes with the ability of
lifelong learning.

•Work as data scientists, entrepreneurs, and bureaucrats for the goodwill of the society
and pursue higher education.

•Exhibit professional ethics and moral values with good leadership qualities and effective
interpersonal skills.

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 16


THE CONCEPT
Faculty LEARNING
wise Result TASK
Analysis

• NA

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 17


THE CONCEPT
End Semester LEARNING
Question Paper TASK
Templates (Offline
Pattern/Online Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 18


End Semester Question Paper
THE CONCEPT Templates
LEARNING (Offline
TASK
Pattern/Online Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 19


End Semester Question Paper
THE CONCEPT TemplatesTASK
LEARNING (Offline
Pattern/Online Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 20


End Semester Question Paper Templates (Offline
THE CONCEPT LEARNING TASK
Pattern/Online Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 21


End Semester Question Paper Templates (Offline Pattern/Online
THE CONCEPT LEARNING TASK
Pattern

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 22


CONTENT
Unit -4 Exploratory Data Analysis
 Handling Missing data,
 Removing Redundant variables,
 variable Selection,
 identifying outliers,
 Removing Outliers,
 Time series Analysis,
 Data transformation and dimensionality reduction techniques
 Principal Component Analysis (PCA),
 Factor Analysis (FA) and
 Linear Discriminant Analysis (LDA),
 Univariate and Multivariate Exploratory Data Analysis.
 Data Munging, Data Wrangling- APIs and other tools for scrapping
data from the web/ internet using R/Python.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 23


THE CONCEPT
PrerequisiteLEARNING
and Recap TASK

Prerequisite:-

• Basic Knowledge of Statistics and Probability

• Basic knowledge of Python Programming.

Recap-

 To cleared the knowledge about the types of data, data


classification and data manipulation.

 To Know the basic concept of data pre – processing.

 Student have knowledge about data reduction.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 24


THE CONCEPT LEARNING TASK
Unit Objective

The objective of the Unit 4 is :


1. To describe how to handle the missing data.
2. To remove the redundant variable.
3. To explore Data transformation and dimensionality reduction
techniques such as Principal Component Analysis (PCA), Factor
Analysis (FA) and Linear Discriminant Analysis (LDA)
4. To describe the services an operating system provides to users,
processes, and other systems
5. To discuss Data Munging, Data Wrangling- APIs and other tools for
scrapping data from the web/ internet using Python.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 25


THE CONCEPT LEARNING TASK
Topic Objective

Objective: to study about the Exploratory data analysis


(EDA) and various kind of tools regarding EDA.

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 26


CONTENT
Exploratory Data Analysis

 Exploratory data analysis (EDA) is used by data scientists to


analyze and investigate data sets and summarize their main
characteristics, often employing data visualization methods.
 It helps determine how best to manipulate data sources to get the
answers you need, making it easier for data scientists to discover
patterns, spot anomalies, test a hypothesis, or check assumptions.
 EDA is primarily used to see what data can reveal beyond the
formal modeling or hypothesis testing task and provides a
provides a better understanding of data set variables and the
relationships between them.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 27
CONTENT
Exploratory Data Analysis

Exploratory data analysis tools:-


•Clustering and dimension reduction techniques, which help create graphical
displays of high-dimensional data containing many variables.
•Univariate visualization of each field in the raw dataset, with summary
statistics.
•Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable.
•Multivariate visualizations, for mapping and understanding interactions
between different fields in the data.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 28


CONTENT
Exploratory Data Analysis

Types of exploratory data analysis:-There are four primary types of EDA:


•Univariate non-graphical. This is simplest form of data analysis, where the data
being analyzed consists of just one variable.
•Univariate graphical. Non-graphical methods don’t provide a full picture of the
data.
•Multivariate nongraphical: Multivariate data arises from more than one
variable. Multivariate non-graphical EDA techniques generally show the
relationship between two or more variables of the data through cross-
tabulation or statistics.
•Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 29


THE CONCEPT LEARNING TASK
Topic Objective

Objective: to understand how to handle the missing data.

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 30


CONTENT
Handling Missing data

Missing Values:-
The data has some missing values in its columns. There are three
major categories of missing values:
1.MCAR (Missing completely at random): These are values that
are randomly missing and do not depend on any other values.
2.MAR (Missing at random): These values are dependent on some
additional features.
3.MNAR (Missing not at random): There is a reason behind why
these values are missing.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 31
CONTENT
Handling Missing data

Handling missing data


Removing missing data
Once you know that data is MCAR and a relatively small
fraction of observations have missing values, then it may be
safe to remove observations.
Encoding as missing
In the MCAR or MAR case for categorical attributes y, a useful
approach is to encode the fact that a value is missing as a new
category and include that in subsequent analysis of attribute y.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 32


CONTENT
Handling Missing data

Imputation
1. MCAR (also for MAR, but this is not ideal)
In this case we can use a simple method for imputation of y.
For numeric attributes we replace missing values in y with
the mean of non-missing values of y.
For categorical attributes y, we replace missing values with
the most common category in the non-missing values of y.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 33


CONTENT
Handling Missing data

Imputation
2. MAR
In this case we use a more complex method by replacing
missing values for attribute y predicting from other
variables x when variables are related (we will see linear
regression using the lm and predict functions later on)

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 34


CONTENT
Handling Missing data

Implications of imputation
Imputation has some effects that can impact analysis.
1.The central tendency of data is retained. For example, if we impute
missing data using the mean of a numeric variable, the mean after
imputation will not change. This is a good reason to impute based on
estimates of central tendency.
2.The spread of the data will change. After imputation, the spread of the
data will be smaller relative to spread if we ignore missing values. This
could be problematic as underestimating the spread of data can yield
over-confident inferences in downstream analysis.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 35


CONTENT
Handling Missing data

Implications of imputation
Imputation has some effects that can impact analysis.
1.The central tendency of data is retained. For example, if we impute
missing data using the mean of a numeric variable, the mean after
imputation will not change. This is a good reason to impute based on
estimates of central tendency.
2.The spread of the data will change. After imputation, the spread of the
data will be smaller relative to spread if we ignore missing values. This
could be problematic as underestimating the spread of data can yield
over-confident inferences in downstream analysis.

Sanchi Kaushik UNIT 04 Data


12/12/2023 36
Analytics
THE CONCEPT LEARNING TASK
Topic Objective

Objective: To learn the techniques for removing the


redundant variable.

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 37


RemovingCONTENT
Redundant variables

1. Removing redundant columns -


Sometimes more than one column can contain the
same/similar value. In that case having two columns does
not add any value to the model. So, it is wise to delete the
redundant column.
2. Remove redundant rows -
This depends on the use case, but if having duplicate
records does not make sense, then it is wise to remove the
redundant rows -
df.drop_duplicates()
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 38
Variable selection
CONTENT

Variable selection is a collection of candidate model variables


tested for significance during model training. Candidate model
variables are also known as independent variables, predictors,
attributes, model factors, covariates, regressors, features, or
characteristics.
Variable selection is a parsimonious process that aims to identify
a minimal set of predictors for maximum gain (predictive
accuracy). This approach is the opposite of data preparation,
where as many meaningful variables as possible are added to the
mining view.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 39
Variable selection
CONTENT

Typical preventive measures during variable selection include:


•Collaboration with experts in the field to identify the important
variables.
•Awareness of any problems in relation to data source, reliability
or mismeasurement.
•Cleaning the data.
•Using control variables to account for banned variables or
specific events such as an economic drift.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 40


Variable selection
CONTENT

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 41


CONTENT
Identifying Outliers

An outlier is something separate or different from the crowd.


Outliers can be a result of a mistake during data collection or it
can be just an indication of variance in your data. Some of the
methods for detecting and handling outliers:
•Box Plot
•Scatter plot
•Z-score
•IQR(Inter-Quartile Range)

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 42


THE CONCEPT LEARNING TASK
Topic Objective

Objective: To understand the concept of Time series


analysis

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 43


Time CONTENT
Series Analysis

Time series analysis is a specific way of analyzing a sequence of


data points collected over an interval of time. In time series
analysis, analysts record data points at consistent intervals over a
set period of time rather than just recording the data points
intermittently or randomly. However, this type of analysis is not
merely the act of collecting data over time.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 44


Time CONTENT
Series Analysis

Why organizations use time series data analysis


Time series analysis helps organizations understand the
underlying causes of trends or systemic patterns over time.
Using data visualizations, business users can see seasonal trends
and dig deeper into why these trends occur.
When organizations analyze data over consistent intervals, they
can also use time series forecasting to predict the likelihood of
future events. Time series forecasting is part of predictive
analytics.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 45


Time CONTENT
Series Analysis

Time series analysis examples:- Examples of time series analysis


in action include:
•Weather data •Brain monitoring (EEG)

•Rainfall measurements •Quarterly sales

•Temperature readings •Stock prices

•Heart rate monitoring (EKG) •Automated stock trading


•Industry forecasts
•Interest rates

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 46


Time CONTENT
Series Analysis

Time Series Analysis Types


•Classification: Identifies and assigns categories to the data.
•Curve fitting: Plots the data along a curve to study the
relationships of variables within the data.
•Descriptive analysis: Identifies patterns in time series data,
like trends, cycles, or seasonal variation.
•Explanative analysis: Attempts to understand the data and
the relationships within it, as well as cause and effect.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 47


Time CONTENT
Series Analysis

Time Series Analysis Types

1. Exploratory analysis
2. Forecasting
3. Intervention analysis
4. Segmentation

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 48


THE CONCEPT LEARNING TASK
Topic Objective

Objective: To explore Data transformation and


dimensionality reduction techniques such as Principal
Component Analysis (PCA), Factor Analysis (FA) and Linear
Discriminant Analysis (LDA)

Recap- basic knowledge of data pre – processing.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 49


Data transformationCONTENT
and dimensionality reduction

Dimensionality Reduction refers to the technique of reducing


the dimension of a data feature set. Usually, machine learning
datasets (feature set) contain hundreds of columns (i.e.,
features) or an array of points, creating a massive sphere in a
three-dimensional space. By applying dimensionality reduction,
you can decrease or bring down the number of columns to
quantifiable counts, thereby transforming the three-dimensional
sphere into a two-dimensional object (circle).

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 50


Data transformationCONTENT
and dimensionality reduction

Dimensionality reduction has many other benefits, such as:


•It eliminates noise and redundant features.
•It helps improve the model’s accuracy and performance.
•It facilitates the usage of algorithms that are unfit for more
substantial dimensions.
•It reduces the amount of storage space required (less data
needs lesser storage space).
•It compresses the data, which reduces the computation time
and facilitates faster training of the data.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 51


Data transformationCONTENT
and dimensionality reduction

Dimensionality Reduction Techniques:-Dimensionality


reduction techniques can be categorized into two broad
categories:
1. Feature selection
The feature selection method aims to find a subset of the
input variables (that are most relevant) from the original
dataset. Feature selection includes three strategies, namely:
•Filter strategy
•Wrapper strategy
•Embedded strategy
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 52
Data transformationCONTENT
and dimensionality reduction

Dimensionality Reduction Techniques


2. Feature extraction
Feature extraction, feature projection, converts the data
from the high-dimensional space to one with lesser
dimensions. This data transformation may either be linear
or it may be nonlinear as well. This technique finds a
smaller set of new variables, each of which is a
combination of input variables (containing the same
information as the input variables).

Sanchi Kaushik UNIT 04 Data Analytics


12/12/2023 53
Data transformationCONTENT
and dimensionality reduction

1. Principal Component Analysis (PCA)


 Principal Component Analysis is one of the leading linear
techniques of dimensionality reduction.
 It is a statistical procedure that orthogonally converts the
‘n’ coordinates of a dataset into a new set
of n coordinates, known as the principal components.
This conversion results in the creation of the first
principal component having the maximum variance.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 54


Data transformationCONTENT
and dimensionality reduction

Linear discriminant analysis (LDA)


The linear discriminant analysis is a generalization of
Fisher’s linear discriminant method that is widely applied
in statistics, pattern recognition, and machine learning.
LDA represents data in a way that maximizes class
separability. While objects belonging to the same class
are juxtaposed via projection, objects from different
classes are arranged far apart.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 55


Data transformationCONTENT
and dimensionality reduction

Factor Analysis
 Factor Analysis (FA) is an exploratory data analysis
method used to search influential underlying factors or
latent variables from a set of observed variables. It helps in
data interpretations by reducing the number of variables. It
extracts maximum common variance from all variables and
puts them into a common score.
 Factor analysis is a linear statistical model. It is used to
explain the variance among the observed variable and
condense a set of the observed variable into the
unobserved variable
12/12/2023
called factors.
Sanchi Kaushik UNIT 04 Data Analytics 56
Data transformationCONTENT
and dimensionality reduction

Types of Factor Analysis


•Exploratory Factor Analysis: It is the most popular factor analysis
approach among social and management researchers. Its basic
assumption is that any observed variable is directly associated with
any factor.
•Confirmatory Factor Analysis (CFA): Its basic assumption is that
each factor is associated with a particular set of observed variables.
CFA confirms what is expected on the basic.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 57


Data transformationCONTENT
and dimensionality reduction

How does Factor Analysis Work?


The primary objective of factor analysis is to reduce the
number of observed variables and find unobservable
variables. These unobserved variables help the market
researcher to conclude the survey. This conversion of the
observed variables to unobserved variables can be achieved
in two steps:
1. Factor Extraction
2. Factor Rotation
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 58
Data transformationCONTENT
and dimensionality reduction

•Factor Extraction: In this step, the number of factors and


approach for extraction selected using variance partitioning
methods such as principal components analysis and common
factor analysis.
•Factor Rotation: In this step, rotation tries to convert factors
into uncorrelated factors — the main goal of this step to
improve the overall interpretability. There are lots of rotation
methods that are available such as: Varimax rotation method,
Quartimax rotation method, and Promax rotation method.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 59
THE CONCEPT LEARNING TASK
Topic Objective

Objective: To learn about the Univariate and Multivariate


Exploratory Data Analysis

Recap- Basic knowledge of Exploratory Data Analysis.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 60


CONTENT
Univariate and Multivariate Exploratory Data Analysis

Exploratory Data Analysis is majorly performed using the


following methods:
1. Univariate analysis
2. Bivariate analysis
3. Multivariate analysis

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 61


CONTENT
Univariate and Multivariate Exploratory Data Analysis

•Univariate analysis:- provides summary statistics for each


field in the raw data set (or) summary only on one
variable. Ex:- CDF,PDF,Box plot, Violin plot.(don't worry, will
see below what each of them is)

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 62


CONTENT
Univariate and Multivariate Exploratory Data Analysis

•Bivariate analysis:- is performed to find the relationship


between each variable in the dataset and the target
variable of interest (or) using 2 variables and finding the
relationship between them.Ex:-Box plot, Violin plot.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 63


CONTENT
Univariate and Multivariate Exploratory Data Analysis

•Multivariate analysis:- is performed to understand


interactions between different fields in the dataset (or) finding
interactions between variables more than 2. Ex:- Pair plot and
3D scatter plot.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 64


THE CONCEPT LEARNING TASK
Topic Objective

Objective: To understand the concept of Data Munging,


Data Wrangling- APIs and other tools for scrapping data
from the web/ internet using R/Python.

Recap- Basic knowledge of Exploratory Data Analysis.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 65


CONTENT
Data Munging

What is Data Munging?


Data munging, also known as data wrangling, is the
process of converting raw data into a more usable
format. Often, data munging occurs as a precursor
to data analytics or data integration. High-quality data is
essential for sophisticated data operations.
The munging process typically begins with a large
volume of raw data. Data scientists will mung the data
into shape by removing any errors or inconsistencies.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 66


CONTENT
Data Munging

The modern data munging process now involves six


main steps:

1. Discover: First, the data scientist performs a degree


of data exploration. This is a first glance at the data to
establish the most important patterns. It also allows the
scientist to identify any major structural issues, such as
invalid data formats.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 67


CONTENT
Data Munging

The modern data munging process now involves six


main steps:

2. Structure: Raw data might not have an appropriate


structure for the intended usage. The data scientists will
organize and normalize the data so that it’s more
manageable. This also makes it easier to perform the
next steps in the munging process.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 68


CONTENT
Data Munging

3. Clean: Raw data can contain corrupt, empty, or invalid


cells. There may also be values that require conversions,
such as dates and currencies.. For instance, the state in a
customer's address might appear as Texas, Tex, or TX.
The cleaning process will standardize this value for every
address.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 69


CONTENT
Data Munging

4. Enrich: Data enrichment is the process of filling in


missing details by referring to other data sources. Data
enrichment lets you fill in all address fields by looking up
the missing values elsewhere, such as in the CRM
database or a postal records lookup.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 70


CONTENT
Data Munging

5. Validate: Finally, it’s time to ensure that all data values


are logically consistent. This means checking things like
whether all phone numbers have nine digits, that there are
no numbers in name fields, and that all dates are valid
calendar dates. Data validation also involves some deeper
checks, such as ensuring that all values are compatible with
the specified data type.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 71


CONTENT
Data Munging

6. Publish: When the data munging process is complete,


the data science team will push it towards its final
destination. Often this is a data repository, where it will
integrate with data from other sources. This will make the
munged data permanently available to all consumers.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 72


CONTENT
Data Munging

Web Scraping with Python Web scraping is an automated method used to extract large
amounts of data from websites. The data on the websites are unstructured. Web
scraping helps collect these unstructured data and store it in a structured form. There
are different ways to scrape websites such as online Services, APIs or writing your own
code.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 73


CONTENT
Data Munging

To extract data using web scraping with python, you need to follow these basic steps:
1.Find the URL that you want to scrape
2.Inspecting the Page
3.Find the data you want to extract
4.Write the code
5.Run the code and extract the data
6.Store the data in the required format

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 74


CONTENT
Data Munging

Libraries used for Web Scraping


As we know, Python is has various applications and there are different libraries for
different purposes. In our further demonstration, we will be using the following
libraries:

•Selenium: Selenium is a web testing library. It is used to automate browser


activities.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 75


CONTENT
Data Munging

Libraries used for Web Scraping


•BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML
documents. It creates parse trees that is helpful to extract the data easily.

•Pandas: Pandas is a library used for data manipulation and analysis. It is used to
extract the data and store it in the desired format.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 76


CONTENT
Data Munging

Web Scraping Example : Scraping Flipkart Website


Step 1: Find the URL that you want to scrape
For this example, we are going scrape Flipkart website to extract the
Price, Name, and Rating of Laptops. The URL for this page
is https://www.flipkart.com/laptops/~buyback-guarantee-on-
laptops-
/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.
PMU_V2.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 77


CONTENT
Data Munging

Step 2: Inspecting the Page


The data is usually nested in tags. So, we inspect the page
to see, under which tag the data we want to scrape is
nested. To inspect the page, just right click on the element
and click on “Inspect”.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 78


CONTENT
Data Munging

When you click on the “Inspect” tab, you will see a “Browser
Inspector Box” open.

Sanchi Kaushik UNIT 04 Data Analytics


12/12/2023 79
CONTENT
Data Munging

Step 4: Write the code


First, let’s create a Python file. To do this, open the terminal
in Ubuntu and type gedit <your file name> with .py
extension.
gedit web-s.py
First, let us import all the necessary libraries:

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 80


CONTENT
Data Munging

To configure web driver to use Chrome browser, we have to set the


path to chrome driver

Refer the below code to open the URL:

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 81


CONTENT
Data Munging

find the div tags with those respective class-names, extract


the data and store the data in a variable

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 82


CONTENT
Data Munging

Step 5: Run the code and extract the data


To run the code, use the below command:

Step 6: Store the data in a required format


After extracting the data, you might want to store it in a format. This format varies
depending on your requirement. For this example, we will store the extracted data in
a CSV (Comma Separated Value) format.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 83


Faculty VideoTHE
Links,CONCEPT
You tube &LEARNING
NPTEL Video TASK
Links and Online
Courses Details

You Tube video

https://www.youtube.com/watch?v=q4pyaVZjqk0

https://www.youtube.com/watch?v=7sJaRHF03K8

https://www.youtube.com/watch?v=mKxFfjNyj3c

https://www.youtube.com/watch?v=azXCzI57Yfc

https://www.youtube.com/watch?v=83x5X66uWK0

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 84


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

1. Which of the following is not an example of a time series


model?

A) Naive approach
B) Exponential smoothing
C) Moving Average
D)None of the above

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 85


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

2. Which of the following can’t be a component for a time series


plot?

A) Seasonality
B) Trend
C) Cyclical
D) Noise
E) None of the above

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 86


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

3. Which of the following is relatively easier to estimate in time


series modeling?

A) Seasonality
B) Cyclical
C) No difference between Seasonality and Cyclical

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 87


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

4. Which of the following techniques would perform better for


reducing dimensions of a data set?

A. Removing columns which have too many missing values

B. Removing columns which have high variance in data

C. Removing columns with dissimilar data trends

D. None of these

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 88


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

5. [ True or False ] Dimensionality reduction algorithms are one


of the possible ways to reduce the computation time required to
build a model.

A. TRUE

B. FALSE

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 89


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

6. Which of the following algorithms cannot be used for


reducing the dimensionality of data?

A. t-SNE

B. PCA

C. LDA False

D. None of these

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 90


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

7. Which of the following statement is TRUE?


(A) Outliers should be identified and removed always from a
dataset.
(B) Outliers can never be present in the testing dataset.
(C) Outliers is a data point that is significantly close to other data
points.
(D) The nature of our business problem determines how
outliers are used.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 91


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

8. __________ are the data objects that don’t comply with the
general model or behaviour of the available data:

a. Evolution Analysis

b. Outlier Analysis

c. Classification

d. Prediction

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 92


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

9. The primary use of data cleaning is:

a. Removing the noisy data

b. Correction of the data inconsistencies

c. Transformations for correcting the wrong data

d. All of the above

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 93


THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

10. __________ means the description and trends or model


regularities for those objects whose behavior would change
eventually over time.

a. Evolution Analysis

b. Outlier Analysis

c. Classification

d. Prediction

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 94


THE CONCEPT LEARNING
Weekly/monthly/Unit TASK
Wise Assignment.

Assignment 1
1. What is an outlier and how to identify them?
2. What are the steps of Data Cleaning?
3. What are the missing values? How do you handle missing values?
4. Name two useful methods of pandas that are useful to handle the
missing values.
5. Explain the phrase “Curse of Dimensionality”.
6. What do you mean by Feature Splitting?
7. What is the importance of using PCA before the clustering?
8. Explain the standardization scaling method to normalize data.
9. What is the Differentiate between Univariate, Bivariate, and
Multivariate analysis?
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 95
THE CONCEPT
Glossary LEARNING
Questions TASK

1. ___________ is really just making up data to artificially inflate


results. It’s better to just drop cases with missing data than to
impute.
2. ____________is always the best way to deal with missing data.
3. Principal Component Analysis (PCA) is an example
of_______________.
4. PCA reduces the dimension by finding a few________.
5. ______is a tool which is used to reduce the dimension of the data.

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 96


THE CONCEPT
Glossary LEARNING
Questions TASK

6. PCA is used to find _______.


7. ______ is non-zero vector that stays parallel after matrix
multiplication.
8. ______is a dimensionality reduction technique which is commonly
used for the supervised classification problems.
9. Discriminative Learning Algorithms include _______
10. The predictions for generative learning algorithms are made using
_______ .

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 97


THE CONCEPT
Expected LEARNING
Questions for University TASK
Exam

1. What is Data Manipulation?


2. Define Outliers. How are they identified?
3. Name some methods to deal with missing value imputation?
4. Explain the standardization scaling method to normalize data.
5. Name top 2 techniques to handle missing data.
6. Explain Principal Component Analysis, assumptions, equations.
7. What is the importance of using PCA before the clustering?
8. Explain the Curse of Dimensionality?
9. How can you evaluate the performance of a dimensionality reduction
algorithm on your dataset?

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 98


THE CONCEPT LEARNING TASK
Summary

 This unit provide us fundamentals domain of Exploratory


Data Analytics and its latest trends in industry.
 In this unit we are also benefitted with the knowledge of
different types of data and very important is how to
handle the missing data and also through the concept
model building which is used in industry prospects.
 This unit will impart us with knowledge of web scraping
using python

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 99


CONTENT

12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 100

You might also like