Unit 4
Unit 4
DATA ANALYTICS
Unit: 4
Mr.Ravi Pandey
B Tech VIIth Sem Assistant Professor
ECE
12/12/2023 2
Evaluation schemeLEARNING TASK
THE CONCEPT
5 Lab – I 0 0 2 25 25 50 1
6 Internship Assessment 0 0 2 50 50 1
Course objective:
The objective of this course is to understand the fundamental concepts of Data Science,
learn about various types of data formats and its manipulations. It helps students to
learn exploratory data analysis and visualization techniques in addition to R
programming language.
CO 1 Understand the fundamental concepts of data analytics in the areas that plays major role K1
within the realm of data science.
CO 2 Explain and exemplify the most common forms of data and its representations. K2
CO 5 Illustrate various visualization methods for different types of data sets and application K3
scenarios.
Text books:
1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis
and Data Mining, John Wiley Publishers, 2007.
2) Data Analysis and Data Mining, 2nd Edition, John Wiley & Sons Publication, 2014.
Reference Books:
• Security.
•Transportation.
•Risk detection.
•Risk Management.
•Delivery.
•Fast internet allocation.
•Reasonable Expenditure.
•Interaction with customers.
•Planning of cities
Course Outcomes
Course outcome: After completion of this course students will be able to:
CO5 Understand and analyze the I/O management and File systems K2, K4
12 December 2023 11
THE CONCEPT LEARNING TASK
Program Outcomes
1. Engineering knowledge
2. Problem analysis
3. Design/development of solutions
4.Conduct investigations of complex problems
5. Modern tool usage
6. The engineer and society
7. Environment and sustainability
8. Ethics:
9. Individual and team work
10. Communication
11. Project management and finance
12. Life-long learning
Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
1 3 2 2 - - - - - - - - 1
3 3 3 - - - - - - - - 1
2
3 3 3 - - - - - - - - 1
3
3 2 1 - - - - - - - - 1
4
3 2 2 - - - - - - - - 1
5
Average
3 2.4 2.2 - - - - - - - - 1
1 3 - -
3 2 -
2
3 2 -
3
3 2 2
4
3 2 -
5
Average
3 2 2
•Solve real-time complex problems and adapt to technological changes with the ability of
lifelong learning.
•Work as data scientists, entrepreneurs, and bureaucrats for the goodwill of the society
and pursue higher education.
•Exhibit professional ethics and moral values with good leadership qualities and effective
interpersonal skills.
• NA
Prerequisite:-
Recap-
Missing Values:-
The data has some missing values in its columns. There are three
major categories of missing values:
1.MCAR (Missing completely at random): These are values that
are randomly missing and do not depend on any other values.
2.MAR (Missing at random): These values are dependent on some
additional features.
3.MNAR (Missing not at random): There is a reason behind why
these values are missing.
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 31
CONTENT
Handling Missing data
Imputation
1. MCAR (also for MAR, but this is not ideal)
In this case we can use a simple method for imputation of y.
For numeric attributes we replace missing values in y with
the mean of non-missing values of y.
For categorical attributes y, we replace missing values with
the most common category in the non-missing values of y.
Imputation
2. MAR
In this case we use a more complex method by replacing
missing values for attribute y predicting from other
variables x when variables are related (we will see linear
regression using the lm and predict functions later on)
Implications of imputation
Imputation has some effects that can impact analysis.
1.The central tendency of data is retained. For example, if we impute
missing data using the mean of a numeric variable, the mean after
imputation will not change. This is a good reason to impute based on
estimates of central tendency.
2.The spread of the data will change. After imputation, the spread of the
data will be smaller relative to spread if we ignore missing values. This
could be problematic as underestimating the spread of data can yield
over-confident inferences in downstream analysis.
Implications of imputation
Imputation has some effects that can impact analysis.
1.The central tendency of data is retained. For example, if we impute
missing data using the mean of a numeric variable, the mean after
imputation will not change. This is a good reason to impute based on
estimates of central tendency.
2.The spread of the data will change. After imputation, the spread of the
data will be smaller relative to spread if we ignore missing values. This
could be problematic as underestimating the spread of data can yield
over-confident inferences in downstream analysis.
1. Exploratory analysis
2. Forecasting
3. Intervention analysis
4. Segmentation
Factor Analysis
Factor Analysis (FA) is an exploratory data analysis
method used to search influential underlying factors or
latent variables from a set of observed variables. It helps in
data interpretations by reducing the number of variables. It
extracts maximum common variance from all variables and
puts them into a common score.
Factor analysis is a linear statistical model. It is used to
explain the variance among the observed variable and
condense a set of the observed variable into the
unobserved variable
12/12/2023
called factors.
Sanchi Kaushik UNIT 04 Data Analytics 56
Data transformationCONTENT
and dimensionality reduction
Web Scraping with Python Web scraping is an automated method used to extract large
amounts of data from websites. The data on the websites are unstructured. Web
scraping helps collect these unstructured data and store it in a structured form. There
are different ways to scrape websites such as online Services, APIs or writing your own
code.
To extract data using web scraping with python, you need to follow these basic steps:
1.Find the URL that you want to scrape
2.Inspecting the Page
3.Find the data you want to extract
4.Write the code
5.Run the code and extract the data
6.Store the data in the required format
•Pandas: Pandas is a library used for data manipulation and analysis. It is used to
extract the data and store it in the desired format.
When you click on the “Inspect” tab, you will see a “Browser
Inspector Box” open.
https://www.youtube.com/watch?v=q4pyaVZjqk0
https://www.youtube.com/watch?v=7sJaRHF03K8
https://www.youtube.com/watch?v=mKxFfjNyj3c
https://www.youtube.com/watch?v=azXCzI57Yfc
https://www.youtube.com/watch?v=83x5X66uWK0
A) Naive approach
B) Exponential smoothing
C) Moving Average
D)None of the above
A) Seasonality
B) Trend
C) Cyclical
D) Noise
E) None of the above
A) Seasonality
B) Cyclical
C) No difference between Seasonality and Cyclical
D. None of these
A. TRUE
B. FALSE
A. t-SNE
B. PCA
C. LDA False
D. None of these
8. __________ are the data objects that don’t comply with the
general model or behaviour of the available data:
a. Evolution Analysis
b. Outlier Analysis
c. Classification
d. Prediction
a. Evolution Analysis
b. Outlier Analysis
c. Classification
d. Prediction
Assignment 1
1. What is an outlier and how to identify them?
2. What are the steps of Data Cleaning?
3. What are the missing values? How do you handle missing values?
4. Name two useful methods of pandas that are useful to handle the
missing values.
5. Explain the phrase “Curse of Dimensionality”.
6. What do you mean by Feature Splitting?
7. What is the importance of using PCA before the clustering?
8. Explain the standardization scaling method to normalize data.
9. What is the Differentiate between Univariate, Bivariate, and
Multivariate analysis?
12/12/2023 Sanchi Kaushik UNIT 04 Data Analytics 95
THE CONCEPT
Glossary LEARNING
Questions TASK