Exercise - 6: DS203-2024-S1 Problem1:: Statistics

Uploaded by

aagamkasliwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

Exercise - 6: DS203-2024-S1 Problem1:: Statistics

Uploaded by

aagamkasliwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Exercise – 6: DS203-2024-S1

Problem1:

Performing EDA on e6-htr-current.csv:

• First convert ‘Timestamp’ to data type date_time.

• A simple data.shape gives us “data shape: (82388, 2)”, meaning there are 82399 rows and 2
colums in the data. Excluding the header, it gives us 82387 entries with two columns named
‘Timestamp’ and ‘HT R Phase Current’.
• Here are the HT R
descriptive statistics of the given
dataset: Timestamp Phase
statistics Current
• count 81726 81726
mean 08:28.6 16.912767
12/23/2018
min 0
5:30
3/6/2019
25% 0.08
4:06
5/16/2019
50% 0.12
2:42
7/26/2019
75% 28.58
1:18
10/4/2019
max 98.5
23:55
std NaN 27.174448

• The line plot of the current v/s day looks as follows:

•
• Clearly we are not able to understand the true nature of data due to its size, so let us plot again
with lesser entries this time:
•
• The above gives the current plot for the first 1000 entries, the distribution is more clear here.
• The former graph is also helpful as it gives us an overview of the data, while the latter gives a
more local view.
• The below code helps us in identifying the number of entries per day
• data['date'] = data['Timestamp'].dt.date
• entries_per_day = data.groupby('date').size()
• entries_per_day
• We see there are around 250-300 entries per day.
• This will be helpful in later analysis.
• Now we create a box plot to get a view of outliers.
• The min, max and quantile are listed in the descriptive statistics table above.

•
• To find the 2 week interval with maximum variation, we create a new column which has the
difference of current values, and at the end we sum the values for 2 week intervals.
• We get the following result
o fluctuations missing
o Timestamp
o 2018-12-23 6754.35 0.0
o 2019-01-06 3669.36 0.0
o 2019-01-20 5202.19 0.0
o 2019-02-03 5664.76 0.0
o 2019-02-17 3134.01 0.0
o 2019-03-03 4855.82 0.0
o 2019-03-17 5191.14 0.0
o 2019-03-31 6145.58 0.0
o 2019-04-14 4086.05 0.0
o 2019-04-28 6512.57 0.0
o 2019-05-12 4933.57 0.0
o 2019-05-26 5636.68 0.0
o 2019-06-09 6689.27 0.0
o 2019-06-23 5439.55 0.0
o 2019-07-07 9562.62 0.0
o 2019-07-21 8602.02 0.0
o 2019-08-04 7463.69 0.0
o 2019-08-18 9559.11 0.0
o 2019-09-01 6652.22 0.0
o 2019-09-15 11524.25 0.0
o 2019-09-29 6706.29 0.0
o 2019-10-13 NaN NaN
o 2019-10-27 4505.84 0.0
o 2019-11-10 NaN NaN
o 2019-11-24 3028.91 0.0
o 2019-12-08 3477.18 0.0
•
• We get the maximum sum for week starting at 2019-09-15.
• We plot the graph of current values for the above period. We get

•
Start Date : 2019-09-15

End Date : 2019-09-29

• To convert this data into a better data we can use the following methods:
o Winsorization: It means to replace outliers ( defined as the entries in bottom and top 5%
of quartile. Winsorization is not useful here as the badness of our data is because of
sudden changes in data values. Nevertheless applying winsorization gives us the
following plot, which as mentioned is no better.

o
o Rolling Mean : To smoothen this dataset, we use rolling mean. It takes the average of a
certain number of observations and assigns that value to the entry. We try some values
of rolling mean window to see which gets us a fairly smooth data without any significant
changes to the data and got the following plot

o
o The problem of sudden fluctuations has to some extent been taken care of, but still
there remains a degree of badness in the data, this is due to the missing values.
o We will use the whole dataset available to assign some values here.
o The data for 2019-09-25 is missing. We can use the data of same date of previous month
and replace it here.

Problem2

• Data.shapes gives us the following info about the data

o There are 100 columns named from c1 to c100 and there are 1025 entries.
• Using data.dtypes we see that the data types are either int or float type.
• The detailed descriptive statistics are in the descstats.xlsx file shared, they are too bulky to be
copoied here.
• Pairplots of some columns by using sns are as follows

•
• Before moving forward we will standardize the data.
• Before that lets make sure there are not missing values
o A simple data.isnull().sum() shows us there are no missing values in the data.
• We proceed with standardization: using Standardize from sklearn preprocessing we standardize
the data, pairplots of above columns after normalization comes out as

• Observe the nature of plots might not hae changed but the scales have changed.
• To decide whether to drop certain columns or not, we will create a correlation heat map, and
remove columns which show high correlation with other columns.
• We obtain the following heat map
•
• Now we identify the columns with high correlation values and remove these columns
• The columns with high correlation values are
o Highly correlated columns (correlation > 0.9):
o ['c24', 'c25', 'c33', 'c54', 'c56', 'c57', 'c69', 'c75', 'c76', 'c77', 'c83', 'c91', 'c92', 'c93', 'c95',
'c99', 'c100']

Problem 3:

1. Subjecting mnist-dataset to PCA

a. First the dataset is standardized and then is subjected to PCA analysis.
b. We get the following elbow plot:
c. Number of components explaining 90% variance: 193
d. Conclusions:
i. The elbow point is at 193 components.
ii. Rule of thumb says to take number of components such that they account for
90% variation in the data.
iii. After this taking more components would give diminishing returns, as in there is
a tradeoff between simplicity and accuracy.
2. Now we subject the standardized dataset to PCA with 2 components, and then scatter plot PC2
v/s PC1, we obtain the following scatter plot

3. Now we subject the dataset to t-SNE analysis.

a. Use TSNE on standardized dataset.
b. Obtain scatter plot of reduced dimension dataset. We obtain the following scatter plot

c. Interpretation:
i. The x and y axis values of t-SNE scatter plot do not have any significance.
ii. Each point on this scatter plot is a data point of the original dataset.
iii. The closeness of two points on the plot displays there similarity in higher
dimensional space.
iv. We can clearly see points that have the same label forming a cluster.
v. This means all the instances of the same digit have high degree of similarity in
the original dataset.
vi. Some outliers can be observed, which are the points that are far away from their
digit clusters.
4. Subjecting the e6-Run2-June22 subset dataset to t-SNE gives the following scatter plot
a. We have given colors to data points monthwise.
b. This helps us in seeing that data points corresponding to same months are similar in
higher dimensions.
c. Still we can see multiple clusters with same colors, this might be due to selection of
moth wise dates, instead if we might go with 10 day clusters we would probably see
better clustering.

Problem4: Major Learnings:

• Performing EDA analysis by using scatter, line and box plots.

• Identifying missing values and handling them.
• Smoothening data using outlier handling.
• Performing PCA and t-SNE analysis to a dataset and interpreting the elbow plot and scatter plot
generated.

Experimental Design I Lecture Notes 1
100% (1)
Experimental Design I Lecture Notes 1
33 pages
Pinheiro and Bates - 2000 - Mixed-Effects Models in S and S-PLUS PDF
100% (2)
Pinheiro and Bates - 2000 - Mixed-Effects Models in S and S-PLUS PDF
537 pages
(Ebook PDF) Principles of Econometrics, 5th Editioninstant Download
100% (3)
(Ebook PDF) Principles of Econometrics, 5th Editioninstant Download
44 pages
Diseño Ecologico Experimental
100% (2)
Diseño Ecologico Experimental
432 pages
Foundation of Data Science Previous Year Question Paper
No ratings yet
Foundation of Data Science Previous Year Question Paper
40 pages
7 Measures of Central Tendency Data Location and Variability
No ratings yet
7 Measures of Central Tendency Data Location and Variability
61 pages
360DigiTmg E Book Data Science
100% (1)
360DigiTmg E Book Data Science
168 pages
5 Assignment5
67% (3)
5 Assignment5
10 pages
CSBS - AD3491 - FDSA - IA 1 - Answer Key
100% (11)
CSBS - AD3491 - FDSA - IA 1 - Answer Key
14 pages
14 Variable Sampling Plan - Student
100% (1)
14 Variable Sampling Plan - Student
17 pages
MAT5007 - Module 1 Problem Set
No ratings yet
MAT5007 - Module 1 Problem Set
3 pages
Ibd Manual
No ratings yet
Ibd Manual
12 pages
HW3 Sol PDF
No ratings yet
HW3 Sol PDF
7 pages
Audit-Sampling EXERCISE
No ratings yet
Audit-Sampling EXERCISE
200 pages
Exercise - 3 Submission - Group - 12
No ratings yet
Exercise - 3 Submission - Group - 12
14 pages
Demgn801 Business Analytics 76 150
No ratings yet
Demgn801 Business Analytics 76 150
75 pages
DS Lab Manual Final
No ratings yet
DS Lab Manual Final
49 pages
Clinical Chemistry 1 Lecture Assignment (Quality Control) : Questions
No ratings yet
Clinical Chemistry 1 Lecture Assignment (Quality Control) : Questions
4 pages
Review of Sessions 1-7 PUBH 614 Spring 2019
No ratings yet
Review of Sessions 1-7 PUBH 614 Spring 2019
68 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
AS Lecture 09 (T-Test)
No ratings yet
AS Lecture 09 (T-Test)
31 pages
Preprocessing - Preprocessing Your Data With R
No ratings yet
Preprocessing - Preprocessing Your Data With R
23 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Eda Lab Manual
No ratings yet
Eda Lab Manual
34 pages
Beyene - Stat For Management - II - Chapter 1
No ratings yet
Beyene - Stat For Management - II - Chapter 1
21 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Module 2.3 EDA Part 3 Time Series Data in Python and R
No ratings yet
Module 2.3 EDA Part 3 Time Series Data in Python and R
20 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Week 10 Intro Time Series
No ratings yet
Week 10 Intro Time Series
34 pages
DAV Practicals
No ratings yet
DAV Practicals
26 pages
CH 06 Test
No ratings yet
CH 06 Test
24 pages
DS203 Exercise 5
No ratings yet
DS203 Exercise 5
29 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
PFDA Khalil Mirza TP053846
No ratings yet
PFDA Khalil Mirza TP053846
39 pages
DAV Practicle File
No ratings yet
DAV Practicle File
28 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
23bet10114 Naman Gupta Assignment-1
No ratings yet
23bet10114 Naman Gupta Assignment-1
17 pages
Griya Report-Con S Eom 1704.Dwg Jalan
No ratings yet
Griya Report-Con S Eom 1704.Dwg Jalan
11 pages
2017 Endsem Paper Solutions
No ratings yet
2017 Endsem Paper Solutions
12 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
23HCS4142 PDF
No ratings yet
23HCS4142 PDF
24 pages
Lesson 10 Steps in Hypothesis Testing
No ratings yet
Lesson 10 Steps in Hypothesis Testing
23 pages
Set-B - CT2 - AnswerKey
No ratings yet
Set-B - CT2 - AnswerKey
10 pages
Econ G2 Final
No ratings yet
Econ G2 Final
10 pages
2018 Endsem Paper
No ratings yet
2018 Endsem Paper
19 pages
Sampling
No ratings yet
Sampling
32 pages
FDS - 3 Solved
No ratings yet
FDS - 3 Solved
21 pages
Machine
No ratings yet
Machine
10 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
E6 - Report: Problem 1
No ratings yet
E6 - Report: Problem 1
16 pages
Audit Sampling Reviewer
No ratings yet
Audit Sampling Reviewer
4 pages
Specification Variable in Econometrics
No ratings yet
Specification Variable in Econometrics
15 pages
Ise487 - HW#1
No ratings yet
Ise487 - HW#1
22 pages
DAV Practical
No ratings yet
DAV Practical
12 pages
2011 Endsem Paper
No ratings yet
2011 Endsem Paper
8 pages
ME444 Lab5 Group8
No ratings yet
ME444 Lab5 Group8
18 pages
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
No ratings yet
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
12 pages
Vanshika Goyal Gec Practicals
No ratings yet
Vanshika Goyal Gec Practicals
31 pages
2009 Endsem Paper
No ratings yet
2009 Endsem Paper
10 pages
Assignment Cars
No ratings yet
Assignment Cars
9 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
CCW331 Set4
No ratings yet
CCW331 Set4
5 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
2019 Endsem Paper
No ratings yet
2019 Endsem Paper
8 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
A Science and Risk-Based Pragmatic Methodology For Blend and Content Uniformity Assessment
No ratings yet
A Science and Risk-Based Pragmatic Methodology For Blend and Content Uniformity Assessment
10 pages
End Semester Answer Key Format-Fods
No ratings yet
End Semester Answer Key Format-Fods
8 pages
1 2 Merged
No ratings yet
1 2 Merged
12 pages
Guidelines DAVP
No ratings yet
Guidelines DAVP
3 pages
Internals1 FDS Scheme
No ratings yet
Internals1 FDS Scheme
7 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
7 pages
Formula Sheet UNIT 1 KMBN 104
No ratings yet
Formula Sheet UNIT 1 KMBN 104
4 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
cdp201 10 11 2023
No ratings yet
cdp201 10 11 2023
17 pages
Fds QB
No ratings yet
Fds QB
6 pages
Sample Size in Animal Studies
No ratings yet
Sample Size in Animal Studies
4 pages
M&M Lab Activity-Advanced Problem: Part-2
No ratings yet
M&M Lab Activity-Advanced Problem: Part-2
4 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
2012 Endsem Make Up Exam
No ratings yet
2012 Endsem Make Up Exam
2 pages
FDS 2 Marks QA
No ratings yet
FDS 2 Marks QA
2 pages
Econometrics Final Assignment
No ratings yet
Econometrics Final Assignment
4 pages
STATISTIKA Histogram
No ratings yet
STATISTIKA Histogram
3 pages
Time Series
No ratings yet
Time Series
1 page
Expt 2
No ratings yet
Expt 2
3 pages
BB101 Tutorial2
No ratings yet
BB101 Tutorial2
2 pages
One-Sample Kolmogorov-Smirnov Test
No ratings yet
One-Sample Kolmogorov-Smirnov Test
1 page
Start Predicting In A World Of Data Science And Predictive Analysis
From Everand
Start Predicting In A World Of Data Science And Predictive Analysis
Matthew Abbitt
No ratings yet
Instruction for Using a Slide Rule
From Everand
Instruction for Using a Slide Rule
W. Stanley
No ratings yet