0% found this document useful (0 votes)
8 views10 pages

Exercise - 6: DS203-2024-S1 Problem1:: Statistics

Uploaded by

aagamkasliwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Exercise - 6: DS203-2024-S1 Problem1:: Statistics

Uploaded by

aagamkasliwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Exercise – 6: DS203-2024-S1

Problem1:

Performing EDA on e6-htr-current.csv:

• First convert ‘Timestamp’ to data type date_time.


• A simple data.shape gives us “data shape: (82388, 2)”, meaning there are 82399 rows and 2
colums in the data. Excluding the header, it gives us 82387 entries with two columns named
‘Timestamp’ and ‘HT R Phase Current’.
• Here are the HT R
descriptive statistics of the given
dataset: Timestamp Phase
statistics Current
• count 81726 81726
mean 08:28.6 16.912767
12/23/2018
min 0
5:30
3/6/2019
25% 0.08
4:06
5/16/2019
50% 0.12
2:42
7/26/2019
75% 28.58
1:18
10/4/2019
max 98.5
23:55
std NaN 27.174448

• The line plot of the current v/s day looks as follows:


• Clearly we are not able to understand the true nature of data due to its size, so let us plot again
with lesser entries this time:

• The above gives the current plot for the first 1000 entries, the distribution is more clear here.
• The former graph is also helpful as it gives us an overview of the data, while the latter gives a
more local view.
• The below code helps us in identifying the number of entries per day
• data['date'] = data['Timestamp'].dt.date
• entries_per_day = data.groupby('date').size()
• entries_per_day
• We see there are around 250-300 entries per day.
• This will be helpful in later analysis.
• Now we create a box plot to get a view of outliers.
• The min, max and quantile are listed in the descriptive statistics table above.


• To find the 2 week interval with maximum variation, we create a new column which has the
difference of current values, and at the end we sum the values for 2 week intervals.
• We get the following result
o fluctuations missing
o Timestamp
o 2018-12-23 6754.35 0.0
o 2019-01-06 3669.36 0.0
o 2019-01-20 5202.19 0.0
o 2019-02-03 5664.76 0.0
o 2019-02-17 3134.01 0.0
o 2019-03-03 4855.82 0.0
o 2019-03-17 5191.14 0.0
o 2019-03-31 6145.58 0.0
o 2019-04-14 4086.05 0.0
o 2019-04-28 6512.57 0.0
o 2019-05-12 4933.57 0.0
o 2019-05-26 5636.68 0.0
o 2019-06-09 6689.27 0.0
o 2019-06-23 5439.55 0.0
o 2019-07-07 9562.62 0.0
o 2019-07-21 8602.02 0.0
o 2019-08-04 7463.69 0.0
o 2019-08-18 9559.11 0.0
o 2019-09-01 6652.22 0.0
o 2019-09-15 11524.25 0.0
o 2019-09-29 6706.29 0.0
o 2019-10-13 NaN NaN
o 2019-10-27 4505.84 0.0
o 2019-11-10 NaN NaN
o 2019-11-24 3028.91 0.0
o 2019-12-08 3477.18 0.0

• We get the maximum sum for week starting at 2019-09-15.
• We plot the graph of current values for the above period. We get


Start Date : 2019-09-15

End Date : 2019-09-29

• To convert this data into a better data we can use the following methods:
o Winsorization: It means to replace outliers ( defined as the entries in bottom and top 5%
of quartile. Winsorization is not useful here as the badness of our data is because of
sudden changes in data values. Nevertheless applying winsorization gives us the
following plot, which as mentioned is no better.

o
o Rolling Mean : To smoothen this dataset, we use rolling mean. It takes the average of a
certain number of observations and assigns that value to the entry. We try some values
of rolling mean window to see which gets us a fairly smooth data without any significant
changes to the data and got the following plot

o
o The problem of sudden fluctuations has to some extent been taken care of, but still
there remains a degree of badness in the data, this is due to the missing values.
o We will use the whole dataset available to assign some values here.
o The data for 2019-09-25 is missing. We can use the data of same date of previous month
and replace it here.

Problem2

• Data.shapes gives us the following info about the data


o There are 100 columns named from c1 to c100 and there are 1025 entries.
• Using data.dtypes we see that the data types are either int or float type.
• The detailed descriptive statistics are in the descstats.xlsx file shared, they are too bulky to be
copoied here.
• Pairplots of some columns by using sns are as follows


• Before moving forward we will standardize the data.
• Before that lets make sure there are not missing values
o A simple data.isnull().sum() shows us there are no missing values in the data.
• We proceed with standardization: using Standardize from sklearn preprocessing we standardize
the data, pairplots of above columns after normalization comes out as

• Observe the nature of plots might not hae changed but the scales have changed.
• To decide whether to drop certain columns or not, we will create a correlation heat map, and
remove columns which show high correlation with other columns.
• We obtain the following heat map

• Now we identify the columns with high correlation values and remove these columns
• The columns with high correlation values are
o Highly correlated columns (correlation > 0.9):
o ['c24', 'c25', 'c33', 'c54', 'c56', 'c57', 'c69', 'c75', 'c76', 'c77', 'c83', 'c91', 'c92', 'c93', 'c95',
'c99', 'c100']

Problem 3:

1. Subjecting mnist-dataset to PCA


a. First the dataset is standardized and then is subjected to PCA analysis.
b. We get the following elbow plot:
c. Number of components explaining 90% variance: 193
d. Conclusions:
i. The elbow point is at 193 components.
ii. Rule of thumb says to take number of components such that they account for
90% variation in the data.
iii. After this taking more components would give diminishing returns, as in there is
a tradeoff between simplicity and accuracy.
2. Now we subject the standardized dataset to PCA with 2 components, and then scatter plot PC2
v/s PC1, we obtain the following scatter plot

3. Now we subject the dataset to t-SNE analysis.


a. Use TSNE on standardized dataset.
b. Obtain scatter plot of reduced dimension dataset. We obtain the following scatter plot

c. Interpretation:
i. The x and y axis values of t-SNE scatter plot do not have any significance.
ii. Each point on this scatter plot is a data point of the original dataset.
iii. The closeness of two points on the plot displays there similarity in higher
dimensional space.
iv. We can clearly see points that have the same label forming a cluster.
v. This means all the instances of the same digit have high degree of similarity in
the original dataset.
vi. Some outliers can be observed, which are the points that are far away from their
digit clusters.
4. Subjecting the e6-Run2-June22 subset dataset to t-SNE gives the following scatter plot
a. We have given colors to data points monthwise.
b. This helps us in seeing that data points corresponding to same months are similar in
higher dimensions.
c. Still we can see multiple clusters with same colors, this might be due to selection of
moth wise dates, instead if we might go with 10 day clusters we would probably see
better clustering.

Problem4: Major Learnings:

• Performing EDA analysis by using scatter, line and box plots.


• Identifying missing values and handling them.
• Smoothening data using outlier handling.
• Performing PCA and t-SNE analysis to a dataset and interpreting the elbow plot and scatter plot
generated.

You might also like