Exercise - 6: DS203-2024-S1 Problem1:: Statistics
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
Problem1:
•
• Clearly we are not able to understand the true nature of data due to its size, so let us plot again
with lesser entries this time:
•
• The above gives the current plot for the first 1000 entries, the distribution is more clear here.
• The former graph is also helpful as it gives us an overview of the data, while the latter gives a
more local view.
• The below code helps us in identifying the number of entries per day
• data['date'] = data['Timestamp'].dt.date
• entries_per_day = data.groupby('date').size()
• entries_per_day
• We see there are around 250-300 entries per day.
• This will be helpful in later analysis.
• Now we create a box plot to get a view of outliers.
• The min, max and quantile are listed in the descriptive statistics table above.
•
• To find the 2 week interval with maximum variation, we create a new column which has the
difference of current values, and at the end we sum the values for 2 week intervals.
• We get the following result
o fluctuations missing
o Timestamp
o 2018-12-23 6754.35 0.0
o 2019-01-06 3669.36 0.0
o 2019-01-20 5202.19 0.0
o 2019-02-03 5664.76 0.0
o 2019-02-17 3134.01 0.0
o 2019-03-03 4855.82 0.0
o 2019-03-17 5191.14 0.0
o 2019-03-31 6145.58 0.0
o 2019-04-14 4086.05 0.0
o 2019-04-28 6512.57 0.0
o 2019-05-12 4933.57 0.0
o 2019-05-26 5636.68 0.0
o 2019-06-09 6689.27 0.0
o 2019-06-23 5439.55 0.0
o 2019-07-07 9562.62 0.0
o 2019-07-21 8602.02 0.0
o 2019-08-04 7463.69 0.0
o 2019-08-18 9559.11 0.0
o 2019-09-01 6652.22 0.0
o 2019-09-15 11524.25 0.0
o 2019-09-29 6706.29 0.0
o 2019-10-13 NaN NaN
o 2019-10-27 4505.84 0.0
o 2019-11-10 NaN NaN
o 2019-11-24 3028.91 0.0
o 2019-12-08 3477.18 0.0
•
• We get the maximum sum for week starting at 2019-09-15.
• We plot the graph of current values for the above period. We get
•
Start Date : 2019-09-15
• To convert this data into a better data we can use the following methods:
o Winsorization: It means to replace outliers ( defined as the entries in bottom and top 5%
of quartile. Winsorization is not useful here as the badness of our data is because of
sudden changes in data values. Nevertheless applying winsorization gives us the
following plot, which as mentioned is no better.
o
o Rolling Mean : To smoothen this dataset, we use rolling mean. It takes the average of a
certain number of observations and assigns that value to the entry. We try some values
of rolling mean window to see which gets us a fairly smooth data without any significant
changes to the data and got the following plot
o
o The problem of sudden fluctuations has to some extent been taken care of, but still
there remains a degree of badness in the data, this is due to the missing values.
o We will use the whole dataset available to assign some values here.
o The data for 2019-09-25 is missing. We can use the data of same date of previous month
and replace it here.
Problem2
•
• Before moving forward we will standardize the data.
• Before that lets make sure there are not missing values
o A simple data.isnull().sum() shows us there are no missing values in the data.
• We proceed with standardization: using Standardize from sklearn preprocessing we standardize
the data, pairplots of above columns after normalization comes out as
• Observe the nature of plots might not hae changed but the scales have changed.
• To decide whether to drop certain columns or not, we will create a correlation heat map, and
remove columns which show high correlation with other columns.
• We obtain the following heat map
•
• Now we identify the columns with high correlation values and remove these columns
• The columns with high correlation values are
o Highly correlated columns (correlation > 0.9):
o ['c24', 'c25', 'c33', 'c54', 'c56', 'c57', 'c69', 'c75', 'c76', 'c77', 'c83', 'c91', 'c92', 'c93', 'c95',
'c99', 'c100']
Problem 3:
c. Interpretation:
i. The x and y axis values of t-SNE scatter plot do not have any significance.
ii. Each point on this scatter plot is a data point of the original dataset.
iii. The closeness of two points on the plot displays there similarity in higher
dimensional space.
iv. We can clearly see points that have the same label forming a cluster.
v. This means all the instances of the same digit have high degree of similarity in
the original dataset.
vi. Some outliers can be observed, which are the points that are far away from their
digit clusters.
4. Subjecting the e6-Run2-June22 subset dataset to t-SNE gives the following scatter plot
a. We have given colors to data points monthwise.
b. This helps us in seeing that data points corresponding to same months are similar in
higher dimensions.
c. Still we can see multiple clusters with same colors, this might be due to selection of
moth wise dates, instead if we might go with 10 day clusters we would probably see
better clustering.