Exploratory Data Analysis For Machine Learning
Exploratory Data Analysis For Machine Learning
The dataset contains 1800 data of monthly temperature f luctuations, from 1870 to 2019.
The data series contains 150 rows, corresponding to the years, and 12 columns, one f or each month.
Af ter that, I made histogram f or any of the 12 f eatures (early), thus creating a single plot with
histograms f or each f eature overlayed. This was done using Pandas plotting f unctionality.
For f uture engineering, I will set out to improve on a baseline set of f eatures :, deriving new f eatures
f rom our existing data. This should make the dif ference between a weak model and a strong o ne. I
plan to use visual visual exploration, intuition and domain understanding in order to construct new
f eatures to improve the f orecasting capabilities of some new f oreseen model.
Af ter analyzing, I came to thin on what do these plots tell us about the distribution of the target .
The f inding were that the distribution of each of the data sets f ollows an almost normal
distribution, with some exceptions (as expected).
By looking at the scatterplots of the data series, we can interpret that it is not possible to perf orm linear
regressions, either by the method of least squares or by any other way that allows the interpretation
of the behavior of the data in the f orm of linear, polynomial or other equations. The dispersion of the
data is such that other, more complex statistical analyses would be required.
Also, each data set is not dif f erent in nature f rom the others, as they all ref lect the behaviors of
a natural variable. Despite the possible sources of error in the measurement of these data, it is
not possible to say that these data are inf luenced by each other. The environmental conditions
vary according to the time of year, but this is an external variable.
1) The ENSO phenomenon will have an impact on sea level rise if the anomalies mostly oscillate
between -0.5 and 1.5 ºC. Has this been true in the time interval between 1870 and 2020?
2) The temperature's anomalies, on average, are higher than -0,10. The decision is based on a
check of a random sample of this data.
3) We can say that there hasn’t been a period of twelve months in which, in a row, these
temperature anomalies have f allen f rom 0.5 ºC?
On average, the temperature's anomalies are higher than -0,10. The decision is based on a check of
a random sample of this data.
The approach is to use the one sample t-test, which determines whether the sample mean is
statistically different f rom a known or hypothesized population mean. The One Sample t Test is a
parametric test.
Af ter the test was perf ormed, and observing that the p-value was way higher than the standard
conf idence level, we can accept the hypothesis that, on average, the temperature's anomalies are
higher than -0,1. There’s an almost non-existing evidence in support f or the alternative hypothesis -
that, on average, the temperature's anomalies are not higher than -0,1.