0% found this document useful (0 votes)
125 views

Exploratory Data Analysis For Machine Learning

The document summarizes exploratory data analysis of a climate dataset containing monthly temperature fluctuations from 1870 to 2019. Key findings include that the data follows a normal distribution with some outliers, linear regression is not suitable due to the non-linear dispersion of the data, and the temperature readings are not influenced by each other but rather external seasonal variables. A hypothesis test determined the average temperature anomaly is higher than -0.1°C. Further non-linear analysis is suggested due to the non-linear behavior of the data series. The data set is of high quality with no missing values.

Uploaded by

Gabriel Pessine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views

Exploratory Data Analysis For Machine Learning

The document summarizes exploratory data analysis of a climate dataset containing monthly temperature fluctuations from 1870 to 2019. Key findings include that the data follows a normal distribution with some outliers, linear regression is not suitable due to the non-linear dispersion of the data, and the temperature readings are not influenced by each other but rather external seasonal variables. A hypothesis test determined the average temperature anomaly is higher than -0.1°C. Further non-linear analysis is suggested due to the non-linear behavior of the data series. The data set is of high quality with no missing values.

Uploaded by

Gabriel Pessine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Exploratory Data Analysis for Machine Learning

01. Brief description of the data set and a summary of its


attributes
The dataset used is Brazil’s INPE, which contains climate data. This one specif ically is related to the
ENOS phenomenon.

The dataset contains 1800 data of monthly temperature f luctuations, from 1870 to 2019.

The data series contains 150 rows, corresponding to the years, and 12 columns, one f or each month.

02. Initial plan for data exploration


The initial plan f or data exploration was to determine mean, median, quantities and ranges (max and
min) f or each of the monthly data measurement we have.

03. Actions taken for data cleaning and future engineering


Regarding data cleaning, I made a scatter plot, with Matplotlib, of the dataset. This initial step was
done to clean, since we noticed the outliers graphically. It was observed that there were outliers data
that did not correspond to the data trend, so they could, by identifying them visually, be cataloged as
anomalies.

Af ter that, I made histogram f or any of the 12 f eatures (early), thus creating a single plot with
histograms f or each f eature overlayed. This was done using Pandas plotting f unctionality.

For f uture engineering, I will set out to improve on a baseline set of f eatures :, deriving new f eatures
f rom our existing data. This should make the dif ference between a weak model and a strong o ne. I
plan to use visual visual exploration, intuition and domain understanding in order to construct new
f eatures to improve the f orecasting capabilities of some new f oreseen model.

04. Key Findings and Insights, which synthesizes the results


of Exploratory Data Analysis in an insightful and actionable
manner

Af ter analyzing, I came to thin on what do these plots tell us about the distribution of the target .
The f inding were that the distribution of each of the data sets f ollows an almost normal
distribution, with some exceptions (as expected).

By looking at the scatterplots of the data series, we can interpret that it is not possible to perf orm linear
regressions, either by the method of least squares or by any other way that allows the interpretation
of the behavior of the data in the f orm of linear, polynomial or other equations. The dispersion of the
data is such that other, more complex statistical analyses would be required.
Also, each data set is not dif f erent in nature f rom the others, as they all ref lect the behaviors of
a natural variable. Despite the possible sources of error in the measurement of these data, it is
not possible to say that these data are inf luenced by each other. The environmental conditions
vary according to the time of year, but this is an external variable.

05. Formulating at least 3 hypothesis about this data


As requested, 3 hypotheses were f ormulated, as f ollows:

1) The ENSO phenomenon will have an impact on sea level rise if the anomalies mostly oscillate
between -0.5 and 1.5 ºC. Has this been true in the time interval between 1870 and 2020?
2) The temperature's anomalies, on average, are higher than -0,10. The decision is based on a
check of a random sample of this data.
3) We can say that there hasn’t been a period of twelve months in which, in a row, these
temperature anomalies have f allen f rom 0.5 ºC?

06. Conducting a formal significance test for one of the


hypotheses and discuss the results
Hypotheses 2 was chosen f or the f ormal significance test.

On average, the temperature's anomalies are higher than -0,10. The decision is based on a check of
a random sample of this data.

Population: X = 'Water temperature anomalies on sector 3.4 of the Pacif ic Ocean'

Ho: μ = -0,1 null hypothesis

Ha: μ > -0,1 alternative hypothesis

The approach is to use the one sample t-test, which determines whether the sample mean is
statistically different f rom a known or hypothesized population mean. The One Sample t Test is a
parametric test.

Af ter the test was perf ormed, and observing that the p-value was way higher than the standard
conf idence level, we can accept the hypothesis that, on average, the temperature's anomalies are
higher than -0,1. There’s an almost non-existing evidence in support f or the alternative hypothesis -
that, on average, the temperature's anomalies are not higher than -0,1.

07. Suggestions for next steps in analyzing this data


Even though we can observe that the data series presents an almost normal distribution, the the
scatterplot allows us to identif y how this data series doesn’t show a linear trend, or that it can be
identif ied with a reasonable equation, due to the absolute dispersion of the data, as it can be detailed
in the standard deviations. So, it indicates that the analysis of this series need to use non-linear
approaches to be able to analyze their behavior more precisely.
08. A paragraph that summarizes the quality of this data set
and a request for additional data if needed
The INPE’s data set is of great quality: no data was missing, no inconsistencies or outliers were
observed f ar f rom the range of values in the data set. It also has a normal distribution and it showed
that dif ferent statistical analysis were possible to be done using this data set.

You might also like