0% found this document useful (0 votes)

166 views3 pages

Exploratory Data Analysis For Machine Learning

The document summarizes exploratory data analysis of a climate dataset containing monthly temperature fluctuations from 1870 to 2019. Key findings include that the data follows a normal distribution with some outliers, linear regression is not suitable due to the non-linear dispersion of the data, and the temperature readings are not influenced by each other but rather external seasonal variables. A hypothesis test determined the average temperature anomaly is higher than -0.1°C. Further non-linear analysis is suggested due to the non-linear behavior of the data series. The data set is of high quality with no missing values.

Uploaded by

Gabriel Pessine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

166 views3 pages

Exploratory Data Analysis For Machine Learning

Uploaded by

Gabriel Pessine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Exploratory Data Analysis for Machine Learning

01. Brief description of the data set and a summary of its

attributes
The dataset used is Brazil’s INPE, which contains climate data. This one specif ically is related to the
ENOS phenomenon.

The dataset contains 1800 data of monthly temperature f luctuations, from 1870 to 2019.

The data series contains 150 rows, corresponding to the years, and 12 columns, one f or each month.

02. Initial plan for data exploration

The initial plan f or data exploration was to determine mean, median, quantities and ranges (max and
min) f or each of the monthly data measurement we have.

03. Actions taken for data cleaning and future engineering

Regarding data cleaning, I made a scatter plot, with Matplotlib, of the dataset. This initial step was
done to clean, since we noticed the outliers graphically. It was observed that there were outliers data
that did not correspond to the data trend, so they could, by identifying them visually, be cataloged as
anomalies.

Af ter that, I made histogram f or any of the 12 f eatures (early), thus creating a single plot with
histograms f or each f eature overlayed. This was done using Pandas plotting f unctionality.

For f uture engineering, I will set out to improve on a baseline set of f eatures :, deriving new f eatures
f rom our existing data. This should make the dif ference between a weak model and a strong o ne. I
plan to use visual visual exploration, intuition and domain understanding in order to construct new
f eatures to improve the f orecasting capabilities of some new f oreseen model.

04. Key Findings and Insights, which synthesizes the results

of Exploratory Data Analysis in an insightful and actionable
manner

Af ter analyzing, I came to thin on what do these plots tell us about the distribution of the target .
The f inding were that the distribution of each of the data sets f ollows an almost normal
distribution, with some exceptions (as expected).

By looking at the scatterplots of the data series, we can interpret that it is not possible to perf orm linear
regressions, either by the method of least squares or by any other way that allows the interpretation
of the behavior of the data in the f orm of linear, polynomial or other equations. The dispersion of the
data is such that other, more complex statistical analyses would be required.
Also, each data set is not dif f erent in nature f rom the others, as they all ref lect the behaviors of
a natural variable. Despite the possible sources of error in the measurement of these data, it is
not possible to say that these data are inf luenced by each other. The environmental conditions
vary according to the time of year, but this is an external variable.

05. Formulating at least 3 hypothesis about this data

As requested, 3 hypotheses were f ormulated, as f ollows:

1) The ENSO phenomenon will have an impact on sea level rise if the anomalies mostly oscillate
between -0.5 and 1.5 ºC. Has this been true in the time interval between 1870 and 2020?
2) The temperature's anomalies, on average, are higher than -0,10. The decision is based on a
check of a random sample of this data.
3) We can say that there hasn’t been a period of twelve months in which, in a row, these
temperature anomalies have f allen f rom 0.5 ºC?

06. Conducting a formal significance test for one of the

hypotheses and discuss the results
Hypotheses 2 was chosen f or the f ormal significance test.

On average, the temperature's anomalies are higher than -0,10. The decision is based on a check of
a random sample of this data.

Population: X = 'Water temperature anomalies on sector 3.4 of the Pacif ic Ocean'

Ho: μ = -0,1 null hypothesis

Ha: μ > -0,1 alternative hypothesis

The approach is to use the one sample t-test, which determines whether the sample mean is
statistically different f rom a known or hypothesized population mean. The One Sample t Test is a
parametric test.

Af ter the test was perf ormed, and observing that the p-value was way higher than the standard
conf idence level, we can accept the hypothesis that, on average, the temperature's anomalies are
higher than -0,1. There’s an almost non-existing evidence in support f or the alternative hypothesis -
that, on average, the temperature's anomalies are not higher than -0,1.

07. Suggestions for next steps in analyzing this data

Even though we can observe that the data series presents an almost normal distribution, the the
scatterplot allows us to identif y how this data series doesn’t show a linear trend, or that it can be
identif ied with a reasonable equation, due to the absolute dispersion of the data, as it can be detailed
in the standard deviations. So, it indicates that the analysis of this series need to use non-linear
approaches to be able to analyze their behavior more precisely.
08. A paragraph that summarizes the quality of this data set
and a request for additional data if needed
The INPE’s data set is of great quality: no data was missing, no inconsistencies or outliers were
observed f ar f rom the range of values in the data set. It also has a normal distribution and it showed
that dif ferent statistical analysis were possible to be done using this data set.

DWDM Notes - Unit 1
No ratings yet
DWDM Notes - Unit 1
26 pages
Cutting Classes
No ratings yet
Cutting Classes
21 pages
Chapter17 Stat Forecasting
100% (1)
Chapter17 Stat Forecasting
101 pages
HW 1
14% (7)
HW 1
9 pages
MVS System Commands
No ratings yet
MVS System Commands
858 pages
Data Quality
100% (2)
Data Quality
16 pages
Census Vs Sample Enumeration: Comparison Chart
No ratings yet
Census Vs Sample Enumeration: Comparison Chart
12 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Evaluation of Precision Performance of Quantitative Measurement Methods Approved Guideline-Second Edition
100% (4)
Evaluation of Precision Performance of Quantitative Measurement Methods Approved Guideline-Second Edition
52 pages
Statistics For Business: Decision Making and Analysis 3rd Edition by Robert Stine (Ebook PDF) Download
100% (2)
Statistics For Business: Decision Making and Analysis 3rd Edition by Robert Stine (Ebook PDF) Download
49 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
Thesis Proposal Conceptual Framework
100% (2)
Thesis Proposal Conceptual Framework
8 pages
Chico State - Oceanography Lab 2
100% (2)
Chico State - Oceanography Lab 2
27 pages
QMF Query Results
No ratings yet
QMF Query Results
1,095 pages
Pengembangan Model Manajemen Pelatihan Dan Pengembangan Pendidikan Karakter Berlokus Padepokan Karakter
No ratings yet
Pengembangan Model Manajemen Pelatihan Dan Pengembangan Pendidikan Karakter Berlokus Padepokan Karakter
11 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Econ 104 Proj 3
100% (1)
Econ 104 Proj 3
25 pages
Blueprint, Questionnaire, Interview
No ratings yet
Blueprint, Questionnaire, Interview
6 pages
Heat Transfer
No ratings yet
Heat Transfer
107 pages
MGT Microproject-1
No ratings yet
MGT Microproject-1
18 pages
Fdident
No ratings yet
Fdident
232 pages
HKCGI Rating - Final
No ratings yet
HKCGI Rating - Final
39 pages
Chapter 4.3 ZICA
No ratings yet
Chapter 4.3 ZICA
39 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
10 pages
By Mark J. Anderson and Shari L. Kraber
No ratings yet
By Mark J. Anderson and Shari L. Kraber
32 pages
Assignment - SPSS Latest
100% (1)
Assignment - SPSS Latest
17 pages
Group 10 TS Assignment
0% (1)
Group 10 TS Assignment
21 pages
Ch. 7. Participants and Data Collection
100% (1)
Ch. 7. Participants and Data Collection
15 pages
Settingsprovider
No ratings yet
Settingsprovider
202 pages
Data Analysis Lectures Notes
No ratings yet
Data Analysis Lectures Notes
37 pages
5 Eckbo, Garret - Landscape-For-Living
No ratings yet
5 Eckbo, Garret - Landscape-For-Living
9 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
Assignment 1 Research Skills PDF
No ratings yet
Assignment 1 Research Skills PDF
5 pages
STA2023 Summary Notes: Chapter 1 - 10
No ratings yet
STA2023 Summary Notes: Chapter 1 - 10
58 pages
Thesis PHD Zamihan
100% (3)
Thesis PHD Zamihan
5 pages
Entrepreneurship - Week 3-& 4
No ratings yet
Entrepreneurship - Week 3-& 4
5 pages
Stat Lab Zalaki
No ratings yet
Stat Lab Zalaki
77 pages
Regression Analysis of Gapminder Data
No ratings yet
Regression Analysis of Gapminder Data
41 pages
QMF Query Results - Prod
No ratings yet
QMF Query Results - Prod
69 pages
Lec Set 1 Data Analysis
No ratings yet
Lec Set 1 Data Analysis
55 pages
Hypothesis Testing With Two Samples
No ratings yet
Hypothesis Testing With Two Samples
43 pages
2 Process Integration Lecture 2
No ratings yet
2 Process Integration Lecture 2
34 pages
Biometrika Trust, Oxford University Press Biometrika
No ratings yet
Biometrika Trust, Oxford University Press Biometrika
22 pages
Econometrics 1 Cumulative Final Study Guide
No ratings yet
Econometrics 1 Cumulative Final Study Guide
35 pages
ARMA-Stochastic Time Series Modeling
100% (1)
ARMA-Stochastic Time Series Modeling
19 pages
AC 3105 Midterm Coverage 227 265
No ratings yet
AC 3105 Midterm Coverage 227 265
39 pages
Banknote Authentication
100% (1)
Banknote Authentication
3 pages
Product Management Skills: A Global Benchmark Study Conducted by 280 Group
No ratings yet
Product Management Skills: A Global Benchmark Study Conducted by 280 Group
41 pages
Data Analysis Course: Time Series Analysis & Forecasting (Version-1)
No ratings yet
Data Analysis Course: Time Series Analysis & Forecasting (Version-1)
43 pages
A Review of Basic Statistical Concepts: Answers To Odd Numbered Problems 1
No ratings yet
A Review of Basic Statistical Concepts: Answers To Odd Numbered Problems 1
32 pages
Graded Project
No ratings yet
Graded Project
36 pages
Action Research (Mechanical & Mechatronics)
No ratings yet
Action Research (Mechanical & Mechatronics)
26 pages
Focus Group Discussion
No ratings yet
Focus Group Discussion
17 pages
Regression For Everyone Vol. 1
No ratings yet
Regression For Everyone Vol. 1
25 pages
ANOVA One Way
No ratings yet
ANOVA One Way
11 pages
Basic Radar Altimetry Toolbox Practical: V. Rosmorduc (CLS)
No ratings yet
Basic Radar Altimetry Toolbox Practical: V. Rosmorduc (CLS)
32 pages
Skoog Fac 10e Sag Ch06
No ratings yet
Skoog Fac 10e Sag Ch06
26 pages
Analyzing The Effectiveness of Risk Management Strategies in Procurement Processes in Construction Companies in Lusaka, Zambia
No ratings yet
Analyzing The Effectiveness of Risk Management Strategies in Procurement Processes in Construction Companies in Lusaka, Zambia
16 pages
Aguinis Et Al 2024 Performance Confirming Refining and Refuting Theories
No ratings yet
Aguinis Et Al 2024 Performance Confirming Refining and Refuting Theories
19 pages
High Performance Computing of Fluid Dynamics
No ratings yet
High Performance Computing of Fluid Dynamics
22 pages
Getting Started With JASP - Final
No ratings yet
Getting Started With JASP - Final
9 pages
Design of Experiments Application, Concepts, Examples: State of The Art
No ratings yet
Design of Experiments Application, Concepts, Examples: State of The Art
19 pages
A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
No ratings yet
A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
9 pages
NITK Unit 4 Lecture 24 Test of Hypothesis Small Samples
No ratings yet
NITK Unit 4 Lecture 24 Test of Hypothesis Small Samples
18 pages
RESEARCH PROPOSAL - Rating Sheet
No ratings yet
RESEARCH PROPOSAL - Rating Sheet
2 pages
Character and Moral Education Based Learning in Students' Character Development
No ratings yet
Character and Moral Education Based Learning in Students' Character Development
10 pages
PSY1004 Session 09
No ratings yet
PSY1004 Session 09
12 pages
Business Intelligence: Coursework 2 M00678748
No ratings yet
Business Intelligence: Coursework 2 M00678748
19 pages
Design Thinking Lesson Yl
No ratings yet
Design Thinking Lesson Yl
17 pages
Guide To Flood Inundation Mapping Using Sentinel-1A in GEE
No ratings yet
Guide To Flood Inundation Mapping Using Sentinel-1A in GEE
17 pages
1 s2.0 S1569843224000888 Main
No ratings yet
1 s2.0 S1569843224000888 Main
17 pages
Sample - 1 MBA 909
No ratings yet
Sample - 1 MBA 909
15 pages
Sampling Techniques
No ratings yet
Sampling Techniques
8 pages
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
No ratings yet
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
10 pages
Mock Exam For The Online Environment
No ratings yet
Mock Exam For The Online Environment
5 pages
Simplex Method
No ratings yet
Simplex Method
8 pages
5.3.4 Journal - Describing Distributions (Journal)
No ratings yet
5.3.4 Journal - Describing Distributions (Journal)
6 pages
Awareness of Secondary School Students Towards Sex Education
No ratings yet
Awareness of Secondary School Students Towards Sex Education
8 pages
Isye HW2
No ratings yet
Isye HW2
10 pages
Proc Esm 2
No ratings yet
Proc Esm 2
11 pages
Research Methods 6RM
No ratings yet
Research Methods 6RM
7 pages
Week3HW 091323
No ratings yet
Week3HW 091323
8 pages
MATH 1281 Learning Guide Unit 5 Reading Assignment Home
No ratings yet
MATH 1281 Learning Guide Unit 5 Reading Assignment Home
4 pages
What Are The Sections of A Scientific (Primary) Research Paper
No ratings yet
What Are The Sections of A Scientific (Primary) Research Paper
4 pages
Type Z Base SQL
No ratings yet
Type Z Base SQL
7 pages
D&D 5th - Monsters by CR
No ratings yet
D&D 5th - Monsters by CR
1 page
Test Plan For z14 LPAR Migration - BMSD
No ratings yet
Test Plan For z14 LPAR Migration - BMSD
3 pages
Homework2 PDF
No ratings yet
Homework2 PDF
3 pages
Forces and Motion
No ratings yet
Forces and Motion
5 pages
WS3 7stamaria
No ratings yet
WS3 7stamaria
6 pages
Factor Analysis
No ratings yet
Factor Analysis
2 pages
Limitations in System Approach in Geomorphology PDF
No ratings yet
Limitations in System Approach in Geomorphology PDF
5 pages
DB2 V 11 New Function Mode (NFM) Testing For BMSB
No ratings yet
DB2 V 11 New Function Mode (NFM) Testing For BMSB
3 pages
Syllabus (PH.D) - PSH7002 Applied Mathematics & Mathematics (Page No 77-78)
No ratings yet
Syllabus (PH.D) - PSH7002 Applied Mathematics & Mathematics (Page No 77-78)
3 pages
Anova in Excel - Easy Excel Tutorial
No ratings yet
Anova in Excel - Easy Excel Tutorial
4 pages
Econometrics Assignment
No ratings yet
Econometrics Assignment
2 pages
A 2 Economy
No ratings yet
A 2 Economy
2 pages
Exercise 6-2 LCM Transition Potential Modeling
No ratings yet
Exercise 6-2 LCM Transition Potential Modeling
3 pages
1321 - 3 - 519493 - 1695916460 - Databricks - Generic
No ratings yet
1321 - 3 - 519493 - 1695916460 - Databricks - Generic
1 page
Elementary Statistics Project
No ratings yet
Elementary Statistics Project
1 page

Exploratory Data Analysis For Machine Learning

Uploaded by

Exploratory Data Analysis For Machine Learning

Uploaded by

Exploratory Data Analysis for Machine Learning

01. Brief description of the data set and a summary of its

02. Initial plan for data exploration

03. Actions taken for data cleaning and future engineering

04. Key Findings and Insights, which synthesizes the results

05. Formulating at least 3 hypothesis about this data

06. Conducting a formal significance test for one of the

Population: X = 'Water temperature anomalies on sector 3.4 of the Pacif ic Ocean'

Ho: μ = -0,1 null hypothesis

Ha: μ > -0,1 alternative hypothesis

07. Suggestions for next steps in analyzing this data

You might also like