Introduction to Data Science
Lecture as part of TERI-NORCE Research School //
Funded by the Norwegian Ministry of Foreign Affairs
Dr. Michel d. S. Mesquita1,2,3
LinkedIn QR code
1
NORCE Norwegian Research Centre, Norway / 2Bjerknes Centre for
Climate Research, Norway / 3M2Lab.org, Norway
Main reference //
This lecture is based on the book shown below
‘KT2018’ hereafter It provides an accessible and concise introduction to
data science.
Kelleher and Tierney (2018) Data Science. Boston: The
MIT Press Essential Knowledge Series. URL:
https://mitpress.mit.edu/books/data-science
TERI-NORCE CRS, October 2019 - New Delhi 2
Definitions //
Data science /
«… a set of principles, problem definitions,
algorithms, and processes for extracting nonobvious
and useful patterns from large data sets»
Machine learning (ML) /
«… focuses on the design and evaluation of
algorithms for extracting patterns from data»
Data mining /
«… generally deals with the analysis of structured
data and often implies and emphasis on commercial
applications»
KT2018
TERI-NORCE CRS, October 2019 - New Delhi 3
3
When to use data science? //
It has to be useful
When we deal with a large number of data examples or the
patterns are too complex for humans to discover and
extract manually
Actionable insight /
Insight = the pattern should give us relevant
information about the problem that is not obvious
Actionable = the insight we get should be something
we can use in some way
TERI-NORCE CRS, October 2019 - New Delhi 4
3
Brief history of data science //
Data science starts in the 1990s drawing knowledge from two historical
fields: data collection and data analysis
Data gathering /
Earlier marks of solstices and sunrise
Transactional data: advent of writing brought record keeping, e.g. Mesopotamia ~3200 BC
Nontransactional data: demographic data, e.g. earliest census in pharaonic Egypt ~3000
BC
Relational data model: Edgar F. Codd in 1970 publishes a paper on a model, which lays
the foundation for modern databases, such as the structure query language (SQL)
Big data: today we have volume, variety, velocity; NoSQL databases
TERI-NORCE CRS, October 2019 - New Delhi 5
Brief history of data science //
Thomas Bayes
Data science starts in the 1990s drawing knowledge from two historical (Wikipedia)
fields: data collection and data analysis
Data analysis /
Statistics: the branch of science that deals with the collection and analysis of data
Probability theory: 17th and 18th centuries, e.g. work of Blaise Pascal, Jakob
Bernoulli, Thomas Bayes, and others. Probability distributions became the new tool to
move beyond descriptive statistics
Statistical learning and modern data science: 19th century; new developments in
probability theory facilitated statistical learning; e.g. Works by Pierre Laplace, Carl
Friedrich Gauss, among others; method of least squares, which laid the foundation for
linear regression and later artificial neural network
Data visualisation and exploratory data analysis: 1780-1820 William Playfair invented
statistical graphics, such as the line chart, area chart, bar chart, pie chart; these
led the foundation for modern methods, such as the t-distributed stochastic neighbord
TERI-NORCE CRS, October 2019 - New Delhi 6
embedding (t-SNE) algorithm
Brief history of data science //
The role of women
The Countess of Lovelace
(Wikipedia)
Ada Lovelace's notes were labelled alphabetically from A
to G. In note G, she describes an algorithm for the
Analytical Engine to compute Bernoulli numbers. It is
considered to be the first published algorithm ever
specifically tailored for implementation on a computer,
and Ada Lovelace has often been cited as the first
computer programmer for this reason.
(Wikipedia)
TERI-NORCE CRS, October 2019 - New Delhi 7
Brief history of data science //
Claude Shannon
Data science starts in the 1990s drawing knowledge from two historical (Wikipedia)
fields: data collection and data analysis
Data analysis (cont.) /
20th century
Karl Pearson developed modern hypothesis testing; R. A. Fisher developed statistical
methods for multivariate analysis and maximum likelihood estimate
Alan Turing in the Second World War led the invention of the electronic computer
Warren McCulloch and Walter Pitts in 1943 proposed the first mathematical model of a
neural network
Claude Shannon in 1948 published «A Mathematical Theory of Communication» and founded
information theory
TERI-NORCE CRS, October 2019 - New Delhi 8
Brief history of data science //
Alan Turing Alan Turing
(Wikipedia)
Turing (1936, 1950)
Papers by Alan Turing on the topics of computable
numbers and artificial intelligence.
TERI-NORCE CRS, October 2019 - New Delhi 9
Brief history of data science //
Nils Nilsson
Data science starts in the 1990s drawing knowledge from two historical (Wikipedia)
fields: data collection and data analysis
Data analysis (cont.) /
Evelyn Fix and Joseph Hodges proposed a model for discriminatory analysis
(classification or pattern-recognition problem), which became the basis for modern
nearest-neighbor models
Establishment of the field of artificial intelligence at a workshop in Dartmouth
College
The term machine learning was beginning to be used to describe programs that gave a
computer the ability to learn from data
Nils Nilsson’s book Learning Machines in 1965 showed how neural networks could be used
to learn linear models for classification
TERI-NORCE CRS, October 2019 - New Delhi 10
Brief history of data science //
Ensemble fatigue
Data science starts in the 1990s drawing knowledge from two historical (Benestad et al., 2017,
fields: data collection and data analysis Nature Climate Change)
Data analysis (cont.) /
Earl Hunt, Janet Marin, and Philip Stone in 1966 developed the concept-learning system
framework, which was the progenitor of an important family of ML algorithms that
induced decision-tree models
A number of independent researchers developed and published earlier versions of the k-
means clustering algorithm
Today: some of the most important developments include ensemble models, where
predictions are made using a set of models, and deep-learning neural networks, which
have multiple layers of neurons (with applications in machine vision, natural-language
processing,…)
TERI-NORCE CRS, October 2019 - New Delhi 11
The Emergence and Evolution of Data Science //
1990s: the term «data science» came to prominence in discussions of the need for statisticians to join with computer
scientists to bring mathematical rigor to computational analysis of large data sets
1997: C.F. Jeff Wu’s public lecture «Statistics = Data Science?»
2001: William S. Cleveland published an action plan for creating a university department
in the field of data science
Also in 2001 //
Leo Breiman published the paper «Statistical Modeling: the two cultures» /
His distinction between a statistical focus on models that explain the data versus an algorithmic focus on
models that can actually predict the data highlights a core difference between statisticans and ML researchers
TERI-NORCE CRS, October 2019 - New Delhi 12
Today //
The role of a data scientist has
become very broad
It is difficult for an individual
to master all of the skill areas
KT2018
TERI-NORCE CRS, October 2019 - New Delhi 13
What are data? //
Datum /
Or piece of information is an abstraction of a
real-world entity (person, object, or event)
The terms variable, feature, and attribute are
often used interchangeably to denote an
individual abstraction
Each entity is typically described by a number of
attributes. For instance, a book might have the
following attributes: author, title, topic,
genre, publisher, price, etc.
TERI-NORCE CRS, October 2019 - New Delhi 14
What are data? What is a data set? //
Data set /
It consists of the data relating to a collection
of entitites, with each entity described in terms
of a set of attributes
In its most basic form a data set is organized in
an n*m data matrix called the analytics record
KT2018
TERI-NORCE CRS, October 2019 - New Delhi 15
Challenges //
Choice of attributes /
How do you choose the most appropriate
attributes?
Different attribute types /
Numeric, nominal and ordinal.
The data type of an attribute affects the methods we can use
to analyze and understand the data, including both the basic
statistics we can use to describe the distribution of values
that an attribute takes and the more complex algorithms we
use to ideentify the patterns of relationships between
attributes
TERI-NORCE CRS, October 2019 - New Delhi 16
Perspectives on data //
Structured data Unstructured data
Can be stored in a table, and every Each instance in the data set may
instance in the table has the same have its own internal structure, and
structure (i.e., set of attributes) this structure is not necessarily
the same in every instance
TERI-NORCE CRS, October 2019 - New Delhi 17
CRISP-DM Model //
TERI-NORCE CRS, October 2019 - New Delhi KT2018 18
Data Science ecosystem //
TERI-NORCE CRS, October 2019 - New Delhi 19
KT2018
Machine learning (ML) //
ML involves a two step process /
Identify useful patterns given some data
(modelling). Some methods include:
- Decision trees
- Regression models
- Neural networks
Once a model is created, it is used for analysis
TERI-NORCE CRS, October 2019 - New Delhi 20
Supervised versus Unsupervised Learning //
Supervised learning / Unsupervised learning /
Learn a function that maps from the There is no target attribute. The
values of the attributes to the value of algorithm has the more general task of
another (target) attribute. looking for regularities in the data.
«Supervised» because each of the - Cluster analysis
instances in the data set lists both the - Euclidean distance
input values and the output (target)
value for each instance.
Each instance in the data set must be
labeled with the value of the target
attribute.
TERI-NORCE CRS, October 2019 - New Delhi 21
Methods //
TERI-NORCE CRS, October 2019 - New Delhi 22
Source: https://tinyurl.com/yydmyo8x
Activity //
Q1 and Q2
TERI-NORCE CRS, October 2019 - New Delhi 23
Simple linear regression //
Hypothetical dataset All_India_Rainfall (mm) El_Nino_Southern_Oscillation
Ten years of accumulated All India Rainfall in mm for
the monsoon season and pre-monsoon DJF mean 1030 -1.50
ENSO values for the corresponding year
950 0.50
980 0.80
Definitions /
1099 -1.80
Let xi be the value of the 1100 -1.80
explanatory variable for the ith
datapoint, and Yi the random 1140 -1.90
variable representing the response
for the same datapoint 999 0.90
950 0.85
988 0.88
TERI-NORCE CRS, October 2019 - New Delhi 24
Simple linear regression //
Assumptions /
The mean of the response variable Yi depends on the
𝐸 𝑌# = 𝛼 + 𝛽𝑥# value of xi of the explanatory variable in a linear
fashion
𝑌# = 𝛼 + 𝛽𝑥# + 𝜖# The variation of the response variable Yi about the
mean is represented by a random variable 𝜖i
𝐸 𝜖# = 0 Note that for the mean of Yi to be as equation 1
above, 𝜖i has to be zero
The variance of 𝜖i is the same for all values of the
𝐸 𝜖# ~𝑁(0, 𝜎 0) explanatory variable (error variance). Also the
different 𝜖i (i=1,2,…,n), are independent of each
other
TERI-NORCE CRS, October 2019 - New Delhi 25
Simple linear regression //
Alternative way to write the model /
𝑎𝑋 + 𝑏~𝑁(𝑎𝜇 + 𝑏, 𝑎0𝜎 0)
Property of distributions. Also, assuming that Yi is
a fixed quantity 𝛼 + 𝛽𝑥# plus a random variable
𝑁(0, 𝜎 0 ), then:
𝑌# = 𝑁(𝛼 + 𝛽𝑥# , 𝜎 0) This is the simple linear regression model
TERI-NORCE CRS, October 2019 - New Delhi 26
Ladder of powers //
Sometimes we need to transform the data. Provided that x>1, we can use
the following transformation alternatives:
Reduce the high values in a dataset relative to low values Stretch out high values
relative to low ones
TERI-NORCE CRS, October 2019 - New Delhi 27
Artificial Neural Networks //
Deeplearning.ai
Video by Andrew Ng
URL: https://youtu.be/n1l-9lIMW7E
TERI-NORCE CRS, October 2019 - New Delhi 28
Deep Learning //
MATLAB Tech Talks
Video by Shyamal Patel
URL: https://www.youtube.com/watch?v=3cSjsTKtN9M
TERI-NORCE CRS, October 2019 - New Delhi 29
Example //
Recognising objects with Convolutional Neural Networks
Michel’s solution, based on video from Deeplearning.ai
TERI-NORCE CRS, October 2019 - New Delhi 30
Discussion based on the paper by Dhar (2013) //
TERI-NORCE CRS, October 2019 - New Delhi 31
Discussion based on the paper by Dhar (2013) //
Question 1
In which ways is data
science different from
statistics?
TERI-NORCE CRS, October 2019 - New Delhi 32
Discussion based on the paper by Dhar (2013) //
Question 2
How much do you Karl Popper
(Wikipedia)
agree/disagree with Karl
Popper’s argument for
evaluating a theory and
scientific progress?
TERI-NORCE CRS, October 2019 - New Delhi 33
Discussion based on the paper by Dhar (2013) //
Question 3
What powerful capability
is machine learning able
to give us?
TERI-NORCE CRS, October 2019 - New Delhi 34
Discussion based on the paper by Dhar (2013) //
Question 4
«…simpler models are more likely
to hold up on future observations
than more complex ones, all else
being equal» (p.68)
How much do you
agree/disagree with this
statement?
TERI-NORCE CRS, October 2019 - New Delhi 35
Discussion based on the paper by Dhar (2013) //
Question 5
Why are problem
formulation skills so
important?
TERI-NORCE CRS, October 2019 - New Delhi 36
Discussion based on the paper by Dhar (2013) //
Question 6
Where do errors come from?
TERI-NORCE CRS, October 2019 - New Delhi 37
Discussion based on the paper by Dhar (2013) //
Question 7
How does the Internet
contribute to inexpensive
large-scale randomized
experiments?
TERI-NORCE CRS, October 2019 - New Delhi 38
Discussion based on the paper by Dhar (2013) //
Question 8
Why is earth science
considered to be one of
the «data-starved areas of
inquiry»?
TERI-NORCE CRS, October 2019 - New Delhi 39
Discussion based on the paper by Dhar (2013) //
Question 9
What should universities
do to improve their
students’ data science
skills?
TERI-NORCE CRS, October 2019 - New Delhi 40
Discussion based on the paper by Dhar (2013) //
Question 10
How would you like to
integrate data science in
your own research work?
TERI-NORCE CRS, October 2019 - New Delhi 41
Summary and future work //
Thank you for your attention!
Feel free to get in touch with us //
[email protected]
In conclusion // +47 4760 2340
Data science has become a growing field, thanks to many past
researchers in the field. Climate science will benefit even more
in the years to come, as more and more researchers and students
learn how to apply different algorithms to solve climate
problems.
We hope this Climate Research School can inspire you as well! In
the next days, you will be learning several tools that can
become part of your professional toolbox!
TERI-NORCE CRS, October 2019 - New Delhi 42