0% found this document useful (0 votes)
48 views19 pages

Unit6 Part3 General Procedure

The document outlines the five main steps of data science: (1) asking an interesting question, (2) obtaining data, (3) exploring the data, (4) modelling the data, and (5) communicating results. Key aspects of each step are discussed, including splitting data into training, validation, and test sets for modelling, and using visualizations and appropriate communication methods to share results with audiences.

Uploaded by

tamanna sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views19 pages

Unit6 Part3 General Procedure

The document outlines the five main steps of data science: (1) asking an interesting question, (2) obtaining data, (3) exploring the data, (4) modelling the data, and (5) communicating results. Key aspects of each step are discussed, including splitting data into training, validation, and test sets for modelling, and using visualizations and appropriate communication methods to share results with audiences.

Uploaded by

tamanna sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Foundations of Data Science

7COM1073

The Five Steps of Data Science


Learning Outcomes
At the end of this session, you should know
• The five steps of data science
• The aim of splitting data to a training set, a validation set and a test
set
The Five Steps of Data Science
Asking an Interesting Question
• You may be given the question; may have a proposal but no specific
research questions yet.
➢Consult an expert in the field
[1]:

➢Is it a supervised learning, or an unsupervised learning?


Obtaining the Data
• Automatic object detection and categorisation in deep astronomical imaging surveys using unsupervised
machine learning [2]

• Identify and classify compounds with potential as transdermal enhancers [3]

• Text data
Exploring the Data
• The analyst should spend some time in learning about the domain
knowledge
• Extract features/predictors/attributes
• Aims:
➢Understand characteristics of the predictors
➢Understand relationships among the predictors
➢What can we infer from the preliminary inferential statistics? We want to be
able to understand our data a bit more than when we first found it. [1]
• Using graphs and/or unsupervised learning algorithms
Basic Questions for Data Exploration [1]
• Is the data organized or not? (structured or unstructured?)
➢ Transform unorganized data into a row/column structure.
• What does each row represent?
➢ Identify what each row actually represents.
• What does each column represent?
➢ Identify each column by the level of data and whether or not it is quantitative/qualitative, and so on.
• Are there any missing data points?
➢ Data scientists must make decisions about how to deal with missing data.
• Do we need to perform any transformations on the columns?
➢ For example, generally speaking, for the sake of statistical modelling and machine learning, we would like each
column to be numerical.
➢ Data normalisation.
• Is the dataset balanced or imbalanced?
➢ Use sampling methods for an imbalanced dataset.
• Are there any inconsistent data points?
➢ Make decision about how to deal with them.
Modelling the Data

• Model tuning
✓ Data splitting
✓ Model choosing

• Model performance
✓ Supervised learning
❖ Classification – accuracy rate; confusion matrix
❖ Regression – mean squared error
✓ Unsupervised learning
❖ For example: quantisation error
Modelling the Data – Data Splitting
• Training set: used to determine values of the parameters.
• Validation set: used to tune the model's hyperparameters and to
evaluate the accuracy of the model.
• Test set: the selected model should be further tested by measuring its
accuracy on a third independent set of data.

Always use parameters extracted from the training set to normalise the
validation/test set
Python – Data Preprocessing

Which of the following code segments should be used?

Always use parameters extracted from the training set to normalise the validation/test set
Modelling the Data - Data Splitting [4]
Training set, validation set and test set

➢The simplest way to split the data into a training and test set is to take a
simple random sample.

– To account for the outcome when splitting the data, stratified random sampling applies
random sampling within subgroups (such as the classes).
Modelling the Data - Data Splitting [4]
Training, validation and test set

➢With small sample sizes:

– The model may need every possible data point to adequately determine model values.

– The uncertainty of the test set can be considerably large to the point where different test
sets may produce very different results.

– A small test would have limited utility as a judge of performance.


Modelling the Data - Learning Curve [6]

“A schematic illustration of the behaviour of training and validation set errors during session, as a
function of the iteration step 𝜏. The goal of achieving the best generalisation performance suggests that
training should be stopped at the point 𝜏Ƹ corresponding to the minimum of the validation set error.”
Modelling the Data - Learning Curve [6]
• When a model overfits a small dataset:
➢The training error is relatively small, while the validation error is relatively
large.
• When a model underfit a large dataset:
➢The training error may increase, but the validation error may decrease

• A model will never, except by chance, give a lower error to the


validation set than the training set.

Useful links:
• https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
• https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/learning-curve-diagnostics.nb.html
Modelling the Data - Choosing between Models [4]
Modelling the Data - Selecting the Best Model [5]
“If our estimator is underperforming, how should we move forward?

• Use a more complicated/more flexible model


• Use a less complicated/less flexible model
• Gather more training samples
• Gather more data to add features to each sample
The answer to this question is often counterintuitive.

The ability to determine what steps will improve your model is what
separates the successful machine learning practitioners from the
unsuccessful.”
Communicating and Visualising the Results
• This is arguably the most important step. While it might seem obvious
and simple, the ability to conclude your results in a digestible format
is much more difficult than it seems. [1]

• Depending on your audience, it can greatly matter how you present


your findings. Your results are only as good as your vehicle of
communication. You can predict the movement of the market with
99.99% accuracy, but if your program is impossible to execute, your
results will go unused. Likewise, if your vehicle is inappropriate for the
field, your results will go equally unused. [1]
A Feature Selection Method [4]
A less theoretical, more heuristic approach
1. Calculate the correlation matrix of the predictors.
2. Determine the two predictors associated with the largest absolute
pairwise correlation (call them predictors A and B).
3. Determine the average correlation between A and the other
variables; then also determine the average correlation between B
and the other variables.
4. If A has a larger average correlation, remove it; otherwise, remove
predictor B.
5. Repeat Steps 2-4 until no absolute correlations are above the
threshold.
References
[1] S. Ozdemir: Principles of Data Science, 2016 Packt Publishing. Chapters 1-3
[2] A. Hocking: Automatic object detection and categorisation in deep astronomical imaging surveys
using unsupervised machine learning, PhD thesis, University of Hertfordshire, July, 2018.
[3] A. Shah: PERFORMING CLASSIFICATION AND REGRESSION ANALYSIS ON POTENTIAL
TRANSDERMAL ENHANCERS, MSc thesis, University of Hertfordshire, 2012.
[4] M. Kuhn and K. Johnson: Applied Predictive Modeling, Spinger, 2013. Chapters 3 &4.
[5] J. VanderPlas: Python Data Science Handbook. Pages 359-379
[6] C. M. Bishop: Neural Networks for Pattern Recognition,1995, Oxford University Press

You might also like