Unit6 Part3 General Procedure
Unit6 Part3 General Procedure
7COM1073
• Text data
Exploring the Data
• The analyst should spend some time in learning about the domain
knowledge
• Extract features/predictors/attributes
• Aims:
➢Understand characteristics of the predictors
➢Understand relationships among the predictors
➢What can we infer from the preliminary inferential statistics? We want to be
able to understand our data a bit more than when we first found it. [1]
• Using graphs and/or unsupervised learning algorithms
Basic Questions for Data Exploration [1]
• Is the data organized or not? (structured or unstructured?)
➢ Transform unorganized data into a row/column structure.
• What does each row represent?
➢ Identify what each row actually represents.
• What does each column represent?
➢ Identify each column by the level of data and whether or not it is quantitative/qualitative, and so on.
• Are there any missing data points?
➢ Data scientists must make decisions about how to deal with missing data.
• Do we need to perform any transformations on the columns?
➢ For example, generally speaking, for the sake of statistical modelling and machine learning, we would like each
column to be numerical.
➢ Data normalisation.
• Is the dataset balanced or imbalanced?
➢ Use sampling methods for an imbalanced dataset.
• Are there any inconsistent data points?
➢ Make decision about how to deal with them.
Modelling the Data
• Model tuning
✓ Data splitting
✓ Model choosing
• Model performance
✓ Supervised learning
❖ Classification – accuracy rate; confusion matrix
❖ Regression – mean squared error
✓ Unsupervised learning
❖ For example: quantisation error
Modelling the Data – Data Splitting
• Training set: used to determine values of the parameters.
• Validation set: used to tune the model's hyperparameters and to
evaluate the accuracy of the model.
• Test set: the selected model should be further tested by measuring its
accuracy on a third independent set of data.
Always use parameters extracted from the training set to normalise the
validation/test set
Python – Data Preprocessing
Always use parameters extracted from the training set to normalise the validation/test set
Modelling the Data - Data Splitting [4]
Training set, validation set and test set
➢The simplest way to split the data into a training and test set is to take a
simple random sample.
– To account for the outcome when splitting the data, stratified random sampling applies
random sampling within subgroups (such as the classes).
Modelling the Data - Data Splitting [4]
Training, validation and test set
– The model may need every possible data point to adequately determine model values.
– The uncertainty of the test set can be considerably large to the point where different test
sets may produce very different results.
“A schematic illustration of the behaviour of training and validation set errors during session, as a
function of the iteration step 𝜏. The goal of achieving the best generalisation performance suggests that
training should be stopped at the point 𝜏Ƹ corresponding to the minimum of the validation set error.”
Modelling the Data - Learning Curve [6]
• When a model overfits a small dataset:
➢The training error is relatively small, while the validation error is relatively
large.
• When a model underfit a large dataset:
➢The training error may increase, but the validation error may decrease
Useful links:
• https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
• https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/learning-curve-diagnostics.nb.html
Modelling the Data - Choosing between Models [4]
Modelling the Data - Selecting the Best Model [5]
“If our estimator is underperforming, how should we move forward?
The ability to determine what steps will improve your model is what
separates the successful machine learning practitioners from the
unsuccessful.”
Communicating and Visualising the Results
• This is arguably the most important step. While it might seem obvious
and simple, the ability to conclude your results in a digestible format
is much more difficult than it seems. [1]