Week5 Modified
Week5 Modified
Modelling/Analysis
Feature Engineering
Data Science Workflow: Explore
● We’ve spent a significant amount of time in this course on this
part, which is the exploration of our data!
○ Sometimes this is called EDA (exploratory data analysis).
● Here we look at statistics on our data, see if there’s any outliers
that might impact our analysis, and visualize to put everything
into perspective.
○ Sometimes this step might show us limitations in our data
which might make us circle back to the previous step of
getting good data!
○ For example, do we need more data? Did we capture some
information incorrectly?
● This is a part of the front end work of data science that gets
recognized and often presented.
● In short, this part mainly focuses on cleaning, profiling and
wrangling your data.
Data Science Workflow: Model
● We now arrive at a key step in the data science process, the
modelling of our data to predict, forecast and provide insights
on our data.
● This part comes with its own challenges, because:
○ How do we choose the best model?
○ How do we test our model(s)?
○ How do we ensure our model will give good results for data
not used to train our model?
● Make sure you document your models and explain how it’s
being used in context to your business problem.
○ Your models will or at least should go through a model
validation process.
○ Consider them a layer of protection for the business to
make sure you don’t put them at risk.
Data Science Workflow: Share
● Last but not least, after your intense amount of work in the first
4 steps of the data science workflow, we arrive at finally
sharing our results.
● One confusion people have is that once your model is built, you
are done. This is not the case!
○ Once your model is done, you must be able to demonstrate
how you will be using your model (an API, run it manually
on a periodic basis, report via Tableau, etc).
○ What conclusions were you able to deduce from your
model?
● As data scientists and machine learning engineers, your job is
not to simply build but to assess the impact of what you have
discovered.
Data Science Workflow: Visual Summary
Data exploration: data preprocessing
● Once you have your hands on the data sources you’ll need for
your problem, the next step is data preprocessing.
● Data preprocessing is a technique which is used to convert
the raw data set into a clean data set.
● It often has many steps involved, like profiling, cleaning,
wrangling, transforming, etc.
● A visual understanding of this process is given as follows:
Data preprocessing: cleaning
● Data cleaning is the process of rectifying data quality issues,
eliminate bad data, replace missing values.
● If you have data that is not in the right format for your analysis,
you can encounter many issues in the model building phase
especially if the data is large.
● Sometimes while getting your data, you may have messed up a
process to acquire certain variables leading to bad data.
● More extreme cases, which happen ironically more often than
not, is that many variables are empty because systems don’t
accurately capture the data.
● If you have data missing, you might have to impute, which is
the process of substituting for missing values using the
average, as one example.
● Some variables may not be in compliance to your companies
rules and regulations, meaning you can’t use them!
Data preprocessing: reduction
● Besides simply removing variables which might not yield any
performance in our model, what if we had say +1000 variables?
● Making scatter plots for something like this and assessing all
directions of relations is very time consuming.
● There are dimension reduction techniques, and feature
importance techniques, that makes this process much more
easier!
● Some techniques are:
○ PCA (principal component analysis)
○ LASSO regression
○ Random Forests
Model building: benchmark models
● After you have data that is ready to be used in your model, you
must ask yourself prior - does the organization or stakeholders
I work with have a benchmark model.
● A benchmark model is the current model used for the
business problem.
○ Based on how the model was evaluated, you must bring up
a model that challenges it, called a challenger model.
● Challenger models and benchmark models must use the same
conditions to be compared against.
○ They must be trained and tested on the same data.
○ You must compare them on the basis of a similar test
statistic (MSE, R2, accuracy, precision, F1, etc).
Model building: What model do I choose?
● We want to make sure our model is accurate and precise.
25