0% found this document useful (0 votes)
18 views

Week5 Modified

The document outlines the key steps in a typical data science workflow including data preprocessing, model building, and sharing results. It discusses important aspects of data preprocessing like data cleaning, reduction, and exploration. When building models, it emphasizes choosing models that avoid overfitting and benchmarking against existing models. Cross-validation is presented as a technique for evaluating model performance.

Uploaded by

turbonstre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Week5 Modified

The document outlines the key steps in a typical data science workflow including data preprocessing, model building, and sharing results. It discusses important aspects of data preprocessing like data cleaning, reduction, and exploration. When building models, it emphasizes choosing models that avoid overfitting and benchmarking against existing models. Cross-validation is presented as a technique for evaluating model performance.

Uploaded by

turbonstre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

BDM 2053

Big Data Algorithms and Statistics


Weekly Course Objectives
● Data Science Workflow
● Data pre-processing
○ Data cleaning
○ Data reduction
● Model building
○ Benchmark models.
○ What model do I use?
● Class imbalance
● NO EXAMPLES IN PYTHON TODAY!
What is the data science workflow?
● A data science workflow outlines the end-to-end steps
needed to complete a data science project.
● It helps all members involved in the data science based project
align and understand key milestones, steps and blockers
associated with the project.
○ For project methodology, you can use AGILE, waterfalls, etc
● A proper workflow acts as “safety guards” for your project,
which prevent you from derailing and making sure everyone is
on the same page.
● The breakdown is given by the following flow diagram:

ASK GET EXPLORE MODEL SHARE


Data Science Workflow: Ask
● The first step in a data science workflow is to ask the questions
you wish to answer. This should always be your first priority.
○ Ex; Who is most likely to end their business with us?
● Many times, business leaders and the hippos (highest paid
person’s opinions) often overshadow what the data suggests.
○ Data can help what we actually see, versus what people
think they see.
● Machine learning and data science is a really powerful way to
get results, but sometimes it might be overkill.
○ After some experience working with data, you can quickly
realize if your stakeholder is looking for something quick
and analytical vs a full machine learning process.
● If projects feel like they derail, you can always look back at this
step to make sure you are aiming to solve your main
question(s).
Data Science Workflow: Get
● Once you know, or during the process of addressing, what you
want to answer, the next step is to get your data.
● Data is often stored across many different areas within a large
organization (typically in the cloud or in data storages like
Amazon Web Services or Hadoop).
● Very rarely is your data packaged neatly for you.
● You will often have process your data, join it across many other
tables, implement some business logic to capture some
metrics, KPIs or features that might be important in later steps.
● This is typically the hardest part of data science and any
machine learning task.
● This step has many other sub-steps which we will get into more
deeper in later slides.
Data Science Iceberg analogy

Modelling/Analysis

All the “sexy” stuff we see with


analytics, machine learning and
Documentation
data science as a whole would not
be possible for the underlying
work.
Getting Data
Bottom line, if you have garbage
data, you get garbage results
Data Preprocessing

Feature Engineering
Data Science Workflow: Explore
● We’ve spent a significant amount of time in this course on this
part, which is the exploration of our data!
○ Sometimes this is called EDA (exploratory data analysis).
● Here we look at statistics on our data, see if there’s any outliers
that might impact our analysis, and visualize to put everything
into perspective.
○ Sometimes this step might show us limitations in our data
which might make us circle back to the previous step of
getting good data!
○ For example, do we need more data? Did we capture some
information incorrectly?
● This is a part of the front end work of data science that gets
recognized and often presented.
● In short, this part mainly focuses on cleaning, profiling and
wrangling your data.
Data Science Workflow: Model
● We now arrive at a key step in the data science process, the
modelling of our data to predict, forecast and provide insights
on our data.
● This part comes with its own challenges, because:
○ How do we choose the best model?
○ How do we test our model(s)?
○ How do we ensure our model will give good results for data
not used to train our model?
● Make sure you document your models and explain how it’s
being used in context to your business problem.
○ Your models will or at least should go through a model
validation process.
○ Consider them a layer of protection for the business to
make sure you don’t put them at risk.
Data Science Workflow: Share
● Last but not least, after your intense amount of work in the first
4 steps of the data science workflow, we arrive at finally
sharing our results.
● One confusion people have is that once your model is built, you
are done. This is not the case!
○ Once your model is done, you must be able to demonstrate
how you will be using your model (an API, run it manually
on a periodic basis, report via Tableau, etc).
○ What conclusions were you able to deduce from your
model?
● As data scientists and machine learning engineers, your job is
not to simply build but to assess the impact of what you have
discovered.
Data Science Workflow: Visual Summary
Data exploration: data preprocessing
● Once you have your hands on the data sources you’ll need for
your problem, the next step is data preprocessing.
● Data preprocessing is a technique which is used to convert
the raw data set into a clean data set.
● It often has many steps involved, like profiling, cleaning,
wrangling, transforming, etc.
● A visual understanding of this process is given as follows:
Data preprocessing: cleaning
● Data cleaning is the process of rectifying data quality issues,
eliminate bad data, replace missing values.
● If you have data that is not in the right format for your analysis,
you can encounter many issues in the model building phase
especially if the data is large.
● Sometimes while getting your data, you may have messed up a
process to acquire certain variables leading to bad data.
● More extreme cases, which happen ironically more often than
not, is that many variables are empty because systems don’t
accurately capture the data.
● If you have data missing, you might have to impute, which is
the process of substituting for missing values using the
average, as one example.
● Some variables may not be in compliance to your companies
rules and regulations, meaning you can’t use them!
Data preprocessing: reduction
● Besides simply removing variables which might not yield any
performance in our model, what if we had say +1000 variables?
● Making scatter plots for something like this and assessing all
directions of relations is very time consuming.
● There are dimension reduction techniques, and feature
importance techniques, that makes this process much more
easier!
● Some techniques are:
○ PCA (principal component analysis)
○ LASSO regression
○ Random Forests
Model building: benchmark models
● After you have data that is ready to be used in your model, you
must ask yourself prior - does the organization or stakeholders
I work with have a benchmark model.
● A benchmark model is the current model used for the
business problem.
○ Based on how the model was evaluated, you must bring up
a model that challenges it, called a challenger model.
● Challenger models and benchmark models must use the same
conditions to be compared against.
○ They must be trained and tested on the same data.
○ You must compare them on the basis of a similar test
statistic (MSE, R2, accuracy, precision, F1, etc).
Model building: What model do I choose?
● We want to make sure our model is accurate and precise.

● Also, we want to make sure your model does good at


forecasting out of box data.
○ If you gave me 100 new observations, do you know how
accurate your model would be?
What model do I choose: Overfitting
● To make sure our model is accurate and precise is relatively
easy. However, the trickier topic to address is will your model
do good at observations its never seen before?
● This is where we get into a topic called overfitting, which
happens very often in machine learning.
● Overfitting occurs when your model fits almost perfectly to
the data used to build your model, called training data.
However, when you give your model data that wasn’t used to
train, testing data, you start seeing very poor accuracy.
● How do we address this? Logically, you split your data into 2
subsets; a training subset and a testing subset.
○ The most classic breakdown is to use 80% of your data to
train your model, and 20% to test your model.
● But we now got another dilemma…
What model do I choose: k cross validation
● Say we used 80% of our model to train. We then only get to test
it on 20% which might not give use the level of certainty we
need. But if go with say a 70-30 split, we get less data to train
our model on. What if we just got a really bad sample of data to
train on?
● k-fold cross validation is the process of dividing your data
into almost k equally sized subsets, and doing k assessments on
your model!
○ You take your data, make k subsets, use k-1 subsets to train
your model, gather your accuracy, precision and other
statistics on the left out subset.
○ Iterate over the other subsets until you get k measures.
○ Average over the statistics to get an “average” performance
of your model!
What model do I choose: k cross validation cont.

Recommended to use 10-fold cross


validation as it has been the most cited!
Class imbalance
● In assignment 1, the linear regression problem I gave was
attempting to introduce you to popular type of prediction
called classification.
● When building a classification model, we typically have 2
(sometimes more) labels and we want to predict. For example:
○ Will someone buy a product (Y or N)
○ Will this customer leave (Y or N)
○ Is this email spam or not (Y or N)
● One big issue sometimes is that in our training model, we will
likely have this target variable but the proportion of your class
of interest is very rare. For example:
○ Is this claim a fraud or not (Y or N)
● We must do something to our data to solve this…
Class imbalance cont.
● We can use some form of sampling to get a better
representation of our minor class (the class with the far fewer
proportion of points). We can either:
○ Oversample the minority class - sample from the observed
data set, with replacement, such that we get a bigger
representation)
○ Undersample the majority class - remove observations
randomly from the majority class until we get an equal
amount of points as the minority class.
○ Over-under sample - oversample minority class and under
sample majority class until a balanced data is obtained.
○ Synthetic Oversampling - process where you oversample the
minority class by creating synthetic variations of its class.
Class imbalance example.
● Say we had the following data:
Class imbalance example.
● Synthetic sampling would look at your original data, and
produce data points using an algorithm, like knn which we will
learn about in the coming weeks, using your minority class.
● It would look like the following:
Model deployment
● Think of your ML model as a data product.
● With that you can do many things like:
1. Save your ML models as objects. Think of this like a “save
point” in a video game. In Python you can save your model
as a pickle object, allowing you to load your model and
apply it to your new data set if you so choose.
2. Transfer ML model logic as a function to be applied as rules
in dashboards. For example, a linear regression equation is
literally a formula that you can encode once you have your
model results, and apply this over your data!
3. Integrate your model in a web framework via an application
programming interface (API). This lets users send and pull
requests to your ML model and run their data against it.
Data science workflow example
● For a great example of how a data scientists workflow looks
like, look at the following:
https://www.youtube.com/watch?v=MpPLp-TBwF8
● Note, this is his own personal workflow. Many core elements
are similar, and he might go over other steps, however, it
typically aligns to the material that was presented today.
Thank You

25

You might also like