Week5 Modified

The document outlines the key steps in a typical data science workflow including data preprocessing, model building, and sharing results. It discusses important aspects of data preprocessing like data cleaning, reduction, and exploration. When building models, it emphasizes choosing models that avoid overfitting and benchmarking against existing models. Cross-validation is presented as a technique for evaluating model performance.

Uploaded by

turbonstre

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Week5 Modified

Uploaded by

turbonstre

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

BDM 2053

Big Data Algorithms and Statistics

Weekly Course Objectives
● Data Science Workflow
● Data pre-processing
○ Data cleaning
○ Data reduction
● Model building
○ Benchmark models.
○ What model do I use?
● Class imbalance
● NO EXAMPLES IN PYTHON TODAY!
What is the data science workflow?
● A data science workflow outlines the end-to-end steps
needed to complete a data science project.
● It helps all members involved in the data science based project
align and understand key milestones, steps and blockers
associated with the project.
○ For project methodology, you can use AGILE, waterfalls, etc
● A proper workflow acts as “safety guards” for your project,
which prevent you from derailing and making sure everyone is
on the same page.
● The breakdown is given by the following flow diagram:

ASK GET EXPLORE MODEL SHARE

Data Science Workflow: Ask
● The first step in a data science workflow is to ask the questions
you wish to answer. This should always be your first priority.
○ Ex; Who is most likely to end their business with us?
● Many times, business leaders and the hippos (highest paid
person’s opinions) often overshadow what the data suggests.
○ Data can help what we actually see, versus what people
think they see.
● Machine learning and data science is a really powerful way to
get results, but sometimes it might be overkill.
○ After some experience working with data, you can quickly
realize if your stakeholder is looking for something quick
and analytical vs a full machine learning process.
● If projects feel like they derail, you can always look back at this
step to make sure you are aiming to solve your main
question(s).
Data Science Workflow: Get
● Once you know, or during the process of addressing, what you
want to answer, the next step is to get your data.
● Data is often stored across many different areas within a large
organization (typically in the cloud or in data storages like
Amazon Web Services or Hadoop).
● Very rarely is your data packaged neatly for you.
● You will often have process your data, join it across many other
tables, implement some business logic to capture some
metrics, KPIs or features that might be important in later steps.
● This is typically the hardest part of data science and any
machine learning task.
● This step has many other sub-steps which we will get into more
deeper in later slides.
Data Science Iceberg analogy

Modelling/Analysis

All the “sexy” stuﬀ we see with

analytics, machine learning and
Documentation
data science as a whole would not
be possible for the underlying
work.
Getting Data
Bottom line, if you have garbage
data, you get garbage results
Data Preprocessing

Feature Engineering
Data Science Workflow: Explore
● We’ve spent a significant amount of time in this course on this
part, which is the exploration of our data!
○ Sometimes this is called EDA (exploratory data analysis).
● Here we look at statistics on our data, see if there’s any outliers
that might impact our analysis, and visualize to put everything
into perspective.
○ Sometimes this step might show us limitations in our data
which might make us circle back to the previous step of
getting good data!
○ For example, do we need more data? Did we capture some
information incorrectly?
● This is a part of the front end work of data science that gets
recognized and often presented.
● In short, this part mainly focuses on cleaning, profiling and
wrangling your data.
Data Science Workflow: Model
● We now arrive at a key step in the data science process, the
modelling of our data to predict, forecast and provide insights
on our data.
● This part comes with its own challenges, because:
○ How do we choose the best model?
○ How do we test our model(s)?
○ How do we ensure our model will give good results for data
not used to train our model?
● Make sure you document your models and explain how it’s
being used in context to your business problem.
○ Your models will or at least should go through a model
validation process.
○ Consider them a layer of protection for the business to
make sure you don’t put them at risk.
Data Science Workflow: Share
● Last but not least, after your intense amount of work in the first
4 steps of the data science workflow, we arrive at finally
sharing our results.
● One confusion people have is that once your model is built, you
are done. This is not the case!
○ Once your model is done, you must be able to demonstrate
how you will be using your model (an API, run it manually
on a periodic basis, report via Tableau, etc).
○ What conclusions were you able to deduce from your
model?
● As data scientists and machine learning engineers, your job is
not to simply build but to assess the impact of what you have
discovered.
Data Science Workflow: Visual Summary
Data exploration: data preprocessing
● Once you have your hands on the data sources you’ll need for
your problem, the next step is data preprocessing.
● Data preprocessing is a technique which is used to convert
the raw data set into a clean data set.
● It often has many steps involved, like profiling, cleaning,
wrangling, transforming, etc.
● A visual understanding of this process is given as follows:
Data preprocessing: cleaning
● Data cleaning is the process of rectifying data quality issues,
eliminate bad data, replace missing values.
● If you have data that is not in the right format for your analysis,
you can encounter many issues in the model building phase
especially if the data is large.
● Sometimes while getting your data, you may have messed up a
process to acquire certain variables leading to bad data.
● More extreme cases, which happen ironically more often than
not, is that many variables are empty because systems don’t
accurately capture the data.
● If you have data missing, you might have to impute, which is
the process of substituting for missing values using the
average, as one example.
● Some variables may not be in compliance to your companies
rules and regulations, meaning you can’t use them!
Data preprocessing: reduction
● Besides simply removing variables which might not yield any
performance in our model, what if we had say +1000 variables?
● Making scatter plots for something like this and assessing all
directions of relations is very time consuming.
● There are dimension reduction techniques, and feature
importance techniques, that makes this process much more
easier!
● Some techniques are:
○ PCA (principal component analysis)
○ LASSO regression
○ Random Forests
Model building: benchmark models
● After you have data that is ready to be used in your model, you
must ask yourself prior - does the organization or stakeholders
I work with have a benchmark model.
● A benchmark model is the current model used for the
business problem.
○ Based on how the model was evaluated, you must bring up
a model that challenges it, called a challenger model.
● Challenger models and benchmark models must use the same
conditions to be compared against.
○ They must be trained and tested on the same data.
○ You must compare them on the basis of a similar test
statistic (MSE, R2, accuracy, precision, F1, etc).
Model building: What model do I choose?
● We want to make sure our model is accurate and precise.

● Also, we want to make sure your model does good at

forecasting out of box data.
○ If you gave me 100 new observations, do you know how
accurate your model would be?
What model do I choose: Overfitting
● To make sure our model is accurate and precise is relatively
easy. However, the trickier topic to address is will your model
do good at observations its never seen before?
● This is where we get into a topic called overfitting, which
happens very often in machine learning.
● Overfitting occurs when your model fits almost perfectly to
the data used to build your model, called training data.
However, when you give your model data that wasn’t used to
train, testing data, you start seeing very poor accuracy.
● How do we address this? Logically, you split your data into 2
subsets; a training subset and a testing subset.
○ The most classic breakdown is to use 80% of your data to
train your model, and 20% to test your model.
● But we now got another dilemma…
What model do I choose: k cross validation
● Say we used 80% of our model to train. We then only get to test
it on 20% which might not give use the level of certainty we
need. But if go with say a 70-30 split, we get less data to train
our model on. What if we just got a really bad sample of data to
train on?
● k-fold cross validation is the process of dividing your data
into almost k equally sized subsets, and doing k assessments on
your model!
○ You take your data, make k subsets, use k-1 subsets to train
your model, gather your accuracy, precision and other
statistics on the left out subset.
○ Iterate over the other subsets until you get k measures.
○ Average over the statistics to get an “average” performance
of your model!
What model do I choose: k cross validation cont.

Recommended to use 10-fold cross

validation as it has been the most cited!
Class imbalance
● In assignment 1, the linear regression problem I gave was
attempting to introduce you to popular type of prediction
called classification.
● When building a classification model, we typically have 2
(sometimes more) labels and we want to predict. For example:
○ Will someone buy a product (Y or N)
○ Will this customer leave (Y or N)
○ Is this email spam or not (Y or N)
● One big issue sometimes is that in our training model, we will
likely have this target variable but the proportion of your class
of interest is very rare. For example:
○ Is this claim a fraud or not (Y or N)
● We must do something to our data to solve this…
Class imbalance cont.
● We can use some form of sampling to get a better
representation of our minor class (the class with the far fewer
proportion of points). We can either:
○ Oversample the minority class - sample from the observed
data set, with replacement, such that we get a bigger
representation)
○ Undersample the majority class - remove observations
randomly from the majority class until we get an equal
amount of points as the minority class.
○ Over-under sample - oversample minority class and under
sample majority class until a balanced data is obtained.
○ Synthetic Oversampling - process where you oversample the
minority class by creating synthetic variations of its class.
Class imbalance example.
● Say we had the following data:
Class imbalance example.
● Synthetic sampling would look at your original data, and
produce data points using an algorithm, like knn which we will
learn about in the coming weeks, using your minority class.
● It would look like the following:
Model deployment
● Think of your ML model as a data product.
● With that you can do many things like:
1. Save your ML models as objects. Think of this like a “save
point” in a video game. In Python you can save your model
as a pickle object, allowing you to load your model and
apply it to your new data set if you so choose.
2. Transfer ML model logic as a function to be applied as rules
in dashboards. For example, a linear regression equation is
literally a formula that you can encode once you have your
model results, and apply this over your data!
3. Integrate your model in a web framework via an application
programming interface (API). This lets users send and pull
requests to your ML model and run their data against it.
Data science workflow example
● For a great example of how a data scientists workflow looks
like, look at the following:
https://www.youtube.com/watch?v=MpPLp-TBwF8
● Note, this is his own personal workflow. Many core elements
are similar, and he might go over other steps, however, it
typically aligns to the material that was presented today.
Thank You

Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
Design of Isolated Square and Rectangular Footings (ACI 318-02)
100% (3)
Design of Isolated Square and Rectangular Footings (ACI 318-02)
6 pages
Manual Epic III Eng v1 0 r4 PDF
No ratings yet
Manual Epic III Eng v1 0 r4 PDF
130 pages
Unit 3
No ratings yet
Unit 3
55 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
ADS-IMP-QNA-2025-15-04-06-06-35_copy
No ratings yet
ADS-IMP-QNA-2025-15-04-06-06-35_copy
33 pages
Statistics For Data Science - 1
100% (2)
Statistics For Data Science - 1
38 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
40 Interview Questions asked at Startups in Machine Learning _ Data Science
No ratings yet
40 Interview Questions asked at Startups in Machine Learning _ Data Science
13 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Unit I
No ratings yet
Unit I
52 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
63 pages
CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
Unit 3
No ratings yet
Unit 3
9 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
23 pages
ML_DA
No ratings yet
ML_DA
55 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
DsNaIT v2.0
No ratings yet
DsNaIT v2.0
43 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
From Everand
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
Yash d.
No ratings yet
Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Data processes
No ratings yet
Data processes
4 pages
S2 - Datascience Lifecycle
No ratings yet
S2 - Datascience Lifecycle
19 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Ebook Data Science
100% (3)
Ebook Data Science
48 pages
ds sem
No ratings yet
ds sem
71 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Data+Science+in+Python+ +Data+Prep+&+EDA
No ratings yet
Data+Science+in+Python+ +Data+Prep+&+EDA
196 pages
Capstone Project
No ratings yet
Capstone Project
9 pages
Article Review 11 Eng
No ratings yet
Article Review 11 Eng
18 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
ML Workflow Steps: Step 2: Building Dataset
No ratings yet
ML Workflow Steps: Step 2: Building Dataset
5 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
DSUR_EA2352001010391_W3
No ratings yet
DSUR_EA2352001010391_W3
3 pages
DSA Lecture1 (1)
No ratings yet
DSA Lecture1 (1)
15 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
1635838720082
No ratings yet
1635838720082
35 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
DCS-FA-18C Early Access Guide RU
100% (1)
DCS-FA-18C Early Access Guide RU
499 pages
Task-1-CDK9-PROTAC-components
No ratings yet
Task-1-CDK9-PROTAC-components
4 pages
BS en 583-6-2008
No ratings yet
BS en 583-6-2008
26 pages
Aerobic Respiration Practice Quiz Questions
0% (1)
Aerobic Respiration Practice Quiz Questions
7 pages
Participant Bios UK - India Data Stewardship For Climate Through AI
No ratings yet
Participant Bios UK - India Data Stewardship For Climate Through AI
14 pages
TPMC CABLE
No ratings yet
TPMC CABLE
1 page
Platinum Stone A 2014
No ratings yet
Platinum Stone A 2014
5 pages
Epic Program Workout
86% (7)
Epic Program Workout
25 pages
Office 365 PDF
25% (4)
Office 365 PDF
3 pages
Filosofi Batik & Kain Kebat
No ratings yet
Filosofi Batik & Kain Kebat
3 pages
Perception of The Quality of Smart City Solutions As A Sense of Residents Safety 2021 MDPI
No ratings yet
Perception of The Quality of Smart City Solutions As A Sense of Residents Safety 2021 MDPI
16 pages
External Flow
100% (1)
External Flow
49 pages
One Shot Straight Lines
100% (2)
One Shot Straight Lines
164 pages
Providing Mobility Access Within A Destination
No ratings yet
Providing Mobility Access Within A Destination
18 pages
Global MEPS Guide For Low Voltage Motors 50060049 Brochure English
No ratings yet
Global MEPS Guide For Low Voltage Motors 50060049 Brochure English
12 pages
7 Mindfulness
100% (1)
7 Mindfulness
6 pages
The Social and Economic Roots of Newton's Principia
No ratings yet
The Social and Economic Roots of Newton's Principia
33 pages
ANATOMY AND PHYSIOLOGY MODULE 1 Notes
No ratings yet
ANATOMY AND PHYSIOLOGY MODULE 1 Notes
2 pages
RTEP-1-IPE3-1
No ratings yet
RTEP-1-IPE3-1
82 pages
Unit 2 Appreciation of Computing in Different Fields
No ratings yet
Unit 2 Appreciation of Computing in Different Fields
17 pages
Building Services
No ratings yet
Building Services
14 pages
Mythril Dragon
No ratings yet
Mythril Dragon
10 pages
Point of Care Smartphone Based ElectrochemicalBiosensing
No ratings yet
Point of Care Smartphone Based ElectrochemicalBiosensing
15 pages
Equatorial Realty Development Vs Mayfair Theater
No ratings yet
Equatorial Realty Development Vs Mayfair Theater
7 pages
Fire Engineering for a Performance Based Code
No ratings yet
Fire Engineering for a Performance Based Code
16 pages
Product Profile
No ratings yet
Product Profile
5 pages
Rudolf Haag's Legacy of Local Quantum Physics and Reminiscences About A Cherished Teacher and Friend
No ratings yet
Rudolf Haag's Legacy of Local Quantum Physics and Reminiscences About A Cherished Teacher and Friend
50 pages
Endomorph_Home_Workout_Routine
No ratings yet
Endomorph_Home_Workout_Routine
3 pages

Week5 Modified

Uploaded by

Week5 Modified

Uploaded by

BDM 2053

Big Data Algorithms and Statistics

ASK GET EXPLORE MODEL SHARE

All the “sexy” stuﬀ we see with

● Also, we want to make sure your model does good at

Recommended to use 10-fold cross

You might also like