0% found this document useful (0 votes)
182 views17 pages

The Friendly Data Science Handbook 2020

The document outlines the OSEMN data science pipeline, which consists of obtaining data, scrubbing/cleaning it, exploring/visualizing it to find patterns, modeling the data to generate predictions, and interpreting the results. It emphasizes that the most important steps are understanding the problem being solved and asking the right business questions before beginning work with data. Choosing algorithms appropriate for the problem and features that relate to the problem are also highlighted as crucial to achieving good results. The goal is to gain business insights and tell a clear, actionable story to address the original problem.

Uploaded by

Leon Parsaud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
182 views17 pages

The Friendly Data Science Handbook 2020

The document outlines the OSEMN data science pipeline, which consists of obtaining data, scrubbing/cleaning it, exploring/visualizing it to find patterns, modeling the data to generate predictions, and interpreting the results. It emphasizes that the most important steps are understanding the problem being solved and asking the right business questions before beginning work with data. Choosing algorithms appropriate for the problem and features that relate to the problem are also highlighted as crucial to achieving good results. The goal is to gain business insights and tell a clear, actionable story to address the original problem.

Uploaded by

Leon Parsaud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

The Data Science Pipeline

OSEMN Pipeline
Understanding the typical work flow on how the data science pipeline works is a crucial
step towards business understanding and problem solving. If you are intimidated about
how the data science pipeline works, say no more. I'll make this simple for you. I found a
very simple acronym from Hilary Mason and Chris Wiggins that you can use throughout
your data science pipeline. That is O.S.E.M.N.

OSEMN Pipeline
O — Obtaining our data
S — Scrubbing / Cleaning our data
E — Exploring / Visualizing our data will allow us to find patterns and trends
M — Modeling our data will give us our predictive power as a wizard
N — Interpreting our data

Business Question
So before we even begin the OSEMN pipeline, the most crucial and important step that we
must take into consideration is understanding what problem we’re trying to solve. Let’s say
this again. Before we even begin doing anything with “Data Science”, we must first take
into consideration what problem we’re trying to solve. If you have a small problem you
want to solve, then at most you’ll get a small solution. If you have a BIG problem to solve,
then you’ll have the possibility of a BIG solution.

Ask yourself:
● How can we translate data into dollars?
● What impact do I want to make with this data?
● What business value does our model bring to the table?
● What will save us lots of money?
● What can be done to make our business run more efficiently?

Knowing this fundamental concept will bring you far and lead you to greater steps in being
successful towards being a Data Scientist. No matter how well your model predicts, no

13
matter how much data you acquire, and no matter how OSEMN your pipeline is… your
solution or actionable insight will only be as good as the problem you set for yourself.

“Good data science is more about the questions you pose of the data rather than data
munging and analysis” — Riley Newman

O — Obtaining our data


You cannot do anything as a data scientist without even having any data. As a rule
of thumb, there are some things you must take into consideration when obtaining
your data. You must identify all of your available datasets (which can be from the
internet or external/internal databases). You must extract the data into a usable
format (.csv, json, xml, etc..)

Skills Required:
● Database Management: MySQL, PostgresSQL,MongoDB
● Querying Relational Databases
● Retrieving Unstructured Data: text, videos, audio files, documents
● Distributed Storage: Hadoops, Apache Spark/Flink

S — Scrubbing / Cleaning our data


This phase of the pipeline should require the most time and effort. Because the
results and output of your machine learning model is only as good as what you put
into it. Basically, garbage in garbage out. Your objective here is to examine the data,
understand every feature you’re working with, identify errors, missing values, and
corrupt records, clean the data, and replace and/or fill missing values.

Skills Required:
Scripting language: Python, R, SAS
Data Wrangling Tools: Python Pandas, R
Distributed Processing: Hadoop, Map Reduce / Spark

E — Exploring / Visualizing our data


Now during the exploration phase, we try to understand what patterns and values
our data has. We’ll be using different types of visualizations and statistical testing to

14
back up our findings. This is where we will be able to derive hidden meanings
behind our data through various graphs and analysis. Go out and explore!

Your objective here is to find patterns in your data through visualizations and charts
and to extract features by using statistics to identify and test significant variables.

Skills Required:
● Python: Numpy, Matplotlib, Pandas, Scipy
● R: GGplot2, Dplyr
● Inferential statistics
● Experimental Design
● Data Visualization

M — Modeling our data


Models are general rules in a statistical sense.Think of a machine learning model as
tools in your toolbox. You will have access to many algorithms and use them to
accomplish different business goals. The better features you use the better your
predictive power will be. After cleaning your data and finding what features are most
important, using your model as a predictive tool will only enhance your business
decision making.

Your objective here is to perform in-depth analytics by creating predictive models.


Machine learning algorithms may be better at predicting, detecting, and processing
patterns than you. But they can't reason. And that's where you com in!

Machine Learning Analogy


Think of Machine learning algorithms as us (students). Just like students, all of the
algorithms learn differently. Each has their own sets of qualities and way of learning. Some
algorithms learn faster, while others learn slower. Some algorithms are lazy learners (i.e.
KNN) while others are eager learners. Some algorithms learn from parametric data (i.e.
Linear Regression) while others learn from non-parametric data (i.e. Decision Trees). Just
like students, some algorithms perform better for a certain problem, whereas others may
perform better on another, i.e. linearly separable vs non-linearly separable problems. Just
like students, these algorithms learn through various patterns and relationships from the
data. That's why it's important to perform EDA and visualizations because if you don't see
a pattern, you're model won't as well. Just like students, if you give an algorithm garbage
information to learn, then it won't perform as good. That's why it's important to choose your

15
features correctly and that each has some relationship to the problem. So remember,
choose the algorithm that is the most appropriate for your problem. Because there is no
"best" learner but there is always the "right" learner.

"Machines can predict the future, as long as the future doesn't look too different from the
past."

Skills Required:
● Machine Learning: Supervised/Unsupervised algorithms
● Evaluation methods: MSE, Precision/Recall, ROC/AUC
● Machine Learning Libraries: Python (Sci-kit Learn) / R (CARET)
● Linear algebra & Multivariate Calculus

N — Interpreting our data


The most important step in the pipeline is to understand and learn how to explain
your findings through communication. Telling the story is key, don’t underestimate
it. It’s about connecting with people, persuading them, and helping them. The art of
understanding your audience and connecting with them is one of the best part of
data storytelling.

The power of emotion


Emotion plays a big role in data storytelling. People aren’t going to magically
understand your findings. The best way to make an impact is telling your story
through emotion. We as humans are naturally influenced by emotions. If you can
tap into your audiences’ emotions, then you my friend, are in control. When you’re
presenting your data, keep in mind the power of psychology. The art of
understanding your audience and connecting with them is one of the best parts of
data storytelling.
The objective you should have for yourself is to be able to identify business insights,
relate back to the business problem, visualize your findings accordingly, keeping it
simple and priority driven, and to tell a clear and actionable story.

Skills Required:
● Business Domain Knowledge
● Data Visualization Tools: Tableau, D3.JS, Matplotlib, GGplot, Seaborn
● Communication: Presenting/Speaking & Reporting/Writing

16
Conclusion
Data is not about statistics, machine learning, visualization, or wrangling. Data is about
understanding. Understanding the problem and how you can solve it using data with
whatever tools or techniques you choose. Understand your problem. Understand your
data. And the rest will follow.

Most of the problems you will face are, in fact, engineering problems. Even with all the
resources of a great machine learning god, most of the impact will come from great
features, not great machine learning algorithms.

So, the basic approach is:


● Make sure your pipeline is solid end to end
● Start with a reasonable objective
● Understand your data intuitively
● Make sure that your pipeline stays solid

17
Machine Learning 101

What is Machine Learning?


Machine Learning involves a computer to recognize patterns by examples, rather than
programming it with specific rules. These patterns are found within Data.

Machine = Your machine or computer


Learning = Finding patterns in data

Machine Learning is about: Creating algorithms (a set of rules) that learns from complex
functions (patterns) from data to make predictions on it.

In short, Machines can predict the future, as long as the future doesn’t look too different
from the past.

Essentially, it can be summarized in 3 Steps:


1. It takes some data
2. Finds pattern from the data
3. Predicts new pattern from the data

Applications of Machine Learning


Before we get started, here is a quick overview of what machine learning is capable of:
● Healthcare: Predicting patient diagnostics for doctors to review
● Social Network: Predicting certain match preferences on a dating website for better
compatibility
● Finance: Predicting fraudulent activity on a credit card
● E-commerce: Predicting customer churn
● Biology: Finding patterns in gene mutations that could represent cancer

18
How Do Machines Learn?
To keep things simple, just know that machines “learn” by finding patterns in similar
data. Think of data as information you acquire from the world. The more data given to a
machine, the “smarter” it gets.

But not all data are the same. Imagine you're a pirate and your life mission was to find the
buried treasure somewhere on the island. In order to find the treasure, you're going to
need sufficient amount of information. Like data, this information can either lead you to the
right direction or the wrong direction. The better the information/data that is obtained,
the more uncertainty is reduced, and vice versa. So it's important to keep in mind the
type of data you're giving to your machine to learn.

Nonetheless, after a sufficient amount of data is given, then the machine can make
predictions. Machines can predict the future, as long as the future doesn’t look too
different from the past.

Machine “learns” really by using old data to get information about what's the most
likelihood that will happen. If the old data looks a lot like the new data, then the things you
can say about the old data will probably be relevant to the new data. It’s like looking back
to look forward.

Types of Machine Learning


There are three main categories of machine learning:

1. Supervised learning: The machine learns from labeled data. Normally, the data is
labeled by humans.
2. Unsupervised learning: The machine learns from unlabeled data. Meaning, there is no
“right” answer given to the machine to learn, but the machine must hopefully find patterns
from the data to come up with an answer.
3. Reinforcement learning: The machine learns through a reward-based system.

19
Supervised Machine Learning
Supervised learning is the most common and studied type of learning because it is easier
to train a machine to learn with labeled data than with unlabeled data. Depending on what
you want to predict, supervised learning can be used to solve two types of problems:
regression or classification.

Regression
If you want to predict continuous values, such as trying to predict the cost of a house or
the weather outside in degrees, you would use regression. This type of problem doesn’t
have a specific value constraint because the value could be any number with no limits.

Classification
If you want to predict discrete values, such as classifying something into categories, you
would use classification. A problem like, "Will he make this purchase" will have an answer
that falls into two specific categories: yes or no. This is also called a binary classification
problem.

Unsupervised Machine Learning


Since there is no labeled data for machines to learn from, the goal for unsupervised
machine learning is to detect patterns in the data and to group them. Unsupervised
learning are machines trying to learn “on their own”without help. Imagine someone
throwing you piles of data and says “Here you go boy, find some patterns and group them
out for me. Thanks and have fun.”

Depending on what you want to group together, unsupervised learning can group data
together by: clustering or association.

Clustering Problem
Unsupervised learning tries to solve this problem by looking for similarities in the data. If
there is a common cluster or group, the algorithm would then categorize them in a certain
form. An example of this could be trying to group customers based on past buying
behavior.

20
Association Problem
Unsupervised learning tries to solve this problem by trying to understand the rules and
meaning behind different groups. Finding a relationship between customer purchases is a
common example of an association problem. Stores may want to know what type of
products were purchased together and could possibly use this information to organize the
placement of these products for easier access. One store found out that there was a strong
association between customers buying beer and diapers. They deduced from this
statement that males who had gone out to buy diapers for their babies also tend to buy
beer as well.

Reinforcement Machine Learning


This type of machine learning requires the use of a reward/penalty system. The goal is to
reward the machine when it learns correctly and to penalize the machine when it learns
incorrectly.

Reinforcement Machine Learning is a subset of Artificial Intelligence. With the wide range
of possible answers from the data, the process of this type of learning is an iterative step.
It continuously learns.

Examples of Reinforcement Learning:


● Training a machine to learn how to play (Chess, Go)
● Training a machine how to learn and play Super Mario by itself
● Self-driving cars

21
Machine Learning Algorithms
With big loads of data everywhere, the power in deriving meaning behind all of it relies on
the use of Machine Learning. Machine learning algorithms are used to learn the structure
of the data. Is there a structure in the data? Can we learn the structure from the data? And
if we can, we can then use it for prediction , description, compression, and etc.

Each machine learning algorithm/model has its own strengths and weaknesses. A problem
that resides in the machine learning domain is the concept of understanding the details in
the algorithms being used and its prediction accuracy. Some models are easier to
interpret/understand but lack prediction power. Whereas other models may have really
accurate predictions but lack interpretability.

This section will not go into detail on what goes inside each algorithm, but will cover the
high-level overview of what each machine learning algorithm does.

Let’s talk about the 7 important machine learning algorithms:


1. Linear Regression
2. Logistic Regression
3. K-Nearest Neighbors (KNN)
4. Support Vector Machines (SVM)
5. Decision Tree
6. Random Forest
7. Gradient Boosting Machine

22
Linear Regression
This is the go to method for regression problems. The linear regression algorithm is used
to see a relationship between the predictor variables and explanatory variable. This
relationship is either a positive, negative, or neutral change between the variables. In its
simplest form, it attempts to fit a straight line to your training data. This line can then be
used as reference to predict future data.

Imagine you’re an ice cream man. Your intuition from previous sales was that you found
yourself selling more ice cream when it was
hotter outside. Now you want to know how
much to sell your ice cream at a certain
temperature. We can use linear regression
to predict just that! The linear regression
algorithm is represented as a formula: y =
mx + b. Where “y” is your dependent
variable (ice cream sales) and “x” is your
independent variable (temperature).

Example: If the temperature is about 75 degrees outside, you would then sell the ice cream
at about $150. This shows that as temperature increases, so does ice cream sales.

Strengths: Linear regression is very fast to implement, easy to understand, and is less
prone to overfitting. It’s a great go-to algorithm for using it as your first model and works
really on linear relationships

Weaknesses: Linear regression performs poorly when there are non-linear relationships.
It is hard to be used on complex data sets.

23
Logistic Regression
This is the go to method for classification method and is commonly used for
interpretability. It is commonly used to predict the probability of an event occurring.
Logistic regression is an algorithm borrowed from statistics and uses a logic/sigmoid
function (Purple Formula) to transform its output, making it either 0 or 1.

Example: Imagine
you’re a banker and you wanted a machine learning algorithm to predict the probability of
a person paying you back the money. They will either: pay you back or not pay you back.
Since a probability is within the range of (0–1), using a linear regression algorithm wouldn’t
make sense here because the line would extend pass 0 and 1. You can’t have a probability
that’s negative or above 1.

Strengths: Similar to its sister, linear regression, logistic regression is easy to interpret and
is less prone to overfitting. It’s fast to implement as well and has surprisingly great accuracy
for its simple design.

Weaknesses: This algorithm performs poorly where there are multiple or non-linear
decision boundaries or capturing complex relationships.

24
K-Nearest Neighbors
K-Nearest Neighbors algorithm is one of the simplest classification techniques. It’s an
algorithm that classifies objects based on its closest training example in its featured space.
The K in K-Nearest Neighbors refers to the number of nearest neighbors the model will be
used for its prediction.

How it Works:

1. Assign K a value (preferably a small


odd number)
2. Find closest number of K points
3. Assign the new point from the
majority of classes

Example: Looking at our unknown person in


the graph, where would you classify him under as: Dothraki or Westerosian? We’ll be
assigning K=5. So in this case, we’ll look at the 5 closest neighbors to our unassigned
person and assign him to the majority class. If you picked Dothraki, then you are correct!
Out of the 5 neighbors, 4 of them were Dothrakis and 1 was Westerosian.

Strengths: This algorithm is good for large data, it learns well on complex patterns, it can
detect linear or non-linear distributed data, and is robust to noisy training data.

Weaknesses: It’s hard to find the right K-value, bad for higher dimensional data, and
requires a lot of computation when fitting larger amounts of features. It’s expensive and
slow.

25
Support Vector Machine
If you want to compare extreme values in your dataset for classification problems, SVM
algorithm is the way to go. It draws a decision boundary, also known as a hyperplane that
best segregates the two classes from one another. Or you can think of it as an algorithm
that looks for some pattern in the data points and find a best line that can separate the
pattern(s).

S — Support refers to the extreme


values/points in your dataset.
V — Vector refers to the values/points in
dataset / feature space.
M — Machine refers to the machine
learning algorithm that focuses on the
support vectors to classify groups of
data. This algorithm literally only
focuses on the extreme points and ignores the rest of the data.

Example: This algorithm only focuses on the extreme values (support vectors) to create
this decision line, which are the two cats and one dog circled in the graph.

Strengths: This model is good for classification problems and can model very complex
dimensions. It’s also good to model non-linear relationships.

Weaknesses: It’s hard to interpret and requires a lot of memory and processing power. It
also does not provide probability estimations and is sensitive to outliers.

26
Decision Tree
A decision tree is made up of nodes. Each node represents a question about the data. And
the branches from each node represents the possible answers. Visually, this algorithm is
very easy to understand. With every decision tree, there is always a root node and this
represents the top most question. The order of importance for each feature is represented
in a top-down approach of the nodes. The higher the node the more important its
property/feature.

Strengths: The decision tree is a very easy to understand and visualize. It’s fast to learn,
robust to outliers, and can work on non-linear relationships. This model is commonly used
to understand what features are being used, such as medical diagnosis and credit risk
analysis. It also has a built in feature selection.

Weaknesses: The biggest drawback of a single decision tree is that it loses its predictive
power from not collecting other overlapping features. A downside to decision trees is the
possibility of building a complex tree which do not generalize well to future data. hard to
interpret, duplication is possible. Decision trees can be unstable because it naturally has
low variance of features.

27
Random Forest
One of the most used and powerful supervised machine learning algorithms for prediction
accuracy. Think of this algorithm as a bunch of decision trees, instead of one single tree
like the Decision Tree algorithm. This grouping of algorithms, in this case decision trees, is
called an Ensemble Method. It’s accurate performance is generated by averaging the
many decision trees together. Random Forest is naturally hard to interpret because of the
various combinations of decision trees it uses. But if you want a model that is a predictive
powerhouse, use this algorithm!

Strengths: Random Forest is known for its great accuracy. It has automatic feature
selection, which identifies what features are most important. It can handle missing data
and imbalanced classes and generalizes well.

Weaknesses: A drawback to random forest is that you have very little control on what goes
inside this algorithm. It’s hard to interpret and won’t perform well if given a set of bad
features.

28
Gradient Boosting Machine
Gradient Boost Machine is another type of Ensemble Method, where it aggregates many
models together and takes a consensus of their predictions. A simple key concept to
remember about this algorithm is this: it trains a weak model into a stronger model. Try
not to over-complicate things here.

How It Works: You’re first model is considered the “weak” model. You train it and find out
what errors your first model produced. The second tree would then use these errors from
the first tree to re-calibrate its algorithm and emphasis more priority to the errors. The third
tree would repeat this process and see what errors the second tree made. And this would
just repeat. Essentially, this model is building a team of models that works together to
solve their weaknesses.

Strengths: Gradient Boosting Machines are very good to use for predictions. It’s one of
the best off-the-shelf algorithm used for great accuracy with decent run time and memory
usage.

Weaknesses: Like other ensemble methods, GBM lacks interpretability. It is not well suited
for handling large dimensions/features because it could take a lot of training process and
computation. So a balance of computational cost and accuracy is a concern.

29

You might also like