0% found this document useful (0 votes)

182 views17 pages

The Friendly Data Science Handbook 2020

The document outlines the OSEMN data science pipeline, which consists of obtaining data, scrubbing/cleaning it, exploring/visualizing it to find patterns, modeling the data to generate predictions, and interpreting the results. It emphasizes that the most important steps are understanding the problem being solved and asking the right business questions before beginning work with data. Choosing algorithms appropriate for the problem and features that relate to the problem are also highlighted as crucial to achieving good results. The goal is to gain business insights and tell a clear, actionable story to address the original problem.

Uploaded by

Leon Parsaud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

182 views17 pages

The Friendly Data Science Handbook 2020

Uploaded by

Leon Parsaud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

The Data Science Pipeline

OSEMN Pipeline
Understanding the typical work flow on how the data science pipeline works is a crucial
step towards business understanding and problem solving. If you are intimidated about
how the data science pipeline works, say no more. I'll make this simple for you. I found a
very simple acronym from Hilary Mason and Chris Wiggins that you can use throughout
your data science pipeline. That is O.S.E.M.N.

OSEMN Pipeline
O — Obtaining our data
S — Scrubbing / Cleaning our data
E — Exploring / Visualizing our data will allow us to find patterns and trends
M — Modeling our data will give us our predictive power as a wizard
N — Interpreting our data

Business Question
So before we even begin the OSEMN pipeline, the most crucial and important step that we
must take into consideration is understanding what problem we’re trying to solve. Let’s say
this again. Before we even begin doing anything with “Data Science”, we must first take
into consideration what problem we’re trying to solve. If you have a small problem you
want to solve, then at most you’ll get a small solution. If you have a BIG problem to solve,
then you’ll have the possibility of a BIG solution.

Ask yourself:
● How can we translate data into dollars?
● What impact do I want to make with this data?
● What business value does our model bring to the table?
● What will save us lots of money?
● What can be done to make our business run more efficiently?

Knowing this fundamental concept will bring you far and lead you to greater steps in being
successful towards being a Data Scientist. No matter how well your model predicts, no

13
matter how much data you acquire, and no matter how OSEMN your pipeline is… your
solution or actionable insight will only be as good as the problem you set for yourself.

“Good data science is more about the questions you pose of the data rather than data
munging and analysis” — Riley Newman

O — Obtaining our data

You cannot do anything as a data scientist without even having any data. As a rule
of thumb, there are some things you must take into consideration when obtaining
your data. You must identify all of your available datasets (which can be from the
internet or external/internal databases). You must extract the data into a usable
format (.csv, json, xml, etc..)

Skills Required:
● Database Management: MySQL, PostgresSQL,MongoDB
● Querying Relational Databases
● Retrieving Unstructured Data: text, videos, audio files, documents
● Distributed Storage: Hadoops, Apache Spark/Flink

S — Scrubbing / Cleaning our data

This phase of the pipeline should require the most time and effort. Because the
results and output of your machine learning model is only as good as what you put
into it. Basically, garbage in garbage out. Your objective here is to examine the data,
understand every feature you’re working with, identify errors, missing values, and
corrupt records, clean the data, and replace and/or fill missing values.

Skills Required:
Scripting language: Python, R, SAS
Data Wrangling Tools: Python Pandas, R
Distributed Processing: Hadoop, Map Reduce / Spark

E — Exploring / Visualizing our data

Now during the exploration phase, we try to understand what patterns and values
our data has. We’ll be using different types of visualizations and statistical testing to

14
back up our findings. This is where we will be able to derive hidden meanings
behind our data through various graphs and analysis. Go out and explore!

Your objective here is to find patterns in your data through visualizations and charts
and to extract features by using statistics to identify and test significant variables.

Skills Required:
● Python: Numpy, Matplotlib, Pandas, Scipy
● R: GGplot2, Dplyr
● Inferential statistics
● Experimental Design
● Data Visualization

M — Modeling our data

Models are general rules in a statistical sense.Think of a machine learning model as
tools in your toolbox. You will have access to many algorithms and use them to
accomplish different business goals. The better features you use the better your
predictive power will be. After cleaning your data and finding what features are most
important, using your model as a predictive tool will only enhance your business
decision making.

Your objective here is to perform in-depth analytics by creating predictive models.

Machine learning algorithms may be better at predicting, detecting, and processing
patterns than you. But they can't reason. And that's where you com in!

Machine Learning Analogy

Think of Machine learning algorithms as us (students). Just like students, all of the
algorithms learn differently. Each has their own sets of qualities and way of learning. Some
algorithms learn faster, while others learn slower. Some algorithms are lazy learners (i.e.
KNN) while others are eager learners. Some algorithms learn from parametric data (i.e.
Linear Regression) while others learn from non-parametric data (i.e. Decision Trees). Just
like students, some algorithms perform better for a certain problem, whereas others may
perform better on another, i.e. linearly separable vs non-linearly separable problems. Just
like students, these algorithms learn through various patterns and relationships from the
data. That's why it's important to perform EDA and visualizations because if you don't see
a pattern, you're model won't as well. Just like students, if you give an algorithm garbage
information to learn, then it won't perform as good. That's why it's important to choose your

15
features correctly and that each has some relationship to the problem. So remember,
choose the algorithm that is the most appropriate for your problem. Because there is no
"best" learner but there is always the "right" learner.

"Machines can predict the future, as long as the future doesn't look too different from the
past."

Skills Required:
● Machine Learning: Supervised/Unsupervised algorithms
● Evaluation methods: MSE, Precision/Recall, ROC/AUC
● Machine Learning Libraries: Python (Sci-kit Learn) / R (CARET)
● Linear algebra & Multivariate Calculus

N — Interpreting our data

The most important step in the pipeline is to understand and learn how to explain
your findings through communication. Telling the story is key, don’t underestimate
it. It’s about connecting with people, persuading them, and helping them. The art of
understanding your audience and connecting with them is one of the best part of
data storytelling.

The power of emotion

Emotion plays a big role in data storytelling. People aren’t going to magically
understand your findings. The best way to make an impact is telling your story
through emotion. We as humans are naturally influenced by emotions. If you can
tap into your audiences’ emotions, then you my friend, are in control. When you’re
presenting your data, keep in mind the power of psychology. The art of
understanding your audience and connecting with them is one of the best parts of
data storytelling.
The objective you should have for yourself is to be able to identify business insights,
relate back to the business problem, visualize your findings accordingly, keeping it
simple and priority driven, and to tell a clear and actionable story.

Skills Required:
● Business Domain Knowledge
● Data Visualization Tools: Tableau, D3.JS, Matplotlib, GGplot, Seaborn
● Communication: Presenting/Speaking & Reporting/Writing

16
Conclusion
Data is not about statistics, machine learning, visualization, or wrangling. Data is about
understanding. Understanding the problem and how you can solve it using data with
whatever tools or techniques you choose. Understand your problem. Understand your
data. And the rest will follow.

Most of the problems you will face are, in fact, engineering problems. Even with all the
resources of a great machine learning god, most of the impact will come from great
features, not great machine learning algorithms.

So, the basic approach is:

● Make sure your pipeline is solid end to end
● Start with a reasonable objective
● Understand your data intuitively
● Make sure that your pipeline stays solid

17
Machine Learning 101

What is Machine Learning?

Machine Learning involves a computer to recognize patterns by examples, rather than
programming it with specific rules. These patterns are found within Data.

Machine = Your machine or computer

Learning = Finding patterns in data

Machine Learning is about: Creating algorithms (a set of rules) that learns from complex
functions (patterns) from data to make predictions on it.

In short, Machines can predict the future, as long as the future doesn’t look too different
from the past.

Essentially, it can be summarized in 3 Steps:

1. It takes some data
2. Finds pattern from the data
3. Predicts new pattern from the data

Applications of Machine Learning

Before we get started, here is a quick overview of what machine learning is capable of:
● Healthcare: Predicting patient diagnostics for doctors to review
● Social Network: Predicting certain match preferences on a dating website for better
compatibility
● Finance: Predicting fraudulent activity on a credit card
● E-commerce: Predicting customer churn
● Biology: Finding patterns in gene mutations that could represent cancer

18
How Do Machines Learn?
To keep things simple, just know that machines “learn” by finding patterns in similar
data. Think of data as information you acquire from the world. The more data given to a
machine, the “smarter” it gets.

But not all data are the same. Imagine you're a pirate and your life mission was to find the
buried treasure somewhere on the island. In order to find the treasure, you're going to
need sufficient amount of information. Like data, this information can either lead you to the
right direction or the wrong direction. The better the information/data that is obtained,
the more uncertainty is reduced, and vice versa. So it's important to keep in mind the
type of data you're giving to your machine to learn.

Nonetheless, after a sufficient amount of data is given, then the machine can make
predictions. Machines can predict the future, as long as the future doesn’t look too
different from the past.

Machine “learns” really by using old data to get information about what's the most
likelihood that will happen. If the old data looks a lot like the new data, then the things you
can say about the old data will probably be relevant to the new data. It’s like looking back
to look forward.

Types of Machine Learning

There are three main categories of machine learning:

1. Supervised learning: The machine learns from labeled data. Normally, the data is
labeled by humans.
2. Unsupervised learning: The machine learns from unlabeled data. Meaning, there is no
“right” answer given to the machine to learn, but the machine must hopefully find patterns
from the data to come up with an answer.
3. Reinforcement learning: The machine learns through a reward-based system.

19
Supervised Machine Learning
Supervised learning is the most common and studied type of learning because it is easier
to train a machine to learn with labeled data than with unlabeled data. Depending on what
you want to predict, supervised learning can be used to solve two types of problems:
regression or classification.

Regression
If you want to predict continuous values, such as trying to predict the cost of a house or
the weather outside in degrees, you would use regression. This type of problem doesn’t
have a specific value constraint because the value could be any number with no limits.

Classification
If you want to predict discrete values, such as classifying something into categories, you
would use classification. A problem like, "Will he make this purchase" will have an answer
that falls into two specific categories: yes or no. This is also called a binary classification
problem.

Unsupervised Machine Learning

Since there is no labeled data for machines to learn from, the goal for unsupervised
machine learning is to detect patterns in the data and to group them. Unsupervised
learning are machines trying to learn “on their own”without help. Imagine someone
throwing you piles of data and says “Here you go boy, find some patterns and group them
out for me. Thanks and have fun.”

Depending on what you want to group together, unsupervised learning can group data
together by: clustering or association.

Clustering Problem
Unsupervised learning tries to solve this problem by looking for similarities in the data. If
there is a common cluster or group, the algorithm would then categorize them in a certain
form. An example of this could be trying to group customers based on past buying
behavior.

20
Association Problem
Unsupervised learning tries to solve this problem by trying to understand the rules and
meaning behind different groups. Finding a relationship between customer purchases is a
common example of an association problem. Stores may want to know what type of
products were purchased together and could possibly use this information to organize the
placement of these products for easier access. One store found out that there was a strong
association between customers buying beer and diapers. They deduced from this
statement that males who had gone out to buy diapers for their babies also tend to buy
beer as well.

Reinforcement Machine Learning

This type of machine learning requires the use of a reward/penalty system. The goal is to
reward the machine when it learns correctly and to penalize the machine when it learns
incorrectly.

Reinforcement Machine Learning is a subset of Artificial Intelligence. With the wide range
of possible answers from the data, the process of this type of learning is an iterative step.
It continuously learns.

Examples of Reinforcement Learning:

● Training a machine to learn how to play (Chess, Go)
● Training a machine how to learn and play Super Mario by itself
● Self-driving cars

21
Machine Learning Algorithms
With big loads of data everywhere, the power in deriving meaning behind all of it relies on
the use of Machine Learning. Machine learning algorithms are used to learn the structure
of the data. Is there a structure in the data? Can we learn the structure from the data? And
if we can, we can then use it for prediction , description, compression, and etc.

Each machine learning algorithm/model has its own strengths and weaknesses. A problem
that resides in the machine learning domain is the concept of understanding the details in
the algorithms being used and its prediction accuracy. Some models are easier to
interpret/understand but lack prediction power. Whereas other models may have really
accurate predictions but lack interpretability.

This section will not go into detail on what goes inside each algorithm, but will cover the
high-level overview of what each machine learning algorithm does.

Let’s talk about the 7 important machine learning algorithms:

1. Linear Regression
2. Logistic Regression
3. K-Nearest Neighbors (KNN)
4. Support Vector Machines (SVM)
5. Decision Tree
6. Random Forest
7. Gradient Boosting Machine

22
Linear Regression
This is the go to method for regression problems. The linear regression algorithm is used
to see a relationship between the predictor variables and explanatory variable. This
relationship is either a positive, negative, or neutral change between the variables. In its
simplest form, it attempts to fit a straight line to your training data. This line can then be
used as reference to predict future data.

Imagine you’re an ice cream man. Your intuition from previous sales was that you found
yourself selling more ice cream when it was
hotter outside. Now you want to know how
much to sell your ice cream at a certain
temperature. We can use linear regression
to predict just that! The linear regression
algorithm is represented as a formula: y =
mx + b. Where “y” is your dependent
variable (ice cream sales) and “x” is your
independent variable (temperature).

Example: If the temperature is about 75 degrees outside, you would then sell the ice cream
at about $150. This shows that as temperature increases, so does ice cream sales.

Strengths: Linear regression is very fast to implement, easy to understand, and is less
prone to overfitting. It’s a great go-to algorithm for using it as your first model and works
really on linear relationships

Weaknesses: Linear regression performs poorly when there are non-linear relationships.
It is hard to be used on complex data sets.

23
Logistic Regression
This is the go to method for classification method and is commonly used for
interpretability. It is commonly used to predict the probability of an event occurring.
Logistic regression is an algorithm borrowed from statistics and uses a logic/sigmoid
function (Purple Formula) to transform its output, making it either 0 or 1.

Example: Imagine
you’re a banker and you wanted a machine learning algorithm to predict the probability of
a person paying you back the money. They will either: pay you back or not pay you back.
Since a probability is within the range of (0–1), using a linear regression algorithm wouldn’t
make sense here because the line would extend pass 0 and 1. You can’t have a probability
that’s negative or above 1.

Strengths: Similar to its sister, linear regression, logistic regression is easy to interpret and
is less prone to overfitting. It’s fast to implement as well and has surprisingly great accuracy
for its simple design.

Weaknesses: This algorithm performs poorly where there are multiple or non-linear
decision boundaries or capturing complex relationships.

24
K-Nearest Neighbors
K-Nearest Neighbors algorithm is one of the simplest classification techniques. It’s an
algorithm that classifies objects based on its closest training example in its featured space.
The K in K-Nearest Neighbors refers to the number of nearest neighbors the model will be
used for its prediction.

How it Works:

1. Assign K a value (preferably a small

odd number)
2. Find closest number of K points
3. Assign the new point from the
majority of classes

Example: Looking at our unknown person in

the graph, where would you classify him under as: Dothraki or Westerosian? We’ll be
assigning K=5. So in this case, we’ll look at the 5 closest neighbors to our unassigned
person and assign him to the majority class. If you picked Dothraki, then you are correct!
Out of the 5 neighbors, 4 of them were Dothrakis and 1 was Westerosian.

Strengths: This algorithm is good for large data, it learns well on complex patterns, it can
detect linear or non-linear distributed data, and is robust to noisy training data.

Weaknesses: It’s hard to find the right K-value, bad for higher dimensional data, and
requires a lot of computation when fitting larger amounts of features. It’s expensive and
slow.

25
Support Vector Machine
If you want to compare extreme values in your dataset for classification problems, SVM
algorithm is the way to go. It draws a decision boundary, also known as a hyperplane that
best segregates the two classes from one another. Or you can think of it as an algorithm
that looks for some pattern in the data points and find a best line that can separate the
pattern(s).

S — Support refers to the extreme

values/points in your dataset.
V — Vector refers to the values/points in
dataset / feature space.
M — Machine refers to the machine
learning algorithm that focuses on the
support vectors to classify groups of
data. This algorithm literally only
focuses on the extreme points and ignores the rest of the data.

Example: This algorithm only focuses on the extreme values (support vectors) to create
this decision line, which are the two cats and one dog circled in the graph.

Strengths: This model is good for classification problems and can model very complex
dimensions. It’s also good to model non-linear relationships.

Weaknesses: It’s hard to interpret and requires a lot of memory and processing power. It
also does not provide probability estimations and is sensitive to outliers.

26
Decision Tree
A decision tree is made up of nodes. Each node represents a question about the data. And
the branches from each node represents the possible answers. Visually, this algorithm is
very easy to understand. With every decision tree, there is always a root node and this
represents the top most question. The order of importance for each feature is represented
in a top-down approach of the nodes. The higher the node the more important its
property/feature.

Strengths: The decision tree is a very easy to understand and visualize. It’s fast to learn,
robust to outliers, and can work on non-linear relationships. This model is commonly used
to understand what features are being used, such as medical diagnosis and credit risk
analysis. It also has a built in feature selection.

Weaknesses: The biggest drawback of a single decision tree is that it loses its predictive
power from not collecting other overlapping features. A downside to decision trees is the
possibility of building a complex tree which do not generalize well to future data. hard to
interpret, duplication is possible. Decision trees can be unstable because it naturally has
low variance of features.

27
Random Forest
One of the most used and powerful supervised machine learning algorithms for prediction
accuracy. Think of this algorithm as a bunch of decision trees, instead of one single tree
like the Decision Tree algorithm. This grouping of algorithms, in this case decision trees, is
called an Ensemble Method. It’s accurate performance is generated by averaging the
many decision trees together. Random Forest is naturally hard to interpret because of the
various combinations of decision trees it uses. But if you want a model that is a predictive
powerhouse, use this algorithm!

Strengths: Random Forest is known for its great accuracy. It has automatic feature
selection, which identifies what features are most important. It can handle missing data
and imbalanced classes and generalizes well.

Weaknesses: A drawback to random forest is that you have very little control on what goes
inside this algorithm. It’s hard to interpret and won’t perform well if given a set of bad
features.

28
Gradient Boosting Machine
Gradient Boost Machine is another type of Ensemble Method, where it aggregates many
models together and takes a consensus of their predictions. A simple key concept to
remember about this algorithm is this: it trains a weak model into a stronger model. Try
not to over-complicate things here.

How It Works: You’re first model is considered the “weak” model. You train it and find out
what errors your first model produced. The second tree would then use these errors from
the first tree to re-calibrate its algorithm and emphasis more priority to the errors. The third
tree would repeat this process and see what errors the second tree made. And this would
just repeat. Essentially, this model is building a team of models that works together to
solve their weaknesses.

Strengths: Gradient Boosting Machines are very good to use for predictions. It’s one of
the best off-the-shelf algorithm used for great accuracy with decent run time and memory
usage.

Weaknesses: Like other ensemble methods, GBM lacks interpretability. It is not well suited
for handling large dimensions/features because it could take a lot of training process and
computation. So a balance of computational cost and accuracy is a concern.

Digital Marketing Agency
100% (1)
Digital Marketing Agency
19 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Meal Prep Companies in USA and Canada
No ratings yet
Meal Prep Companies in USA and Canada
172 pages
Total Listing Machine Learning
100% (1)
Total Listing Machine Learning
114 pages
Medical Spas in Los Angeles, CA (Report by Harrison Gordon)
No ratings yet
Medical Spas in Los Angeles, CA (Report by Harrison Gordon)
182 pages
Mining Social Network Graphs
No ratings yet
Mining Social Network Graphs
35 pages
Lightweight IBM Cloud Garage Method For Data Science Romeo Kienzler
No ratings yet
Lightweight IBM Cloud Garage Method For Data Science Romeo Kienzler
34 pages
K P P Abhilash Emergency Medicine Best Practices at CMC EMAC 2018
100% (1)
K P P Abhilash Emergency Medicine Best Practices at CMC EMAC 2018
531 pages
Behavioral Psychology (MAS-108) : Unit 3 Lecture 1: Engineering Psychology
No ratings yet
Behavioral Psychology (MAS-108) : Unit 3 Lecture 1: Engineering Psychology
17 pages
Đề Thi Học Kì 1 Lớp 3 Môn Tiếng Anh
No ratings yet
Đề Thi Học Kì 1 Lớp 3 Môn Tiếng Anh
56 pages
6sn1118 0dh23 0aa1 Manual
100% (1)
6sn1118 0dh23 0aa1 Manual
485 pages
Jakemurphy
No ratings yet
Jakemurphy
21 pages
Insurance Awareness Handouts - Basics of Insurance
No ratings yet
Insurance Awareness Handouts - Basics of Insurance
8 pages
Valmet IQ Fiber Orientation Control With Slice Actuators Operator Manual - DC115298 - 01
No ratings yet
Valmet IQ Fiber Orientation Control With Slice Actuators Operator Manual - DC115298 - 01
22 pages
Free Download Data Science Curriculum - Innomatics Research Labs Hyderabad, India
No ratings yet
Free Download Data Science Curriculum - Innomatics Research Labs Hyderabad, India
14 pages
Defence Standard 00-970 Part 1 Section 1: Issue 13 Date: 13 Jul 2015
No ratings yet
Defence Standard 00-970 Part 1 Section 1: Issue 13 Date: 13 Jul 2015
23 pages
A Shani 2020
No ratings yet
A Shani 2020
9 pages
Survey Instrument Validation Rating Scale SHS 2023
No ratings yet
Survey Instrument Validation Rating Scale SHS 2023
1 page
Software Engineering PPT 3.1
No ratings yet
Software Engineering PPT 3.1
35 pages
Phrasal Verbs
No ratings yet
Phrasal Verbs
20 pages
20cs51i Makeup Exam September 2023 QP - Deemech
No ratings yet
20cs51i Makeup Exam September 2023 QP - Deemech
2 pages
Data Science Tools Study Guides For MIT's 15.003
No ratings yet
Data Science Tools Study Guides For MIT's 15.003
23 pages
Master in Business Analytics Big Data
No ratings yet
Master in Business Analytics Big Data
40 pages
Besongntor Orockakwa
No ratings yet
Besongntor Orockakwa
37 pages
REFERENCES
No ratings yet
REFERENCES
7 pages
Unit 1 Data Science Notes
No ratings yet
Unit 1 Data Science Notes
33 pages
Math 6 March 23 Quarter 3 Speed
No ratings yet
Math 6 March 23 Quarter 3 Speed
34 pages
Agile Technologies 21CS641 Module 1
No ratings yet
Agile Technologies 21CS641 Module 1
19 pages
The Travelers Property Casualty Co. v. Saint-Gobain Technical Fabrics Canada Ltd.
No ratings yet
The Travelers Property Casualty Co. v. Saint-Gobain Technical Fabrics Canada Ltd.
11 pages
PPT1
No ratings yet
PPT1
93 pages
Dam301 Data Mining and Data Warehousing Summary 08024665051
No ratings yet
Dam301 Data Mining and Data Warehousing Summary 08024665051
48 pages
The Classification of Stocks With Basic Financial Indicators An Application of Cluster Analysis On The BIST 100 Index
No ratings yet
The Classification of Stocks With Basic Financial Indicators An Application of Cluster Analysis On The BIST 100 Index
29 pages
Continuity at A Point
No ratings yet
Continuity at A Point
20 pages
Predicting Autism in Children
No ratings yet
Predicting Autism in Children
52 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
Diagnostic Report: ENGINE #1 - J1939 Active Fault Codes
No ratings yet
Diagnostic Report: ENGINE #1 - J1939 Active Fault Codes
4 pages
Class Xi Python
100% (2)
Class Xi Python
138 pages
2 LecturE 1 2
No ratings yet
2 LecturE 1 2
28 pages
Autism Spectrum Disorder Detection Using Facial Images
No ratings yet
Autism Spectrum Disorder Detection Using Facial Images
14 pages
Sample Certificate of Non-Claim (Car Insurance Claim)
71% (7)
Sample Certificate of Non-Claim (Car Insurance Claim)
1 page
Big Data Analytics - Quick Guide - Tutorialspoint
No ratings yet
Big Data Analytics - Quick Guide - Tutorialspoint
50 pages
Korea University Urban Planning and Urban Design Lab
No ratings yet
Korea University Urban Planning and Urban Design Lab
4 pages
The Project Management Professional (PMP) Exam Cheat Sheet
No ratings yet
The Project Management Professional (PMP) Exam Cheat Sheet
2 pages
QMM Report Tata Steel
100% (1)
QMM Report Tata Steel
33 pages
Data Science Skills They Dont Teach You
No ratings yet
Data Science Skills They Dont Teach You
72 pages
Kaggle State of Machine Learning and Data Science 2020 PDF
No ratings yet
Kaggle State of Machine Learning and Data Science 2020 PDF
30 pages
Formato Aplicacion para Lanchas DE
No ratings yet
Formato Aplicacion para Lanchas DE
2 pages
Usage of Regular Expressions in NLP
No ratings yet
Usage of Regular Expressions in NLP
7 pages
A Practical Guide To Ideathon
No ratings yet
A Practical Guide To Ideathon
12 pages
Emc Data Science Study WP PDF
No ratings yet
Emc Data Science Study WP PDF
6 pages
Machine Design, Vol.4 (2012) No.2, ISSN 1821-1259 Pp. 103-106
No ratings yet
Machine Design, Vol.4 (2012) No.2, ISSN 1821-1259 Pp. 103-106
4 pages
ABCD Complete V7b HR 1
No ratings yet
ABCD Complete V7b HR 1
11 pages
Ultraviolet Protection Factor (UPF)
No ratings yet
Ultraviolet Protection Factor (UPF)
4 pages
Anomaly Detection: Course: Data Mining II
No ratings yet
Anomaly Detection: Course: Data Mining II
12 pages
Challenges and Risks
No ratings yet
Challenges and Risks
13 pages
Business Plan - Template - TBDC
No ratings yet
Business Plan - Template - TBDC
12 pages
Scaling AI and ML
No ratings yet
Scaling AI and ML
4 pages
Unit 1 Full Notes
No ratings yet
Unit 1 Full Notes
52 pages
Data Science Regular Handout
No ratings yet
Data Science Regular Handout
25 pages
Artificial Intelligence & Expert System
100% (1)
Artificial Intelligence & Expert System
18 pages
Registers and Collaboration: Making Lists We Can Trust
No ratings yet
Registers and Collaboration: Making Lists We Can Trust
29 pages
Big Data Technology
100% (1)
Big Data Technology
10 pages
National HQ - 1978
No ratings yet
National HQ - 1978
40 pages
Big Data
No ratings yet
Big Data
25 pages
ECC For EBS
100% (1)
ECC For EBS
6 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
22 pages
Association Rule Mining Lesson PDF
No ratings yet
Association Rule Mining Lesson PDF
9 pages
Ds Viva Q
No ratings yet
Ds Viva Q
13 pages
Chapter 2 Basic Physics of Semiconductors
No ratings yet
Chapter 2 Basic Physics of Semiconductors
42 pages
Startup Visa
No ratings yet
Startup Visa
1 page
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
18 pages
Benefits of Use Cases
50% (2)
Benefits of Use Cases
6 pages
AnalytixLabs - Advanced Big Data Science Using Python-R-Hadoop-Spark
No ratings yet
AnalytixLabs - Advanced Big Data Science Using Python-R-Hadoop-Spark
13 pages
Autism ML Paper
No ratings yet
Autism ML Paper
7 pages
Data Science Resource Package!
No ratings yet
Data Science Resource Package!
14 pages
Data Deduplication Strategies in Cloud Computing
No ratings yet
Data Deduplication Strategies in Cloud Computing
5 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Data Science
No ratings yet
Data Science
8 pages
Impact Investing Nonprofits: Nonprofits That Invest in Social Enterprises (Globally and in The United States)
No ratings yet
Impact Investing Nonprofits: Nonprofits That Invest in Social Enterprises (Globally and in The United States)
2 pages
Different Adv Algorithms For Machine Learning
No ratings yet
Different Adv Algorithms For Machine Learning
13 pages
Understand The Data Science From Scratch Know Everything in Detail!
No ratings yet
Understand The Data Science From Scratch Know Everything in Detail!
5 pages
Delta Neutral Vega Long
No ratings yet
Delta Neutral Vega Long
6 pages
Mentee Application Form: PERIOD: 1 October 2022 - 1 May 2023
No ratings yet
Mentee Application Form: PERIOD: 1 October 2022 - 1 May 2023
2 pages
Education Loan Prediction Analysis
No ratings yet
Education Loan Prediction Analysis
5 pages
Using CMMI, ITIL, and PMBoK To Improve Proposal Operations - Brenda Crist 6-12-09
No ratings yet
Using CMMI, ITIL, and PMBoK To Improve Proposal Operations - Brenda Crist 6-12-09
51 pages
Data Science With Python
No ratings yet
Data Science With Python
4 pages
Query Optimization in Object Oriented Databases Through Detecting Independent Subqueries
No ratings yet
Query Optimization in Object Oriented Databases Through Detecting Independent Subqueries
5 pages
Dce Lab Manual
0% (1)
Dce Lab Manual
21 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
Data Science With Python Explained PDF
No ratings yet
Data Science With Python Explained PDF
1 page
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet

The Friendly Data Science Handbook 2020

Uploaded by

The Friendly Data Science Handbook 2020

Uploaded by

The Data Science Pipeline

O — Obtaining our data

S — Scrubbing / Cleaning our data

E — Exploring / Visualizing our data

M — Modeling our data

Your objective here is to perform in-depth analytics by creating predictive models.

Machine Learning Analogy

N — Interpreting our data

The power of emotion

So, the basic approach is:

What is Machine Learning?

Machine = Your machine or computer

Essentially, it can be summarized in 3 Steps:

Applications of Machine Learning

Types of Machine Learning

Unsupervised Machine Learning

Reinforcement Machine Learning

Examples of Reinforcement Learning:

Let’s talk about the 7 important machine learning algorithms:

1. Assign K a value (preferably a small

Example: Looking at our unknown person in

S — Support refers to the extreme

You might also like