0% found this document useful (0 votes)

25 views16 pages

Prediction Phase 4

The document discusses experiments and prediction in data science. It covers topics like A/B testing, sample sizes, statistical significance, time series forecasting, seasonality, and supervised machine learning. Supervised machine learning uses historical labeled data to train models that can predict labels on new data based on features.

Uploaded by

fatimamaryam882

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views16 pages

Prediction Phase 4

Uploaded by

fatimamaryam882

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Experimentation and Prediction

UNDERSTANDING DATA SCIENCE

https://campus.datacamp.com/courses/understanding-data-science/experimentation-and-
prediction?ex=14

COURSE INSTRUCTOR
Anam Shahid
Data science workflow
This chapter covers the final stage, beginning with experiments.

What are experiments in data science?

Experiments help drive decisions and draw conclusions. Generally, they begin with a
question and a hypothesis, then data collection, followed by a statistical test and its
interpretation.

Case study: which is the better blog post title?

For instance, suppose we want to pick the best blog post title. Our question is: does
blog title A or blog title B result in more clicks? We make the hypothesis that both blogs
will result in the same amount of clicks. To collect data, we randomly divide our
audience into two groups, each seeing a different title. We run the experiment until our
sample size is reached - more on that later.
Then, we use a statistical test to see whether the difference between the titles' click-
through rate is significant. Finally, we interpret results. In our case, we want to choose
the better performing title. However, often, its inconclusive, leading to more questions
and experiments.

What is A/B Testing?

This is called A/B Testing, often called Champion/Challenger testing, and it's used to
make a choice between two options.

Terminology Review
Sample size is the number of data points used. In data science, you'll hear "Are your
results significant?". A statistically significant result means that your result is probably
not due to chance given the statistical assumptions made. Statistical tests help
determine this and there are many to choose from. Let's take a closer look at A/B
testing.

A/B Testing Steps

There are four steps in an A/B test: picking a metric to track, calculating sample size,
running the experiment, and checking for significance. Let's examine each step.

1.Pick a metric to track: click-through rate

Our metric is click-through rate, the percent of people who click on a link after viewing
the title.

2. Calculate your sample size

Next, we'll run the experiment until we reach a sample size large enough to be certain
that results aren't due to random chance. The size depends on a "baseline metric" that
helps gauge any changes. In our case, it's how often people click on a link to one of our
blogs usually. If the rate is much larger or smaller than 50%, then we need a larger
sample size. Click-rate is

The sample size depends on the sensitivity we want. A test's sensitivity tells us how
small of a change in our metric we can detect. Larger sample sizes allow us to detect
smaller changes. You might think that we want a high sensitivity, but we actually want to
optimize for an amount that is meaningful for our question.

3. Run your experiment

We run our experiment until we reach the calculated size. Stopping the experiment
before or after could lead to biased results.
4. Check for significance
Once we reach the target size, we check our metric. We see some difference between
the titles, but how do we know if it's meaningful? We check by performing a test of
statistical significance. If they are significant, we can be reasonably sure that the
difference is not due to random chance, but to an actual difference in preference.

What if the results aren't significant?

What if the results aren't significant? If there are any differences in click rates, they're
smaller than the threshold we chose when determining the sensitivity. Running our test
longer won't help because it'll only detect a smaller difference and we've decided that
smaller differences are irrelevant! There still might be a difference in click rates between
the titles: but that difference isn't important to us for making decisions.
Time Series Forecasting
Now, we'll learn about modeling and time series forecasting.

Modeling in data science

Data scientists and machine learning scientists spend a lot of time building models.
Models attempt to represent a real-world process with statistics. At a high level, models
define relationships between variables with equations. These definitions are based on
statistical assumptions and historical data.

Predictive modeling
Predictive modeling is a sub-category of modeling used for prediction. By modeling a
process, we can enter new inputs and see what outcome it outputs. For instance,
you can enter a future date in a model of unemployment rate to get a prediction of what
the unemployment rate will be next month.

The output can be the probability of an outcome, for example, the probability that a
tweet is fake.

Predictive models can be as simple as a linear equation with an x and y variable to a

deep learning algorithm that is uninterpretable by humans. Let's look at using predictive
modeling on time series data.
Time series data
Time series is a series of data points sequenced by time. Examples include daily stock
and gas prices over the years. Often times, it's in the forms of rates, such monthly
unemployment rates or a patient's heart rate during surgery. They can be
measurements like CO2 levels or the height of tides recorded regularly over a time
period.

1. Plotting time series data

Let's plot an example. We have time series data of Canadian unemployment rates
measured monthly from 1976 to 2015. Time series data is usually plotted as a line
graph like this, with time on the x-axis.

2. Seasonality in time series

Often times when plotting time series, you can find patterns. For example, this plot
graphs the average temperature in Boston over three years. Can you figure out the
pattern here?
3. Seasonality in time series
The line peaks during summer months and reaches its lowest during winter months. If
we graphed ice cream sales, we'd see a similar pattern. This is called seasonality.
Seasonality is when there are repeating patterns related to time such as months and
weeks. Another example is spending spikes at the end of the month when people
receive a paycheck.

Forecasting time series

Time series data is used in predictive modeling to predict metrics at future dates. We
call this forecasting. For example, predicting rainfall next month or the state of traffic
and the stock market in a couple of hours to what the population will be in 20 years. We
can build predictive models using time series data from the past years or decades to
generate predictions. This uses a combination of statistical and machine learning
methods. Let's look at an example.

Example: Pea prices in Rwanda

The United Nation provides open data on global food prices. Here we have the price of
peas in Rwandan Francs from 2011 and 2016. There's some seasonality here, can you
spot it? Prices are lowest around December and January, but peak around August.
Some years show a second peak around April. There seems to be a general increase in
pea prices annually. Can we forecast what will happen with consideration to this
seasonality and price increase?

Forecasting pea prices in Rwanda

Here is the forecast of a predictive model. The blue line depicts the forecast. The
seasonality remains and it anticipates a continued increase of pea prices, seen by the
higher peaks and lows. There are also two blue areas shown along the forecast. These
are confidence intervals, which are extremely useful for evaluating predictions. We see
two confidence intervals: 80% and 95%. The model is 80% sure that the true value will
be in the area labeled as 80. Same goes with the area labeled as 95. If we're using this
forecast to make big decisions, confidence intervals can help us buffer for the
unexpected.
Supervised machine learning
Now, we'll dive into machine learning!

As we learned previously, machine learning is a set of methods for making predictions

based on existing data, hence it belongs in the last step of the workflow.

What is supervised machine learning?

Supervised machine learning is a subset of machine learning where the existing data
has a specific structure: it has labels and features. More on that later. Examples of its
abilities includes recommendation systems, diagnosing biomedical images, recognizing
hand-written digits, and predicting customer churn. Let's define these new terms with a
case study.

Case study: churn prediction

Suppose we have a subscription business and want to predict whether a given
customer is likely to stay subscribed or cancel their subscription, also known as churn.
First, we need some training data to build our model off of. This would be historical
customer data.

Some of those customers will have maintained their subscription, while others will have
churned. We eventually want to be able to predict the label for each customer: churned
or subscribed.
We'll need features to make this prediction. Features are different pieces of information
about each customer that might affect our label. For example, perhaps age, gender, the
date of last purchase, or household income will predict cancellations. The magic of
machine learning is that we can analyze many features all at once. We use these labels
and features to train a model to make predictions on new data.
Suppose we have a customer who may or may not churn soon. We can collect feature
data on this customer, such as age, or date of last purchase.

9. Case study: churn prediction

We can feed this data into our trained model

and then, our trained model will give us a prediction. If the customer is not in danger of
churning, we can count on their revenue for another month! If they are in danger of
churning, we can reach out to them to try to keep them subscribed.

Supervised machine learning recap

Let's recap. Machine learning makes a prediction based on data. In supervised machine
learning, that data has two characteristics: features and labels. Labels are what we want
to predict, like the customer churning. Features are data that might help predict the
label, such as profession or date of last purchase. Once we have the features and
labels, we train a model and use it to make predictions on new data.

Model evaluation
After training a model, how do we know if it's any good? It's always good practice not to
allocate all of your historical data for training. Withheld data is called a test set and can
be used to evaluate the goodness of the model. In our example, we could ask the model
to predict whether a set of customers would churn, and then measure the accuracy of
our prediction.
For example, let's say we're testing our model on our test set made up of 1000
customers, where only 30 of the customers have actually churned.We put that test data
into our newly trained model and it predicts that all the customers remain.

If we calculate the overall accuracy of that model, it technically has a high accuracy of
97% because it was correct on 970 of the 1000 customers. This is despite never
correctly labeling a customer churning. Checking both outcomes is important for rare
events. Only by examining the accuracy of each label do we get 0% accuracy at
predicting churn when churn was the actual outcome. This model is not useful to use in
its current state, so we'll have to re-train it with different parameters or more data.

Clustering
Previously, we learned how to use Supervised Learning to make predictions based on
labeled data. In this lesson, we’ll cover another subset of machine learning called
clustering.

What is clustering?
Clustering is a set of machine learning algorithms that divide data into categories, called
clusters. Clustering can help us see patterns in messy datasets. Machine Learning
Scientists use clustering to divide customers into segments, images into categories, or
behaviors into typical and anomalous.
Supervised vs. unsupervised machine learning
Clustering is part of a broader category within Machine Learning called "Unsupervised
Learning". Unsupervised Learning differs from Supervised Learning in the structure of
the training data. While Supervised Learning uses data with features and labels,
Unsupervised Learning uses data with only features. This makes Unsupervised
Learning, and clustering, particularly appealing: you can use it even when you don't
know much about your dataset.

Case study: discovering new species

Let's say you are a botanist and you've been doing field work on a previously
unexplored island. Notably, you have several observations on these flowers you've
never seen before. You believe you might have discovered a couple new flower
species, but you're not sure how many and how to classify each flower. Let's see if
clustering can help.

1. Defining features
The first step is finding features. Luckily, you've been meticulous in your data gathering
and measured over 100 flowers. We can use your measurements as features for our
model. This is indeed an unsupervised learning problem because we have features but
we're not sure what species each flower belongs to or even how many new species
there are!

2. Defining number of clusters

Some clustering algorithms need us to define how many clusters we want to create. The
number of clusters we ask for greatly affects how the algorithm will segment our data.
Here's our flower data graphed over three features: petal width, sepal length, and
number of petals on the x,y, and z axes, respectively.
3. Comparing number of clusters
Here is how the algorithm divides the data if we ask for two clusters. One color
represents on cluster, in our case, one new flower species. And here is how it divides
the same data if we ask for three clusters.

4. Comparing number of clusters

And these are the results when we ask for four and eight clusters. We can tell intuitively
that eight is probably too many clusters, because there aren't as many clear cut areas in
our graph.
Clustering won't tell us exactly how many clusters we have, but it can help us make an
informed decision. In your case, it seems like you've found three or four new species.
Having a strong hypothesis about our data helps us get better results from the
clustering algorithm. For example, you may know from your experience as a botanist
that petal width usually has wide variance within a species and shouldn't be given too
much weight. You can use this information to design a better clustering algorithm.

Clustering review
Let's review. Clustering is an Unsupervised Machine Learning method that divides an
unlabeled dataset into different categories. In order to perform clustering, we must first
select relevant features of our dataset. Next, we select the number of clusters based on
hypotheses about our data. Finally, we use the results of our clustering to solve our
problems, whether it's defining new species, segmenting customers, or classifying
movies into genres. There are a lot of diverse usages for clustering!

Numsense! Data Science For The Layman
100% (3)
Numsense! Data Science For The Layman
65 pages
Mastering Predictive Analytics With R - Sample Chapter
No ratings yet
Mastering Predictive Analytics With R - Sample Chapter
57 pages
Castor Oil, The Palm of Christ Compilation
100% (10)
Castor Oil, The Palm of Christ Compilation
11 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Tema04-Modelos de Prediccion-Parte III Series Temporales-2015-16
No ratings yet
Tema04-Modelos de Prediccion-Parte III Series Temporales-2015-16
110 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
FlashSystem Family - Enterprise For Everyone L2 Sellers Presentation - 2020-Mar-19
100% (1)
FlashSystem Family - Enterprise For Everyone L2 Sellers Presentation - 2020-Mar-19
65 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Predictive Analytics and Data Mining: Charles Elkan Elkan@cs - Ucsd.edu May 31, 2011
No ratings yet
Predictive Analytics and Data Mining: Charles Elkan Elkan@cs - Ucsd.edu May 31, 2011
165 pages
Predictive Analytics: Module 11: Forecasting
No ratings yet
Predictive Analytics: Module 11: Forecasting
55 pages
Data Alat Berat
No ratings yet
Data Alat Berat
10 pages
Data Science Course From Packt
No ratings yet
Data Science Course From Packt
11 pages
DsNaIT v2.0
No ratings yet
DsNaIT v2.0
43 pages
1 - Biomechanics - Gait Analysis
100% (1)
1 - Biomechanics - Gait Analysis
48 pages
Forecasting: - What Is Forecasting ? - Why Forecasting ? - How To Forecast ? Some of The Models
No ratings yet
Forecasting: - What Is Forecasting ? - Why Forecasting ? - How To Forecast ? Some of The Models
21 pages
Lathe Machine (PDF) - Definition, Parts, Types, Operations & Specifications PDF
No ratings yet
Lathe Machine (PDF) - Definition, Parts, Types, Operations & Specifications PDF
43 pages
GIC Bourdon Sensing
No ratings yet
GIC Bourdon Sensing
2 pages
Sas Semma
100% (1)
Sas Semma
39 pages
AnIntroductiontoDataMining PDF
No ratings yet
AnIntroductiontoDataMining PDF
40 pages
Tesla Stock Marketing Price Prediction
No ratings yet
Tesla Stock Marketing Price Prediction
62 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
What Are Illegal Drugs?
No ratings yet
What Are Illegal Drugs?
6 pages
Lab 03
No ratings yet
Lab 03
13 pages
BT-99 Parts
No ratings yet
BT-99 Parts
6 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Data Science Activity
No ratings yet
Data Science Activity
11 pages
Tutorial 3
No ratings yet
Tutorial 3
30 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
ML Unit II Modelling Notes
No ratings yet
ML Unit II Modelling Notes
18 pages
Numsense
No ratings yet
Numsense
138 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
Corrosion Metals For Resistance
No ratings yet
Corrosion Metals For Resistance
7 pages
Nirma Bus Routes
No ratings yet
Nirma Bus Routes
2 pages
Problem No. 2: Nursing Care of A Family When A Child Needs Diagnostic or Therapeutic Modalities
No ratings yet
Problem No. 2: Nursing Care of A Family When A Child Needs Diagnostic or Therapeutic Modalities
20 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Datasheet Norsat LNA Ku Band 4000 Series
No ratings yet
Datasheet Norsat LNA Ku Band 4000 Series
1 page
Digimon World Data Squad Walk Through
No ratings yet
Digimon World Data Squad Walk Through
15 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Sat - 34.Pdf - A Systematic Approach Towards Description and Classification of Crime Incidents
No ratings yet
Sat - 34.Pdf - A Systematic Approach Towards Description and Classification of Crime Incidents
11 pages
231
No ratings yet
231
10 pages
Kinetics of Condensation Reaction of Crude Glycerol With Acetaldehyde in A Reactive Extraction Process
No ratings yet
Kinetics of Condensation Reaction of Crude Glycerol With Acetaldehyde in A Reactive Extraction Process
10 pages
Fire Safety Lecture
No ratings yet
Fire Safety Lecture
37 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Data Mining Techniques For Sales Forecastings
No ratings yet
Data Mining Techniques For Sales Forecastings
4 pages
Heckman 425 Wedge Insert
No ratings yet
Heckman 425 Wedge Insert
1 page
Lecture 1
No ratings yet
Lecture 1
62 pages
Modelo Atômico de Theodoro Augusto Ramos
No ratings yet
Modelo Atômico de Theodoro Augusto Ramos
2 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
Predictive Analytics in Marketing
No ratings yet
Predictive Analytics in Marketing
90 pages
Meal Plan
No ratings yet
Meal Plan
1 page
Chromosomal Crossover
No ratings yet
Chromosomal Crossover
4 pages
CH 03
No ratings yet
CH 03
48 pages
Data Science
No ratings yet
Data Science
63 pages
E Monika Sree 10-10-2024
No ratings yet
E Monika Sree 10-10-2024
60 pages
Unit - Iii - Ba
No ratings yet
Unit - Iii - Ba
36 pages
DM Unit 4
No ratings yet
DM Unit 4
22 pages
Belt Conveyors Rollers Diagnostics Based On Acoustic Signal Collected Using Autonomous Legged Inspection Robot
No ratings yet
Belt Conveyors Rollers Diagnostics Based On Acoustic Signal Collected Using Autonomous Legged Inspection Robot
13 pages
Data Science
No ratings yet
Data Science
44 pages
Module - 03
No ratings yet
Module - 03
28 pages
Lecture 11
No ratings yet
Lecture 11
37 pages
Ehs 5 454
No ratings yet
Ehs 5 454
10 pages
U4 Clasification and Prediction
No ratings yet
U4 Clasification and Prediction
15 pages
The Legend of Saint Barbara
No ratings yet
The Legend of Saint Barbara
2 pages
Chapter 4 - Machine Learning
No ratings yet
Chapter 4 - Machine Learning
81 pages
Bon Appétit Bon Appétit
No ratings yet
Bon Appétit Bon Appétit
16 pages
The Application of Multiplexer
No ratings yet
The Application of Multiplexer
2 pages
Unit 2 Supervised Learning
No ratings yet
Unit 2 Supervised Learning
36 pages
Data Mining Jntuh Cse R18
No ratings yet
Data Mining Jntuh Cse R18
20 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Changes in Matter &
No ratings yet
Changes in Matter &
30 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
RNN Basics
No ratings yet
RNN Basics
17 pages
Routine 2024-25
No ratings yet
Routine 2024-25
11 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
大侦探皮卡丘
No ratings yet
大侦探皮卡丘
35 pages
Sample Questions
No ratings yet
Sample Questions
29 pages
생명과학1.2-2.소화계,순환계,호흡계 - 정리노트 ?
No ratings yet
생명과학1.2-2.소화계,순환계,호흡계 - 정리노트 ?
1 page
ClassXI DS Teacher Presentation
No ratings yet
ClassXI DS Teacher Presentation
77 pages
DAV - Technical Book
No ratings yet
DAV - Technical Book
137 pages
Report of Mini Project
No ratings yet
Report of Mini Project
53 pages

Prediction Phase 4

Uploaded by

Prediction Phase 4

Uploaded by

Experimentation and Prediction

UNDERSTANDING DATA SCIENCE

What are experiments in data science?

Case study: which is the better blog post title?

What is A/B Testing?

A/B Testing Steps

1.Pick a metric to track: click-through rate

2. Calculate your sample size

3. Run your experiment

What if the results aren't significant?

Modeling in data science

Predictive models can be as simple as a linear equation with an x and y variable to a

1. Plotting time series data

2. Seasonality in time series

Forecasting time series

Example: Pea prices in Rwanda

Forecasting pea prices in Rwanda

As we learned previously, machine learning is a set of methods for making predictions

What is supervised machine learning?

Case study: churn prediction

9. Case study: churn prediction

We can feed this data into our trained model

Supervised machine learning recap

Case study: discovering new species

2. Defining number of clusters

4. Comparing number of clusters

You might also like