0% found this document useful (0 votes)
25 views16 pages

Prediction Phase 4

The document discusses experiments and prediction in data science. It covers topics like A/B testing, sample sizes, statistical significance, time series forecasting, seasonality, and supervised machine learning. Supervised machine learning uses historical labeled data to train models that can predict labels on new data based on features.

Uploaded by

fatimamaryam882
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views16 pages

Prediction Phase 4

The document discusses experiments and prediction in data science. It covers topics like A/B testing, sample sizes, statistical significance, time series forecasting, seasonality, and supervised machine learning. Supervised machine learning uses historical labeled data to train models that can predict labels on new data based on features.

Uploaded by

fatimamaryam882
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Experimentation and Prediction

UNDERSTANDING DATA SCIENCE


https://campus.datacamp.com/courses/understanding-data-science/experimentation-and-
prediction?ex=14

COURSE INSTRUCTOR
Anam Shahid
Data science workflow
This chapter covers the final stage, beginning with experiments.

What are experiments in data science?


Experiments help drive decisions and draw conclusions. Generally, they begin with a
question and a hypothesis, then data collection, followed by a statistical test and its
interpretation.

Case study: which is the better blog post title?


For instance, suppose we want to pick the best blog post title. Our question is: does
blog title A or blog title B result in more clicks? We make the hypothesis that both blogs
will result in the same amount of clicks. To collect data, we randomly divide our
audience into two groups, each seeing a different title. We run the experiment until our
sample size is reached - more on that later.
Then, we use a statistical test to see whether the difference between the titles' click-
through rate is significant. Finally, we interpret results. In our case, we want to choose
the better performing title. However, often, its inconclusive, leading to more questions
and experiments.

What is A/B Testing?


This is called A/B Testing, often called Champion/Challenger testing, and it's used to
make a choice between two options.

Terminology Review
Sample size is the number of data points used. In data science, you'll hear "Are your
results significant?". A statistically significant result means that your result is probably
not due to chance given the statistical assumptions made. Statistical tests help
determine this and there are many to choose from. Let's take a closer look at A/B
testing.

A/B Testing Steps


There are four steps in an A/B test: picking a metric to track, calculating sample size,
running the experiment, and checking for significance. Let's examine each step.

1.Pick a metric to track: click-through rate


Our metric is click-through rate, the percent of people who click on a link after viewing
the title.

2. Calculate your sample size


Next, we'll run the experiment until we reach a sample size large enough to be certain
that results aren't due to random chance. The size depends on a "baseline metric" that
helps gauge any changes. In our case, it's how often people click on a link to one of our
blogs usually. If the rate is much larger or smaller than 50%, then we need a larger
sample size. Click-rate is

The sample size depends on the sensitivity we want. A test's sensitivity tells us how
small of a change in our metric we can detect. Larger sample sizes allow us to detect
smaller changes. You might think that we want a high sensitivity, but we actually want to
optimize for an amount that is meaningful for our question.

3. Run your experiment


We run our experiment until we reach the calculated size. Stopping the experiment
before or after could lead to biased results.
4. Check for significance
Once we reach the target size, we check our metric. We see some difference between
the titles, but how do we know if it's meaningful? We check by performing a test of
statistical significance. If they are significant, we can be reasonably sure that the
difference is not due to random chance, but to an actual difference in preference.

What if the results aren't significant?


What if the results aren't significant? If there are any differences in click rates, they're
smaller than the threshold we chose when determining the sensitivity. Running our test
longer won't help because it'll only detect a smaller difference and we've decided that
smaller differences are irrelevant! There still might be a difference in click rates between
the titles: but that difference isn't important to us for making decisions.
Time Series Forecasting
Now, we'll learn about modeling and time series forecasting.

Modeling in data science


Data scientists and machine learning scientists spend a lot of time building models.
Models attempt to represent a real-world process with statistics. At a high level, models
define relationships between variables with equations. These definitions are based on
statistical assumptions and historical data.

Predictive modeling
Predictive modeling is a sub-category of modeling used for prediction. By modeling a
process, we can enter new inputs and see what outcome it outputs. For instance,
you can enter a future date in a model of unemployment rate to get a prediction of what
the unemployment rate will be next month.

The output can be the probability of an outcome, for example, the probability that a
tweet is fake.

Predictive models can be as simple as a linear equation with an x and y variable to a


deep learning algorithm that is uninterpretable by humans. Let's look at using predictive
modeling on time series data.
Time series data
Time series is a series of data points sequenced by time. Examples include daily stock
and gas prices over the years. Often times, it's in the forms of rates, such monthly
unemployment rates or a patient's heart rate during surgery. They can be
measurements like CO2 levels or the height of tides recorded regularly over a time
period.

1. Plotting time series data


Let's plot an example. We have time series data of Canadian unemployment rates
measured monthly from 1976 to 2015. Time series data is usually plotted as a line
graph like this, with time on the x-axis.

2. Seasonality in time series


Often times when plotting time series, you can find patterns. For example, this plot
graphs the average temperature in Boston over three years. Can you figure out the
pattern here?
3. Seasonality in time series
The line peaks during summer months and reaches its lowest during winter months. If
we graphed ice cream sales, we'd see a similar pattern. This is called seasonality.
Seasonality is when there are repeating patterns related to time such as months and
weeks. Another example is spending spikes at the end of the month when people
receive a paycheck.

Forecasting time series


Time series data is used in predictive modeling to predict metrics at future dates. We
call this forecasting. For example, predicting rainfall next month or the state of traffic
and the stock market in a couple of hours to what the population will be in 20 years. We
can build predictive models using time series data from the past years or decades to
generate predictions. This uses a combination of statistical and machine learning
methods. Let's look at an example.

Example: Pea prices in Rwanda


The United Nation provides open data on global food prices. Here we have the price of
peas in Rwandan Francs from 2011 and 2016. There's some seasonality here, can you
spot it? Prices are lowest around December and January, but peak around August.
Some years show a second peak around April. There seems to be a general increase in
pea prices annually. Can we forecast what will happen with consideration to this
seasonality and price increase?

Forecasting pea prices in Rwanda


Here is the forecast of a predictive model. The blue line depicts the forecast. The
seasonality remains and it anticipates a continued increase of pea prices, seen by the
higher peaks and lows. There are also two blue areas shown along the forecast. These
are confidence intervals, which are extremely useful for evaluating predictions. We see
two confidence intervals: 80% and 95%. The model is 80% sure that the true value will
be in the area labeled as 80. Same goes with the area labeled as 95. If we're using this
forecast to make big decisions, confidence intervals can help us buffer for the
unexpected.
Supervised machine learning
Now, we'll dive into machine learning!

As we learned previously, machine learning is a set of methods for making predictions


based on existing data, hence it belongs in the last step of the workflow.

What is supervised machine learning?


Supervised machine learning is a subset of machine learning where the existing data
has a specific structure: it has labels and features. More on that later. Examples of its
abilities includes recommendation systems, diagnosing biomedical images, recognizing
hand-written digits, and predicting customer churn. Let's define these new terms with a
case study.

Case study: churn prediction


Suppose we have a subscription business and want to predict whether a given
customer is likely to stay subscribed or cancel their subscription, also known as churn.
First, we need some training data to build our model off of. This would be historical
customer data.

Some of those customers will have maintained their subscription, while others will have
churned. We eventually want to be able to predict the label for each customer: churned
or subscribed.
We'll need features to make this prediction. Features are different pieces of information
about each customer that might affect our label. For example, perhaps age, gender, the
date of last purchase, or household income will predict cancellations. The magic of
machine learning is that we can analyze many features all at once. We use these labels
and features to train a model to make predictions on new data.
Suppose we have a customer who may or may not churn soon. We can collect feature
data on this customer, such as age, or date of last purchase.

9. Case study: churn prediction

We can feed this data into our trained model

and then, our trained model will give us a prediction. If the customer is not in danger of
churning, we can count on their revenue for another month! If they are in danger of
churning, we can reach out to them to try to keep them subscribed.

Supervised machine learning recap


Let's recap. Machine learning makes a prediction based on data. In supervised machine
learning, that data has two characteristics: features and labels. Labels are what we want
to predict, like the customer churning. Features are data that might help predict the
label, such as profession or date of last purchase. Once we have the features and
labels, we train a model and use it to make predictions on new data.

Model evaluation
After training a model, how do we know if it's any good? It's always good practice not to
allocate all of your historical data for training. Withheld data is called a test set and can
be used to evaluate the goodness of the model. In our example, we could ask the model
to predict whether a set of customers would churn, and then measure the accuracy of
our prediction.
For example, let's say we're testing our model on our test set made up of 1000
customers, where only 30 of the customers have actually churned.We put that test data
into our newly trained model and it predicts that all the customers remain.

If we calculate the overall accuracy of that model, it technically has a high accuracy of
97% because it was correct on 970 of the 1000 customers. This is despite never
correctly labeling a customer churning. Checking both outcomes is important for rare
events. Only by examining the accuracy of each label do we get 0% accuracy at
predicting churn when churn was the actual outcome. This model is not useful to use in
its current state, so we'll have to re-train it with different parameters or more data.

Clustering
Previously, we learned how to use Supervised Learning to make predictions based on
labeled data. In this lesson, we’ll cover another subset of machine learning called
clustering.

What is clustering?
Clustering is a set of machine learning algorithms that divide data into categories, called
clusters. Clustering can help us see patterns in messy datasets. Machine Learning
Scientists use clustering to divide customers into segments, images into categories, or
behaviors into typical and anomalous.
Supervised vs. unsupervised machine learning
Clustering is part of a broader category within Machine Learning called "Unsupervised
Learning". Unsupervised Learning differs from Supervised Learning in the structure of
the training data. While Supervised Learning uses data with features and labels,
Unsupervised Learning uses data with only features. This makes Unsupervised
Learning, and clustering, particularly appealing: you can use it even when you don't
know much about your dataset.

Case study: discovering new species


Let's say you are a botanist and you've been doing field work on a previously
unexplored island. Notably, you have several observations on these flowers you've
never seen before. You believe you might have discovered a couple new flower
species, but you're not sure how many and how to classify each flower. Let's see if
clustering can help.

1. Defining features
The first step is finding features. Luckily, you've been meticulous in your data gathering
and measured over 100 flowers. We can use your measurements as features for our
model. This is indeed an unsupervised learning problem because we have features but
we're not sure what species each flower belongs to or even how many new species
there are!

2. Defining number of clusters


Some clustering algorithms need us to define how many clusters we want to create. The
number of clusters we ask for greatly affects how the algorithm will segment our data.
Here's our flower data graphed over three features: petal width, sepal length, and
number of petals on the x,y, and z axes, respectively.
3. Comparing number of clusters
Here is how the algorithm divides the data if we ask for two clusters. One color
represents on cluster, in our case, one new flower species. And here is how it divides
the same data if we ask for three clusters.

4. Comparing number of clusters


And these are the results when we ask for four and eight clusters. We can tell intuitively
that eight is probably too many clusters, because there aren't as many clear cut areas in
our graph.
Clustering won't tell us exactly how many clusters we have, but it can help us make an
informed decision. In your case, it seems like you've found three or four new species.
Having a strong hypothesis about our data helps us get better results from the
clustering algorithm. For example, you may know from your experience as a botanist
that petal width usually has wide variance within a species and shouldn't be given too
much weight. You can use this information to design a better clustering algorithm.

Clustering review
Let's review. Clustering is an Unsupervised Machine Learning method that divides an
unlabeled dataset into different categories. In order to perform clustering, we must first
select relevant features of our dataset. Next, we select the number of clusters based on
hypotheses about our data. Finally, we use the results of our clustering to solve our
problems, whether it's defining new species, segmenting customers, or classifying
movies into genres. There are a lot of diverse usages for clustering!

You might also like