Prediction Phase 4
Prediction Phase 4
COURSE INSTRUCTOR
Anam Shahid
Data science workflow
This chapter covers the final stage, beginning with experiments.
Terminology Review
Sample size is the number of data points used. In data science, you'll hear "Are your
results significant?". A statistically significant result means that your result is probably
not due to chance given the statistical assumptions made. Statistical tests help
determine this and there are many to choose from. Let's take a closer look at A/B
testing.
The sample size depends on the sensitivity we want. A test's sensitivity tells us how
small of a change in our metric we can detect. Larger sample sizes allow us to detect
smaller changes. You might think that we want a high sensitivity, but we actually want to
optimize for an amount that is meaningful for our question.
Predictive modeling
Predictive modeling is a sub-category of modeling used for prediction. By modeling a
process, we can enter new inputs and see what outcome it outputs. For instance,
you can enter a future date in a model of unemployment rate to get a prediction of what
the unemployment rate will be next month.
The output can be the probability of an outcome, for example, the probability that a
tweet is fake.
Some of those customers will have maintained their subscription, while others will have
churned. We eventually want to be able to predict the label for each customer: churned
or subscribed.
We'll need features to make this prediction. Features are different pieces of information
about each customer that might affect our label. For example, perhaps age, gender, the
date of last purchase, or household income will predict cancellations. The magic of
machine learning is that we can analyze many features all at once. We use these labels
and features to train a model to make predictions on new data.
Suppose we have a customer who may or may not churn soon. We can collect feature
data on this customer, such as age, or date of last purchase.
and then, our trained model will give us a prediction. If the customer is not in danger of
churning, we can count on their revenue for another month! If they are in danger of
churning, we can reach out to them to try to keep them subscribed.
Model evaluation
After training a model, how do we know if it's any good? It's always good practice not to
allocate all of your historical data for training. Withheld data is called a test set and can
be used to evaluate the goodness of the model. In our example, we could ask the model
to predict whether a set of customers would churn, and then measure the accuracy of
our prediction.
For example, let's say we're testing our model on our test set made up of 1000
customers, where only 30 of the customers have actually churned.We put that test data
into our newly trained model and it predicts that all the customers remain.
If we calculate the overall accuracy of that model, it technically has a high accuracy of
97% because it was correct on 970 of the 1000 customers. This is despite never
correctly labeling a customer churning. Checking both outcomes is important for rare
events. Only by examining the accuracy of each label do we get 0% accuracy at
predicting churn when churn was the actual outcome. This model is not useful to use in
its current state, so we'll have to re-train it with different parameters or more data.
Clustering
Previously, we learned how to use Supervised Learning to make predictions based on
labeled data. In this lesson, we’ll cover another subset of machine learning called
clustering.
What is clustering?
Clustering is a set of machine learning algorithms that divide data into categories, called
clusters. Clustering can help us see patterns in messy datasets. Machine Learning
Scientists use clustering to divide customers into segments, images into categories, or
behaviors into typical and anomalous.
Supervised vs. unsupervised machine learning
Clustering is part of a broader category within Machine Learning called "Unsupervised
Learning". Unsupervised Learning differs from Supervised Learning in the structure of
the training data. While Supervised Learning uses data with features and labels,
Unsupervised Learning uses data with only features. This makes Unsupervised
Learning, and clustering, particularly appealing: you can use it even when you don't
know much about your dataset.
1. Defining features
The first step is finding features. Luckily, you've been meticulous in your data gathering
and measured over 100 flowers. We can use your measurements as features for our
model. This is indeed an unsupervised learning problem because we have features but
we're not sure what species each flower belongs to or even how many new species
there are!
Clustering review
Let's review. Clustering is an Unsupervised Machine Learning method that divides an
unlabeled dataset into different categories. In order to perform clustering, we must first
select relevant features of our dataset. Next, we select the number of clusters based on
hypotheses about our data. Finally, we use the results of our clustering to solve our
problems, whether it's defining new species, segmenting customers, or classifying
movies into genres. There are a lot of diverse usages for clustering!