0% found this document useful (0 votes)
14 views17 pages

Using Models To Explore

Uploaded by

anubavroshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views17 pages

Using Models To Explore

Uploaded by

anubavroshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Using Models to Explore

Your Data
Using Models to Explore
Your Data
• model is something we construct to help us understand the
real world.
• A common example is the use of an animal which mimics a
human disease to help us understand, and hopefully,prevent
and/or treat the disease.
• The same concept applies to a set of data–presumably you
are using the data to understand the real world.
• In the world of politics a pollster has a dataset on a sample of
likely voters and the pollster’s job is to use this sample to
predict the election outcome.
• data analyst uses the polling data to construct a model to
predict what will happen on election day.
• The process of building a model involves imposing a specific
structure on the data and creating a summary of the data.
• In the polling data example, you may have thousands of
observations, so the model is a mathematical equation that
reflects the shape or pattern of the data, and the equation
allows you to summarize the thousands of observations with,
for example, one number, which might be the percentage of
voters who will vote for your candidate.
• A statistical model serves two key purposes in a data analysis,
 which are to provide a quantitative summary of your data
 and to impose a specific structure on the population from
which the data were sampled.
• Imagine you wanted to conduct a survey of 20
people to ask them how much they’d be willing to
spend on a product you’re developing.
• What is the goal of this survey? Probably, if you’re
spending time and money developing a new product,
you believe that there is a large population of people
out there who are willing to buy this product.
• However, it’s far too costly and complicated to ask
everyone in that population what they’d be willing to
pay.
• So you take a sample from that population to get a
sense of what the population would pay.
• One of us (Roger) recently published a book titled R
Programming for Data Science1.
• Before the book was published,interested readers could
submit their name and email address to the book’s web site
to be notified about the books publication.
• In addition, there was an option to specify how much they’d
be willing to pay for the book.
• Below is a random sample of 20 response from people who
volunteered this information.
• 25 20 15 5 30 7 5 10 12 40 30 30 10 25 10 20 10
10 25 5
• “What do the data say?” One thing you could do is simply
hand over the data—all 20 numbers.
• The first key element of a statistical model is data reduction.
• The basic idea is you want to take the original set of numbers
consisting of your dataset and transform them into a smaller set
of numbers.
• The process of data reduction typically ends up with a statistic.
• Statistic is any summary of the data. The sample mean, or
average,is a statistic.
• So is the median, the standard deviation, the maximum, the
minimum, and the range.
• Some statistics are more or less useful than others but they are
all summaries of the data.
• simplest data reduction you can produce is the mean, or the
simple arithmetic average, of the data, which in this case is
$17.2.
• Going from 20 numbers to 1 number is about as much
reduction.
1.Models as Expectations:
• simple summary statistic, such as the mean of a set of
numbers, is not enough to formulate a model.
• A statistical model must also impose some structure on the
data.
• statistical model provides a description of how the world
works and how the data were generated.
• The model is essentially an expectation of the relationships
between various factors in the real world and in your dataset.
Applying the normal model:
• Perhaps the most popular statistical model in the world is the
Normal model.
• This model says that the randomness in a set of data can be
explained by the Normal distribution, or a bell-shaped curve.
• The Normal distribution is fully specified by two parameters—
the mean and the standard deviation.
• To apply the Normal model to this dataset, we just need to calculate the
mean and standard deviation.
• In this case, the mean is $17.2 and the standard deviation is $10.39.
• Given those parameters, our expectation under the Normal model is that
the distribution of prices that people are willing to pay looks something
like this.
• According to the model, about 68% of the population would be willing to
pay somewhere between $6.81 and $27.59 for this new product. Whether
that is useful information or not depends on the specifics of the situation
• use the statistical model to answer more complex questions if you
want.
• For example, suppose you wanted to know “What proportion of
the population would be willing to pay more than $30 for this
book?”
pnorm(30, mean = mean(x), sd = sd(x), lower.tail = FALSE)
[1] 0.1089893
• So about 11% of the population would be willing to pay more than
$30 for the product.
• Again, whether this is useful to you depends on your specific goals.
• we used the data to draw the picture (to calculate the mean and
standard deviation of the Normal distribution), but ultimately the
data do not appear directly in the plot.
• In this case we are using the Normal distribution to tell us what
the population looks like, not what the data look like.
• The Normal distribution is our expectation for what the data
should look like.
2.Comparing Model Expectations to Reality:
• How do we know if our expectations match with reality?
Drawing a fake picture
• To begin with we can make some pictures, like a histogram of the data
• 20 data points from a Normal distribution and overlaid the theoretical
Normal curve on top of the histogram.
• If the population followed roughly a Normal distribution, and
the data were a random sample from that population, then
the distribution estimated by the histogram should look like
the theoretical model provided by the Normal distribution.
• Normal distribution is a good statistical model for the data.
• Normal distribution allows for negative values, but we don’t
really expect that people will say that they’d be willing to pay
negative dollars for a book.
The real picture:
a histogram of the data from the sample of 20
respondents.
At first glance, it looks like the histogram and the
Normal distribution don’t match very well.
The histogram has a large spike around $10, a feature
that is not present with the blue curve.
Also, the Normal distribution allows for negative values
on the left-hand side of the plot, but there are no data
points in that region of the plot.
• Normal model isn’t really a very good
representation of the population given the
data that we sampled from the population.
3.Reacting to Data: Refining Our Expectations:
model and the data don’t match very well, as
was indicated by the histogram above.
So what do do? Well, we can either
1. Get a different model; or
2. Get different data
Or we could do both.
• will choose a different statistical model to represent
the population, the Gamma distribution.
• This distribution has the feature that it only allows
positive values, so it eliminates the problem we had
with negative values with the Normal distribution.
• do the following:
• 1. Develop expectations: Draw a fake picture—what
do we expect to see before looking at the data?
• 2. Compare our expectations to the data
• 3. Refine our expectations, given what the data show
4.Examining Linear Relationships :
understand linear relationships between variables
of interest. The most common statistical
technique to help with this task is linear
regression.
1. developing expectations,
2. comparing our expectations to data,
3. refining our expectations—to the application
of linear regression
5.When Do We Stop?
 In some cases, a single iteration may be sufficient, but in
most real-life cases, you’ll need to iterate at least a few
times.
 might be able to iterate over and over again.
 every answer will usually raise more questions and
require further digging into the data.
 When exactly do you stop the process then? Statistical
theory suggests a number of different approaches to
determining when a statistical model is “good enough”
and fits the data well.
 a few high-level criteria to determine when you might
consider stopping the data analysis iteration.
Summary:
 first set your expectations for a how a model
should characterize a dataset before you actually
apply a model to data.
 Then you can check to see how your model
conforms to your expectation.
 Often, there will be features of the dataset that do
not conform to your model and you will have to
either refine your model or examine the data
collection process.

You might also like