Question Set Varian, Big Data New Tricks For Econometrics"
Question Set Varian, Big Data New Tricks For Econometrics"
Current Issues
2
Varien, Hal: “Big Data: New Tricks for Econometrics”
P2.T9.902. Big data techniques including machine learning
P2.T9.802. Big Data: New Tricks for Econometrics by Hal Varian
902.1. About the analysis of big data, Hal Varian says "Conventional statistical and econometric
techniques such as regression often work well, but there are issues unique to big datasets that
may require different tools. First, the sheer size of the data involved may require more powerful
data manipulation tools. Second, we may have more potential predictors than appropriate for
estimation, so we need to do some kind of variable selection. Third, large datasets may allow for
more flexible relationships than simple linear models. Machine learning techniques such as
decision trees, support vector machines, neural nets, deep learning, and so on may allow for
more effective ways to model complex relationships."
a) Excel remains the best database for Big Data because it contains fully 2^18 rows
b) The goal of machine learning is to develop good in-sample predictions but such methods
are not helpful if the data is "too fat" or "too tall"
c) NoSQL databases are table-based relational databases that are more sophisticated than
(i.e., "less primitive than") structured query language
d) Data analysis includes four categories: prediction (a primary concern of machine
learning), summarization, estimation, and hypothesis testing
902.2. Hal Varian introduces "trees" as non-linear methods that are effective alternatives to
linear or logistic regression for prediction. Classification trees such as binary trees (i.e., two
branches at each node) are used for multiple outcomes, while regression trees handle
continuous dependent variables. In regard to some of the different tools and techniques for
manipulating and analyzing big data, each of the following statements is true EXCEPT which
statement is inaccurate?
a) Random forests is a technique that uses multiple classification and/or regression trees
b) The primary drawback of trees is that, because they lack methods for coping with
missing values, trees require all observations in the dataset to be complete cases
c) Trees sometimes do not work well when the underlying relationship is linear, but on the
other hand they tend to thrive when there are important non-linear relationships and
interactions
d) Elastic net regression adds a penalty term to the sum of squared residuals in a
multivariate regression model such that it includes the special case of ordinary least
squares (OLS) when the penalty term equals zero
3
902.3. In regard to areas of potential collaboration between econometrics and machine learning,
according to Hal Varian each of the following statements is true EXCEPT which is inaccurate?
a) In big datasets, model uncertainty tend to be small but sampling uncertainty tends to be
quite large
b) Machine learning tends to find that averaging over many small models tends to give
better out-of-sample prediction than choosing a single model
c) In order to model the average treatment effect as a function of other variables, we
typically need to model both the observed difference in outcome and the selection bias
d) Prediction methods can assist with the thorny problem of estimating causation; for
example, Bayesian Structural Time Series (BSTS) is a machine learning technique that
can be used to forecast a counterfactual and estimate the causal effect of certain
variables
4
Answers:
902.1. D. True: Data analysis includes four categories: prediction (a primary concern of
machine learning), summarization, estimation, and hypothesis testing
Writes Hal Varian: "Data analysis in statistics and econometrics can be broken down into four
categories: 1) prediction, 2) summarization, 3) estimation, and 4) hypothesis testing. Machine
learning is concerned primarily with prediction; the closely related field of data mining is also
concerned with summarization, and particularly with finding interesting patterns in the data.
Econometricians, statisticians, and data mining specialists are generally looking for insights that
can be extracted from the data. Machine learning specialists are often primarily concerned with
developing high-performance computer systems that can provide useful predictions in the
presence of challenging computational constraints. Data science, a somewhat newer term, is
concerned with both prediction and summarization, but also with data manipulation,
visualization, and other similar tasks. Note that terminology is not standardized in these areas,
so these descriptions reflect general usage, not hard-and-fast definitions. Other terms used to
describe computer-assisted data analysis include knowledge extraction, information discovery,
information harvesting, data archaeology, data pattern processing, and exploratory data
analysis."
5
902.2. B. is FALSE because classification and regression trees are good at handling
incomplete case (i.e., observations with missing values) as there exists several methods
for coping with missing values.
902.3. A. False. Instead, say Hal Varian, "In this period of big data, it seems strange to
focus on sampling uncertainty, which tends to be small with large datasets, while
completely ignoring model uncertainty, which may be quite large. One way to address
this is to be explicit about examining how parameter estimates vary with respect to
choices of control variables and instruments."
6
In regard to true (D), see the mini case study concerning the estimation of a casual
effect of advertising on sales (page 22) and the article's several mentions of Bayesian
Structural Time Series (BSTS); e.g., "The ideal way to estimate advertising effectiveness
is, of course, to run a controlled experiment. In this case the control group provides an
estimate of the counterfactual: what would have happened without ad exposures. But
this ideal approach can be quite expensive, so it is worth looking for alternative ways to
predict the counterfactual. One way to do this is to use the Bayesian Structural Time
Series (BSTS) method described earlier."
7
P2.T9.802. Big Data: New Tricks for Econometrics by Hal Varian
Learning objectives: Describe the issues unique to big datasets. Explain and assess
different tools and techniques for manipulating and analyzing big data. Examine the
areas for collaboration between econometrics and machine learning.
802.1. Below is Hal Varian's simple classification tree that predicts Titanic survivors (Figure 1 in
the reading).
According to the tree, each of the following statements is true EXCEPT which is inaccurate?
8
802.2. With respect to tools and techniques for manipulating and analyzing big data, each of the
following statements is true EXCEPT which is false?
802.3. As an illustrative example of the "most important area for collaboration" between
econometrics and machine learning, Hal Varian considers the a (case study) relationship
between advertising campaigns and website visits. With respect to this case study, which of the
following BEST summarizes the key insight that illustrates a collaboration between
econometrics and machine learning?
a) The study substitutes a predictive model for a conventional control group in order to
demonstrate causality
b) The study employs machine learning in order to generate a model with a higher multiple
coefficient of determination
c) The study borrows from econometrics in a way that better generates exploratory data
analysis (EDA) and renders the complex relationships easier to understand
d) A BSTS model forecasts directly the beta coefficient of advertising spend as an
explanatory variable, then econometric methods are employed to overly time-series
covariates
9
Answers:
802.1. B. False. Passengers in First or Second Class (i.e., Class < 2.5) who are younger
than 16 live, but all Third Class passengers (including the young) do not survive;
although among Third class, this misclassifies 1 - 370/501 = 26.1% of this group.
In regard to (A), (C) and (D), each is TRUE. In regard to false (A), financial assets are more
often carried at market value.
In regard to true (A), 501 + 36 + 233 + 276 = 1,046 (the raw data file count is 1,309 but
there are only 1,046 complete cases) and the two features (aka, predictors) are Class
(1st, 2nd, or 3rd) and Age.
In regard to true (C), 1 - 723/1,046 = 30.9%
In regard to true (D), in terms of the tree, First Class passengers are located in either of
the two "lived" nodes; Young Second Class passengers live, but old Second Class
passengers do not live.
802.2. C. is inaccurate. Rather, the inverse is true: random forests are something of a
black box but their performance is generally superior!
According to Varian, "Random Forests: Random forests is a technique that uses multiple
trees. A typical procedure uses the following steps: 1. Choose a bootstrap sample of the
observations and start to grow a tree; 2. At each node of the tree, choose a random sample of
the predictors to make the next decision. Do not prune the trees; 3. Repeat this process many
times to grow a forest of trees; 4. In order to determine the classification of a new observation,
have each tree make a classification and use a majority vote for the final prediction ... This
method produces surprisingly good out-of-sample fits, particularly with highly nonlinear data. In
fact, Howard and Bowles (2012) claim 'ensembles of decision trees (often known as Random
Forests) have been the most successful general-purpose algorithm in modern times.' ... One
defect of random forests is that they are a bit of a black box—they don’t offer simple summaries
of relationships in the data. As we have seen earlier, a single tree can offer some insight about
how predictors interact. But a forest of a thousand trees cannot be easily interpreted. However,
random forests can determine which variables are 'important' in predictions in the sense of
contributing the biggest improvements in prediction accuracy."
10
In regard to true (B) and (D), see below.
Varian: "General Considerations for Prediction: Our goal with prediction is typically to get good
out-of-sample predictions. Most of us know from experience that it is all too easy to construct a
predictor that works well in-sample but fails miserably out-of-sample. To take a trivial example,
(n) linearly independent regressors will fit (n) observations perfectly but will usually have poor
out-of-sample performance. Machine learning specialists refer to this phenomenon as the
'overfitting problem' and have come up with several ways to deal with it.
First, since simpler models tend to work better for out-of-sample forecasts, machine learning
experts have come up with various ways to penalize models for excessive complexity. In the
machine learning world, this is known as 'regularization,' and we will describe some examples
below. Economists tend to prefer simpler models for the same reason, but have not been as
explicit about quantifying complexity costs.
Second, it is conventional to divide the data into separate sets for the purpose of training,
testing, and validation. You use the training data to estimate a model, the validation data to
choose your model, and the testing data to evaluate how well your chosen model performs.
(Often validation and testing sets are combined.)
... The test-train cycle and cross-validation are very commonly used in machine learning and, in
my view, should be used much more in economics, particularly when working with large
datasets. For many years, economists have reported in-sample goodness-of-fit measures using
the excuse that we had small datasets. But now that larger datasets have become available,
there is no reason not to use separate training and testing sets. Cross-validation also turns out
to be a very useful technique, particularly when working with reasonably large data. It is also a
much more realistic measure of prediction performance than measures commonly used in
economics."
802.3. A. TRUE: The study substitutes a predictive model for a conventional control
group in order to demonstrate causality. Rather than a conventional control group, the case
study employs a machine learning time series method (i.e., Bayesian Structural Time Series,
BSTS) in order to PREDICT website visits without advertising spend; aka, the "as if"
counterfactual. In this way, an experiment is SIMULATED rather than explicitly conducted such
that causal inferences can be drawn; e.g., advertising has a significant causal impact on website
visits. In general, to establish causation (rather than correlation), an experiment is required.
11