0% found this document useful (0 votes)
18 views11 pages

Question Set Varian, Big Data New Tricks For Econometrics"

Uploaded by

monikank.dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Question Set Varian, Big Data New Tricks For Econometrics"

Uploaded by

monikank.dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

P2.T9.

Current Issues

Varien, Hal: “Big Data: New Tricks for Econometrics”

Bionic Turtle FRM Practice Questions


By David Harper, CFA FRM CIPM
www.bionicturtle.com
Varien, Hal: “Big Data: New Tricks for Econometrics”
P2.T9.902. BIG DATA TECHNIQUES INCLUDING MACHINE LEARNING............................................. 3
P2.T9.802. BIG DATA: NEW TRICKS FOR ECONOMETRICS BY HAL VARIAN .................................. 8

2
Varien, Hal: “Big Data: New Tricks for Econometrics”
P2.T9.902. Big data techniques including machine learning
P2.T9.802. Big Data: New Tricks for Econometrics by Hal Varian

P2.T9.902. Big data techniques including machine learning


Learning objectives: Describe the issues unique to big datasets. Explain and assess
different tools and techniques for manipulating and analyzing big data. Examine the
areas for collaboration between econometrics and machine learning

902.1. About the analysis of big data, Hal Varian says "Conventional statistical and econometric
techniques such as regression often work well, but there are issues unique to big datasets that
may require different tools. First, the sheer size of the data involved may require more powerful
data manipulation tools. Second, we may have more potential predictors than appropriate for
estimation, so we need to do some kind of variable selection. Third, large datasets may allow for
more flexible relationships than simple linear models. Machine learning techniques such as
decision trees, support vector machines, neural nets, deep learning, and so on may allow for
more effective ways to model complex relationships."

Further, according to Hal Varian, which of the following statements is TRUE?

a) Excel remains the best database for Big Data because it contains fully 2^18 rows
b) The goal of machine learning is to develop good in-sample predictions but such methods
are not helpful if the data is "too fat" or "too tall"
c) NoSQL databases are table-based relational databases that are more sophisticated than
(i.e., "less primitive than") structured query language
d) Data analysis includes four categories: prediction (a primary concern of machine
learning), summarization, estimation, and hypothesis testing

902.2. Hal Varian introduces "trees" as non-linear methods that are effective alternatives to
linear or logistic regression for prediction. Classification trees such as binary trees (i.e., two
branches at each node) are used for multiple outcomes, while regression trees handle
continuous dependent variables. In regard to some of the different tools and techniques for
manipulating and analyzing big data, each of the following statements is true EXCEPT which
statement is inaccurate?

a) Random forests is a technique that uses multiple classification and/or regression trees
b) The primary drawback of trees is that, because they lack methods for coping with
missing values, trees require all observations in the dataset to be complete cases
c) Trees sometimes do not work well when the underlying relationship is linear, but on the
other hand they tend to thrive when there are important non-linear relationships and
interactions
d) Elastic net regression adds a penalty term to the sum of squared residuals in a
multivariate regression model such that it includes the special case of ordinary least
squares (OLS) when the penalty term equals zero

3
902.3. In regard to areas of potential collaboration between econometrics and machine learning,
according to Hal Varian each of the following statements is true EXCEPT which is inaccurate?

a) In big datasets, model uncertainty tend to be small but sampling uncertainty tends to be
quite large
b) Machine learning tends to find that averaging over many small models tends to give
better out-of-sample prediction than choosing a single model
c) In order to model the average treatment effect as a function of other variables, we
typically need to model both the observed difference in outcome and the selection bias
d) Prediction methods can assist with the thorny problem of estimating causation; for
example, Bayesian Structural Time Series (BSTS) is a machine learning technique that
can be used to forecast a counterfactual and estimate the causal effect of certain
variables

4
Answers:

902.1. D. True: Data analysis includes four categories: prediction (a primary concern of
machine learning), summarization, estimation, and hypothesis testing

Writes Hal Varian: "Data analysis in statistics and econometrics can be broken down into four
categories: 1) prediction, 2) summarization, 3) estimation, and 4) hypothesis testing. Machine
learning is concerned primarily with prediction; the closely related field of data mining is also
concerned with summarization, and particularly with finding interesting patterns in the data.
Econometricians, statisticians, and data mining specialists are generally looking for insights that
can be extracted from the data. Machine learning specialists are often primarily concerned with
developing high-performance computer systems that can provide useful predictions in the
presence of challenging computational constraints. Data science, a somewhat newer term, is
concerned with both prediction and summarization, but also with data manipulation,
visualization, and other similar tasks. Note that terminology is not standardized in these areas,
so these descriptions reflect general usage, not hard-and-fast definitions. Other terms used to
describe computer-assisted data analysis include knowledge extraction, information discovery,
information harvesting, data archaeology, data pattern processing, and exploratory data
analysis."

In regard to (A), (B) and (C), each is FALSE.


 In regard to false (A) and (C): Economists have historically dealt with data that fits in a
spreadsheet, but that is changing as new more-detailed data becomes available (see
Einav and Levin 2013, for several examples and discussion). If you have more than a
million or so rows in a spreadsheet, you probably want to store it in a relational
database, such as MySQL. Relational databases offer a flexible way to store,
manipulate, and retrieve data using a Structured Query Language (SQL), which is easy
to learn and very useful for dealing with medium-sized datasets. However, if you have
several gigabytes of data or several million observations, standard relational databases
become unwieldy. Databases to manage data of this size are generically known as
“NoSQL” databases. The term is used rather loosely, but is sometimes interpreted as
meaning “not only SQL.” NoSQL databases are more primitive than SQL databases in
terms of data manipulation capabilities but can handle larger amounts of data.
 In regard to false (B): In machine learning, the x-variables are usually called
“predictors” or “features.” The focus of machine learning is to find some function that
provides a good prediction of y as a function of x. Historically, most work in machine
learning has involved cross-section data where it is natural to think of the data being
independent and identically distributed (IID) or at least independently distributed. The
data may be “fat,” which means lots of predictors relative to the number of observations,
or “tall” which means lots of observations relative to the number of predictors ... Our goal
with prediction is typically to get good out-of-sample predictions.

5
902.2. B. is FALSE because classification and regression trees are good at handling
incomplete case (i.e., observations with missing values) as there exists several methods
for coping with missing values.

In regard to (A), (C) and (D), each is TRUE.


 In regard to true (A), "Random Forests is a technique that uses multiple trees. A typical
procedure uses the following steps: 1. Choose a bootstrap sample of the observations
and start to grow a tree; 2. At each node of the tree, choose a random sample of the
predictors to make the next decision. Do not prune the trees; 3. Repeat this process
many times to grow a forest of trees; 4. In order to determine the classification of a new
observation, have each tree make a classification and use a majority vote for the final
prediction ... This method produces surprisingly good out-of-sample fits, particularly with
highly nonlinear data. In fact, Howard and Bowles (2012) claim ensembles of decision
trees (often known as ‘Random Forests’) have been the most successful general-
purpose algorithm in modern times.”
 In regard to false (B) and true (C), "Trees tend to work well for problems where there
are important nonlinearities and interactions ... Trees also handle missing data well ...
Interestingly enough, trees tend not to work very well if the underlying relationship really
is linear, but there are hybrid models such as RuleFit (Friedman and Popescu 2005) that
can incorporate both tree and linear relationships among variables."
 In regard to true (D), see the section on Variable Selection > LASSO and Friends

902.3. A. False. Instead, say Hal Varian, "In this period of big data, it seems strange to
focus on sampling uncertainty, which tends to be small with large datasets, while
completely ignoring model uncertainty, which may be quite large. One way to address
this is to be explicit about examining how parameter estimates vary with respect to
choices of control variables and instruments."

In regard to (B), (C), and (D) each is TRUE.


 In regard to true (B), "An important insight from machine learning is that averaging over
many small models tends to give better out-of-sample prediction than choosing a single
model."
 In regard to true (C), "To state this in a slightly more formal way, consider the identity
from Angrist and Pischke (2009, p. 11): observed difference in outcome = average
treatment effect on the treated + selection bias ... If you want to model the average
treatment effect as a function of other variables, you will usually need to model both the
observed difference in outcome and the selection bias. The better your predictive model
for those components, the better your estimate of the average treatment effect will be. Of
course, if you have a true randomized treatment–control experiment, selection bias goes
away and those treated are an unbiased random sample of the population."

6
 In regard to true (D), see the mini case study concerning the estimation of a casual
effect of advertising on sales (page 22) and the article's several mentions of Bayesian
Structural Time Series (BSTS); e.g., "The ideal way to estimate advertising effectiveness
is, of course, to run a controlled experiment. In this case the control group provides an
estimate of the counterfactual: what would have happened without ad exposures. But
this ideal approach can be quite expensive, so it is worth looking for alternative ways to
predict the counterfactual. One way to do this is to use the Bayesian Structural Time
Series (BSTS) method described earlier."

Discuss here in forum: https://www.bionicturtle.com/forum/threads/p2-t9-902-big-data-


techniques-including-machine-learning-varian.22100/

7
P2.T9.802. Big Data: New Tricks for Econometrics by Hal Varian
Learning objectives: Describe the issues unique to big datasets. Explain and assess
different tools and techniques for manipulating and analyzing big data. Examine the
areas for collaboration between econometrics and machine learning.

802.1. Below is Hal Varian's simple classification tree that predicts Titanic survivors (Figure 1 in
the reading).

According to the tree, each of the following statements is true EXCEPT which is inaccurate?

a) This testing set contain 1,046 observations and two features


b) The tree predicts that all passengers who are younger than 16 will live
c) The rules in this tree misclassify about 30.9% of the "in sample" (testing set)
observations
d) The tree predicts that all First Class passengers live, but only some Second Class
passengers live

8
802.2. With respect to tools and techniques for manipulating and analyzing big data, each of the
following statements is true EXCEPT which is false?

a) Classifier performance is often improved by adding randomness and examples of this


include boosting, bagging and bootstrapping
b) When using a large data set (e.g., big data), the data should be parsed at least into
separate training and testing sets; or even training, validation, and testing sets
c) Random forests have the advantage of intuitive usability by offering simple summaries of
data relationships, but their disadvantage is inferior out-of-sample performance
especially with nonlinear data
d) Pruning a tree is an example of regularization because it imposes a cost for tree
complexity (e.g., number of terminal nodes) with the goal of simplifying the model and
generating better out-of-sample predictions

802.3. As an illustrative example of the "most important area for collaboration" between
econometrics and machine learning, Hal Varian considers the a (case study) relationship
between advertising campaigns and website visits. With respect to this case study, which of the
following BEST summarizes the key insight that illustrates a collaboration between
econometrics and machine learning?

a) The study substitutes a predictive model for a conventional control group in order to
demonstrate causality
b) The study employs machine learning in order to generate a model with a higher multiple
coefficient of determination
c) The study borrows from econometrics in a way that better generates exploratory data
analysis (EDA) and renders the complex relationships easier to understand
d) A BSTS model forecasts directly the beta coefficient of advertising spend as an
explanatory variable, then econometric methods are employed to overly time-series
covariates

9
Answers:

802.1. B. False. Passengers in First or Second Class (i.e., Class < 2.5) who are younger
than 16 live, but all Third Class passengers (including the young) do not survive;
although among Third class, this misclassifies 1 - 370/501 = 26.1% of this group.

In regard to (A), (C) and (D), each is TRUE. In regard to false (A), financial assets are more
often carried at market value.
 In regard to true (A), 501 + 36 + 233 + 276 = 1,046 (the raw data file count is 1,309 but
there are only 1,046 complete cases) and the two features (aka, predictors) are Class
(1st, 2nd, or 3rd) and Age.
 In regard to true (C), 1 - 723/1,046 = 30.9%
 In regard to true (D), in terms of the tree, First Class passengers are located in either of
the two "lived" nodes; Young Second Class passengers live, but old Second Class
passengers do not live.

802.2. C. is inaccurate. Rather, the inverse is true: random forests are something of a
black box but their performance is generally superior!

According to Varian, "Random Forests: Random forests is a technique that uses multiple
trees. A typical procedure uses the following steps: 1. Choose a bootstrap sample of the
observations and start to grow a tree; 2. At each node of the tree, choose a random sample of
the predictors to make the next decision. Do not prune the trees; 3. Repeat this process many
times to grow a forest of trees; 4. In order to determine the classification of a new observation,
have each tree make a classification and use a majority vote for the final prediction ... This
method produces surprisingly good out-of-sample fits, particularly with highly nonlinear data. In
fact, Howard and Bowles (2012) claim 'ensembles of decision trees (often known as Random
Forests) have been the most successful general-purpose algorithm in modern times.' ... One
defect of random forests is that they are a bit of a black box—they don’t offer simple summaries
of relationships in the data. As we have seen earlier, a single tree can offer some insight about
how predictors interact. But a forest of a thousand trees cannot be easily interpreted. However,
random forests can determine which variables are 'important' in predictions in the sense of
contributing the biggest improvements in prediction accuracy."

In regard to (A), (B) and (D) each is TRUE:


 In regard to true (A), "Boosting, Bagging, Bootstrap: There are several useful ways to
improve classifier performance. Interestingly enough, some of these methods work by
adding randomness to the data. This seems paradoxical at first, but adding randomness
turns out to be a helpful way of dealing with the overfitting problem. Bootstrap involves
choosing (with replacement) a sample of size n from a dataset of size n to estimate the
sampling distribution of some statistic. A variation is the 'm out of n bootstrap' which
draws a sample of size m from a dataset of size n > m. Bagging involves averaging
across models estimated with several different bootstrap samples in order to improve the
performance of an estimator. Boosting involves repeated estimation where
misclassified observations are given increasing weight in each repetition. The final
estimate is then a vote or an average across the repeated estimate."

10
 In regard to true (B) and (D), see below.
Varian: "General Considerations for Prediction: Our goal with prediction is typically to get good
out-of-sample predictions. Most of us know from experience that it is all too easy to construct a
predictor that works well in-sample but fails miserably out-of-sample. To take a trivial example,
(n) linearly independent regressors will fit (n) observations perfectly but will usually have poor
out-of-sample performance. Machine learning specialists refer to this phenomenon as the
'overfitting problem' and have come up with several ways to deal with it.

First, since simpler models tend to work better for out-of-sample forecasts, machine learning
experts have come up with various ways to penalize models for excessive complexity. In the
machine learning world, this is known as 'regularization,' and we will describe some examples
below. Economists tend to prefer simpler models for the same reason, but have not been as
explicit about quantifying complexity costs.

Second, it is conventional to divide the data into separate sets for the purpose of training,
testing, and validation. You use the training data to estimate a model, the validation data to
choose your model, and the testing data to evaluate how well your chosen model performs.
(Often validation and testing sets are combined.)

... The test-train cycle and cross-validation are very commonly used in machine learning and, in
my view, should be used much more in economics, particularly when working with large
datasets. For many years, economists have reported in-sample goodness-of-fit measures using
the excuse that we had small datasets. But now that larger datasets have become available,
there is no reason not to use separate training and testing sets. Cross-validation also turns out
to be a very useful technique, particularly when working with reasonably large data. It is also a
much more realistic measure of prediction performance than measures commonly used in
economics."

802.3. A. TRUE: The study substitutes a predictive model for a conventional control
group in order to demonstrate causality. Rather than a conventional control group, the case
study employs a machine learning time series method (i.e., Bayesian Structural Time Series,
BSTS) in order to PREDICT website visits without advertising spend; aka, the "as if"
counterfactual. In this way, an experiment is SIMULATED rather than explicitly conducted such
that causal inferences can be drawn; e.g., advertising has a significant causal impact on website
visits. In general, to establish causation (rather than correlation), an experiment is required.

Discuss here in forum: https://www.bionicturtle.com/forum/threads/p2-t9-802-big-data-new-


tricks-for-econometrics-by-hal-varian.13462/

11

You might also like