Assignment 3
Assignment 3
1. Given a 5 GB data set with 50 attributes (each containing 100 distinct values) and
512 MB of main memory in your laptop, outline an efficient method that constructs
decision trees in such large data sets. Justify your answer by rough calculation of
your main memory usage.
2. Given a decision tree, you have the option of (a) converting the decision tree to rules
and then pruning the resulting rules, or (b) pruning the decision tree and then
converting the pruned tree to rules. What advantage does (a) have over (b)?
3. Consider a training set that contains 100 positive examples and 400 negative
examples. For each of the following candidate rules,
R1: A -→ + (covers 4 positive and 1 negative examples),
R2: B -→ + (covers 30 positive and 10 negative examples),
R3: C -→ + (covers 100 positive and 90 negative examples),
determine which is the best and worst candidate rule according to:
4. (a) Suppose the fraction of undergraduate students who smoke is 15% and the
fraction of graduate students who smoke is 23%. If one-fifth of the college students
are graduate students and the rest are undergraduates, what is the probability that a
student who smokes is a graduate student?
(b) Given the information in part (a), is a randomly chosen college student more
likely to be a graduate or undergraduate student?
(c) Repeat part (b) assuming that the student is a smoker.
(d) Suppose 30% of the graduate students live in a dorm but only 10% of the
undergraduate students live in a dorm. If a student smokes and lives in the dorm, is
he or she more likely to be a graduate or undergraduate student? You can assume
independence between students who live in a dorm and those who smoke.
Table 1