Week Two Assignment A

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 1

2.2 Describe the difference in roles assumed by the validation partition and the test partition.

The validation partition is used to assess the performance of each supervised learning model so
that we can compare models and pick the best one. In some algorithms (e.g., classification and
regression trees, k-nearest neighbors) the validation partition may be used in automated
fashion to tune and improve the model. This means that the validation data are actually used to
help build the model. The test data partition is used for assessing the performance of the final
chosen model on new data. The test data are not used to build models, or to further tweak the
model or improve its fit. (If the test data were used for these purposes, they would play a role
in building or selecting the best model and would no longer provide an unbiased assessment of
the chosen model's performance with completely new data.)
2.3 Consider the sample from a database of credit applicants in Table 2.15. Comment on the
likelihood that it was sampled randomly, and whether it is likely to be a useful sample.
Typically, we perform data mining on less than the complete database. Data mining algorithms
will have varying limitations on what they can handle in terms of the numbers of records and
variables, limitations that may be specific to computing power and capacity as well as software
limitations. Even within those limits, many algorithms will execute faster with smaller samples.
Accurate models can often be built with as few as several thousand records. Hence, we will
want to sample a subset of records for credit applications.
2.4 Consider the sample from a bank database shown in Table 2.16; it was selected randomly
from a larger database to be the training set. Personal Loan indicates whether a solicitation
for a personal loan was accepted and is the response variable. A campaign is planned for a
similar solicitation in the future and the bank is looking for a model that will identify likely
responders. Examine the data carefully and indicate what your next step would be.
After carefully observing the sample from a bank database, there is many randomly selected
databases are to be arranged for the training set. According to the given sample database
personal Loan indicates whether a solicitation for a personal loan was accepted and is the
response variable. The next step would be reducing the mortgage so it is easily available for
both high and low income and improving the performance of securities acct.
2.7 A dataset has 1000 records and 50 variables with 5% of the values missing, spread
randomly throughout the records and variables. An analyst decides to remove records with
missing values. About how many records would you expect to be removed?
For a record to have all values present, it must avoid having a missing value (P = 0.95) for each
of 50 records. The chance that a given record will escape having a missing value for two
variables is 0.95 * 0.95 = 0.903. The chance that a given record would escape having a missing
value for all 50 records is (0.95)50=0.076945. This implies that 1-0.076944=0.9231 (92.31%) of
all records will have missing values and would be deleted.

You might also like