Predictive-Analysis-Lab-Manual
Predictive-Analysis-Lab-Manual
Predictive-Analysis-Lab-Manual
MANUAL
PREDICTIVE
ANALYSIS
Vision and Mission of the Department
To create excellence in business management for nurturing value driven business leaders with
analytical and entrepreneurial mindset to foster innovative ideas in order to transform the world
and serve the society.
M1 : Design a unique competency directed and industry relevant curriculum with outcome
oriented teaching learning process facilitated by world class infrastructure.
M2 : Enhance students’ cognitive, research, analytical, ethical and behavioral competencies
through programs that equip them to meet global business challenges in the professional
world.
M3 : Facilitate student centric sound academic environment with co-curricular and extra-
curricular activities to groom and develop future ready business professionals.
M4 : Design a transparent evaluation system for objective assessment of the program learning.
M5 : Align meaningful interactions with the academia, industry and community to facilitate
value driven holistic development of the students.
M6 : Develop ethical and socially responsible entrepreneurial attitude for harnessing the
environmental opportunities through creativity and innovation for a vibrant and sustainable
society.
Program Educational Objectives (PEOs)
Program Educational Objectives (PEOs) are the broad statements that describe career and
professional accomplishments that graduates will attain within a few years of graduation. After
successful completion of MBA program from Chandigarh University, the graduates will:
PEO1: Make significant impact as successful management professionals with a sound business
and entrepreneurial acumen leading to a promising career in the various management domains.
PEO 2: Develop the professional competence for astute decision making, organization skills,
planning and its efficient implementation, research, data analysis and interpretation with a
solution finding approach.
PEO 3: Be known for their team player qualities to handle diversity and the leadership skills to
make sound decisions while working with peers in an inter-disciplinary environment with people
of cross-cultural attributes
PEO 4: Be adaptable to new technology, innovations and changes in world economy that
positively impacts and contributes towards industry, academia and the community at large.
PEO 5: Be responsible citizens with high ethical conduct that will empower the business
organizations with high integrity, moral values, social effectiveness and legal business
intelligence.
PO2 Foster Analytical and critical thinking abilities for data-based decision making
PO3
Ability to develop Value based Leadership ability
PO4 Ability to understand, analyze and communicate global, economic, legal, and
ethical aspects of business.
Version-2018
Ist Semester
Subject
Subjects L T P Cr
Code
BAT- 601 Accounting for Managers 4 0 0 4
Fundamentals of Management and Organizational
BAT- 602 4 0 0 4
Behaviour
BAT- 603 Managerial Economics 4 0 0 4
BAT- 604 Quantitative Techniques for Managers 4 0 0 4
BAT- 605 Marketing Management 3 0 0 3
PCT-610 Professional Business Communication 3 0 0 3
PCP-611 Professional Business Communication (LAB) 0 0 2 1
BAT- 608 Computer Applications for Business 2 0 2 3
BAT- 609 Supply Chain Management 3 0 0 3
TOTAL CREDITS IN SEMESTER 29
Mentoring Lectures 2 0 0 0
TOTAL NO. OF SESSIONS 33
IInd Semester
BAT- 660 Legal and Business Environment 3 0 0 3
BAT- 661 Corporate Finance 4 0 0 4
BAT- 662 Operations Management and Research 4 0 0 4
BAT- 654 Research Methodology 4 0 0 4
BAT- 655 Social Media and Digital Marketing 3 0 0 3
BAT- 656 Human Resource Management 3 0 0 3
BAP- 657 Introduction to SPSS 0 0 4 2
TOTAL CREDITS IN SEMESTER 23
TOTAL NO. OF SESSIONS 25
TOTAL NO. OF SESSIONS (8 SESSIONS/WEEK TO
35
TPP + 2 sessions of Mentoring)
IN ADDITION TO COMPOULSORY SUBJECTS, A STUDENT HAS TO CHOOSE TWO
SUBJECTS FROM EACH OPTED SPECIALISATION
IIIrd Semester
BAT- 731 Retail Management 3 0 0 3
BAT- 732 Corporate Strategy 3 0 0 3
BAP- 703 Predictive Analysis 0 0 4 2
SPECIALISATION (MARKETING, HR, FINANCE & INTERNATIONAL BUSINESS)
Specialisation Group A/B/C/D/E: Elective I 4 0 0 4
Specialisation Group A/B/C/D/E: Elective II 4 0 0 4
Specialisation Group A/B/C/D/E: Elective III 4 0 0 4
Specialisation Group A/B/C/D/E: Elective IV 4 0 0 4
BAI- 705 Summer Training Report 0 0 0 4
TOTAL CREDITS IN SEMESTER 28
TOTAL NO. OF SESSIONS 26
TOTAL NO. OF SESSIONS (8 SESSIONS/WEEK TO
35
TPP + 1 sessions of Mentoring)
AFTER THIRD SEMESTER, A STUDENT HAS TO CHOOSE TWO SUBJECTS FROM
ONE SPECIALISATION
IVth Semester
BAT- 780 Corporate Social Responsibility and Sustainability 3 0 0 3
BAT- 781 Indian Ethos and Business Ethics 3 0 0 3
SPECIALISATION (ENTREPRENEURSHIP & INTERNATIONAL BUSINESS)
Specialisation Group A/B/C/D/E: Elective I 4 0 0 4
Specialisation Group A/B/C/D/E: Elective II 4 0 0 4
Specialisation Group A/B/C/D/E: Elective III 4 0 0 4
Specialisation Group A/B/C/D/E: Elective IV 4 0 0 4
BAR- 752 Final Research Project and Publications 0 0 0 4
TOTAL CREDITS IN SEMESTER 26
TOTAL NO. OF SESSIONS 22
Final Resreach Project & Publications- Students are required to conduct Research Project
Survey under the supervision of Assigned Supervisor (Faculty). In this regard, each faculty
member will be assigned 4 groups (each group contains 5 students) and each Group has to
publish two Research Paper from their research project work in UGC listed Journal. Each
published Research Paper will carry 2 credits.
FINANCE (A)
SPECIALIZATIONS (Semester-3)
BAA-735 Investment Analysis and Portfolio Management 4 0 0 4
BAA-736 Managing Banks and Financial Institutions 4 0 0 4
SPECIALIZATIONS (Semester-4)
BAA-785 Financial Markets and Services 4 0 0 4
BAA-786 Taxation 4 0 0 4
MARKETING (B)
SPECIALIZATIONS (Semester-3)
BAB-711 Consumer Behaviour 4 0 0 4
BAB-749 Rural Marketing 4 0 0 4
SPECIALIZATIONS (Semester-4)
BAB-787 Sales and Distribution Management 4 0 0 4
BAB-788 Services Marketing 4 0 0 4
ENTREPRENEURSHIP (E)
SPECIALIZATIONS (Semester-3)
BAE-746 Entrepreneurial Strategies-I 4 0 0 4
BAE-747 Social Entrepreneurship 4 0 0 4
SPECIALIZATIONS (Semester-4)
BAE-793 Entrepreneurial Strategies-II 4 0 0 4
BAE-794 Entrepreneurial Marketing 4 0 0 4
C
Subject Name: Predictive Analytics – LAB MANUALS
Subject Code: BAP – 703
BAP-703 Predictive Analysis L T P C
Total Contact Hours 60 0 0 4 2
MBA
Prerequisite: Logical reasoning and aptitude
Marks 100
Internal : 100 External : 0
Course Objective
To develop the fundamental understanding and application of Mathematics and Statistics in
business organizations
Unit Course Outcome
1 Students will be able to initiate effective use of SPSS in business problem
2 Student will learn to evaluate and solve the business problems logically
3 Students will be able to avoid risks and spot opportunities.
Content of the
Syllabus Unit-
Introduction to IBM SPSS Statistics: Course Introduction, Introducing IBM SPSS Statistics,
Reading Data, Variable Properties, Working with the Data editor, Summarizing Individual
Variables, Modifying Data Values: Recode, Modifying Data Values: Compute, Describing
Relationship between Variables, Selecting Cases, Creating and Editing Charts, Output in the
Viewer, Syntax Basics, Course Summary, Menus and the Help System.
Unit-II
Data Management and Manipulation with IBM SPSS Statistics, Helpful Data Management
Features, Transformations: Computing with Numeric Data, Transformations: Computing with
Date and Time Variables, Transformations: Computing with Alphanumeric Data. Additional
Data Transformations, Identifying Duplicates and Restructuring Data, Aggregating Data,
Merging Files – Adding Cases Adding Variables, Analyzing Multiple Response Questions,
Working with Pivot Tables , Working with Charts , Exporting Tables and Charts, An
Introduction to Output Management System, Automating IBM SPSS Statistics, Controlling the
IBM SPSS Statistics Environment
Unit-III
Multidimensional scaling, Factor analysis and Cluster analysis, Concepts of Logistic Regression,
Comparison of Several Populations (One way analysis of Variance and Analysis of Variance of
Ranks)
Recommended Books:
The syllabus has been divided into three units. Paper setter will set 3 questions from each unit
and 1 compulsory question spread over the whole syllabus consisting of 5 short answer
questions. Compulsory question will be placed at number one. Candidate shall be required to
attempt 6 questions in all including compulsory question and selecting not more than 2 questions
from each unit. All questions carry equal marks.
Course
Course Name Description of CO PO1 PO2 PO3 PO45 PO5 PO6
Code
SPSS (Statistical Package for Social Sciences) is a versatile and responsive program designed to
undertake a range of statistical procedures. SPSS software is widely used in a range of disciplines and is
available from all computer pools within the University of South Australia.
It’s important to note that SPSS is not only statistical software – there are many others that you may
come across if you pursue a career that requires you to work with data. Some of the common statistical
packages include Stata and SAS (and there are many others).
1. Input window : It is the first page that gets displayed when we open SPSS software. Here we enter
the data primarily.
Data view : The data view is used to store and show your data. It is much like an ordinary
spreadsheet although in general the data is structured so that rows are cases and the columns
are for the different variables that relate to each case.
Variable view : The variable view contains the variables on your data set , so it defines
the properties of your dataset. Each row will define all of the various variables . The
variables includes – name, type, width, decimals, label, values, missing, column, align
and measure.
2. Output window : This window is used to show the results that have been output from your data
analysis. Depending on the analysis that you are carrying out this may include the Chart Editor
Window or Pivot Table Window.
3. Syntax window : This window shows the underlying commands that have executed your data
analysis. If you are a confident coder this is where you can amend the code or write your own from
scratch and then run your own custom analysis on your data set.
4. Script window :Scripts can be used to customize operations within a particular stream and they are
saved with that stream. Scripts can be used to specify a particular execution order for the terminal
nodes within a stream. It is mainly used for coding and programming.
Q.3. Uses of SPSS in business?
SPSS is used as a data collection tool by researchers. The data entry screen in SPSS looks much
like any other spreadsheet software. We can enter variables and quantitative data and save the
file as a data file. Furthermore, we can organize our data in SPSS by assigning properties to
different variables.
Once data is collected and entered the data sheet in SPSS, we can create an output file from the
data. For example, we can create frequency distribution of our data to determine whether our
data is normally distributed. The frequency distribution is displayed in an output file.
The most obvious use for SPSS is to use the software to run statistical tests. SPSS has all of
the most widely used statistical tests built-in to the software. Therefore, we won’t have to do
any mathematical equations by hand.
SPSS helps to create reports of questionnaire data in the form of graphical presentations
which are ready for publications and reporting.
SPSS research tool can compare and explore the differences between responses to two or
more questions. It’s very easy to find the difference between to batches of data.
Using this tool we can make analysis according to the expected research goals and obtain
the gained results.
SPSS is the powerful tool for data analyzing and it’s also used for SPSS data entry. It’s the first
step in statistical process and it should be very important to input data correctly.
SORTING FILE
Sorting data allows us to re-organize the data in ascending or descending order with respect to a
specific variable. Same procedures in SPSS require that your data be sorted in a certain way
before the procedure will execute. Using SPSS and Pasw/sorting variables.
One of the function you will often want to perform in the data view of the data editor is sorting
by a variables values to bring those of similar value together. This can be a very useful when
exploring the raw data in your datasheet. There are two ways to do this. One is quite simple, the
other allows sorting on more than one variable.
STEPS
The simple way to sort variable values is to make sure you are looking at the data view tab.
Then scroll to the variable by which you want to sort. Right click on its column heading and a
context menu will appear. At the bottom of the context menu are two sorting options: “ sort
ascending” and “sort descending”. Choosing the first will move the smallest values to the top of
the data set while a descending sort will bring up the largest value. Keep in mind that SPSS
automatically (unlike excel) moves all rows in unison. So you don’t need to worry about cases
becoming misaligned when sorting.
Sometimes its useful to sort on multiple variables, which means that SPSS while sort the data set
by the values of the first variable than breaks ties in that sort by sorting on the values of the
second variable and so on. The sort on multiple variable at once choose “ data” >”sort cases” :
The “sort order” box allows the choice between ascending or descending order once again. On
the left is a list of all the variables in the data set. Choose them the most important sort first by
clicking the variable involved and then the blue arrow and repeat.
STEPS
The data>split file dialog lets you control this mechanism. To active
Choose either
Compare groups: produces single tables including all groups
Organize output by groups: produces separate tables for each group
Select a variable region in our example for the group based on field
Make sure to check sort the file by grouping variables is selected, if you are not
certain that the file is sorted on the grouping variable.
After clicking OK, split is activated ( you can see “split file on” in the status line of the SPSS
window) and any procedure you invoke now will be performed. Separately on each group
defined by the grouping variables as long as you do not turn off split file.
If you need to know what is current split variable you will have to open the data> split
file dialog.
COMPUTE
Sometimes you may need to compute a new variable based on existing information (from other
variables) in your data.
In this tutorial we will discus how to compute variables in SPSS using numeric expressions,
built –in functions and conditional logic.
The compute variable window will open where you specify how to calculate your new variable
RECODE
Sometimes you will want to transform a variable by grouping its categories or values together.
For example: you may want to change a continuous variable into a categorical variable or you
may want to merge the categories of a normal variable. In SPSS this type of transform is
called recoding.
Recoding into a different variable transforms an original variable into a new variable. That is the
changes do not overwrite the original variable, they are instead applied to a copy of the original
variable under a new name.
STEPS
To recode into different variables click transform >recode into different variables
The recode into different variables window will appear. The left column lists all of the variables
in your data set. Select the variables you wish to recode by clicking it. Click the arrow in the
center to move the selected variable to the center text box.
SELECT CASES
Data files are not always organized to meet specific users need. For example: users may
wish to select specific subjects or split the data file into separate groups for analysis. If you have
two or more subjects groups in your data and you want to analyze each subject independently
you want to analyze each subject you can use the select case option.
STEPS
Open the part1.sav data file provided in your computer. Make sure you are on data view
Click the data menu, and then click select cases. The select cases dialog box opens.
Select the if condition is satisfied option
Click the if button. The select cases: if dialog box opens. Select the gender variables in
the left box, and then click the transfer arrow button to move it to the right box. Click the
= button and then click the 1 button. Because the symbol 1 represents formals according to
our value levels we are telling SPSS to select only female participants.
Click the continue button to return to the select cases dialog box. Click the OK button to
return to the data view. All males will be excluded from the statistical analysis.
Run an analysis. Note the crossed out participants in the data file. Those represents all
the male participants.
To undo the select cases open the select cases box and click the reset button.
Experiment 2: A Chi-square goodness-of-fit test in SPSS
The four steps below show you how to analyse your data using a chi-square goodness-of-fit test
in SPSS Statistics when you have hypothesised that you have equal expected proportions (N.B.,
if you are unclear about the differences between equal and unequal expected proportions, see
the Introduction). Also, it is important to note that this procedure will only give you the correct
results if you have set up your data correctly in SPSS Statistics (N.B., if you have entered the
summated frequencies for each group of your categorical variable, this procedure will only work
if you have already "weighted" your cases, as we explained in the Data Setup section earlier, but
if you have entered all of your data into SPSS Statistics in raw form, this procedure will not give
the correct results). In our enhanced chi-square goodness-of-fit test guide, we show all the SPSS
Statistics procedures for when you have equal and unequal expected proportions, as well as when
you have to weight your cases or have not summated your data. If you only need to follow this
"quick start" guide for equal expected proportions (without the weighting of cases), the four
steps you need are shown below. At the end of these four steps, we show you how to interpret
the results from this test.
Click Analyze > Nonparametric Tests > Legacy Dialogs > Chi-square... on the top
menu as shown below:
Note: If you are on older versions of SPSS Statistics, you will not have to go through the Legacy
Dialogs me
Published with written permission from SPSS Statistics, IBM Corporation.
You will be presented with the Chi-square Test dialogue box, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
Transfer the gift_type variable into the Test Variable List: box by using the button, as
shown below:
Keep the All categories equal option selected in the –Expected Values– area as we are assuming equal
proportions for each category.
In the dialog box of the factor analysis we start by adding our variables (the standardized tests math,
reading, and writing, as well as the aptitude tests 1-5) to the list of variables.
In the dialog Descriptives… we need to add a few statistics to verify the assumptions made by the
factor analysis. To verify the assumptions, we need the KMO test of sphericity and the Anti-Image
Correlation matrix.
The dialog box Extraction… allows us to specify the extraction method and the cut-off value for the
extraction. Generally, SPSS can extract as many factors as we have variables. In an exploratory
analysis, the eigenvalue is calculated for each factor extracted and can be used to determine the
number of factors to extract. A cutoff value of 1 is generally used to determine factors based on
eigenvalues.
Next, an appropriate extraction method need to be selected. Principal components is the default
extraction method in SPSS. It extracts uncorrelated linear combinations of the variables and gives
the first factor maximum amount of explained variance. All following factors explain smaller and
smaller portions of the variance and are all uncorrelated with each other. This method is
appropriate when the goal is to reduce the data, but is not appropriate when the goal is to identify
latent constructs.
The second most common extraction method is principal axis factoring. This method is appropriate
when attempting to identify latent constructs, rather than simply reducing the data. In our research
question, we are interested in the dimensions behind the variables, and therefore we are going to use
principal axis factoring.
The next step is to select a rotation method. After extracting the factors, SPSS can rotate the factors
to better fit the data. The most commonly used method is varimax. Varimax is an orthogonal
rotation method that tends produce factor loading that are either very high or very low, making it
easier to match each item with a single factor. If non-orthogonal factors are desired (i.e., factors
that can be correlated), a direct oblimin rotation is appropriate. Here, we choose varimax.
In the dialog box Options we can manage how missing values are treated – it might be appropriate to
replace them with the mean, which does not change the correlation matrix but ensures that we do not
over penalize missing values. Also, we can specify in the output if we do not want to display all
factor loadings. The factor loading tables are much easier to read when we suppress small factor
loadings. Default value is 0.1, but in this case, we will increase this value to 0.4. The last step
would be to save the results in the Scores… dialog. This automatically creates standardized scores
representing each extracted factor.
Experiment 4: The Cluster Analysis in SPSS
Our research question for this example cluster analysis is as follows:
What homogenous clusters of students emerge based on standardized test scores in mathematics,
reading, and writing?
In SPSS Cluster Analyses can be found in Analyze/Classify…. SPSS offers three methods for the
cluster analysis: K-Means Cluster, Hierarchical Cluster, and Two-Step Cluster.
K-means cluster is a method to quickly cluster large data sets. The researcher define the number
of clusters in advance. This is useful to test different models with a different assumed number of
clusters.
Hierarchical cluster is the most common method. It generates a series of models with cluster
solutions from 1 (all cases in one cluster) to n (each case is an individual cluster). Hierarchical
cluster also works with variables as opposed to cases; it can cluster variables together in a
manner somewhat similar to factor analysis. In addition, hierarchical cluster analysis can handle
nominal, ordinal, and scale data; however it is not recommended to mix different levels of
measurement.
Two-step cluster analysis identifies groupings by running pre-clustering first and then by running
hierarchical methods. Because it uses a quick cluster algorithm upfront, it can handle large data
sets that would take a long time to compute with hierarchical cluster methods. In this respect, it
is a combination of the previous two approaches. Two-step clustering can handle scale and
ordinal data in the same model, and it automatically selects the number of clusters.
The hierarchical cluster analysis follows three basic steps: 1) calculate the distances, 2) link the
clusters, and 3) choose a solution by selecting the right number of clusters.
First, we have to select the variables upon which we base our clusters. In the dialog window we
add the math, reading, and writing tests to the list of variables. Since we want to cluster cases we
leave the rest of the tick marks on the default.
In the dialog box Statistics… we can specify whether we want to output the proximity matrix
(these are the distances calculated in the first step of the analysis) and the predicted cluster
membership of the cases in our observations. Again, we leave all settings on default.
In the dialog box Plots… we should add the Dendrogram. The Dendrogram will graphically show
how the clusters are merged and allows us to identify what the appropriate number of clusters is.
The dialog box Method… allows us to specify the distance measure and the clustering method.
First, we need to define the correct distance measure. SPSS offers three large blocks of distance
measures for interval (scale), counts (ordinal), and binary (nominal) data.
For interval data, the most common is Square Euclidian Distance. It is based on the Euclidian
Distance between two observations, which is the square root of the sum of squared distances.
Since the Euclidian Distance is squared, it increases the importance of large distances, while
weakening the importance of small distances.
If we have ordinal data (counts) we can select between Chi-Square or a standardized Chi-Square
called Phi-Square. For binary data, the Squared Euclidean Distance is commonly used.
In our example, we choose Interval and Square Euclidean Distance.
Next, we have to choose the Cluster Method. Typically, choices are between-groups linkage
(distance between clusters is the average distance of all data points within these clusters), nearest
neighbor (single linkage: distance between clusters is the smallest distance between two data
points), furthest neighbor (complete linkage: distance is the largest distance between two data
points), and Ward’s method (distance is the distance of all clusters to the grand average of the
sample). Single linkage works best with long chains of clusters, while complete linkage works
best with dense blobs of clusters. Between-groups linkage works with both cluster types. It is
recommended is to use single linkage first. Although single linkage tends to create chains of
clusters, it helps in identifying outliers. After excluding these outliers, we can move onto Ward’s
method. Ward’s method uses the F value (like in ANOVA) to maximize the significance of
differences between clusters.
A last consideration is standardization. If the variables have different scales and means we
might want to standardize either to Z scores or by centering the scale. We can also transform the
values to absolute values if we have a data set where this might be appropriate.
Experiment 5: The Logistic Regression Analysis in SPSS
First we need to check that all cells in our model are populated. Although the logistic regression is
robust against multivariate normality and therefore better suited for smaller samples than a probit
model, we still need to check, because we don’t have any categorical variables in our design we will
skip this step.
Logistic Regression is found in SPSS under Analyze/Regression/Binary Logistic…
The next 3 tables are the results fort he intercept model. That is the Maximum Likelihood model if
only the intercept is included without any of the dependent variables in the analysis. This is basically
only interesting to calculate the Pseudo R² that describe the goodness of fit for the logistic model.
The relevant tables can be found in the section ‘Block 1’ in the SPSS output of our logistic regression
analysis. The first table includes the Chi-Square goodness of fit test. It has the null hypothesis that
intercept and all coefficients are zero. We can reject this null hypothesis.
The next table includes the Pseudo R², the -2 log likelihood is the minimization criteria used by
SPSS. We see that Nagelkerke’s R² is 0.409 which indicates that the model is good but not great.
Cox & Snell’s R² is the nth root (in our case the 107th of the -2log likelihood improvement. Thus we
can interpret this as 30% probability of the event passing the exam is explained by the logistic
model.
The next table contains the classification results, with almost 80% correct classification the model is
not too bad – generally a discriminant analysis is better in classifying data correctly.
The last table is the most important one for our logistic regression analysis. It shows the regression
function -1.898 + .148*x1 – .022*x2 – .047*x3 – .052*x4 + .011*x5. The table also includes the test
of significance for each of the coefficients in the logistic regression model. For small samples the t-
values are not valid and the Wald statistic should be used instead. Wald is basically t² which is Chi-
Square distributed with df=1. However, SPSS gives the significance levels of each coefficient. As we
can see, only Apt1 is significant all other variables are not.
If we change the method from Enter to Forward:Wald the quality of the logistic regression
improves. Now only the significant coefficients are included in the logistic regression equation. In
our case this is Apt1 and the intercept.
We see that , and we know that a 1 point higher score in the Apt1 test
multiplies the odds of passing the exam by 1.17 (exp(.158)). We can also calculate the critical
value which is Apt1 > -intercept/coefficient > -5.270/.158 > 33.35. That is if a pupil scored higher
than
33.35 on the Aptitude Test 1 the logistic regression predicts that this pupil will pass the final exam.
Imagine that you are working as a statistician for the Internal Revenue Service. You are
told that another IRS employee has developed four composite scores (X 1 - X4), easily
computable from the information that taxpayers provide on their income tax returns and from
other databases to which the IRS has access. These composite scores were developed in the
hope that they would be useful for discriminating tax cheaters from other persons. To see if
these composite scores actually have any predictive validity, the IRS selects a random sample of
taxpayers and audits their returns. Based on this audit, each taxpayer is placed into one of three
groups: Group 1 is persons who overpaid their taxes by a considerable amount, Group 2 is
persons who paid the correct amount, and Group 3 is persons who underpaid their taxes by a
considerable amount. X1 through X4 are then computed for each of these taxpayers. You are
given a data file with group membership, X 1, X2, X3, and X4 for each taxpayer, with an equal
number of subjects in each group. Your job is to use discriminant function analysis to develop a
pair of discriminant functions (weighted sums of X1 through X4) to predict group membership.
You use a fully stepwise selection procedure to develop a (maybe) reduced (less than four
predictors) model. You employ the WILKS method of selecting variables to be entered or
deleted, using the default p criterion for entering and removing variables.
Click Method and select “Wilks’ lambda” and “Use probability of F.” Click Continue.
Under Statistics, ask for the group means. Under Classify, ask for a territorial map. Continue,
OK.
Look at the output, “Variables Not in the Analysis.” At Step 0 the tax groups (overpaid, paid
correct, underpaid) differ most on X3 ( drops to .636 if X3 is entered) and “Sig. of F to enter” is
less than .05, so that predictor is entered first. After entering X3, all remaining predictors are
eligible for entry, but X1 most reduces lambda, so it enters. The Wilks lambda is reduced from
.635 to .171. On the next step, only X2 is eligible to enter, and it does, lowering Wilks lambda to
.058. At this point no variable already in meets the criterion for removal and no variable out
meets the criterion for entry, so the analysis stops.
Look back at the Step 0 statistics. Only X2 and X3 were eligible for entry. Note, however, that
after X3 was entered, the p to enter dropped for all remaining predictors. Why? X3 must
suppress irrelevant variance in the other predictors (and vice versa). After X1 is added to X3, p to
enter for X4 rises, indicating redundancy of X4 with X1.
If you look at the standardized coefficients and loadings you will see that high scores on DF1
result from high X3 and low X1. If you look back at the group means you will see that those
who underpaid are characterized by having low X 3 and high X1, and thus low DF1. This
suggests that DF1 is good for discriminating the cheaters (those who underpaid) from the others.
The centroids confirm this.
If you look at the standardized coefficients and loadings for DF2 you will see that high DF2
scores come from having high X2 and low X1. From the group means you see that those who
overpaid will have low DF2 (since they have a low X2 and a high X1). DF2 seems to be good for
separating those who overpaid from the others, as confirmed by the centroids for DF2.
In the territorial map the underpayers are on the left, having a low DF1 (high X1 and low X3).
The overpayers are on the lower right, having a high DF1 and a low DF2 (low X2, high X3, high
X1). Those who paid the correct amount are in the upper right, having a high DF1 and a high
DF2 (low X1, high X2, high X3).
Experiment 7: Confirmatory Factor Analysis
Confirmatory factor analysis (CFA) is a multivariate statistical procedure that is used to test how
well the measured variables represent the number of constructs. Confirmatory factor analysis
(CFA) and exploratory factor analysis (EFA) are similar techniques, but in exploratory factor
analysis (EFA), data is simply explored and provides information about the numbers of factors
required to represent the data. In exploratory factor analysis, all measured variables are related to
every latent variable. But in confirmatory factor analysis (CFA), researchers can specify the
number of factors required in the data and which measured variable is related to which latent
variable. Confirmatory factor analysis (CFA) is a tool that is used to confirm or reject the
measurement theory.
1. From the menus choose: Analyze > Forecasting > Apply Models...
2. Enter the file specification for a model file or click Browse and select a model file (model files
are created with the Time Series Modeler procedure).
Optionally, you can: v Reestimate model parameters using the data in the active dataset.
Forecasts are created using the reestimated parameters. v Save predictions, confidence intervals,
and noise residuals. v Save reestimated models in XML format. Model Parameters and Goodness
of Fit Measures Load from model file. Forecasts are produced using the model parameters from
the model file without reestimating those parameters. Goodness of fit measures displayed in
output and used to filter models (best- or worst-fitting) are taken from the model file and reflect
the data used when each model was developed (or last updated). With this option, forecasts do
not take into account historical data--for either dependent or independent variables--in the active
dataset. You must choose Reestimate from data if you want historical data to impact the
forecasts. In addition, forecasts do not take into account values of the dependent series in the
forecast period--but they do take into account values of independent variables in the forecast
period. If you have more current values of the dependent series and want them to be included in
the forecasts, you need to reestimate, adjusting the estimation period to include these values.
Reestimate from data. Model parameters are reestimated using the data in the active dataset.
Reestimation of model parameters has no effect on model structure. For example, an
ARIMA(1,0,1) model will remain so, but the autoregressive and moving-average parameters will
be reestimated. Reestimation does not result in the detection of new outliers. Outliers, if any, are
always taken from the model file. v Estimation Period. The estimation period defines the set of
cases used to reestimate the model parameters. By default, the estimation period includes all
cases in the active dataset. To set the estimation period, select Based on time or case range in the
Select Cases dialog box. Depending on available data, the estimation period used by the
procedure may vary by model and thus differ from the displayed value. For a given model, the
true estimation period is the period left after eliminating any contiguous missing values, from the
model's dependent variable, occurring at the beginning or end of the specified estimation period.
Forecast Period The forecast period for each model always begins with the first case after the end
of the estimation period and goes through either the last case in the active dataset or a user-
specified date. If parameters are not reestimated (this is the default), then the estimation period
for each model is the set of cases used when the model was developed (or last updated). v First
case after end of estimation period through last case in active dataset. Select this option when the
end of the estimation period is prior to the last case in the active dataset, and you want forecasts
through the last case. v First case after end of estimation period through a specified date. Select
this option to explicitly specify the end of the forecast period. Enter values for all of the cells in
the Date grid. If no date specification has been defined for the active dataset, the Date grid shows
the single column Observation. To specify the end of the forecast period, enter the row number
(as displayed in the Data Editor) of the relevant case. The Cycle column (if present) in the Date
grid refers to the value of the CYCLE_ variable in the active dataset. Output Available output
includes results for individual models as well as results across all models. Results for individual
models can be limited to a set of best- or poorest-fitting models based on user-specified criteria.
Statistics and Forecast Tables The Statistics tab provides options for displaying tables of model
fit statistics, model parameters, autocorrelation functions, and forecasts. Unless model
parameters are reestimated (Reestimate from data on the Models tab), displayed values of fit
measures, Ljung-Box values, and model parameters are those from the model file and reflect the
data used when each model was developed (or last updated). Outlier information is always taken
from the model file. Display fit measures, Ljung-Box statistic, and number of outliers by model.
Select (check) this option to display a table containing selected fit measures, Ljung-Box value,
and the number of outliers for each model. Fit Measures. You can select one or more of the
following for inclusion in the table containing fit measures for each model: v Stationary R-square
v R-square v Root mean square error v Mean absolute percentage error v Mean absolute error v
Maximum absolute percentage error v Maximum absolute error v Normalized BIC See the topic
Chapter 6, “Goodness-of-Fit Measures,” on page 25 for more information. Statistics for
Comparing Models. This group of options controls the display of tables containing statistics
across all models. Each option generates a separate table. You can select one or more of the
following options: v Goodness of fit. Table of summary statistics and percentiles for stationary
R-square, R-square, root mean square error, mean absolute percentage error, mean absolute
error, maximum absolute percentage error, maximum absolute error, and normalized Bayesian
Information Criterion. v Residual autocorrelation function (ACF). Table of summary statistics
and percentiles for autocorrelations of the residuals across all estimated models. This table is
only available if model parameters are reestimated (Reestimate from data on the Models tab). v
Residual partial autocorrelation function (PACF). Table of summary statistics and percentiles for
partial autocorrelations of the residuals across all estimated models. This table is only available if
model parameters are reestimated (Reestimate from data on the Models tab). Statistics for
Individual Models. This group of options controls display of tables containing detailed
information for each model. Each option generates a separate table. You can select one or more
of the following options: v Parameter estimates. Displays a table of parameter estimates for each
model. Separate tables are displayed for exponential smoothing and ARIMA models. If outliers
exist, parameter estimates for them are also displayed in a separate table. v Residual
autocorrelation function (ACF). Displays a table of residual autocorrelations by lag for each
estimated model. The table includes the confidence intervals for the autocorrelations. This table
is only available if model parameters are reestimated (Reestimate from data on the Models tab).
v Residual partial autocorrelation function (PACF). Displays a table of residual partial
autocorrelations by lag for each estimated model. The table includes the confidence intervals for
the partial autocorrelations. This table is only available if model parameters are reestimated
(Reestimate from data on the Models tab). Display forecasts. Displays a table of model forecasts
and confidence intervals for each model.
To view the data drag in a table node and attach it to the statistics node already on the canvas.
Click run and double click to view the table output.
The next step in the process is to read in the data using a type node. The type node specifies
metadata and data properties for each field: the measurement level, data values, the role and
missing value definitions. From the screenshot you can see that the field enroll is our target.
Next drag a CHAID node and attach it to the existing type node. CHAID stands for chi square
automatic interaction detection and is one of the more popular decision tree models. And really
what’s going on behind the scenes is that the model is running the chi square test many times.
This will make more sense in just a little bit, but essentially, the model is picking the predictors
with the strongest relationship with the outcome field and that is determined by the field that has
the highest chi square statistic.
Edit the CHAID node and make sure the target is set correctly. You can also remove
any inputs or predictors that you don’t want included in the model. When you’re
ready click the Run icon in the lower left hand corner to create the model.
If all goes well you will get the golden nugget. Double click the nugget to see the results.
What we have here are the top predictors of enrollment. The top three predictors are financial
aid, overnight visit and alumni meeting. In total there were six predictors that the model deemed
important. To view the decision tree click on the Viewer tab.
As we continue to work our way down the tree, we see that the next most important variable is
an overnight visit. If a student was offered a financial aid package of 51%-75% and also took an
overnight visit we were able to accurately predict that they would enroll around 93% of the time.
Alternatively, if students did not take an overnight visit we predicted that they would not enroll
63% of the time. This rule applied to 289 students and we were accurate about 183 times.
And just like that we continue to work our way down the tree to the next most significant
variable until we reach a terminal node, which signifies that the prediction has ended.
This was a simple decision tree aimed as showing which variables help us to accurately predict
student enrollment. Keep in mind that predictive analytics can be applied in a variety of
industries including education, retail, healthcare and finance just to name a few.