FBR & IT Applications: Compiled and Presented by DR - Deepak Joshi For Academic Use Only
FBR & IT Applications: Compiled and Presented by DR - Deepak Joshi For Academic Use Only
https://www.ibm.com/account/reg/in-en/signup?formid=urx-
19774
1
21-11-2023
Story
• Once upon a time, there was a beautiful girl named Cinderella. She
was 20 years Old. She had blue eyes with long golden hairs and fair
complexion. She lived unhappily with her two stepsisters and their
mother. They treated Cinderella very badly. One day, an invitation to a
ball at the palace arrives. But Cinderella’s stepmother would not let
her go. Cinderella was made to sew two new party gowns each for
her stepmother and stepsisters, and curl their hair. They then went to
the ball, leaving Cinderella alone at home
2
21-11-2023
Data Visualisation
3
21-11-2023
• Median: Mid Point of all Data, its not skewed but rarely of use further.
• Arrange Data from least to highest (Highest – Lowest = Range)
• Middlemost if odd, if even average of two middlemost
Thus Median of Second Case is 5500 (Spend of an Average Student NOT Average Spend of a person, as we sorted the Spend First)
4
21-11-2023
• 5*=91 (455)
• 4*=21 (84)
• 3*=14 (42)
• 2*=08 (16)
• 1*=31 (31)
• Avg Rating=Total Rating/No of
People
• 628/165=3.8
5
21-11-2023
Compiled and Presented by Dr.Deepak Joshi for Academic Use Pic Credits:https://www.geyerinstructional.com/
Only
6
21-11-2023
• Dispersion:
• Indicates degree to which data is spreading
around an average value(CT)
• Range, Inter Qartile Range, Std Dev & Variance
• Skewness :
• Indicates Symmetry in Data
• A dataset (looked for distribution), is distributed
equally from midpoint to right and left it its
evenly distributed (Symmetric)
• +ve Score: Right handed Skewed Outliers on
Right Side hence mean on Right side, -ve Score:
Left Skewed
• Kurtosis:
• Indication of concentration around central part
and measure of data being heavily tailed or
lightly tailed as per normal distribution
• Datasets with low Kurtosis are do not
concentrate heavily around midpoint
• Normally distributed data has near 0
Skewness and near 3 Kurtosis a: Kurtosis>3, i.e Leptokurtic
b: Kurtosis=3, i.e Normal
c: Kurtosis<3, i.e Platykurtic
Compiled and Presented by Dr.Deepak Joshi for Academic Use
Only
• ND is Key to Stat/CLT: Avg Calculated from independent, identically distributed random variables have approximately Normal Distribution. (As the sample size increases tend to
follow normality ~ mean of sample=mean of population)
• Normal Distribution:
• Std Dev is 1
• Zero Skewness
• Kurtosis is 3 (Normally tailed, rather than heavily tailed or lightly tailed)
• Mean, Median & Mode at 1 point
7
21-11-2023
2. Select Excel
4. Select Open
8
21-11-2023
Will automatically
read first row as
variable names
Output:StatisticViewer to
see Results when we
Execute Something
This is Data
Editor where
we do the work
9
21-11-2023
10
21-11-2023
Click this to
open a file
OPEN OTHER TYPES OF FILE LIKE CSV ETC
11
21-11-2023
Press this to
open
See we selected
the
corrsoponding
option
Look at the
preview and click
next
12
21-11-2023
See what is
indicated: Select
Yes if Top Line
indicates
variables
Select Next
13
21-11-2023
Select Next
14
21-11-2023
2 TYPES OF VIEW
1. DATA VIEW (Default
what you see)
2. VARIABLE VIEW
15
21-11-2023
16
21-11-2023
17
21-11-2023
Age of an
Std Deviation = Deviation from Mean Score, Average Age
Average Person
summarises continuous data only, Larger
indicate more spread of observations from CT
18
21-11-2023
Case
The market research team at AdRight is assigned the task to identify the profile of the typical customer for each treadmill
product offered by CardioGood Fitness.
The market research team decides to investigate whether there are differences across the product lines with respect to
customer characteristics. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness
retail store during the prior three months. The data are stored in the CardioGoodFitness.csv file.
The team identifies the following customer variables to study:
• product purchased, TM195, TM498, or TM798;
• gender;
• age, in years;
• education, in years;
• relationship status, single or partnered; annual household income ($);
• average number of times the customer plans to use the treadmill each week;
• average number of miles the customer expects to walk/run each week;
• and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape.
Perform descriptive analytics to create a customer profile for each CardioGood Fitness treadmill product line.
Compiled and Presented by Dr.Deepak Joshi for Academic Use
Only
19
21-11-2023
20
21-11-2023
21
21-11-2023
22
21-11-2023
23
21-11-2023
24
21-11-2023
When you
have added
all values for
that click
Continue
25
21-11-2023
After
Continue you
shall land
here. Finally
Click OK
26
21-11-2023
27
21-11-2023
Hypothesis
• Conjecture about a population
• A statement about a population parameter
• A premise or claim that one wants to test
28
21-11-2023
29
21-11-2023
Analyse
• Accept
• Reject
30
21-11-2023
Sample Individuals
With Characterstics
- Age
- Gender
Population - Color
- Region
31
21-11-2023
Uncertainty Leads
• α (Alpha) = Significance Level = the probability of rejecting the null
hypothesis when it was in fact true.
• Often 5% (5 times out of 100 we shall be wrong in rejecting)
• P Value is Calculated Probability, the probability in the tail beyond the
sample mean assuming that the null hypothesis is correct
• Calculation might differ based on technique but interpretation is same (the
probability of obtaining your sample data, IF the null hypothesis is true,
thus)
• P Value > .05 (α) We accept Null Hypothesis (We want stronger evidence to support)
• P Value < .05 (α) We reject Null Hypothesis
• Confidence Ievel + Alpha = 1
32
21-11-2023
Chi-Square
• Test for Independence / Pearson's Chi-square: Test of Association
Between 2 Categorical Variables
• Discovers if there is a relationship b/w 2 categorical variables
• 2 Categorical Variables Like: Gender, Areas, Profession, education level etc
• Is Gender associated with Shopping Frequency defined as High & Low
• Is gender associated with preferred buying mode (Online Physical)
• Young, Old are likely to vote equally for BJP/CONGRESS ETC
33
21-11-2023
34
21-11-2023
In Statsitcs,
see Chisquare
is Selecetd or
Not
In Cells, Don’t
Forget to
check
Expected
35
21-11-2023
36
21-11-2023
As the above is a chisquare tables of 2x5, and not more than 20% of all the cells have an expected count of
less than 5 (Yates, Moore & McCabe, 19999, p.734) and x2 (4) = 18.99, p=.001, hence considering the above
result we reject null hypothesis and it is concluded that the relationship between mode of buying and safety
concern is statistically significant.
Thus based on the above hypothesis, it is established that people are preferring buying cosmetics online as
they consider it safer than the online mode
Z & T Test
• Both compares 2 population means (same or different)
• Z when the population parameter (variances/sd) are known and the
sample size is large.
• T Test when population parameter are Unknown/sample size is less
(30)
• Z (Z score) indicates how many std dev above or below the population
mean the score calculated form Z test is.
• Z score=(x~- 𝜇)/ 𝜎 (x~=sample mean, 𝜇=Population Mean, 𝜎 =Std Dev)
37
21-11-2023
T Test
• 1 sample t-test: Compares mean of a single group against a known mean.
• A College may claim that the average Income of his entire Batch 2020 is Rs.50000 or
more than it.
• A School may claim that all his students have an above average IQ.
• Independent samples t-test: Compares mean for two groups
• MFM-D/MFM-C w.r.t Salary
• Type of Exercise (A/B) w.r.t BP Level
• Men & Women w.r.t Shopping Time
• Paired sample t-test: Compares means from the same group at different
times
• Spend on Medicine Pre Covid Vs Spend on Medicine Post Covid
• A Specific Training improved the Running Time of Runners (Pre Run Post Run Time)
NULL HYPO = THERE IS NO DIFFERENCE BETWEEN THE TRUE MEAN AND THE
COMPARED MEAN
There is no difference between sample mean and the normal population
mean
Compiled and Presented by Dr.Deepak Joshi for Academic Use
Only
38
21-11-2023
39
21-11-2023
40
21-11-2023
41
21-11-2023
42
21-11-2023
43
21-11-2023
44
21-11-2023
45
21-11-2023
46
21-11-2023
47
21-11-2023
48
21-11-2023
Variable 1 Variable 2
Mean 5.927273 6.395454545
Variance 2.107792 1.857597403
Observations 22 22
Pearson Correlation 0.746811
49
21-11-2023
50
21-11-2023
51
21-11-2023
52
21-11-2023
53
21-11-2023
54
21-11-2023
For Anova in
New Value we
shall put
Numerics i.e
1, 2, 3 etc..We
Otherwise we
can give
Strings value
also like
E..English etc
of any
character
width as
required…But
for that we
shall have to
check this
When
you have
added all
values for
Compiled and Presented by Dr.Deepak Joshi for Academic Use that click
Only Continue
55
21-11-2023
Click OK
Finally
56
21-11-2023
57
21-11-2023
Click on Option
and Check
Descriptives
58
21-11-2023
AT SIG .041,
THERE IS A STATISTICALLY
SIGNIFICANT DIFFERENCE
BETWEEN THE GROUPS…
BUT
WHICH OF THESE GROUP
ARE DIFFENT CANT BE
CONFIRMED.
FOR THAT WE
HAVE….?
Multiple Comparisons
shall indicate which
group are different
from each other….
THUS TUKEYS POST
HOC (Others as well) IS
THE BEST WAY FOR
MULTIPLE
COMPARISONS
NON SIGNIFICANT
B/W 3 and 2
59
21-11-2023
60
21-11-2023
61
21-11-2023
COORELATION IS
Pic courtesy: www.pinterest.com/pin/179440366372836984/
NOT CAUSATION
COORELATION
• Association between two variables
• Checks if one moves with the other
• Movement direction & strength varies
• Experience in Years and Salary
• Height & eight of kids
• Supply & Price
62
21-11-2023
COORELATION
COORELATION
• Relationship between Mother’s Height and Babies
• Relationship between Mother’s Weight and Babies
63
21-11-2023
64
21-11-2023
65
21-11-2023
Factor Analysis
• Many Variable to Fewer Factors
• Many Observed correlated variables to Latent Variables
• Types
• Exploratory Factor Analysis
• Confirmatory Factory Analysis
• Method
• Principal Component Analysis (Max Variance put into 1st factor): Most
Common Used
• Other Methods like Common Factor Analysis (Finds Common Variation to put
into factors) which are less commonly used
Compiled and Presented by Dr.Deepak Joshi for Academic Use
Only
Assumptions
• All Variables Should be continuous (Mostly Ordinal are also used but
they should be equidistant, like Likert Scale)
• Sampling Adequacy: Large Sample 10 times the items (KMO at least
.5)
• Adequate correlation b/w variables. (Barlet Test of Sp,)
• No significant outiers
66
21-11-2023
67
21-11-2023
68
21-11-2023
69
21-11-2023
In Rotation: Varimax
Rotation so that there is no
repetition of variables in
Component Matrix..
In Options:
70
21-11-2023
71
21-11-2023
.7 / .4
72
21-11-2023
Condition of
stadium
Outer appearance Perception of Stadium
of stadium
Interior design
of stadium
Entry Price
Price of season
ticket Value
No of Star Players
Regression - Definition
73
21-11-2023
y = mx + c
y = c + mx
74
21-11-2023
Click on Analyse
then Regression
then Linear
Select Dependent
& Independent
accordingly Click Statistics and Select,
Rsuared Change, Durbin
Watson etc as per
requirement
75
21-11-2023
Y = 0 + 1 x1 + 2 x2 + ... + k xk +
y = mx + c
y = c + m1x1 + m 2 x 2 + err
76
21-11-2023
The assumptions that are made in multiple linear regression model are as follows:
77