BRM Unit 3 & 5 Data Analysis
BRM Unit 3 & 5 Data Analysis
Urooj A Siddiqui
Data – Raw Facts, especially numerical facts,
collected together for reference or
information.
Data is collected on some particular
variable/s
Data analysis is processing of data to derive
useful information
Knowledge communicated concerning some
particular fact
The created knowledge helps in APPLICATION /
DECISION MAKING
Categorical:Qualitative
Continuous: Quantitative
Data
Categorical Continuous
Tables / Crosstab
Editing / Cleaning
Graph / Figure
Coding Statistical Analysis
Classification 1. Descriptive Methods
Frequency, %age, Ratio,
Tabulation
Mean, Median, Standard
Graphical Deviation (Variance)
Representation 2. Inferential Methods
Comparison (t/z-test/Anova)
Association (chi square test)
Correlation (r)
Prediction/ Regression
(y = ax + b)
Editing / Data Cleaning
examining the collected raw data to detect any errors
and omit/correct it if possible
Coding
assigning numerals to answers so that responses can
be put into a limited number of categories
Classification
Grouping of data on some basis (large volume of raw
data is reduced into homogenous groups
I. Attribute - on the basis of demographic bases
eg. gender, rural/urban, day scholar/hosteller
II. Class Interval – on the basis on some numeric range
eg. 0-10, 10-20 etc.
I. Tabulation
is the process of displaying raw data in tabular
form and summarising it for further analysis
orderly arranging data in columns and rows
Tabulation is essential because
It conserves space and reduces statements
It facilitates the process of summation of
items, comparison, detection of errors and
omissions
Basis for various statistical computations
temp of
Gende Yrs in Pain
Name Caste Age Mob. No. Edu IQ locality
r school level
deg cel
7 1 1 60 9450366367 1 0 16 0 4
2 1 2 65 8004896712 1 16 14 2 20
5 2 1 35 9934876545 2 19 0 0 15
4 2 1 90 2542543598 1 8 16 0 0
3 2 3 38 9458098734 3 21 13 3 0
6 1 1 48 9412890112 4 23 20 2 -1
1 1 1 45 8796654398 0 12 10 2 30
Nominal & Ordinal called qualitative . Interval and Ratio called quantitative
Roll. Age
Single / Multi Variable Table - one or No (yr)
more variable (no interaction) 1 22
2 24
Single Variable Freq. Table
3 23
Age Group (years) Freq.
4 26
Below 20 2
5 19
20-22 28
6 25
22-24 16
. .
24-26 10
. .
Above 26 4
. .
60 . .
. .
**Multiple Variable Table – as presented in above slide
60 22
Crosstabs – interaction of two or more
variables
Two Variable Interaction – Crosstab
Gender
4 Sales
3.5
3
2.5
4.3 4.5
2
3.5
1.5
2.5
1
0.5
0
2018 2019 2020 2021
Bar Chart
5 Clustered Bar
4.5
4
3.5
3 1st
2.5 2nd
2 4.3 4.4
4 3rd
3.5
1.5 3 3 4th
1 2.4 2.5 2.5
2 2 1.8
0.5
0
2018 2019 2020
Histogram
To show the distribution of a Roll. Age
No (yr)
quantitative variable
1 22
2 24
3 23
12
4 26
10
5 19
8
Frequency
6 25
6
10 . .
4 8
6 . .
2 4 . .
2 0
0
10 20 30 40 50
. .
Class Interval/Variable Unit . .
60 22
Line Diagram
To show change in variable in a particular time
period / on some reference range
₹ 7.40
₹ 7.20
₹ 7.00
₹ 6.80
Stock Price
₹ 6.60
₹ 6.40
₹ 6.20
₹ 6.00
₹ 5.80
₹ 5.60
1 2 3 4 5 6 7 8 9 10
Last 10 Days
Line Diagram
May also be used to compare 2 or more variables
along the range
14
12
10
8 Adani
6 Tata
4 Reliance
2
0
1 2 3 4 5 6 7 8
Scatter Plot
It is used to express relationships between two
variables
6
5
4
Sales in
3
Crore
2 Y-Values
1
0
0 1 2 3 4
Adv Budget in 10’Lacs
Scatter Plot
to express relationships between two variables
Scatter Plot
Trend Lines - Correlation
No. of
Income / day 80
families
70
0-500 20
60
500-1000 30
50
No.of families
1000-1500 50 40
1500-2000 70 30
2000-2500 40 20
2500-3000 30 10
3000-3500 10 0
0 1000 2000 3000 4000
Income
. .
age (xi) x-xi (x-xi) sqr.
A 21 2 4
B 22 1 1
C 23 0 0
D 24 -1 1
E 25 -2 4
10 (sum x-xi sq)
mean x 23 Sum 0
. . SD 2 (years)
(s-sample-known)
. . (𝜎 – population - unknown)
60 22
A distribution in frequencies of observations is
known – probability distribution
21 25
19 27
17 29
Example of our case
95% CI = µ ± 2.SD
Lower limit = µ - 2.SD, Upper limit = µ + 2.SD,
LL = 23 - 2.2 = 19, UL = 23 + 2.2 = 27
95% CI Range = 19-27 years
95% of the students in the class are in the range
of 19-27 yrs
We are 95% confident that if we randomly select
a student from the class his/her age will be
within this range (19-27 yrs)
Reverse is Hypothesis Testing
If mean and SD of any population is known and if
some value is given can we determine whether it
belongs to this population or distribution ?
0
-0.5 +0.5
-1
+1
-1.5 +1.5
Finding Probability
Calculate z score (test statistic) of the observed
value or hypothesized value with the formula
Determine p value associated with particular z
score at selected significance level (5%)
P value can be seen in the tables of the particular
test
t=
Two types of Hypothesis, Null - H0, Alternate - Ha
P Value Method Table Value Method
Determine p value Calculate test statistic
1 1 1 22 9450366367 87 72 HR-3
2 1 2 24 8004896712 65 68 HR-3
3 2 1 26 9934876545 48 56 Fin.-2
4 2 1 21 2542543598 95 83 Mktg.-1
5 2 3 22 9458098734 65 58 Fin.-2
6 1 1 23 9412890112 74 65 Mktg.-1
Example - H0 & H1
H0 – There is no difference b/w mean of two groups
H1 – There is a significant difference b/w mean of two groups
H0 – There is no difference b/w mean marks of males &
females
H1 – There is a significant difference b/w male & females
Hypothesis Testing steps
Set Null Value (u1=u2, u1-u2=0) – Make Null Distribution –
Calculate z /t sample test statistic – compare with table
value/set p value – reject/accept null
Used to compare variance of two samples
Employed in ANOVA – analysis of variance
When there are more than two groups and their
means are to be compared
Example
Comparison of marks among three streams of
students arts, commerce and science
H0 – There is no difference among mean marks of three groups
H1 – There is a significant difference among mean marks of three
groups
Mean/Variance Correlation
Estimation Regression
Z test Discriminant
T test Cluster Analysis etc.
Chi Square
F Test
Correlation
Regression analysis
1 dependent variable/DV (continuous)
many independent variables/IV (continuous)
Y = a.x1 +b.x2 +c.x3…….+.x.n
Discriminant analysis
1 dependent variable (categorical)
many independent variables (continuous)
Z (yes/no) = a.x1 +b.x2 +c.x3…….+.x.n
Cluster analysis
No DV/IV
Used to group respondents/customers in
various cluster
Employed in market segmentation
Factor analysis
No DV/IV
Used to group variables in various cluster of
more condensed variables