0% found this document useful (0 votes)
6 views50 pages

BRM Unit 3 & 5 Data Analysis

The document discusses different types of data and methods of data analysis. It defines categorical and continuous data and different types of each. It also explains various steps and methods used for data preparation, summarization, and analysis including tabulation, graphical representation, descriptive and inferential statistics.

Uploaded by

Aman Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views50 pages

BRM Unit 3 & 5 Data Analysis

The document discusses different types of data and methods of data analysis. It defines categorical and continuous data and different types of each. It also explains various steps and methods used for data preparation, summarization, and analysis including tabulation, graphical representation, descriptive and inferential statistics.

Uploaded by

Aman Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Dr.

Urooj A Siddiqui
 Data – Raw Facts, especially numerical facts,
collected together for reference or
information.
 Data is collected on some particular
variable/s
 Data analysis is processing of data to derive
useful information
 Knowledge communicated concerning some
particular fact
 The created knowledge helps in APPLICATION /
DECISION MAKING
 Categorical:Qualitative
 Continuous: Quantitative

Data

Categorical Continuous

Nominal Ordinal Interval Ratio


 Any phenomenon which takes at least two
different values/ observations

 Data:Set of values/ observations


collected on variable is called data
 Nominal
 Ordinal
 Interval
 Ratio
1. Data Preparation / Initial 2. Summarizing Data / Data
Operations Analysis Operations

 Tables / Crosstab
 Editing / Cleaning
 Graph / Figure
 Coding  Statistical Analysis
 Classification 1. Descriptive Methods
 Frequency, %age, Ratio,
 Tabulation
 Mean, Median, Standard
 Graphical Deviation (Variance)
Representation 2. Inferential Methods
 Comparison (t/z-test/Anova)
 Association (chi square test)
 Correlation (r)
 Prediction/ Regression
(y = ax + b)
 Editing / Data Cleaning
 examining the collected raw data to detect any errors
and omit/correct it if possible
 Coding
 assigning numerals to answers so that responses can
be put into a limited number of categories
 Classification
 Grouping of data on some basis (large volume of raw
data is reduced into homogenous groups
I. Attribute - on the basis of demographic bases
eg. gender, rural/urban, day scholar/hosteller
II. Class Interval – on the basis on some numeric range
eg. 0-10, 10-20 etc.
I. Tabulation
 is the process of displaying raw data in tabular
form and summarising it for further analysis
 orderly arranging data in columns and rows
Tabulation is essential because
 It conserves space and reduces statements
 It facilitates the process of summation of
items, comparison, detection of errors and
omissions
 Basis for various statistical computations
temp of
Gende Yrs in Pain
Name Caste Age Mob. No. Edu IQ locality
r school level
deg cel

Ram M Hindu 60 9450366367 NIL 0 16 Mild-0 -4

Akbar M Muslim 65 8004896712 HS 16 14 Mod-1 20

Sita F Hindu 309 9934876545 Int. 19 0 Mild-0 15

Shalini F Hindu 90 2542543598 HS 8 16 Mild-0 0

Mehnaj F Sikh 38 9458098734 UG 21 13 Severe-2 0

Ravi M Hindu 48 9412890112 PG 23 20 Mod-1 -1

Hari M Hindu 45 8796654398 Prim 12 10 Mod-1 30


temp of
Edu Yrs in Pain
Name Gender Caste Age Mob.No. IQ locality
level sch. level
deg cel

7 1 1 60 9450366367 1 0 16 0 4

2 1 2 65 8004896712 1 16 14 2 20

5 2 1 35 9934876545 2 19 0 0 15

4 2 1 90 2542543598 1 8 16 0 0

3 2 3 38 9458098734 3 21 13 3 0

6 1 1 48 9412890112 4 23 20 2 -1

1 1 1 45 8796654398 0 12 10 2 30

Nominal & Ordinal called qualitative . Interval and Ratio called quantitative
Roll. Age
 Single / Multi Variable Table - one or No (yr)
more variable (no interaction) 1 22
2 24
Single Variable Freq. Table
3 23
Age Group (years) Freq.
4 26
Below 20 2
5 19
20-22 28
6 25
22-24 16
. .
24-26 10
. .
Above 26 4
. .
60 . .
. .
**Multiple Variable Table – as presented in above slide
60 22
 Crosstabs – interaction of two or more
variables
Two Variable Interaction – Crosstab

Gender

Age Group Male Female Total


Below 20 1 1 2
20-22 18 10 28
22-24 9 7 16
24-26 7 3 10
Above 26 3 1 4
38 22 60
Graphical Representation of Data
 Pie Chart
 Bar Graph
 Histogram
 Line Graph
 Scatter Plot
 Scatter Plot & Correlation
Pie Charts
 It is used to represent %ages, distribution of 1
variable at various levels

Sales (in mn)


1.2,
8%
1.4,
10% 1st Qtr
2nd Qtr
3.2, 8.2, 58% 3rd Qtr
23%
4th Qtr
Bar Chart
 It is used to represent 1 variable at various levels
 Levels can be year/ groups etc.

4 Sales
3.5
3
2.5
4.3 4.5
2
3.5
1.5
2.5
1
0.5
0
2018 2019 2020 2021
Bar Chart
5 Clustered Bar
4.5
4
3.5
3 1st
2.5 2nd
2 4.3 4.4
4 3rd
3.5
1.5 3 3 4th
1 2.4 2.5 2.5
2 2 1.8
0.5
0
2018 2019 2020
Histogram
 To show the distribution of a Roll. Age
No (yr)
quantitative variable
1 22
2 24
3 23
12
4 26
10
5 19
8
Frequency

6 25
6
10 . .
4 8
6 . .
2 4 . .
2 0
0
10 20 30 40 50
. .
Class Interval/Variable Unit . .
60 22
Line Diagram
 To show change in variable in a particular time
period / on some reference range

₹ 7.40

₹ 7.20

₹ 7.00

₹ 6.80
Stock Price

₹ 6.60

₹ 6.40

₹ 6.20

₹ 6.00

₹ 5.80

₹ 5.60
1 2 3 4 5 6 7 8 9 10

Last 10 Days
Line Diagram
 May also be used to compare 2 or more variables
along the range
14
12
10
8 Adani
6 Tata
4 Reliance

2
0
1 2 3 4 5 6 7 8
Scatter Plot
 It is used to express relationships between two
variables
6
5
4
Sales in
3
Crore
2 Y-Values

1
0
0 1 2 3 4
Adv Budget in 10’Lacs
Scatter Plot
 to express relationships between two variables
Scatter Plot
 Trend Lines - Correlation
No. of
Income / day 80
families
70
0-500 20
60
500-1000 30
50

No.of families
1000-1500 50 40

1500-2000 70 30

2000-2500 40 20

2500-3000 30 10

3000-3500 10 0
0 1000 2000 3000 4000
Income
. .
age (xi) x-xi (x-xi) sqr.
A 21 2 4
B 22 1 1
C 23 0 0
D 24 -1 1
E 25 -2 4
10 (sum x-xi sq)
mean x 23 Sum 0

Avg Sq (variance) 2 (10 by 5), n=5


SD (root v) s 1.41
Roll. Age
No (yr) Age Group (years) Freq. Probability
1 22 Below 20 2 2/60
2 24
20-22 28 28/60
3 23
22-24 16 16/60
4 26
24-26 10 10/60
5 19
Above 26 4 4/60
6 22
60
. .
Mean 23 (years)
. . (x-sample-known)
. . (µ-population - unknown)

. . SD 2 (years)
(s-sample-known)
. . (𝜎 – population - unknown)
60 22
A distribution in frequencies of observations is
known – probability distribution

 Z- Normal Distribution/Test - Mean (µ), SD-


 To compare means (1 or 2 means)
t – Distribution/Test- Mean (x), SD (s)
 To compare means (1 or 2 means)
 Chi Square Distribution / Test
 To compare sample SD with population SD
F Test
 To compare two sample variances
A freq. distribution with bell shape curve and
some known properties
 Parameters - Mean (µ), SD (sigma)
 Known properties
 68% values are within µ ± 1 SD
 95% values are within µ ± 2 SD
 99% values are within µ ± 3 SD

 95% CI = µ ± 2.SD (range)


 Lower limit µ - 2.SD
 Upper limit µ + 2.SD
23

21 25

19 27

17 29
Example of our case
 95% CI = µ ± 2.SD
 Lower limit = µ - 2.SD, Upper limit = µ + 2.SD,
 LL = 23 - 2.2 = 19, UL = 23 + 2.2 = 27
 95% CI Range = 19-27 years
 95% of the students in the class are in the range
of 19-27 yrs
 We are 95% confident that if we randomly select
a student from the class his/her age will be
within this range (19-27 yrs)
 Reverse is Hypothesis Testing
 If mean and SD of any population is known and if
some value is given can we determine whether it
belongs to this population or distribution ?
0

-0.5 +0.5

-1
+1

-1.5 +1.5
Finding Probability
 Calculate z score (test statistic) of the observed
value or hypothesized value with the formula
 Determine p value associated with particular z
score at selected significance level (5%)
 P value can be seen in the tables of the particular
test

When Population SD is KNOWN When Population SD is UNKNOWN

t=
 Two types of Hypothesis, Null - H0, Alternate - Ha
P Value Method Table Value Method
 Determine p value  Calculate test statistic

 Compare with selected value – TSCal


alpha level (0.05)  Determine Critical value

 p ≤ 0.05 – Reject Null of test statistic at


selected significance level
 P > 0.05 – Fail to Reject
– TSTab
null / accept null
 If TSCal ≥ TSTab – Reject
 This method is generally
Null
employed by data analysis
software – Excel, SPSS  If TSCal < TSTab – Fail to
Reject null / accept null
 This method is generally
employed when manual
testing is done
No. of Marks Specialization
Gender Caste Age
RN Mob.No. Classes Obtained Opted
G C A
N M S

1 1 1 22 9450366367 87 72 HR-3

2 1 2 24 8004896712 65 68 HR-3

3 2 1 26 9934876545 48 56 Fin.-2

4 2 1 21 2542543598 95 83 Mktg.-1

5 2 3 22 9458098734 65 58 Fin.-2

6 1 1 23 9412890112 74 65 Mktg.-1

• Mean & Variance (SD) – Eg. A, N, M – sample stat. – x, s


• Correlation Eg. N-M, A-N, A-M –r
• Association between Gender and Sp. Opted (G n S) - chi
Note Sample Ch.c – Statistic , Population Ch.c - Parameter
 Assume a population – N, µ,
 Now assume we take many samples of size n and
calculate mean for each sample
 x1, x2, x3, x4, x5, x6, . . . . . . . . x100
 Can we make a freq. distribution of these values
and draw a curve?
 Now when we draw a distribution of these values
we will have an average (x) and SD (s)
 This average is called mean of means and
considered mean of population
 The SD of population is calculated as
which is called as Standard Error
 Sample mean & their difference - z / t
 Sample correlation statistic– z / t (derived from r)
 Variance (SD2) – F
 Association – Chi Sqr.

 Central Limit Theorem


 If we collect many samples and draw its
distribution the mean of this distribution is
population mean and SD of population is
 We use CLT in Hypothesis Testing
z - when is Known and sample size is ≥ 30
 t - when is Unknown and sample size < 30
 In sample estimation t test is employed

 Example - H0 & H1
 H0 – There is no difference b/w mean of two groups
 H1 – There is a significant difference b/w mean of two groups
 H0 – There is no difference b/w mean marks of males &
females
 H1 – There is a significant difference b/w male & females
 Hypothesis Testing steps
 Set Null Value (u1=u2, u1-u2=0) – Make Null Distribution –
Calculate z /t sample test statistic – compare with table
value/set p value – reject/accept null
 Used to compare variance of two samples
 Employed in ANOVA – analysis of variance
 When there are more than two groups and their
means are to be compared
 Example
 Comparison of marks among three streams of
students arts, commerce and science
 H0 – There is no difference among mean marks of three groups
 H1 – There is a significant difference among mean marks of three
groups

 Set Null Value (µ1=µ2=µ3) – Make Null Distribution – Calculate F


test statistic – compare with table value/p value – reject/accept
null
Test of Independence

 It is used to determine association between two
categorical variables (nominal & ordinal)
 Example
 Gender (M/F) and Opted Specialization (M/F/HR)
 Question like ‘is any specialisation is preferred by
females?’ are answered
 H0 – There is no association b/w gender and opted speclisa.n
 H1 – There is a significant association b/w gender & opted
speclisa.n
 Here, mean is not calculated instead frequency of categories
is taken into consideration
 Actual Frequency and Expected Frequency
 Cross tabs are used to calculate actual & expected freq

Two Variable Interaction – Crosstab

Opted Total Gender


Specialization (60) Male (40) Female (20)
Mktg. 30 20 8
Fin. 15 10 2
HR 15 10 10
60 40 20

 Hypothesis Testing steps


 Set Null Value (actual freq. = expected freq.) – Make Null
Distribution – Calculate chi sqr. sample test statistic –
compare with table value/set p value – reject/accept null
 Set Null and Alternate Hypothesis – H0 H1
 Select the null value
 Null – status quo, no difference, no effect
 Status quo – no change
 No difference – 0 difference
 No relationship – 0 effect / 0 correlation
 No association – 0 relationship (b/w nominal variab.)
 It is assumed that H0 is true in population
 Draw Null Distribution – find range of expected values
if null is true (µ ± 2.SE)
 Take observed value from sample and compare with
expected null values
 If observed value is among expected null range –
accept null
 If observed value is different from null range – reject
null
1. Univariate/Bi-variate 2. Muti-variate

 Mean/Variance  Correlation
Estimation  Regression
 Z test  Discriminant
 T test  Cluster Analysis etc.
 Chi Square
 F Test
 Correlation
 Regression analysis
 1 dependent variable/DV (continuous)
 many independent variables/IV (continuous)
 Y = a.x1 +b.x2 +c.x3…….+.x.n

 Discriminant analysis
 1 dependent variable (categorical)
 many independent variables (continuous)
 Z (yes/no) = a.x1 +b.x2 +c.x3…….+.x.n
 Cluster analysis
 No DV/IV
 Used to group respondents/customers in
various cluster
 Employed in market segmentation

 Factor analysis
 No DV/IV
 Used to group variables in various cluster of
more condensed variables

You might also like