0% found this document useful (0 votes)
51 views72 pages

Unit 1

The document discusses various topics related to data analysis including quantitative and qualitative data, scales of measurement, measures of central tendency and dispersion, data visualization techniques like histograms and box plots, and introduction to big data. It provides examples and explanations of key concepts.

Uploaded by

udaywal.nandini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views72 pages

Unit 1

The document discusses various topics related to data analysis including quantitative and qualitative data, scales of measurement, measures of central tendency and dispersion, data visualization techniques like histograms and box plots, and introduction to big data. It provides examples and explanations of key concepts.

Uploaded by

udaywal.nandini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Data and its Descriptive Analysis

UNIT- 1:
Topics to be Covered
• Quantitative and Qualitative Data,
• Attributes and variables,
• Scales of measurement: nominal, ordinal, interval and
ratio,
• Measures of Central Value: Mean, Median, Mode,
• Measures of Dispersion: Absolute and Relative
measures of dispersion – Range, Quartile Deviation,
Mean Deviation, Standard Deviation, Moments,
Skewness, Kurtosis.
• Visualization of Data: Histograms, Stem and Leaf Plots,
Five Number Summary and Box Plots.
• Introduction to Big Data: Characteristics and Stages.
Readings
• Statistics for Business by Stiner & Foster D.,
Pearson India
(Chapter 2,3,4)
• Statistical Methods by S.P.Gupta, Vol. 1
(Chapter 5,6,7,8,9)
• Business Statistics by N.D.Vohra
• Box and Whisker Plot, Moments, Five Point
Summary)
Quantitative & Qualitative Data
• Quantitative Data:
• Quantitative data involves numerical information
and is used to quantify observations or
measurements. It deals with objective,
measurable data.
• Examples: Sales figures, revenue, market share,
number of employees, customer satisfaction
scores.
• Methods of Collection: Surveys, experiments,
structured observations, numerical records.
Qualitative Data:

• Qualitative data involves non-numerical


information and provides insights into the
qualities or characteristics of a phenomenon. It is
subjective and exploratory.
• Examples: Customer feedback, interview
responses, focus group discussions, open-ended
survey responses.
• Methods of Collection: Interviews, focus groups,
open-ended surveys, observations, content
analysis.
Attributes
– Attributes are characteristics or qualities that
describe an object, person, or phenomenon.
– They are the labels or categories that can be used
to classify and describe elements in a study.
– Example: In the context of employee
performance, attributes could include job
satisfaction, leadership skills, and communication
abilities.
Variables

– A variable is a measurable characteristic that can


take different values. Variables can be classified as
independent (predictor) or dependent (outcome)
in a study.
– Example: In a study on the impact of training on
employee productivity, the type and duration of
training could be independent variables, while the
measured increase in productivity would be the
dependent variable
MEASUREMENT
• It is the process of assigning numbers or some
other symbols to the characteristics of certain
objects
• Like in research peoples attitudes, perceptions
are measured and we assign numbers to such
characteristics
TYPES OF MEASUREMENT SCALE
1. NOMINAL SCALE
2. ORDINAL SCALE
3. INTERVAL SCALE
4. RATIO SCALE
NOMINAL SCALE
• Lowest level of measurement
• Numbers are assigned to identify objects
• Any object is assigned a higher number is in no way
superior to the one which us assigned a lower number
• Objects are divided into mutually exclusive and
collectively exhaustive categories
• Assigned number cannot be added, subtracted,
multiplied or divided, can only be counted
• A frequency distribution table can be prepared and
can use chi square test on these
Examples
• What is your religion?
a. Hinduism
b. Sikhism
c. Christianity
d. Islam
e. Any other

A Hindu can be assigned number 1, Sikh can be 2


and so on, and any religion assigned a higher
number is in no way superior to other number.
• Are you married?
a. Yes
b. No

• In which of the following departments do wou


work?
a. Marketing
b. HR
c. Finance
ORDINAL SCALE
• Next higher level of measurement than nominal
scale
• Tells whether an object has more or less of
characteristics than some other objects
• It Cannot answer how much more or how much
less
• Assigned number cannot be added, subtracted,
multiplied or divided
• Can conduct median, percentiles, quartiles, rank
order correlation coefficient, sign test
Examples
• Rank the following attributes while choosing a
restaurant for dinner where most important
attribute can be ranked 1 and so on.

Attribute Rank
Food quality 1
Prices 3
Menu variety 2
Ambience 5
Service 4
INTERVAL SCALE
• Next higher level of measurement than ordinal scale
• Takes care of limitation of ordinal scale where
difference between scores on ordinal scale does not
have any meaningful interpretation
• In interval scale the difference of the score on the
scale has meaningful interpretation
• Mathematical form:
Y = a + bX where a is not equal to 0
• Interval scale data has an arbitrary origin (non-zero
origin).
Example
• Celcius and Fahrenheit
• C’ = 5/9 (F’ – 32)
• The difference in score has a meaningful
interpretation but the ratio of the score in not
meaningful
• How important is price to you while buying a car?
a. Least important (1)
b. Unimportant (2)
c. Neutral (3)
d. Important (4)
e. Most Important (5)

• How do you rate the work environment of your


organization?
a. Very good (5)
b. Good (4)
c. Neither good nor bad (3)
d. Bad (2)
e. Very bad (1)
RATIO SCALE
• Highest level of measurement over all the
scales
• Ratio scale has a meaningful ratio
interpretation
• In ratio scale we have a natural zero unlike
interval scale
Examples
• How many chemists shops are there in your
locality?
• How many students are there in the MBA
programme in IIFT?
• How much distance do you need to travel
from your residence t reach the airport?
1.2 Pictorial and Tabular Methods in
Descriptive Statistics
1. Stem and Leaf Diagram
Q. Complete the stem and leaf diagram for the
following items:
7.6 8.1 9.2 6.8 5.9 6.2 6.1 5.8 7.3 8.1 8.8 7.4 7.7
8.2
STEM LEAF
5 8 9
6 1 2 8
7 3 4 6 7
8 1 1 2 8
9 2
Example 2
• Example Stem-and-Leaf Diagram for Exam
Scores:
• Suppose you have the following set of exam
scores for a group of students:
• 68, 72, 75, 78, 80, 82, 84, 85, 88, 90
• To create a stem-and-leaf diagram, you would
separate each score into a stem and a leaf. The
stem represents the tens digit, and the leaf
represents the units digit.
• The stems are: 6, 7, 8, 9 (representing 60s, 70s,
80s, and 90s).
• The leaves are the units digits of the scores.
• The stem-and-leaf diagram would look like this:
Stem Leaf
6 8
7 258
8 02458
9 0
3. HISTOGRAM
1. Histogram for equal class interval
Q. Marks of 20 students in a class are given
below.
36 25 38 46 55 68 72 55 36 38 67 45 22 48 91 46
52 61 58 55
CLASS INTERVAL FREQUENCY NUMBERS INCLUDED
20-30 2 25,22
30-40 4 36,38,36,38
40-50 4 46,45,48,46
50-60 5 55,55,52,58,55
60-70 3 68,67,61
70-80 1 72
80-90 0 -
90-100 1 91
2. HISTOGRAM WITH UNEQUAL CLASS
WIDTH
• Relative Frequency/ Frequency density =
FREQUENCY/ CLASS WIDTH
Age 0-40 40-50 50-60 60-90 90-110

Freq 80 15 25 90 30

Class width 40 10 10 30 20

Reletive 80/40= 2 15/10= 1.5 25/10= 2.5 90/30= 3 30/20= 1.5


Frequency
4. BOXPLOTS
• Helps in describing most prominent features
of data set like:
1. Center
2. Spread
3. Extent and nature of any departure from
symmetry
4. Identification of outliers (observations that
lie unusually far from the main body of data)
Steps
• Order the n observations from smallest to largest and
separate the smallest half from the largest half
• The median is included in both halves if n is odd
• Then the lower fourth is the median of the smallest half and
the upper fourth is the median of the largest half
• A measure of spread that is resistant to outliers is the
fourth spread i.e. fs given by
Iinter Quartile Range= fs= upper fourth – lower fourth
• Its is the smallest 25% or the largest 25% of the data and
hence it is resistant to outliers
• Simplest boxplot is based on 5 number summary:
Smallest xi lower fourth median upper fourth largest xi
• Draw a horizontal measurement scale
• Then place a rectangle above this axis such that
the left edge of the rectangle is at the lower
fourth and right edge is at upper fourth (box
width = fs)
• Place a vertical line segment inside the rectangle
at the location of median (position of median
conveys skewness in the middle 50% of the data)
• Draw whiskers out from either end of the
rectangle to the smallest and largest observations
Interpretation of Box plot
• Positively Skewed: If the distance from the
median to the maximum is greater than the
distance from the median to the minimum, then
the box plot is positively skewed.
• Negatively Skewed: If the distance from the
median to minimum is greater than the distance
from the median to the maximum, then the box
plot is negatively skewed.
• Symmetric: The box plot is said to be symmetric if
the median is equidistant from the maximum and
minimum values.
Q. Construct a boxplot for the
following data.
40 52 55 60 70 75 85 85 90 90 92 94 94 95 98 100 115 125 125
ANSWER
• 5 number summary is as follows:
Smallest xi= 40
Lower fourth = 72.5 (70+75 / 2)
Median = 90
Upper fourth = 96.5 (95 + 98 / 2)
Largest xi = 125
Right edge of the box is much closer to median than left edge,
indicating the data is Negatively Skewed as the distance from the median
to minimum is greater than the distance from the median to the maximum,
Mean

Mean for grouped data:

EXAMPLE
Class Interval Mid points
(X)

200-400 300

400-600 500

600-800 700

800-1000 900= A

1000-1200 1100

1200-1400 1300

1400-1600 1500

total
Question.
Class Interval Mid points F d fd
(X)
200-400 300 500 -3 -1500
400-600 500 300 -2 -600
600-800 700 280 -1 -280
800-1000 900= A 120 0 0
1000-1200 1100 100 1 100
1200-1400 1300 80 2 160
1400-1600 1500 20 3 60
total 1400 -2060
Combined Mean

Boys Girls
Number 100 50
Mean Weight 60 kg 45 kg
Answer

Median

Question
Class interval f cf
18-22 120 120
22-26 125 245
26-30 280 525
30-34 260 785
34-38 155 940
38-42 184 1124
42-46 162 1286
46-50 86 1372
50-54 75 1447
54-58 53 1500

Median = size of N/2th observation = 1500/2 = 750th observation


Median lies in the class 30-34
Median = 30 + (750-525)/ 260 *4 = 33.46
MODE

Question
Class interval f Cf
Below 60 12 12
60-62 18 30
62-64 25 55
64-66 30 85
66-68 10 95
68-70 3 98
70-72 2 100

Mode = 64 + (30-25)/ (60-25-10) * 2 = 64.4


Quartiles

Question
CI f cf
Below 375 69 69
375-450 167 236 Q1
450-525 207 443
525-600 65 508 Q3
600-675 58 566
675-750 24 590
750-825 10 600

Measures of Dispersion
• The degree to which the numerical data tend
to spread about an average value.
1. Range
2. Standard Deviation
3. Quartile Deviation
4. Mean Deviation
Range
• Simplest measure of dispersion
• = largest item – smallest item
Standard deviation

s.no. Xi
1 27.3
2 27.9
3 32.9
4 35.2
5 44.9
6 39.9
7 30
8 29.7
9 28.5
10 32
11 37.6
total 365.9
PRACTICE QUESTION
• OBTAIN THE SAMPLE STANDARD DEVIATION
FOR THE FOLLOWING DATA
s.no. Xi (xi-xbar) (xi-xbar)^2
1 27.3 -5.96 35.522
2 27.9 -5.36 28.73
3 32.9 -.36 0.13
4 35.2 1.94 3.76
5 44.9 11.64 135.49
6 39.9 6.64 44.09
7 30 -3.26 10.62
8 29.7 -3.56 12.67
9 28.5 -4.76 22.65
10 32 -1.26 1.58
11 37.6 4.34 18.83
total 365.9 0 314.106
• SD = ROOT OF (314.106/11) = 5.34
Coefficient of Variation

Combined Standard Deviation

EXAMPLE
Boys Girls
Number 100 50
Mean Weight 60 kg 45 kg
Variance 9 4

• CALCULATE COMBINED S.D.


Example
Boys Girls
Number 100 50
Mean Weight 60 kg 45 kg
Variance 9 4


Quartile deviation
• Interquartile range = Q3-Q1
• Q.D. = (Q3-Q1) / 2
• It is an absolute measure of dispersion
• Example: obtain the quartile deviation for the
following data:
490, 540, 590, 600, 620, 650, 680, 770, 830, 840, 890,
900
• Here Q1= median of lower half of set (490, 540, 590,
600, 620, 650) = (590+ 600)/ 2 = 595
• And Q3 = median of upper half of set (680, 770, 830,
840, 890, 900) = (830+840/2) = 835
• Hence Q.D. = (835-595)/ 2 = 120
Mean Deviation

Example
• Suppose you have the following set of exam
scores for a class:
• ={68,72,75,78,80}X={68,72,75,78,80}
Steps to Calculate Mean Deviation:

• Calculate the Mean


=68+72+75+78+805=3735=74.6X¯=568+72+75+78+80=537
3=74.6
• Calculate the Deviation of Each Data Point from the Mean:
– Deviations: −6.6,−2.6,0.4,3.4,5.4−6.6,−2.6,0.4,3.4,5.4
• Calculate the Absolute Deviation for Each Data Point:
– Absolute Deviations: 6.6,2.6,0.4,3.4,5.46.6,2.6,0.4,3.4,5.4
• Calculate the Mean Deviation:
Mean Deviation=(6.6+2.6+0.4+3.4+5.45)/5=18.45/5=3.68
Interpretation:

• In practical terms, a mean deviation of 3.68


suggests that, on average, each exam score
deviates from the mean score by
approximately 3.68 points.
Skewness
• Measures the degree of asymmetry of the
distribution.
• 0 skewness- perfectly symmetrical
• + ve skewness- right tail is more prominent
• -ve skewness- left tail is more prominent
KURTOSIS- peakedness of the frequency distribution/ curve
Flat Curve/ wider tails- platykurtic
Normal curve- mesokurtic
Narrow peak, narrow tails- Leptokurtic
Measure of Skewness

KURTOSIS- peakedness of the frequency distribution/ curve
Flat Curve/ wider tails- platykurtic
Normal curve- mesokrtic
Narrow peak, narrow tails- Leptokurtic
Measure of Kurtosis

Question
xi
32
36
36
37
39
41
45
46
48
xi Xi- xbar (Xi- xbar)^2 (Xi- xbar)^3 (Xi- xbar)^4
32 -8 64 -512 4096
36 -4 16 -64 256
36 -4 16 -64 256
37 -3 9 -27 81
39 -1 1 -1 1
41 1 1 1 1
45 5 25 125 625
46 6 36 216 1296
48 8 64 512 4096
360 0 232 186 10708

Mean = xbar = 360/ 9= 40


You might also like