Desc. Stat
Desc. Stat
Descriptive Statistics
TEXTBOOKS (REQUIRED MATERIALS)
1. Statistics for Business & Economics by David R. Anderson; Dennis J. Sweeney; Thomas A. Williams; Jeffrey
D. Camm; James J. Cochran. Cengage Learning
Additional References
• Aczel, A. D., & Sounderpandian, J. (1999). Complete business statistics. Boston, MA: Irwin/McGraw Hill.
• Business Statistics for Contemporary Decision Making. Ken Black. Wiley India.
• Lecture Notes (Notes will be distributed each week by the faculty and/or shared through google classroom.)
In God we trust, all others must bring data
- W Edwards Deming
Components of Analytics
Predicting future
Data synthesis and events
Visualization
Descriptive Predictive
Analytics Analytics
Prescriptive
Analytics
Optimization and decision
making
Descriptive Predictive Prescriptive
• Most shoppers turn towards right when they enter the a retail store.
• Descriptive statistics
• Collect data (e.g., survey)
• Present data (e.g., tables and graphs)
• Summarize data (e.g., sample mean)
• Inferential statistics
• Drawing conclusions about a population based only on sample data
Descriptive Statistics
• Most of the statistical information in newspapers, magazines, company
reports, and other publications consists of data that are summarized and
presented in a form that is easy to understand.
• Such summaries of data, which may be tabular, graphical, or numerical, are
referred to as descriptive statistics.
Example
The manager of Honda Auto would like to have a better understanding of the
cost of parts used in the engine tune-ups performed in her/his shop. She/he
examines 50 customer invoices for tune-ups. The costs of parts, rounded to
the nearest Indian Rs, are listed on the next slide.
Example: Honda Auto Repair
Sample of Parts Cost (Million Indian Rs) for 50 Tune-ups
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
Inferential Statistics
Population: The set of all elements of interest in a particular study.
Sample: A subset of the population.
Company Stock Exchange Annual Sales ($M) Earnings per share ($)
Dataram NQ 73.10 0.86
EnergySouth N 74.00 1.67 Observation
Element Names Keystone N 365.70 0.86
LandCare NQ 111.40 0.33
Psychemedics N 17.60 0.13
Data Set
Data and Data Sets
• Data are the facts and figures collected, analyzed, and summarized for
presentation and interpretation.
• All the data collected in a particular study are referred to as the data
set for the study.
• Data: Collections of any number of related observations.
Elements, Variables, and Observations
• Elements are the entities on which data are collected.
• A variable is a characteristic of interest for the elements.
• The set of measurements obtained for a particular element is called
an observation.
• A data set with n elements contains n observations.
• The total number of data values in a complete data set is the number
of elements multiplied by the number of variables.
Structured and Unstructured Data
• Structured data means that the data is described in a matrix
form with labelled rows and columns.
• Any data that is not originally in the matrix form with rows and
columns is an unstructured data.
Categorical and Quantitative Data
• Data can be further classified as being categorical or quantitative.
• The statistical analysis that is appropriate depends on whether the
data for the variable are categorical or quantitative.
• In general, there are more alternatives for statistical analysis when
the data are quantitative.
Categorical Data
• Labels or names are used to identify an attribute of each element
• Often referred to as qualitative data
• Use either the nominal or ordinal scale of measurement
• Can be either numeric or nonnumeric
• Appropriate statistical analyses are rather limited
Quantitative Data
• Quantitative data indicate how many or how much.
• Quantitative data are always numeric.
• Ordinary arithmetic operations are meaningful for quantitative data.
Sources of data
• Primary data:
• Primary data is the one, which is collected by the investigator himself for the purpose
of a specific inquiry or study. Such data is original in character and is generated by
survey conducted by individuals or research institution or any organisation.
• Direct personal interviews.
• Indirect Oral interviews.
• Information from correspondents.
• Mailed questionnaire method.
• Schedules sent through enumerators.
• Secondary data:
• Secondary data are those data which have been already collected and analysed by
some earlier agency for its own use; and later the same data are used by a different
agency.
• Published sources, and
• Unpublished sources.
Data Sources
Data Available From Selected Government Agencies
Federal Reserve Board www.federalreserve.gov Data on money supply, exchange rates, discount rates
Office of Mgmt. & Budget www.whitehouse.gov/omb Data on revenue, expenditures, debt of federal government
Department of Commerce www.doc.gov Data on business activity, value of shipments, profit by industry
Bureau of Labor Statistics www.bls.gov Customer spending, unemployment rate, hourly earnings, safety
record
Data Type
�
�1 +�2 +…+�� ��
Mean=� = =
� �=1 �
Mean
• Symbol X is frequently used to represent the estimated value of the mean from a
sample.
• If the entire population is available and if we calculate mean based on the entire
population, then we have the population mean which is denoted by (population
mean).
• In following Table, the average salary is given by
(270 220 240 250 180 300 240 235 425 240) 1000
X 260000
10
Property of Mean
An important property of mean is that the summation of deviation of observations from
the mean is zero, that is
n
X i X 0
i1
Median (or Mid) Value
• Median is the value that divides the data into two equal parts, that is, the proportion of
observations below median and above median will be 50%.
• Easiest way to find the median value is by arranging the data in the increasing order and the
median is the value at position (n + 1)/2 when n is odd. When n is even, the median is the
average value of (n/2)th and (n + 2)/2th observation after arranging the data in the increasing
order.
• Ex:
• The number of deposits in a branch of a bank in a week is
Day 1 2 3 4 5 6 7
N u m b e r o f 245 326 180 226 445 319 260
Deposits
• The ascending order of the data in Table is given by 180, 226, 245, 260, 319, 326 and 445.
• Now (n + 1)/2 = (8/2) = 4. Thus the median is the 4th value in the data after arranging them
in the increasing order; in this case it is 260
Mode
• Mode is the most frequently occurring value in the dataset
• Mode is the only measure of central tendency which is valid for qualitative (nominal) data
since the mean and median for nominal data are meaningless.
• For example, assume that a customer data with a retailer has the marital status of
customer, namely, (a) Married, (b) Unmarried, (c) Divorced Male, and (d) Divorced Female.
Mean and median are meaningless when we try to use them on a qualitative data such as
marital status. On the other hand, mode will capture the customer type in terms of
marital status that occurs most frequently in the database
Measures of Variation
• Predictive analytics techniques such as regression attempt to explain variation
in the outcome variable (Y) using predictor variables (X)
• Variability in the data is measured using the following measures:
• Range
• Inter-Quartile Distance (IQD)
• Variance
• Standard Deviation
Sample Variance
• In case of a sample, the Sample Variance
(S 2) is calculated using
2 ( X i X )2
n
S
i 1 n 1
• While calculating sample variance S2, the sum of squared deviation is divided by
(n-1), this is known as Bessel’s correction.
2
n
X i X
i 1
Range, IQD and Variance
• Range is the difference between maximum and minimum value of the
data. It captures the data spread.
• Inter-quartile distance (IQD), also called inter-quartile range (IQR) is a
measure of the distance between Quartile 1 (Q1) and Quartile 3 (Q3)
• Variance is a measure of variability in the data from the mean value.
Variance for population, 2, is calculated using
(n
X ) 2
Variance 2 i
i 1 n
Standard Deviation
The population standard deviation () and sample standard deviation (S) are given by
n
(Xi ) 2 n ( X X )2
i 1 n
S
i 1
i
n 1
Degrees of Freedom
Histogram
• Histogram is the visual representation of the data which can be used to assess the
probability distribution (frequency distribution) of the data
Here Xmax and Xmin are the maximum and minimum values of the data and
W is desired the width of the bin (interval). Intervals in histograms are usually of equal size
Step 2: Count the number of observations from the data that fall under each bin (interval).
Step 3: Create a frequency distribution (bin in the horizontal axis and frequency in the
vertical axis) using the information obtained in steps 1 and 2
Use of Histogram
• The following formula is used usually for a sample with n observations (Joanes
and Gill, 1998): n(n 1)
G1 g1
n2
Kurtosis
• Kurtosis is another measure of shape, aimed at shape of the tail, that is,
whether the tail of the data distribution is heavy or light. Kurtosis is
measured using the following equation:
4 4
Kurtosis = X i X / n
i 1
4
4
4
X i X / n
Excess Kurtosis= i 1 3
4
Chebyshev’s Theorem
1
P k X k 1 2
k
• Ex: Amount spent per month by a segment of credit card users of a bank has a
mean value of 12000 and standard deviation of 2000. Calculate the proportion
of customers who are spending between 8000 and 16000?
• Solution:
1
P(8000 X 16000)=P( 2 X + 2) 1 2 0.75
2
That is, the proportion of customers spending between 8000 and 16000 is at least 0.75 (or 75%)
Example (Percentile Calculation)
Time between failures of wire-cut (in hours)
2 22 32 39 46 56 76 79 88 93
3 24 33 44 46 66 77 79 89 99
5 24 34 45 47 67 77 86 89 99
9 26 37 45 55 67 78 86 89 99
21 31 39 46 56 75 78 87 90 102
1. Calculate the mean, median, and mode of time between failures of wire-cuts
2. The company would like to know by what time 10% (ten percentile or P10) and
90% (ninety percentile or P90) of the wire-cuts will fail?
3. Calculate the values of P25 and P75.
Solution
1. Mean = 57.64, median = 56, and mode = 46
Instead of rounding the value obtained from Eq, we can use the following approximation: P10
= 10 × (51)/100 = 5.1
Value at 5th position is 21. Value at position 5.1 is approximated as 21 + 0.1 × (value at 6th
position – value at 5th position) = 21 + 0.1(1) = 21.1
P90 = 90 × 51/100 = 45.9
The value at position 45 is 90 and at position 45.9 is 90 + 0.9 × (3) = 92.7
That is, 90% of the wire-cuts will fail by 92.7 hours
3. P25 (1st Quartile or Q1) = 25 × 51/100 = 12.75 , Value at 12th position is 33, so
P25 = 33 + 0.75 (value at 13th position – value at 12th position) = 33 + 0.75 (1) = 33.75
Pie Chart
• Pie chart is mainly used for categorical data and is a circular chart that
displays the proportion of each category in the dataset
Scatter Plot
• Scatter plot is a plot of two variables that will assist data scientists to
understand if there is any relationship between two variables
• scatter plot is also useful for assessing the strength of the relationship
and to find if there are any outliers in the data
Box Plot (or Box and Whisker Plot)
• The box plot is constructed using IQR, minimum and maximum values
Bollywood movie Budget Boxplot