0% found this document useful (0 votes)
41 views55 pages

Desc. Stat

The document provides an overview of descriptive statistics including components, applications, and descriptive analysis. It also discusses predictive and prescriptive analytics as well as sources of data, types of data, and statistical inference.

Uploaded by

Arvind Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views55 pages

Desc. Stat

The document provides an overview of descriptive statistics including components, applications, and descriptive analysis. It also discusses predictive and prescriptive analytics as well as sources of data, types of data, and statistical inference.

Uploaded by

Arvind Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Business Statistics

Descriptive Statistics
TEXTBOOKS (REQUIRED MATERIALS)

1. Statistics for Business & Economics by David R. Anderson; Dennis J. Sweeney; Thomas A. Williams; Jeffrey
D. Camm; James J. Cochran. Cengage Learning

Additional References
• Aczel, A. D., & Sounderpandian, J. (1999). Complete business statistics. Boston, MA: Irwin/McGraw Hill.

• Business Statistics for Contemporary Decision Making. Ken Black. Wiley India.

• Statistics for Management. Richard I. Levin & David S. Rubin. Pearson.

• Lecture Notes (Notes will be distributed each week by the faculty and/or shared through google classroom.)
In God we trust, all others must bring data
- W Edwards Deming
Components of Analytics

Predicting future
Data synthesis and events
Visualization
Descriptive Predictive
Analytics Analytics

Prescriptive
Analytics
Optimization and decision
making
Descriptive Predictive Prescriptive

What Happened ? What Will Happen ? What Action to Take ?


Descriptive Applications

• Most shoppers turn towards right when they enter the a retail store.

• Conversion rate of women shoppers is higher than male shoppers among


electronic gadgets purchasers.

• Strawberry pop-tarts sell 7 times more during hurricane compared to regular


period.

• Women car buyers prefer women sales person.


Predictive Problems

• Which product the customer is likely to buy in his next purchase


(recommender system).

• Which customer is likely to default in his/her loan payment.

• Who is likely to cancel the product that was ordered through e-


commerce portal.
Prescriptive Problem

• What is the optimal product mix?

• What is the optimal route for a delivery truck.

• Best markdown pricing for fashion products.

• Optimal assignment of aircraft to flight.

• How to manage the fleet of vehicles owned by a company for employee


drop and pick up?
Stages of Statistics
• Collection of Data:
• It is the first step and this is the foundation upon which the entire data set. Careful
planning is essential before collecting the data. There are different methods of
collection of data such as census, sampling, primary, secondary, etc., and the
investigator should make use of correct method.
• Presentation of data:
• The mass data collected should be presented in a suitable, concise form for further
analysis. The collected data may be presented in the form of tabular or
diagrammatic or graphic form.
• Analysis of data:
• The data presented should be carefully analysed for making inference from the
presented data such as measures of central tendencies, dispersion, correlation,
regression etc.
Stages of Statistics
• Interpretation of data:
• The final step is drawing conclusion from the data collected. A valid
conclusion must be drawn on the basis of analysis. A high degree of skill and
experience is necessary for the interpretation.
Two Branches of Statistics

• Descriptive statistics
• Collect data (e.g., survey)
• Present data (e.g., tables and graphs)
• Summarize data (e.g., sample mean)

• Inferential statistics
• Drawing conclusions about a population based only on sample data
Descriptive Statistics
• Most of the statistical information in newspapers, magazines, company
reports, and other publications consists of data that are summarized and
presented in a form that is easy to understand.
• Such summaries of data, which may be tabular, graphical, or numerical, are
referred to as descriptive statistics.

Example
The manager of Honda Auto would like to have a better understanding of the
cost of parts used in the engine tune-ups performed in her/his shop. She/he
examines 50 customer invoices for tune-ups. The costs of parts, rounded to
the nearest Indian Rs, are listed on the next slide.
Example: Honda Auto Repair
Sample of Parts Cost (Million Indian Rs) for 50 Tune-ups

91 78 93 57 75 52 99 80 97 62

71 69 72 89 66 75 79 75 72 76

104 74 62 68 97 105 77 65 80 109

85 97 88 68 83 68 71 69 67 74

62 82 98 101 79 105 79 69 62 73
Inferential Statistics
Population: The set of all elements of interest in a particular study.
Sample: A subset of the population.

Inferential statistics: The process of using data obtained from a sample to


make estimates and test hypotheses about the characteristics of a
population.
Census: Collecting data for the entire population.
Sample survey: Collecting data for a sample.
Process of Statistical Inference
Example: Honda Auto

Step 1 Step 2 Step 3 Step 4


• Population consists • A sample of 50 • The sample data • The sample average
of all tune ups. engine tune-ups is provides a sample is used to estimate
Average cost of examined. average parts cost the population
parts is unknown. of Million Rs 79 per average.
tune-up.
Data, Data Sets, Elements, Variables, and Observations
Variables

Company Stock Exchange Annual Sales ($M) Earnings per share ($)
Dataram NQ 73.10 0.86
EnergySouth N 74.00 1.67 Observation
Element Names Keystone N 365.70 0.86
LandCare NQ 111.40 0.33
Psychemedics N 17.60 0.13

Data Set
Data and Data Sets
• Data are the facts and figures collected, analyzed, and summarized for
presentation and interpretation.
• All the data collected in a particular study are referred to as the data
set for the study.
• Data: Collections of any number of related observations.
Elements, Variables, and Observations
• Elements are the entities on which data are collected.
• A variable is a characteristic of interest for the elements.
• The set of measurements obtained for a particular element is called
an observation.
• A data set with n elements contains n observations.
• The total number of data values in a complete data set is the number
of elements multiplied by the number of variables.
Structured and Unstructured Data
• Structured data means that the data is described in a matrix
form with labelled rows and columns.

• Any data that is not originally in the matrix form with rows and
columns is an unstructured data.
Categorical and Quantitative Data
• Data can be further classified as being categorical or quantitative.
• The statistical analysis that is appropriate depends on whether the
data for the variable are categorical or quantitative.
• In general, there are more alternatives for statistical analysis when
the data are quantitative.
Categorical Data
• Labels or names are used to identify an attribute of each element
• Often referred to as qualitative data
• Use either the nominal or ordinal scale of measurement
• Can be either numeric or nonnumeric
• Appropriate statistical analyses are rather limited
Quantitative Data
• Quantitative data indicate how many or how much.
• Quantitative data are always numeric.
• Ordinary arithmetic operations are meaningful for quantitative data.
Sources of data
• Primary data:
• Primary data is the one, which is collected by the investigator himself for the purpose
of a specific inquiry or study. Such data is original in character and is generated by
survey conducted by individuals or research institution or any organisation.
• Direct personal interviews.
• Indirect Oral interviews.
• Information from correspondents.
• Mailed questionnaire method.
• Schedules sent through enumerators.

• Secondary data:
• Secondary data are those data which have been already collected and analysed by
some earlier agency for its own use; and later the same data are used by a different
agency.
• Published sources, and
• Unpublished sources.
Data Sources
Data Available From Selected Government Agencies

Government Agency Web address Some of the Data Available


Census Bureau www.census.gov Population data, number of households, household income

Federal Reserve Board www.federalreserve.gov Data on money supply, exchange rates, discount rates

Office of Mgmt. & Budget www.whitehouse.gov/omb Data on revenue, expenditures, debt of federal government

Department of Commerce www.doc.gov Data on business activity, value of shipments, profit by industry

Bureau of Labor Statistics www.bls.gov Customer spending, unemployment rate, hourly earnings, safety
record
Data Type

• Cross-Sectional Data: A data collected on many variables of interest at


the same time or duration of time is called cross-sectional data.

• Time Series Data: A data collected for a single variable such as


demand for smartphones collected over several time intervals (weekly,
monthly, etc.) is called a time series data.

• Panel Data: Data collected on several variables (multiple dimensions)


over several time intervals is called panel data (also known as
longitudinal data).
TYPES OF DATA MEASUREMENT SCALES

• Nominal scale refers to variables that are basically names (qualitative


data) and also known as categorical variables.
• Ordinal scale is a variable in which the value of the data is captured from
an ordered set, which is recorded in the order of magnitude.
• Interval scale corresponds to a variable in which the value is chosen from
an interval set. Variable such as temperature measured in centigrade) or
intelligence quotient (IQ) score are examples of interval scale
• Any variable for which the ratios can be computed and are meaningful is
called ratio scale.
Population And Sample

 Population is the set of all possible observations (often


called cases, records, subjects or data points) for a given
context of the problem.

 Sample is the subset taken from a population.


Descriptive Statistics
Measures Of Central Tendency: Mean, median, mode

• Mean (or Average) Value


Mean is the arithmetical average value of the data and is one of the
most frequently used measures of
central tendency.


�1 +�2 +…+�� ��
Mean=� = =
� �=1 �
Mean

• Symbol X is frequently used to represent the estimated value of the mean from a
sample.
• If the entire population is available and if we calculate mean based on the entire
population, then we have the population mean which is denoted by  (population
mean).
• In following Table, the average salary is given by
 (270  220  240  250  180  300  240  235  425  240)  1000
X  260000
10

Property of Mean
An important property of mean is that the summation of deviation of observations from
the mean is zero, that is 
n  
  X i  X   0

i1 
Median (or Mid) Value

• Median is the value that divides the data into two equal parts, that is, the proportion of
observations below median and above median will be 50%.
• Easiest way to find the median value is by arranging the data in the increasing order and the
median is the value at position (n + 1)/2 when n is odd. When n is even, the median is the
average value of (n/2)th and (n + 2)/2th observation after arranging the data in the increasing
order.

• Ex:
• The number of deposits in a branch of a bank in a week is
Day 1 2 3 4 5 6 7
N u m b e r o f 245 326 180 226 445 319 260
Deposits

• The ascending order of the data in Table is given by 180, 226, 245, 260, 319, 326 and 445.
• Now (n + 1)/2 = (8/2) = 4. Thus the median is the 4th value in the data after arranging them
in the increasing order; in this case it is 260
Mode
• Mode is the most frequently occurring value in the dataset

• Mode is the only measure of central tendency which is valid for qualitative (nominal) data
since the mean and median for nominal data are meaningless.

• For example, assume that a customer data with a retailer has the marital status of
customer, namely, (a) Married, (b) Unmarried, (c) Divorced Male, and (d) Divorced Female.
Mean and median are meaningless when we try to use them on a qualitative data such as
marital status. On the other hand, mode will capture the customer type in terms of
marital status that occurs most frequently in the database
Measures of Variation
• Predictive analytics techniques such as regression attempt to explain variation
in the outcome variable (Y) using predictor variables (X)
• Variability in the data is measured using the following measures:
• Range
• Inter-Quartile Distance (IQD)
• Variance
• Standard Deviation
Sample Variance
• In case of a sample, the Sample Variance
 (S 2) is calculated using

2 ( X i  X )2
n
S  
i 1 n 1

• While calculating sample variance S2, the sum of squared deviation is divided by
(n-1), this is known as Bessel’s correction.
2
n   
  X i  X 
i 1  
Range, IQD and Variance
• Range is the difference between maximum and minimum value of the
data. It captures the data spread.
• Inter-quartile distance (IQD), also called inter-quartile range (IQR) is a
measure of the distance between Quartile 1 (Q1) and Quartile 3 (Q3)
• Variance is a measure of variability in the data from the mean value.
Variance for population, 2, is calculated using
(n
X   ) 2
Variance   2   i
i 1 n

Standard Deviation
The population standard deviation () and sample standard deviation (S) are given by

n
(Xi  ) 2 n ( X  X )2
  
i 1 n
S 
i 1
i
n 1
Degrees of Freedom

• Degrees of freedom is equal to the number of independent variables in


the model (Trochim, 2005).

• For example, we can create any sample of size n with mean value of X by
randomly selecting (n  1) values. We need to fix just one out of n values.
Thus the number of independent variables in this case is (n  1)

• Degrees of freedom is defined as the difference between the number of


observations in the sample and number of parameters estimated (Walker
1940, Toothaker and Miller, 1996). If there are n observations in the
sample and k parameters are estimated from the sample, then the
degrees of freedom is (n  k).
Data Visualization
• There are many useful charts such as histogram, bar chart, pie-chart, box-plot
that would assist data scientist with visualization of the data

Histogram
• Histogram is the visual representation of the data which can be used to assess the
probability distribution (frequency distribution) of the data

• Histograms are created for continuous (numerical) data.

• It is a frequency distribution of data arranged in consecutive and non-overlapping


intervals
Steps to construct histograms
Step 1: Divide the data into finite number of non-overlapping and consecutive bins
(interval) X max  X min
Number of bins, N = W

Here Xmax and Xmin are the maximum and minimum values of the data and
W is desired the width of the bin (interval). Intervals in histograms are usually of equal size

Step 2: Count the number of observations from the data that fall under each bin (interval).

Step 3: Create a frequency distribution (bin in the horizontal axis and frequency in the
vertical axis) using the information obtained in steps 1 and 2
Use of Histogram

Histogram is very useful since it assists data scientist to identify


the following:
• The shape of the distribution and to assess the probability
distribution of the data.
• Measures of central tendency such median and mode.
• Measures of variability such as spread.
• Measure of shape such as skewness
Measures of Shape  Skewness and Kurtosis

• Skewness is a measure of symmetry or lack of symmetry. A dataset is symmetrical


when the proportion of data at equal distance (measured in terms of standard
deviation) from mean (or median) is equal. That is, the proportion of data between 
and  - k is same as  and + k, where k is some positive constant.

• Pearson’s moment coefficient of skewness for a dataset with n observations is given


by
n 
3
 (Xi  X ) / n
g1  i 1
3
• The value of g1 will be close to 0 when the data is symmetrical.
• A positive value of g1 indicates a positive skewness and a negative value indicates
negative skewness.
Skewness

• The following formula is used usually for a sample with n observations (Joanes
and Gill, 1998): n(n  1)
G1  g1
n2
Kurtosis
• Kurtosis is another measure of shape, aimed at shape of the tail, that is,
whether the tail of the data distribution is heavy or light. Kurtosis is
measured using the following equation:

4   4
Kurtosis =   X i  X  / n
i 1 
4

• Kurtosis value of less than 3 is called platykurtic distribution and greater


than 3 is called leptokurtic distribution.
• The kurtosis value of 3 indicates standard normal distribution (also called
mesokurtic)
Leptokurtic, mesokurtic, and platykurtic distributions
Excess Kurtosis

• The excess kurtosis is a measure that captures deviation from kurtosis of a


normal distribution and is given by:

4
4  
  X i  X  / n
Excess Kurtosis= i 1  3
4
Chebyshev’s Theorem

• Chebyshev’s theorem (also known as Chebyshev’s inequality) is an empirical


rule that allows us to predict proportion of observations that is likely to lie
between an interval defined using mean and standard deviation. Probability of
1
finding a randomly selected value in an interval defined by   k is k 2 that is
1 

1
P    k  X    k   1  2
k
• Ex: Amount spent per month by a segment of credit card users of a bank has a
mean value of 12000 and standard deviation of 2000. Calculate the proportion
of customers who are spending between 8000 and 16000?
• Solution:
1
P(8000  X  16000)=P(  2  X   + 2)  1  2  0.75
2

That is, the proportion of customers spending between 8000 and 16000 is at least 0.75 (or 75%)
Example (Percentile Calculation)
Time between failures of wire-cut (in hours)
2 22 32 39 46 56 76 79 88 93

3 24 33 44 46 66 77 79 89 99

5 24 34 45 47 67 77 86 89 99

9 26 37 45 55 67 78 86 89 99

21 31 39 46 56 75 78 87 90 102

1. Calculate the mean, median, and mode of time between failures of wire-cuts
2. The company would like to know by what time 10% (ten percentile or P10) and
90% (ninety percentile or P90) of the wire-cuts will fail?
3. Calculate the values of P25 and P75.
Solution
1. Mean = 57.64, median = 56, and mode = 46

2. Note that the data is arranged in increasing order in columns.


Position corresponding to Px  x (n+1)/100
The position of P10 = 10 × (51)/100 = 5.1.
We can round off 5.1 to its nearest integer which is 5. The corresponding value from table is
21 (10 percentage of observations in Table have a value of less than or equal to 21).
That is, by 21 hours, 10% of the wire-cuts will fail. In asset management (and reliability
theory), this value is called P10 life.

Instead of rounding the value obtained from Eq, we can use the following approximation: P10
= 10 × (51)/100 = 5.1
Value at 5th position is 21. Value at position 5.1 is approximated as 21 + 0.1 × (value at 6th
position – value at 5th position) = 21 + 0.1(1) = 21.1
P90 = 90 × 51/100 = 45.9
The value at position 45 is 90 and at position 45.9 is 90 + 0.9 × (3) = 92.7
That is, 90% of the wire-cuts will fail by 92.7 hours

3. P25 (1st Quartile or Q1) = 25 × 51/100 = 12.75 , Value at 12th position is 33, so

P25 = 33 + 0.75 (value at 13th position – value at 12th position) = 33 + 0.75 (1) = 33.75

P75 (3rd Quartile or Q3) = 75 × 51/100 = 38.25


Value at 38th position is 86, so
P75 = 86 + 0.25 (value at 39th position – value at 38th position) = 86 + 0.25 (0) = 86
Percentile
• Percentile, decile and quartile are frequently used to identify the position of
the observation in the dataset.
• Percentile, denoted as Px, is the value of the data at which x percentage of the
data lie below that value
Position corresponding to Px  x (n+1)/100
• Px is the position in the data calculated , where n is the number of observations
in the data.
Decile and Quartile

• Decile corresponds to special values of percentile that divide the data


into 10 equal parts.
• First decile contains first 10% of the data and second decile contains first
20% of the data and so on.

• Quartile divides the data into 4 equal parts.


• The first quartile (Q1) contains first 25% of the data, Q2 contains 50% of
the data and is also the median. Quartile 3 (Q3) accounts for 75% of the
data
Bar Chart
• Bar chart is a frequency chart for qualitative variable (or categorical
variable)

• Bar chart can be used to assess the most-occurring and least-occurring


categories within a dataset
• Histograms cannot be used when the variable is qualitative

Pie Chart
• Pie chart is mainly used for categorical data and is a circular chart that
displays the proportion of each category in the dataset
Scatter Plot

• Scatter plot is a plot of two variables that will assist data scientists to
understand if there is any relationship between two variables

• The relationship could be linear or non-linear

• scatter plot is also useful for assessing the strength of the relationship
and to find if there are any outliers in the data
Box Plot (or Box and Whisker Plot)

• Box plot (aka Box and Whisker plot) is a graphical representation of


numerical data that can be used to understand the variability of the
data and the existence of outliers
• Box plot is designed by identifying the following descriptive statistics:
• Lower quartile (1st Quartile), median and upper quartile (3rd Quartile).
• Lowest and highest value
• Inter-quartile range (IQR).
IQR Box Plot

• The box plot is constructed using IQR, minimum and maximum values
Bollywood movie Budget Boxplot

• The box plot for the Bollywood movie budget

You might also like