Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13
Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13
Types of Variables
Scales of Measurement
Bill scored 1200 on the SAT and entered college as a physics major, he change to business however
because he thought it was more interesting. Because he made the Dean’s list last semester, his parents
gave him $30 to buy a new Casio calculator. Identify at least one piece of information in the:
a) Nominal scale – Going to college, will buy a calculator, was a physics major, is a business major,
was on the deans list
b) Ordinal scale of measurement – Bill is a freshmen
c) Interval scale of measurement – scored a 1200 score on SAT
d) Ratio scale of measurement – Bill received $30
Accounting – Public accounting firms use statistical sampling procedures when conducting audits for
their clients
Economics – Economists use statistical information in making and forecasting about the future of the
economy or some aspect of it
Finance – Financial advisors use price-earnings ratio and dividends yields to guide their investment
advice
Production – A variety of statistical quality control charts are used to monitor the output of a
production process
Information system – A variety of statistical information helps administrators asses the performance of
computer network
Data: Are the facts and figures collected, analysed and summarized for presentation and interpretation
All the data collected in a particular study are referred to as the data set for the study.
Sources of Measurement
Nominal
Ordinal
The data have the properties of nominal data and the order or rank of the data is meaningful.
A nonnumeric label or numeric code may be used.
Example: Students of a university are classified by their class standing using a nonnumeric label such as
freshman, sophomore, Junior or Senior. Alternatively, a numeric code could be used for the class standing
variable (e.g. 1 denotes Freshman, 2 denotes sophomore and so on…)
Interval
The data have the properties of ordinal data, and the interval between observations is expressed in terms
of a fixed unit of measure.
Interval data are always numeric
Example: Melissa has an SAT score of 1985, while Kevin has an SAT score of 1880. Melissa scored 105
points more than Kevin
Ratio
The data have all the properties of interval data and the ratio of two values is meaningful.
Variables such as distance, height, weight, and time use the ratio scale.
This scale must contain a zero value that indicates that nothing exists for the variable at the zero point.
Example: Melissa’s college record shows 36 credit hours earned, while Kevin’s shows 72 credit hours.
Kevin has twice as many credit hours earned than Melissa
Categorical Data
Quantitative Data
Cross Sectional data are collected at the same time or approximately the same point in time
Example: Data detailing the number of building permits issued in November 2012 in each of the
counties of Ohio
Time series data are collected at the same or approximately the same point in time
Example: data detailing the number of building permits issued in Lucas country Ohio, of the last 36
months
Key Terms:
Data array: An orderly presentation of data in either ascending or descending numerical order.
Frequency Distribution: A table that represents the data in classes and that shows the number of
observations in each class.
Class: The category
Frequency: The number in each class
Class limits: Boundaries for each class
Class interval: Width of each class
Class mark: Midpoint of each class
Sturges’ Rule
- How to set the approximate number of classes to begin constructing a frequency distribution
K=1+3.322( log 10 n)
Where:
k = approximate number of classes to use
n = the number of observations in the data set
1. Number of classes
- Choose an approximate number of classes for your data. Sturges’ rule can help.
2. Estimate the class interval
- Divide the approximate number of classes (from step 1) into the range of your data to find the
approximate class interval, where the range is defined as the largest data value minus the
smallest data value
3. Determine the class interval
- Round the estimate (from step 2) to a convenient value
4. Lower Class Limit
- Determine the lower class limit for the first class by selecting a convenient number that is smaller
than the lowest data value
5. Class Limits
- Determine the other class limits by repeatedly adding the class width (from step 2 )to the prior
class limit, starting with the lower class limit (from step 3)
6. Define the class
- Use the sequence of class limits to define the classes
An Alternative: Accrue the relative frequencies for each class instead of the raw frequencies. Then you
don’t have to divide by the total to get percentages
NOTE *
- If the measures are computed for data from a sample, they are called sample statistics
- If the measures are computed or data from a population they are called population parameters
- A sample statistics is referred to as the point estimator of the corresponding populations
parameter
Mean
Sample mean
∑ xi
X=
n
Population mean
∑ xi
μ=
N
Median
-The median of the data set is the value in the middle when the data items are arranged in
ascending order
- Whenever a data set is has extreme values, the median is the preferred meaure of central location
- The median is the measure of locations most often reported for annual income and property value
- A few extreme incomes and property value can inflate the mean.
When the median is two numbers take the average of those two
Mode
- The mode of a data set is the value that occurs with the greatest frequency
- The greatest frequency can occur at two different frequencies
- If the data has two modes, the data is bimodal
- If the data has more than two mode, the data is multimodal
* Caution – is the data is bimodal of multimodal, Excels mode function will incorrectly identify a
single mode
Measure of Variability
Measures of Variability
- Range
- Variance
- Standard deviation
- Coefficient of variation
- Covariance, correlation
Range
- The range of a data set is the difference between the largest and smallest data values
- It is the simplest measure of variability
- It is very sensitive to the smallest and largest data values
Variance
σ =∑ ¿ ¿ - For population
Standard deviation
s= √ S2 – Sample
σ =√ σ 2 – Population
Coefficient of Variation
- The coefficient of variation indicates how large the standard deviation is in relation to the mean
( Xs × 100) % - Sample
( σμ ×100) % - Population
Distribution Shape: Skewness
Z-scores
xi−X
zi=
s
- An observations z-score is a measure of the relative location of the observation in a data set
- A data value less than the sample mean will have a z-score less than zero
Detecting Outliers
- Thus far we have examined numerical values methods used to summarize the data for one
variable at a time. Often managers of decision makers are interested in the relationship between
two variables
- To descriptive measures of the relationship between 2 variables are covariance and coefficient
Covariance
∑ ( xi−μx)( yi−μy )
σxy = - For Population
N
Correlation Coefficient
sxy
Rxy= - For samples
sxsy
σxy
ρxy= - For Population
σxσy
Key Concepts
Probability – a numerical value that represents the chance, likelihood, possibility that an event will
occur (always between 0-1)
Event – Each possible outcome of a variable
If the probability event A occurs is a/b then the odds of events A occurring are a to b-a
Intersections – Both/And
- The intersection of A and B is the set of all events for which both A and B occur
Union – Either/Or
- The union of A and B is the set of all events for with either A or B occur
The general rule of addition P(A or B) = P(A) + P(B) – P(A and B) is always true
When events A and B are mutually exclusive, the last term in the rule, P(A and B), will be zero by
definition
Random Variables:
- Discrete
- Continuous
Probability Distributions
The probability distribution for a random variable describes how probabilities are distributed over the
values of the random variables
The probability distribution is defined by a probability function, denoted by f(x), which provides the
probability for each value of the random variables.
The required conditions for a discrete probability function are:
f ( x ) ≥0
∑ f ( x )=1
A continuous random variable is a variable that can assume any value on a continuum (can assume an
infinite number of values.)
These can potentially take on any value depending on the ability to measure accurately
x−μ
z=
σ
- We can think of z as a measure of the number of standard deviations x is from u
Z formula
- Standardised any normal distribution
Z score
- Computed by the z formula
- The number of standard deviations which a value is away from the mean
Z-scores
- The standardised z-score is how far above or below the individual value is compared to the
population mean in units of standard deviations
X – Z – Prob
Types of Studies
Exploratory
- Understand a problem, identify relevant variables, formulate hypothesis
Descriptive
- Establish reliable measurements
Casual
- Determine relationships among variables
Predictive
- Use analysis to forecast
Sources of Data
Primary
- Data generated by the researcher for this study
- Survey, experimental, observation research most popular
- Tend to require more time and expense than secondary data
Secondary
- Data gathered from another source or for another purpose
- Internal sources within the researcher’s organisation
- External sources, including governmental, trade, commercial and internet sources
Types of Surveys
Mail survey
- A mailed questionnaire with cover letter and return envelope
Personal interview
- A purposeful conversation
Telephone interview
- An interview conducted over the telephone
Web Survey
- A questionnaire completed over the internet
Sampling Error
- Random, non-directional
- When a sample is used instead of a census
Non-sampling Error
- Directional bias overstating or understanding the true population parameter
- Potential sources include: poor sample design, Poor measurement, Poor instrumental
Sampling Methods
Simple random
- Every person has an equal chance of being selected. Best when roster of the population exists
Systematic
- Randomly enter a stream of elements and sample every kth element. Best when elements are
randomly ordered, no cyclic variation
Stratified
- Randomly sample elements from every layer, or stratum of the population. Best when elements
within strata are homogenous
Cluster
- Randomly sample elements within some of the strata. Best When elements within strata are
heterogeneous
Convenience
- Elements are sampled because of ease and availability
Quota
- Elements are sampled, but not randomly, from every layer, or stratum, of the population
Purposive
- Elements are sampled because they are atypical, not representative of the population
Judgement
- Elements are sampled because the researcher believes the members are representative of the
population
- Sometimes we want to select a sample, but find it not possible to obtain a list of all the elements in
a population
- As a result, we cannot use the random number selection procedure
- Most often this situation occurs in infinite population cases
- Populations are often generated by an ongoing process where there is no upper limit on the
number of units than can be generated
Some examples of on-going processes are: (with infinite population)
- Parts being manufactured on a production line
- Transactions occurring at bank
- Telephone calls at a help centre
- Customers entering a store
A random sample from an infinite population is a sample selected such that the following conditions are
satisfied
- Each element selected comes from the population of interest
- Each element is selected independently
Point estimation
Practical advice
When the population from which we are selecting a random sample does not have a normal distribution,
the central limit theorem is helpful in identifying the shape of the sampling distribution (x bar)
Central limit theorem: In selecting random samples of size n from population, the sample distribution of
the sample mean can be approximated by a normal distribution as the sample size becomes large
Unbiased – if the expected value of the sample statistic is equal to the population parameter being
estimated, the sample statistic is said to be an unbiased estimator of the population parameter
Efficiency - Given the choice of two unbiased estimators of the same population parameter, we would
prefer to use the point estimator with the smaller standard deviation, since it tends to provide estimates
closer to the population parameter.
The point estimator with the smaller standard deviation is said to have greater relative efficiency than
the other.
Consistency - A point estimator is consistent if the values of the point estimator tend to become closer to
the population parameter as the sample size becomes larger.
Topic 9 (week 9) – Interval Estimation, Confidence Interval, and Sample Size Determination
- A point estimator cannot be expected to provide the exact value of the population parameter
- An interval estimate can be computed by adding and subtracting a margin of error to the point
estimate.
- Point Estimate +/- Margin of Error
- The purpose of an interval estimate is to provide information about how close the point estimate
is to the value of the parameter
- The general form of an interval estimate of a population mean is:
Xbar ± Margin of Error
- In order to develop an interval estimate of a population mean, the margin or error must be
computed using either:
The population standard deviation σ , or
The sample standard deviations
- σ is rarely known exactly, but often a good estimate can be obtained based on historical data or
other information
- We refer to such cases as the σ known case
Interval Estimate of μ
σ
x̄ ± z a
2 √n
Where:
X̄ is the sample mean
1-a is the confidence coefficient
z a Is the z value providing an area of a/2 in the upper tail of the standard normal probability
2
distribution?
σ Is the population’s standard deviation
n is the sample size
Meaning of Confidence
- Because 90% of all intervals constructed using, X̄ plus or minus 1.645σ will contain the population
mean, we say we are 90% confident that the interval plus or minus 1.645σ , includes the population
mean u.
- We say that this interval has been established at the 90% confident level
- The value .9 is referred to as the confidence coefficient
- ID an estimate of the population standard deviation σ cannot be developed prior to sampling, we use
the sample standard deviation s to estaimte σ
- This is the σ unknown estimate u is based on the t distribution
- We’ll assume for now that the population is normally distributed
T Distrbution
- The t distribution is a family of similar probability distributions
- A specific t distribution depends on a parameter known as the degrees of freedom
- Degrees of freedom refer to the number of independent pieces of information that go into the
computation of s
- At t distribution with more degrees of freedom has less dispersion
- As the degrees of freedom increases, the difference between the t distribution and the standard
normal probability distribution becomes smaller and smaller
Interval Estimate
s
x t/2/2
n
Hypothesis testing – can be used to determine whether a statement about the value of a population
parameter should or should not be rejected.
The null hypothesis, denoted by H0, is the tentative assumption about the population parameter
The alternative hypothesis, denoted by Ha, is the opposite of what is stated in the null hypothesis
Example: A new teaching method is developed that is believed to be better than the current method.
Alternative Hypothesis: The new teaching method is better
Null Hypothesis: The new method is no better than the old method
- We might begin a belief or assumption that a statement about the value of a population parameter is
true
- We then using a hypothesis test to challenge the assumption and determine if there is a statistical
evidence to conclude that the assumption is incorrect
- IN these situations, it is helpful to develop the null hypothesis first
Example: The label on a soft drink bottle states that it contains 67.7 fluid ounces.
Null hypothesis: The Label is correct μ ≥ 67.6 ounces
Alternative population: The label is incorrect μ<67.6 ounces
Summary of Forms for Null and Alternative Hypothesis about a Population Mean
- The equality part of the hypothesis always appears in the null hypothesis
- In general a hypothesis, test about the value of a population mean u must take one of the following
three forms (where u is the hypothesis value of the population mean)
Type 1 Error
Type 2 Errors
- p- Value Approach is the probability, computed using the test statistic, that measures the support
(or lack of support) provided by the sample for null hypothesis
- If the p-value s less than or equal to the level of significance a, the value of the test statistic is in
the rejection region
- Reject H0 if the p-value less than or equal to a
Regression analysis – can be used to develop an equation showing how the variables are related
- The variable being predicted is called the dependent variable and is denoted by y
- The variables being used to predict the value of the depend variable are called the independent
variables and are denoted by x
- Simple linear regression involves one independent variable and one dependent variable
- The relationship between the two variables is approximated by a straight line
- Regression analysis involving two or more independent variables is called multiple regression
Nonparametric Methods
- Most of the statistical methods referred to as parametric require the use of interval – or ratio –
scaled data
- Nonparametric methods are often the only way to analyse categorical (nominal or ordinal) data
and draw statistical conclusions
- Nonparametric methods require no assumptions about the population probability distributions
- Nonparametric methods are often called distribution – free methods
- Whenever the data are quantitative we will transform the data into categorical data in order to
conduct nonparametric test
- The Wilcoxon signed ranked test is procedure for analysing data from a matched samples
experiment
- The test used quantitative data but does not require the assumption that the differences between
the paired observations are normally distributed
- It only requires the assumptions that the differences have a symmetric distribution
- This occurs whenever the shapes of the two populations are the same and the focus is on
determining if there is a difference between the two population medians.
- Test for equality between two populations medians using matched samples
- Differences between the paired observation need not be normally distributed
- Use Distribution free procedure
- Must used normal approximation if the sample size is larger than 10
- If the medians of the two populations are equal, we would expect the sum of the negative signed
ranks and the sum of the positive signed ranks to be approximately the same
- This is another nonparametric methods for determining whether there is a difference between
two populations
- This test is based on two independent samples
- Advantages of this procedure are:
a) it can be used with either ordinal data or quantitative data
b) It does not require the assumptions that the populations have a normal distribution
- Instead of testing for the difference between the medians of two populations, this methods tests to
determine whether the two populations are identical