0% found this document useful (0 votes)
124 views20 pages

Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13

The document provides an overview of key concepts in business data analysis and interpretation. It discusses topics like descriptive statistics, data sources, and constructing frequency distributions. Specifically, it defines statistics, qualitative and quantitative variables, scales of measurement, types of data sources like cross-sectional and time series data. It also explains how to calculate the number of classes and construct a frequency distribution, including converting it to a relative frequency distribution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views20 pages

Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13

The document provides an overview of key concepts in business data analysis and interpretation. It discusses topics like descriptive statistics, data sources, and constructing frequency distributions. Specifically, it defines statistics, qualitative and quantitative variables, scales of measurement, types of data sources like cross-sectional and time series data. It also explains how to calculate the number of classes and construct a frequency distribution, including converting it to a relative frequency distribution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

lOMoARcPSD|4689763

Business Data Analysis and Interpretation Notes - Lecture


notes, lectures 1 - 13
Business Data Analysis and Interpretation (James Cook University)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Justin Paul Vallinan ([email protected])
lOMoARcPSD|4689763

Business Data Analysis and Interpretation Notes


Topic 1 (Week 1) – Key Concepts of Statistics and Applications

Statistics: Is the collection, representation, summarization, interpretation and analysis of data.


Statistics – Two usages
- The study of statistics
- Statistics as reported sample measures (descriptive and inferential)
Key Terms
Census: An official count or survey, especially of a population
Selected subset: A sample of data taken from a pool of data like census

Parameter vs. Statistics

Parameter – descriptive measure of the population (represented by μ)


Statistics – descriptive measure of a sample (represented by x̅ )

Types of Variables

Qualitative Variables – Attributes, Categories


Examples: male/female, registered to vote/not, ethnicity, eye colour…
Quantitative Variables
- Discrete: Usually take on integer value but can take on fractions when variable allows – counts,
how many…
- Continuous: Can take on any value at any point along an interval – Measurements, how much…

Problem 1.16 – Types of variables

a) Whether you own a Panasonic television set


- Qualitative variable (two levels: yes/no and no measurement)
b) Your status as a full time or part time student
- Qualitative variable (two levels: Full/part time and no measurement)
c) Number of people who attended the school graduation
- Quantitative Discrete variable (a countable whole number + can only be whole numbers)

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Scales of Measurement

Nominal Scale – Labels represent various levels of categorical variable


Ordinal Scale – Labels represent an order that indicates either preference or ranking
Interval scale – Numerical Labels indicate order and distance between elements. There is no absolute
zero and multiples are not meaningful
Ratio Scale – Numerical Labels indicate and distance between elements. There is no absolute zero.

Problem 1.20: Scales of measurement

Bill scored 1200 on the SAT and entered college as a physics major, he change to business however
because he thought it was more interesting. Because he made the Dean’s list last semester, his parents
gave him $30 to buy a new Casio calculator. Identify at least one piece of information in the:
a) Nominal scale – Going to college, will buy a calculator, was a physics major, is a business major,
was on the deans list
b) Ordinal scale of measurement – Bill is a freshmen
c) Interval scale of measurement – scored a 1200 score on SAT
d) Ratio scale of measurement – Bill received $30

Topic 2 – Types of DATA Sources, and Introduction to Statistical Packages

Applications in Business and Economic

Accounting – Public accounting firms use statistical sampling procedures when conducting audits for
their clients
Economics – Economists use statistical information in making and forecasting about the future of the
economy or some aspect of it
Finance – Financial advisors use price-earnings ratio and dividends yields to guide their investment
advice
Production – A variety of statistical quality control charts are used to monitor the output of a
production process
Information system – A variety of statistical information helps administrators asses the performance of
computer network

Data: Are the facts and figures collected, analysed and summarized for presentation and interpretation
All the data collected in a particular study are referred to as the data set for the study.

Element: Are the entities on which data are collected


A variable is a characteristic of interest for the elements
The set of measurements obtained for a particular element is called an observation
A data set with n elements contains n observations

Sources of Measurement

Scales of measurement include: Nominal – Interval – Ordinal – Ratio


- The scale determines the amount of information contained in the data.
- The scale indicates the data summarization and statistical analysis that are most appropriate

Nominal

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Data are labels or names used to identify an attribute of the element.


A nonnumeric label or numeric code may be used.
Example: Students of a university are classified by the school in which they are enrolled using a numeric
label such as business, Humanities, etc. Alternatively, numeric code could be used for the variables (e.g. 1
denotes business, 2 denotes humanities)

Ordinal
The data have the properties of nominal data and the order or rank of the data is meaningful.
A nonnumeric label or numeric code may be used.
Example: Students of a university are classified by their class standing using a nonnumeric label such as
freshman, sophomore, Junior or Senior. Alternatively, a numeric code could be used for the class standing
variable (e.g. 1 denotes Freshman, 2 denotes sophomore and so on…)

Interval
The data have the properties of ordinal data, and the interval between observations is expressed in terms
of a fixed unit of measure.
Interval data are always numeric
Example: Melissa has an SAT score of 1985, while Kevin has an SAT score of 1880. Melissa scored 105
points more than Kevin

Ratio
The data have all the properties of interval data and the ratio of two values is meaningful.
Variables such as distance, height, weight, and time use the ratio scale.
This scale must contain a zero value that indicates that nothing exists for the variable at the zero point.
Example: Melissa’s college record shows 36 credit hours earned, while Kevin’s shows 72 credit hours.
Kevin has twice as many credit hours earned than Melissa

Categorical Data

- Label or names used to identify an attribute of each element


- Often referred to as qualitative data
- Use either nominal or ordinal scale of measurement
- Can either be numeric or non numeric
- Appropriate statistical data analyze are rather limited

Quantitative Data

- Quantitative data indicate how many or how much


- Discrete: If measuring or how many
- Continuous: If measuring how much
- Quantitative data is always numeric
- Ordinary arithmetic operations are meaningful for quantitative data.

Cross – Sectional Data

Cross Sectional data are collected at the same time or approximately the same point in time
Example: Data detailing the number of building permits issued in November 2012 in each of the
counties of Ohio

Time Series Data

Time series data are collected at the same or approximately the same point in time

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Example: data detailing the number of building permits issued in Lucas country Ohio, of the last 36
months

Topic 3 (week 3) – Descriptive Statistics: Summarising Data, Graphs, Tables, and


Shapes

Key Terms:

Data array: An orderly presentation of data in either ascending or descending numerical order.
Frequency Distribution: A table that represents the data in classes and that shows the number of
observations in each class.
Class: The category
Frequency: The number in each class
Class limits: Boundaries for each class
Class interval: Width of each class
Class mark: Midpoint of each class

Sturges’ Rule

- How to set the approximate number of classes to begin constructing a frequency distribution

K=1+3.322( log 10 n)

Where:
k = approximate number of classes to use
n = the number of observations in the data set

How to Construct a Frequency Distribution

1. Number of classes
- Choose an approximate number of classes for your data. Sturges’ rule can help.
2. Estimate the class interval
- Divide the approximate number of classes (from step 1) into the range of your data to find the
approximate class interval, where the range is defined as the largest data value minus the
smallest data value
3. Determine the class interval
- Round the estimate (from step 2) to a convenient value
4. Lower Class Limit
- Determine the lower class limit for the first class by selecting a convenient number that is smaller
than the lowest data value
5. Class Limits
- Determine the other class limits by repeatedly adding the class width (from step 2 )to the prior
class limit, starting with the lower class limit (from step 3)
6. Define the class
- Use the sequence of class limits to define the classes

Converting to a Relative Frequency Distribution

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

1. Retain the same classes defined in the frequency distribution.


2. Sum the total number of observations across all classes of the frequency distribution
3. Divide the frequency for each class by the total number of observations time 100, forming the
percentage of data value in each class

Forming the Cumulative Relative Frequency Distribution

1. List the number of observations in the lowest class


2. Add the frequency of the lowest class to the frequency of the second class. Record that cumulative
sum for the second class
3. Continue to add the prior cumulative sum to the frequency for that class, so that the cumulative
sum for the final class is the total number of observations in the data set
4. Divide the accumulated frequencies for each class by the total number of observations, giving you
the percentage of all observations that occurred up to and including that class

An Alternative: Accrue the relative frequencies for each class instead of the raw frequencies. Then you
don’t have to divide by the total to get percentages

Topic 4 (week 4) – Describing Data, Numerical Descriptive Statistics


Measure of Location

NOTE *
- If the measures are computed for data from a sample, they are called sample statistics
- If the measures are computed or data from a population they are called population parameters
- A sample statistics is referred to as the point estimator of the corresponding populations
parameter

Mean

- The mean provides a measure of central location


- The mean of the data set is the average of all data value
- The same mean (X bar) is the point estimator or the population mean (mu)

Sample mean

∑ xi
X=
n

Population mean

∑ xi
μ=
N

Median

-The median of the data set is the value in the middle when the data items are arranged in
ascending order
- Whenever a data set is has extreme values, the median is the preferred meaure of central location
- The median is the measure of locations most often reported for annual income and property value
- A few extreme incomes and property value can inflate the mean.
When the median is two numbers take the average of those two

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Mode

- The mode of a data set is the value that occurs with the greatest frequency
- The greatest frequency can occur at two different frequencies
- If the data has two modes, the data is bimodal
- If the data has more than two mode, the data is multimodal
* Caution – is the data is bimodal of multimodal, Excels mode function will incorrectly identify a
single mode

Measure of Variability

- It is often desirable to consider measures of variability (dispersion), as well as measures of


location
- For example, in choosing supplier A or supplier B we might consider not only the average delivery
time for each, but also the variability time for each

Measures of Variability

- Range
- Variance
- Standard deviation
- Coefficient of variation
- Covariance, correlation

Range

- The range of a data set is the difference between the largest and smallest data values
- It is the simplest measure of variability
- It is very sensitive to the smallest and largest data values

Variance

- The variance is a measure of variability that utilizes all the data


- It is based on the difference between the value of each observation (xi) and the mean
- The variance is useful in comparing the variability of two or more variables
- The variance is the average of the squared difference between each data value and the mean
The variance is compared as follows:

s2=∑¿ ¿ - for sample

σ =∑ ¿ ¿ - For population

Standard deviation

The standard deviation is computed as follows:

s= √ S2 – Sample

σ =√ σ 2 – Population

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Coefficient of Variation

- The coefficient of variation indicates how large the standard deviation is in relation to the mean

( Xs × 100) % - Sample
( σμ ×100) % - Population
Distribution Shape: Skewness

- An important measure of the shape of a distribution is called skewness

Z-scores

- The z-score is often called the standardized value


- It denotes the number of standard deviations a data value x, is from the mean:

xi−X
zi=
s

- An observations z-score is a measure of the relative location of the observation in a data set
- A data value less than the sample mean will have a z-score less than zero

Detecting Outliers

- An outlier is an unusually small or large value in a data set


- A data value with a z-score of -3 or greater than +3 might be considered an outlier
It might be:
- An incorrectly recorded data value
- A data value that was incorrectly added in the data value
- A correctly recorded data value than belongs in the data set.

Measure of Association Between Two variables

- Thus far we have examined numerical values methods used to summarize the data for one
variable at a time. Often managers of decision makers are interested in the relationship between
two variables
- To descriptive measures of the relationship between 2 variables are covariance and coefficient

Covariance

- The covariance is a measure of the linear association between two variables


- Positive values and indicate a positive relationship
- Negative values indicate negative relationships
Covariance is computed as follows

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

∑( xi− X)( yi−Y )


Sxy= - For Samples
n−1

∑ ( xi−μx)( yi−μy )
σxy = - For Population
N

Correlation Coefficient

- Correlation is a measure of linear association and not necessarily causation


- Just because two variables are highly correlated, it does not mean that one variable is the cause of
another
Correlation Coefficient is computed as follows:

sxy
Rxy= - For samples
sxsy

σxy
ρxy= - For Population
σxσy

Topic 5 (week 5) – Basic Introduction to Probability, Discrete Probability


Distribution

Key Concepts

Probability – a numerical value that represents the chance, likelihood, possibility that an event will
occur (always between 0-1)
Event – Each possible outcome of a variable

The probability of the complement of event A, written A’ is:


P ( A ' )=1−P ( A ' )
The law of large numbers. Over a large number of trials, the relative frequency with which an event
occurs will approach the probability of its occurrence for a single trials

Odds vs. Probability

If the probability event A occurs is a/b then the odds of events A occurring are a to b-a

Mutually exclusive events


Events A and B are mutually exclusive if both cannot occur at the same time, that is, if their intersection
is empty. In a Venn diagram, mutually exclusive events are usually shown as nonintersecting areas. If
intersecting areas are shown, the intersections are empty

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Intersections vs. Unions

Intersections – Both/And
- The intersection of A and B is the set of all events for which both A and B occur
Union – Either/Or
- The union of A and B is the set of all events for with either A or B occur

Working with Unions and Intersections

The general rule of addition P(A or B) = P(A) + P(B) – P(A and B) is always true
When events A and B are mutually exclusive, the last term in the rule, P(A and B), will be zero by
definition

Key Terms (2)

Random Variables:
- Discrete
- Continuous
Probability Distributions

The probability distribution for a random variable describes how probabilities are distributed over the
values of the random variables
The probability distribution is defined by a probability function, denoted by f(x), which provides the
probability for each value of the random variables.
The required conditions for a discrete probability function are:
f ( x ) ≥0
∑ f ( x )=1

The Bernoulli Process, Characteristics

1. There are two or more consecutive trials


2. In each trial, there are just two possible outcomes
3. The trials are statistically independent
4. The probability of success remains constant trial – to – trial

- Our interest is in the number of successes occurring in the n trials


- We let x denote the number of successes occurring in the n trials

Topic 6 – Continuous Probability Normal Distribution

Continuous Probability Distribution

A continuous random variable is a variable that can assume any value on a continuum (can assume an
infinite number of values.)
These can potentially take on any value depending on the ability to measure accurately

The Normal Distribution

- An important family of continuous distributions


- Bell shaped, symmetric, and asymptotic
- To specify a particular distribution in this family, two parameters must be given

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

- Location is determined by the mean


- Spread is determined by the standard deviation
- The random variable x an infinite variable range

The Normal Distribution shape

The Standard Normal Distribution

- Also known as the z distribution


- Mean is 0
- Standard deviation is 1

Converting to the Standard Normal Distribution

x−μ
z=
σ
- We can think of z as a measure of the number of standard deviations x is from u
Z formula
- Standardised any normal distribution
Z score
- Computed by the z formula
- The number of standard deviations which a value is away from the mean

Z-scores

- The standardised z-score is how far above or below the individual value is compared to the
population mean in units of standard deviations

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Normal Probability Distribution Empirical Rule

X – Z – Prob

Topic 7 (week 7) – Sampling and Surveys

Types of Studies

Exploratory
- Understand a problem, identify relevant variables, formulate hypothesis
Descriptive
- Establish reliable measurements
Casual
- Determine relationships among variables
Predictive
- Use analysis to forecast

The research Process

1. Define the problem


2. Decide on type of data needed
3. Determine how to gather data
4. Plan collection of data/select sample
5. Draw conclusions and report findings
6. Draw conclusions and report findings
7. Make decisions in terms of research

Sources of Data

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Primary
- Data generated by the researcher for this study
- Survey, experimental, observation research most popular
- Tend to require more time and expense than secondary data
Secondary
- Data gathered from another source or for another purpose
- Internal sources within the researcher’s organisation
- External sources, including governmental, trade, commercial and internet sources

Types of Surveys

Mail survey
- A mailed questionnaire with cover letter and return envelope
Personal interview
- A purposeful conversation
Telephone interview
- An interview conducted over the telephone
Web Survey
- A questionnaire completed over the internet

Types and Sources of Error

Sampling Error
- Random, non-directional
- When a sample is used instead of a census
Non-sampling Error
- Directional bias overstating or understanding the true population parameter
- Potential sources include: poor sample design, Poor measurement, Poor instrumental

Experimental vs. Observation

Experimental – action and reaction


- Independent variable, or treatment
Dependent variable, or measurement
- Internal validity – did the treatment produce the effect?
- External validity – Will the treatment produce the effect again in other people or settings?
Observation – Watching or listening

Sampling Methods

Simple random
- Every person has an equal chance of being selected. Best when roster of the population exists
Systematic
- Randomly enter a stream of elements and sample every kth element. Best when elements are
randomly ordered, no cyclic variation
Stratified
- Randomly sample elements from every layer, or stratum of the population. Best when elements
within strata are homogenous
Cluster
- Randomly sample elements within some of the strata. Best When elements within strata are
heterogeneous

Convenience
- Elements are sampled because of ease and availability

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Quota
- Elements are sampled, but not randomly, from every layer, or stratum, of the population
Purposive
- Elements are sampled because they are atypical, not representative of the population
Judgement
- Elements are sampled because the researcher believes the members are representative of the
population

Topic 8 (week 8) – Sample Distribution of Sample Mean

Sampling from a Finite Population

Finite populations are often defined by lists such as:


- Organising membership roster
- Credit card account numbers
- Inventory product numbers
A simple random sample of size n from a finite population of size N is called sampling with replacement
In large sampling projects, computer generated random numbers are often used to outomate the sample
selection process
Excel (RAND)

Sampling form an Infinite Population

- Sometimes we want to select a sample, but find it not possible to obtain a list of all the elements in
a population
- As a result, we cannot use the random number selection procedure
- Most often this situation occurs in infinite population cases
- Populations are often generated by an ongoing process where there is no upper limit on the
number of units than can be generated
Some examples of on-going processes are: (with infinite population)
- Parts being manufactured on a production line
- Transactions occurring at bank
- Telephone calls at a help centre
- Customers entering a store
A random sample from an infinite population is a sample selected such that the following conditions are
satisfied
- Each element selected comes from the population of interest
- Each element is selected independently

Point estimation

- Point estimation is a form of statistical inference


- In point estimation we use data from the sample to complete a value of sample statistics that
serves as an estimator of the population mean
- S is point estimator of the population standard deviation

Practical advice

- The target population we want to make inferences about


- The sample population is the population from which the sample is actually taken

Central limit theorem

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

When the population from which we are selecting a random sample does not have a normal distribution,
the central limit theorem is helpful in identifying the shape of the sampling distribution (x bar)

Central limit theorem: In selecting random samples of size n from population, the sample distribution of
the sample mean can be approximated by a normal distribution as the sample size becomes large

Properties of Point estimates

Unbiased – if the expected value of the sample statistic is equal to the population parameter being
estimated, the sample statistic is said to be an unbiased estimator of the population parameter

Efficiency - Given the choice of two unbiased estimators of the same population parameter, we would
prefer to use the point estimator with the smaller standard deviation, since it tends to provide estimates
closer to the population parameter.
The point estimator with the smaller standard deviation is said to have greater relative efficiency than
the other.

Consistency - A point estimator is consistent if the values of the point estimator tend to become closer to
the population parameter as the sample size becomes larger.

Topic 9 (week 9) – Interval Estimation, Confidence Interval, and Sample Size Determination

Margin of Error and the Interval Estimate

- A point estimator cannot be expected to provide the exact value of the population parameter
- An interval estimate can be computed by adding and subtracting a margin of error to the point
estimate.
- Point Estimate +/- Margin of Error
- The purpose of an interval estimate is to provide information about how close the point estimate
is to the value of the parameter
- The general form of an interval estimate of a population mean is:
Xbar ± Margin of Error

Interval Estimate of a Population Mean: σ Known

- In order to develop an interval estimate of a population mean, the margin or error must be
computed using either:
The population standard deviation σ , or
The sample standard deviations
- σ is rarely known exactly, but often a good estimate can be obtained based on historical data or
other information
- We refer to such cases as the σ known case

Interval Estimate of μ
σ
x̄ ± z a
2 √n

Where:
X̄ is the sample mean
1-a is the confidence coefficient

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

z a Is the z value providing an area of a/2 in the upper tail of the standard normal probability
2
distribution?
σ Is the population’s standard deviation
n is the sample size

Meaning of Confidence

- Because 90% of all intervals constructed using, X̄ plus or minus 1.645σ will contain the population
mean, we say we are 90% confident that the interval plus or minus 1.645σ , includes the population
mean u.
- We say that this interval has been established at the 90% confident level
- The value .9 is referred to as the confidence coefficient

Adequate Sample Size


- In most applications, a sample size of n = 30 is adequate
- If the population distribution is highly skewed or contains outlier, a sample size of 50 or more is
recommended
- If the population is not normally distributed but is roughly symmetric, a sample size as small as 15
will suffice
- If the populations is believed to be at least approximately normal, a sample size of less than 15 can
be used.

Interval Estimate of a Population Mean: σ Unknown

- ID an estimate of the population standard deviation σ cannot be developed prior to sampling, we use
the sample standard deviation s to estaimte σ
- This is the σ unknown estimate u is based on the t distribution
- We’ll assume for now that the population is normally distributed

T Distrbution
- The t distribution is a family of similar probability distributions
- A specific t distribution depends on a parameter known as the degrees of freedom
- Degrees of freedom refer to the number of independent pieces of information that go into the
computation of s
- At t distribution with more degrees of freedom has less dispersion
- As the degrees of freedom increases, the difference between the t distribution and the standard
normal probability distribution becomes smaller and smaller

Interval Estimate

s
x t/2/2
n

where: 1 -a = the confidence coefficient


ta/2 = the t value providing an area of a/2
in the upper tail of a t distribution
with n - 1 degrees of freedom
s = the sample standard deviation

Topic 10 ( week 10) – Hypothesis Testing

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

Hypothesis testing – can be used to determine whether a statement about the value of a population
parameter should or should not be rejected.
The null hypothesis, denoted by H0, is the tentative assumption about the population parameter
The alternative hypothesis, denoted by Ha, is the opposite of what is stated in the null hypothesis

Developing Null and Alternative Hypothesis

Example: A new teaching method is developed that is believed to be better than the current method.
Alternative Hypothesis: The new teaching method is better
Null Hypothesis: The new method is no better than the old method

Null hypothesis as an assumption to be challenged

- We might begin a belief or assumption that a statement about the value of a population parameter is
true
- We then using a hypothesis test to challenge the assumption and determine if there is a statistical
evidence to conclude that the assumption is incorrect
- IN these situations, it is helpful to develop the null hypothesis first

Example: The label on a soft drink bottle states that it contains 67.7 fluid ounces.
Null hypothesis: The Label is correct μ ≥ 67.6 ounces
Alternative population: The label is incorrect μ<67.6 ounces

Summary of Forms for Null and Alternative Hypothesis about a Population Mean

- The equality part of the hypothesis always appears in the null hypothesis
- In general a hypothesis, test about the value of a population mean u must take one of the following
three forms (where u is the hypothesis value of the population mean)

Type 1 Error

A type 1 error is rejecting H0, when it is true


- The probability of making a type 1 error when the null hypothesis is true as an equality is called
the level of significance
- Applications of hypothesis testing that only control the type 1 error often known as the
significance test.

Type 2 Errors

- A type 1 error is rejecting H0 when it is true


- It is difficult to control the probability of making a type 2 error
- Situations avoid the risk of making a type two error
- Situations avoid the risk of making a type 2 error by using the “do not reject H0 and not “accept
H0”

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

- p- Value Approach is the probability, computed using the test statistic, that measures the support
(or lack of support) provided by the sample for null hypothesis
- If the p-value s less than or equal to the level of significance a, the value of the test statistic is in
the rejection region
- Reject H0 if the p-value less than or equal to a

Suggested Guidelines for Interpreting p-value

Less than 0.01 – Overwhelming evidence to conclude Ha is true


Between 0.01 and 0.05 – Strong evidence to conduct Ha is true
Between 0.05 and 0.1 – Weak evidence to conclude Ha is true
Greater than 0.1 – Insufficient evidence to conclude Ha is true

Topic 11 – Simple Linear Regression

Regression analysis – can be used to develop an equation showing how the variables are related
- The variable being predicted is called the dependent variable and is denoted by y
- The variables being used to predict the value of the depend variable are called the independent
variables and are denoted by x
- Simple linear regression involves one independent variable and one dependent variable
- The relationship between the two variables is approximated by a straight line
- Regression analysis involving two or more independent variables is called multiple regression

Simple Linear Regression Model

The simple linear regression model is: y = Bo +B1x + e


Where:
Bo and Bi are called parameters of the model
E is a random variable called the error term

Simple linear Regression Equation

The simple linear regression equation is E(y) = Bo +B1x

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

- Graph of regression equation is a straight line


- Bo is the y intercept of the regression line
- B1 is the slope of the regression line
- E(y) is the expected value of y for a given x value

Least Square Method

Least squares criterion: ∑ ( y 1− yi ) sqaured


Where y1 = observed value of the depedent variable for the ith observation
Y(arrow up)i = estimated value of the dependent variable for the ith observation

Assumptions About term error e

1. The error is a random variable with a mean of zero

Topic 12 – Non – Parametric Statistics

Nonparametric Methods

- Most of the statistical methods referred to as parametric require the use of interval – or ratio –
scaled data
- Nonparametric methods are often the only way to analyse categorical (nominal or ordinal) data
and draw statistical conclusions
- Nonparametric methods require no assumptions about the population probability distributions
- Nonparametric methods are often called distribution – free methods
- Whenever the data are quantitative we will transform the data into categorical data in order to
conduct nonparametric test

Wilcoxon Signed – Ranked Test

- The Wilcoxon signed ranked test is procedure for analysing data from a matched samples
experiment
- The test used quantitative data but does not require the assumption that the differences between
the paired observations are normally distributed
- It only requires the assumptions that the differences have a symmetric distribution
- This occurs whenever the shapes of the two populations are the same and the focus is on
determining if there is a difference between the two population medians.

Wilcoxon Rank-Sum test for differences in 2 medians

- Test for equality between two populations medians using matched samples
- Differences between the paired observation need not be normally distributed
- Use Distribution free procedure
- Must used normal approximation if the sample size is larger than 10

Wilcoxon Signed Rank Test

Let T- : denote the sum of the negative signed ranks


Let T+: denote the sum of the positive signed ranks

- If the medians of the two populations are equal, we would expect the sum of the negative signed
ranks and the sum of the positive signed ranks to be approximately the same

Downloaded by Justin Paul Vallinan ([email protected])


lOMoARcPSD|4689763

- We use T+ as the test statistics

(Know definition of alpha)

Mann – Whitney – Wilcoxon Test

- This is another nonparametric methods for determining whether there is a difference between
two populations
- This test is based on two independent samples
- Advantages of this procedure are:
a) it can be used with either ordinal data or quantitative data
b) It does not require the assumptions that the populations have a normal distribution
- Instead of testing for the difference between the medians of two populations, this methods tests to
determine whether the two populations are identical

The hypotheses are:


H0: The two populations are identical
Ha: The two populations are not identical

Downloaded by Justin Paul Vallinan ([email protected])

You might also like