0% found this document useful (0 votes)
5 views6 pages

C3 Comm213

Chapter 3 discusses the definitions and differences between populations and samples, emphasizing the importance of sampling methods to reduce costs and biases in data collection. It covers various statistical concepts including descriptive and inferential statistics, measures of central tendency and dispersion, and probability distributions. Additionally, it highlights the significance of visualizing data through tools like histograms and box plots for better interpretation and analysis.

Uploaded by

laurabosselet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

C3 Comm213

Chapter 3 discusses the definitions and differences between populations and samples, emphasizing the importance of sampling methods to reduce costs and biases in data collection. It covers various statistical concepts including descriptive and inferential statistics, measures of central tendency and dispersion, and probability distributions. Additionally, it highlights the significance of visualizing data through tools like histograms and box plots for better interpretation and analysis.

Uploaded by

laurabosselet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 3: Defining population and samples

LO 3.1 Defining pop and samples


Population:
- Group with something in common (parameter)
- Expensive/impossible to get all
- Parameter: characteristic of a population

VS Sample:
- Subset of a population
- Representative and focus on statistics = characteristics of a sample
- Used to make inferences (conclusion) about the characteristic of a pop
- To reduce cost = sample is drawn out of population
 Parameter and inference may match

Descriptive/Summary statistics: measures that describes visible component of population or


sample
Inferential statistics: measure calculated only using sample
Hypothesis: proposed explanation made on basis of sample/limited evidence
- Starting point for more investigation
- Uses inferential statistics
- T-test, Z-test, P-test, F-test

LO 3.2: Sampling methods data reduction and bias

1. Simple random sampling:


- all elements have equal chance of being selected into the sample to hope select
representative of whole population
- Doesn’t care about selecting subset, or when homogeneous data/similar
2. 3ified random sampling:
- Pop is categorized/ dividing population into subgroups or strata based on specific
characteristics or attributes  calculate proportion of population in each group and
random sample each group to ensure appropriate number of each group (stratum) is
represented
- Ex: ensure representation of both happy/unhappy customers
3. Cluster sampling:
- Diving into group, calculate proportion, select only few cluster relevant + proportionate
- Select few groups: geography, time zone
- More efficient and cost-effective
4. Convenience/non-probability sampling
- Ease of access and availability when limited and budgets at low
- Some bias
 data already collected = simply subset data
 if need to be collected = distribute survey then stop collecting data once # reached

Data reduction: process of reducing size of data set to more manageable and suitable size for a
business analysis projects
- To retain meaningful information
- Focus on most interesting, critical, and abnormal items
- Speeds up analysis + reduce cost
How? Filtering:
 data  sort and filter  filter: removing rows that aren’t interesting/relevant/ choosing
interesting one
Decision process of using subset of data: need to consider purpose of analysis, time and cost.

Bias in business analytics:


- Prejudice in favor of or against
- Intentional vs unintentional
- During data collection, analysis, results
Types:
1. Nonresponse: sample not answering survey = potential distortion in results
 to prevent: inform sample about importance = sending gift card
- Partiality that results when respondents differ from non-respondents
2. Selection: analyst selected inappropriate sample
 when purposefully selects potions of participants/data that are likely to provide answers
aligning with analyst’s belief = not representative so not generalized
3. Confirmation: favor info that confirm pre-existing belief – do not reflect on truth =
inaccurate conclusions. Have already been collected
4. Outlier: extreme values influencing the interpretation of results. Have already been
collected

LO 3 Understanding statistics:

Probability distributions:
1. Random Variable: quantifies the outcomes of random occurrences
- To measure things that happen by chance
2. Data distribution: shows all possible values for a variable and how frequent
- Tells you what numbers could come up and how frequent each number is likely to appear
- Deals with hypothetical/predicted data
3. Probability distribution:
- Statistical function that describes possible values in population and likelihood that any
given observation (random variable) can take a given range/value
- How probable an outcome is
- Deals with observed data you already have
- If discrete distribution = presented in ranges

Types of numerical data: determine types of probability distribution


1. Discrete data: whole number, finite set of values between 2 observations
Ex: inventory, vehicles
2. Continuous data: any numerical value, infinite set of value
Ex: Weight, currency

Measures of central tendency: describe the center point of data set + susceptible to outliers!
Mean: average = Sum/n – cannot be categorical + susceptible to outliers
Median: midpoint of data distribution – cannot be categorical (small – highest) – ave if pair – no
susceptibility to outliers
Mode: simplest+ most common value – most important for categorical - uni/bi/multi-modal
Symmetry = when all above are equal
Kurtosis: distribution shape + thickness of tails
- Closer to 0 = normal distribution
- Whether more clustered in peak or tails
Work with Skewness: help determine likelihood
of event falling in a tail
- Positive kurtosis : skewed right. More
extreme value peaked at center
- Negative Kurtosis: skewed left. Fewer extreme values so flatter around center
 Mean higher than median = implied some outliers are skewing it right

Measures of dispersion:
Range: Max – Min
- Affected by outliers
Interquartile range (IQR): 4 quartiles to determine shape
- Should be in sorted order
- Q1: lowest 25% of observations
- Q2: next 25% - from 25% to 50%(= median)
- Q3: median to 75%
- Q4: 75 and more
- “Inter”= implies not interest in 4 quartile but specific middle section Q2 and Q3
 Suppose you have a data set : 24,25,25,25,26,27,28,28,29
- Q2 value = median = 26
- Q1 value = median of lower half (red color) = 25+25/2 =25
- Q3 value = median of upper half in blue color = 28+28/2 =28
- Interquartile range : Q3 – Q1 = 28-25 = 3
-
Variance: average squared deviation from the mean
- Measure how individual data points in a dataset differ from the average
- [X1 – Mean1]^2 + [….] / # of x
Standard deviation:
- Square root of Variance
- Same unit as data value
Coefficient of variation: o/u
- Measure of relative variability by comparing standard deviation to mean
- Useful when comparing different datasets or units or scales
- [SD/MEAN]*100

Continuous probability distributions:


Normal Gaussian Distribution:
- Bell shaped with most data point clustered near average/mean
- Natural occurring, symmetric = skewness is 0, but kurtosis is 3 b/c 68% data fall within
one SD, 95% within 2 SD, 99.7% falls within 3 SD
- Ex: weight, pop height, pop IQ, shoe size

Standard Normal distribution:


- Theoretical distribution: Doesn’t represent a real distribution – used only to make
comparison between distribution easier + calculated probabilities of individual
observation
- M-M-M = 0
- SD=1

Z score: standardized values


- Measures number of SD a data point is away from the mean
- Z = (x-mean)/ SD

Uniform distribution:
- Probability distribution that describes a set of continuous random variable where every
value within a given range is equally likely to occur:
- Flat and constant distribution – no peaks no tails
LO 3.4: Using Software tools to Create Summary Statistics:

Argument: value that the function uses to perform calculations (IQR exceptions)
Descriptive data: Data  Data analysis on right  DS  Ok = get data
Tableau Summary statistics – descriptive Statistics: Worksheet  Show summary

- Bin - Frequency

- [0-50) - 65
LO 3.5 Interpreting and Visualizing statistics:
Frequency distribution: - [50-100) - 63
- For numerical data b/c categorical we can just count
- Bins, classes, and intervals (categories in numerical - [100-150) - 30
data) - [150-200) - 8
- Table uses bins = to list frequent of various outcome
in sample
- Symmetrical = expect 5th/middle to contain most observation
- Skewed right = more observation in 1st
- Skewed left = more in last one

Histogram:
- Visual representation for frequency distribution
- Size of bins can shape
- Shows stats like: mean, sun, count etc…
- No gaps in between unless need to indicate absence of data point for that interval
- Reference lines for mean and median
- Like vertical bar chart but bars are replaced by bins
- Y axis in bar graph for descriptive but in H for # of observations

Box Plot:
- shows dispersion in
terms of quartiles
Box: represents the IQR: 25th
Q1, 75th Q3
- Length: reflects spread
of middle 50% of the
data

Line/whisker: represent the


range of data from quartiles
(Q3+1.5IQR and Q1-
1.5IQR)
Median: horizontal bar
which divides the data
in two equal halves
Outliers: individual
points beyond

You might also like