C3 Comm213
C3 Comm213
VS Sample:
- Subset of a population
- Representative and focus on statistics = characteristics of a sample
- Used to make inferences (conclusion) about the characteristic of a pop
- To reduce cost = sample is drawn out of population
Parameter and inference may match
Data reduction: process of reducing size of data set to more manageable and suitable size for a
business analysis projects
- To retain meaningful information
- Focus on most interesting, critical, and abnormal items
- Speeds up analysis + reduce cost
How? Filtering:
data sort and filter filter: removing rows that aren’t interesting/relevant/ choosing
interesting one
Decision process of using subset of data: need to consider purpose of analysis, time and cost.
LO 3 Understanding statistics:
Probability distributions:
1. Random Variable: quantifies the outcomes of random occurrences
- To measure things that happen by chance
2. Data distribution: shows all possible values for a variable and how frequent
- Tells you what numbers could come up and how frequent each number is likely to appear
- Deals with hypothetical/predicted data
3. Probability distribution:
- Statistical function that describes possible values in population and likelihood that any
given observation (random variable) can take a given range/value
- How probable an outcome is
- Deals with observed data you already have
- If discrete distribution = presented in ranges
Measures of central tendency: describe the center point of data set + susceptible to outliers!
Mean: average = Sum/n – cannot be categorical + susceptible to outliers
Median: midpoint of data distribution – cannot be categorical (small – highest) – ave if pair – no
susceptibility to outliers
Mode: simplest+ most common value – most important for categorical - uni/bi/multi-modal
Symmetry = when all above are equal
Kurtosis: distribution shape + thickness of tails
- Closer to 0 = normal distribution
- Whether more clustered in peak or tails
Work with Skewness: help determine likelihood
of event falling in a tail
- Positive kurtosis : skewed right. More
extreme value peaked at center
- Negative Kurtosis: skewed left. Fewer extreme values so flatter around center
Mean higher than median = implied some outliers are skewing it right
Measures of dispersion:
Range: Max – Min
- Affected by outliers
Interquartile range (IQR): 4 quartiles to determine shape
- Should be in sorted order
- Q1: lowest 25% of observations
- Q2: next 25% - from 25% to 50%(= median)
- Q3: median to 75%
- Q4: 75 and more
- “Inter”= implies not interest in 4 quartile but specific middle section Q2 and Q3
Suppose you have a data set : 24,25,25,25,26,27,28,28,29
- Q2 value = median = 26
- Q1 value = median of lower half (red color) = 25+25/2 =25
- Q3 value = median of upper half in blue color = 28+28/2 =28
- Interquartile range : Q3 – Q1 = 28-25 = 3
-
Variance: average squared deviation from the mean
- Measure how individual data points in a dataset differ from the average
- [X1 – Mean1]^2 + [….] / # of x
Standard deviation:
- Square root of Variance
- Same unit as data value
Coefficient of variation: o/u
- Measure of relative variability by comparing standard deviation to mean
- Useful when comparing different datasets or units or scales
- [SD/MEAN]*100
Uniform distribution:
- Probability distribution that describes a set of continuous random variable where every
value within a given range is equally likely to occur:
- Flat and constant distribution – no peaks no tails
LO 3.4: Using Software tools to Create Summary Statistics:
Argument: value that the function uses to perform calculations (IQR exceptions)
Descriptive data: Data Data analysis on right DS Ok = get data
Tableau Summary statistics – descriptive Statistics: Worksheet Show summary
- Bin - Frequency
- [0-50) - 65
LO 3.5 Interpreting and Visualizing statistics:
Frequency distribution: - [50-100) - 63
- For numerical data b/c categorical we can just count
- Bins, classes, and intervals (categories in numerical - [100-150) - 30
data) - [150-200) - 8
- Table uses bins = to list frequent of various outcome
in sample
- Symmetrical = expect 5th/middle to contain most observation
- Skewed right = more observation in 1st
- Skewed left = more in last one
Histogram:
- Visual representation for frequency distribution
- Size of bins can shape
- Shows stats like: mean, sun, count etc…
- No gaps in between unless need to indicate absence of data point for that interval
- Reference lines for mean and median
- Like vertical bar chart but bars are replaced by bins
- Y axis in bar graph for descriptive but in H for # of observations
Box Plot:
- shows dispersion in
terms of quartiles
Box: represents the IQR: 25th
Q1, 75th Q3
- Length: reflects spread
of middle 50% of the
data