Chapter 1 & 2 - Stats
Chapter 1 & 2 - Stats
Types of Data
• Data: Compilations of facts, figures, or other contents, both numerical and non-numerical.
- All types/formats are generated from multiple sources
- Customers/businesses use data from to help make decisions.
- Statistics is the language of data.
• Statistics: is the science that deals with collecting, preparing, analyzing, interpreting, and
presenting data.
• First: find the right data and prepare it for the analysis.
• Second: use the appropriate statistical tool, which depends on the data.
• Third: clearly communicate information with actionable business insights.
Branches of Statistics
• Descriptive Statistics: Summarizes IMPORTANT ASPECTS OF DATA SET including
collecting, organizing, and presenting data in charts and tables.
- Often calculate numerical measures (typical value, variability).
• Inferential Statistics: Draws conclusions about a LARGER SET OF DATA (population) based
on the smaller set of data (sample). It involves analyzing sample data to make inferences about
the unknown population parameter.
- A population consists of all items/members of interest.
- A sample is a subset of the population.
GENERALLY: It is not feasible to obtain population data
- (ex. all the population in the Philippines using cellphone)
Big Data
• 3 Characteristics of Big data:
o Volume: Immense amount of data compiled from multiple sources.
o Velocity: Data generated at a rapid speed.
o Variety: Different types, forms, and granularity of data.
• Additional characteristics:
o Veracity: Credibility and quality of the data.
o Value: Methodological plan for formulating questions and unlocking hidden potential.
• Challenges: Difficult in managing, processing, and analyzing large volumes of data using
traditional tools.
Measurement Scales
• Nominal Scale: LEAST SOPHISTICATED. Represents categories or groups without a specific
order (e.g., marital status).
• Ordinal Scale: STRONGER LEVEL OF MEASUREMENT. Categorizes and ranks data with
respect to some characteristic, but differences between ranks are not meaningful (e.g., star
ratings).
• Interval Scale: MEANINGFUL DIFFERENCES. Categorizes and ranks data with meaningful
differences, but zero is arbitrary (e.g., temperature). Ratios are NOT meaningful.
• Ratio Scale: STRONGEST LEVEL OF MEASUREMENT. CONSISTENT AND
MEANINGFUL with a true zero point, allowing meaningful ratios (e.g., weight, height, profits).
Arithmetic operations are valid on interval- and ratio-scaled variable.
Data Preparation
• Inspecting and Preparing Data: Involves counting, sorting, handling missing values, and
subsetting.
o Counting and Sorting: Helps verify data completeness or determine if there are missing
values and review value ranges.
o Strategies in handling missing values:
▪ Omission Strategy: EXCLUDE OBSERVATION with missing values.
▪ Imputation Strategy: REPLACE values with reasonable imputed values (e.g.,
average for numeric variables, predominant category for categorical variables).
Stem-and-Leaf Diagram
• Stem(left-most digitis)-and-Leaf(the last digit) Diagram: Provides a visual method for
displaying a numerical variable, showing where observations are centered and dispersed.
o Example: Age of the wealthiest people in the world.