0% found this document useful (0 votes)
32 views5 pages

Chapter 1 & 2 - Stats

The document covers the fundamentals of data and data preparation, including types of data, branches of statistics, and methods of data collection. It discusses various measurement scales, the characteristics of big data, and the importance of data preparation techniques. Additionally, it outlines methods for visualizing data through tables and graphs, emphasizing the significance of effective communication of insights derived from data analysis.

Uploaded by

Jvnz Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views5 pages

Chapter 1 & 2 - Stats

The document covers the fundamentals of data and data preparation, including types of data, branches of statistics, and methods of data collection. It discusses various measurement scales, the characteristics of big data, and the importance of data preparation techniques. Additionally, it outlines methods for visualizing data through tables and graphs, emphasizing the significance of effective communication of insights derived from data analysis.

Uploaded by

Jvnz Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Chapter 1: Data and Data Preparation

Types of Data
• Data: Compilations of facts, figures, or other contents, both numerical and non-numerical.
- All types/formats are generated from multiple sources
- Customers/businesses use data from to help make decisions.
- Statistics is the language of data.
• Statistics: is the science that deals with collecting, preparing, analyzing, interpreting, and
presenting data.

• First: find the right data and prepare it for the analysis.
• Second: use the appropriate statistical tool, which depends on the data.
• Third: clearly communicate information with actionable business insights.

Branches of Statistics
• Descriptive Statistics: Summarizes IMPORTANT ASPECTS OF DATA SET including
collecting, organizing, and presenting data in charts and tables.
- Often calculate numerical measures (typical value, variability).
• Inferential Statistics: Draws conclusions about a LARGER SET OF DATA (population) based
on the smaller set of data (sample). It involves analyzing sample data to make inferences about
the unknown population parameter.
- A population consists of all items/members of interest.
- A sample is a subset of the population.
GENERALLY: It is not feasible to obtain population data
- (ex. all the population in the Philippines using cellphone)

Two ways of collecting data


• Cross-sectional Data: refers to data collected by recording a characteristic of many subjects at
the SAME POIN IN TIME or without regard to differences in time.
o Example: NBA Eastern Conference standings for a specific season.
• Time Series Data: Data collected OVER SEVERAL TIME PERIODS focusing on certain
groups, events, or objects.
- Time series data can include hourly, daily, weekly, monthly, quarterly, or annual
observations.
o Example: Homeownership rates over several years.
Types of Data
• Structured Data: Resides in a PRE-DEFINED row-column format, such as spreadsheets or
databases. It is NUMERICAL AND OBJECTIVE.
- Today, only about 20% of all data used in business decisions are structured.
• Unstructured Data: DOES NOT conform to a PRE-DEFINED format and includes textual and
multimedia content, such as social media data.
- Do not conform to a row-column model required in most database systems
Example: social media data such as Twitter, YouTube, Facebook, and blogs.

Big Data
• 3 Characteristics of Big data:
o Volume: Immense amount of data compiled from multiple sources.
o Velocity: Data generated at a rapid speed.
o Variety: Different types, forms, and granularity of data.

• Additional characteristics:
o Veracity: Credibility and quality of the data.
o Value: Methodological plan for formulating questions and unlocking hidden potential.
• Challenges: Difficult in managing, processing, and analyzing large volumes of data using
traditional tools.

Variables and Scales of Measurement


2 types of variables
• Categorical Variables: Qualitative data representing categories (e.g., marital status).
• Numerical Variables: Quantitative data, either discrete (countable values) or continuous
(uncountable values within an interval).
NOTE: In order to choose the appropriate techniques for summarizing and analyzing variables, we
need to distinguish between the different measurement scales.

Measurement Scales
• Nominal Scale: LEAST SOPHISTICATED. Represents categories or groups without a specific
order (e.g., marital status).
• Ordinal Scale: STRONGER LEVEL OF MEASUREMENT. Categorizes and ranks data with
respect to some characteristic, but differences between ranks are not meaningful (e.g., star
ratings).
• Interval Scale: MEANINGFUL DIFFERENCES. Categorizes and ranks data with meaningful
differences, but zero is arbitrary (e.g., temperature). Ratios are NOT meaningful.
• Ratio Scale: STRONGEST LEVEL OF MEASUREMENT. CONSISTENT AND
MEANINGFUL with a true zero point, allowing meaningful ratios (e.g., weight, height, profits).
Arithmetic operations are valid on interval- and ratio-scaled variable.

Data Preparation
• Inspecting and Preparing Data: Involves counting, sorting, handling missing values, and
subsetting.
o Counting and Sorting: Helps verify data completeness or determine if there are missing
values and review value ranges.
o Strategies in handling missing values:
▪ Omission Strategy: EXCLUDE OBSERVATION with missing values.
▪ Imputation Strategy: REPLACE values with reasonable imputed values (e.g.,
average for numeric variables, predominant category for categorical variables).

• Numeric variables: replace with the average.


• Categorical variables: replace with the predominant category
o Subsetting: EXTRACTING RELEVANT PORTION of the data set for analysis,
eliminating low-quality data, and excluding redundant variables.
Chapter 2: Tabular and Graphical Methods
Introductory Case: House Prices in Punta Gorda
• Objective: Use sample information to:
1. Make summary statements concerning the range of house prices.
2. Comment on where house prices tend to cluster.
3. Examine the relationship between house price and size.

Methods to Visualize a Categorical Variable


• Frequency Distribution: Group data into categories and record the number of observations
in each category. Calculate relative frequency and percentage frequency.
• To calculate the frequency distribution: multiply the proportion by 100 to get percentage.
o Example: Myers-Briggs assessment personality types for 1,000 employees.
• Bar Chart: Depicts frequency or relative frequency for each category using horizontal or vertical
bars.
- Series of either horizontal or vertical bars.
- Bar lengths proportional to the values they are depicting.
Note: The vertical axis on a graph should not have excessively high values at the top.
• Pie Chart: Segmented circle portraying relative frequencies of categories.

Methods to Visualize the Relationship Between TWO Categorical Variables


• Contingency Table: Examines the relationship BETWEEN TWO categorical variables by
showing frequencies for each combination of values.
o Example: Myers-Briggs personality assessment and sex.
• Stacked Column Chart: Visualizes MORE THAN ONE categorical variable, allowing
comparison within each category.

Methods to Visualize a Numeric Variable


• Categorical - the raw data could be categorized in a well-defined way.
• Numerical variable - each observation represents a meaningful amount or count.
• Frequency Distribution: Summarizes a numerical variable by constructing intervals or classes.
o Example: House prices in Punta Gorda.
Intervals:
- Mutually exclusive
- The total number of intervals usually ranges from 5 to 20
- Intervals are exhaustive.
- Easy to recognize and interpret.
3 other items to compute:

• Relative frequency: PROPORTION or fraction of observation that falls into


EACH INTERVAL.
• Cumulative frequency: NUMBER OF OBSERVATION that falls BELOW THE
UPPER LIMIT.
• Cumulative relative frequency: PROPORTION or fraction of observation that
falls BELOW THE UPPER LIMIT.
• Histogram: Graphically represents a frequency distribution using rectangles with heights
representing frequency or relative frequency.
- Symmetric: mirror image of itself (same both sides)
- Skewed: Positive (elongated right tail) or negative (elongated left tail).
• Polygon: Connects midpoints of intervals with a straight line to show the shape of a
distribution.
• Ogive: Depicts cumulative frequency or cumulative relative frequency using points
connected by a straight line.

More Data Visualization Methods


• Scatterplot: Examines the relationship between two numerical variables, revealing linear,
nonlinear, or no relationship.
- Determine if two numerical variables are related in some systematic way.
- • Each point represents a pair of observations of the two variables.
- • Refer to one variable as x (x-axis) and the other as y (y-axis).
o Example: House prices and square footage in Punta Gorda.
• Scatterplot with a Categorical Variable: Incorporates a categorical variable using color to show
its category.
A scatterplot with a categorical variable modifies a basic scatterplot.
• Incorporate a categorical variable in addition to the two numeric variables.
• Encode the categorical variable with color.
• Giving each point a distinct hue makes it easy to show its category.
• Line Chart: Displays a numerical variable as a series of consecutive observations connected by a
line, useful for tracking changes or trends over time.
o Example: Monthly stock prices for Apple and Merck.

Stem-and-Leaf Diagram
• Stem(left-most digitis)-and-Leaf(the last digit) Diagram: Provides a visual method for
displaying a numerical variable, showing where observations are centered and dispersed.
o Example: Age of the wealthiest people in the world.

You might also like