Chapter 1
Chapter 1
Data – are compilations of facts, figures, or other 2 ways to collect sample data generally:
contents, both numerical and non-numerical.
1. Cross-sectional Data - refers to data collected
Statistics – is the language of data. It is the science that by recording a characteristic of many subjects at
deals with the collection, preparation, analysis, the same point in time, or without regard to
interpretation, and presentation of data. differences in time. (e.g., 2018-2019 NBA
Eastern Conference Standings)
First: find the right data and prepare it for the
analysis.
Second: use the appropriate statistical tool,
which depends on the data.
Third: clearly communicate information with
actionable business insights.
2. Time Series Data - refers to data collected over
2 Branches of Statistics: several time periods focusing on certain groups
Descriptive Statistics of people, specific events, or objects. It can
include hourly, daily, weekly, monthly,
Refers to the summary of important aspects of quarterly, or annual observations. (e.g.,
a data set. homeownership rates % in the US)
Includes collecting, organizing, and presenting
the data in the form of charts and tables.
Often calculate numerical measures (typical
value, variability).
Inferential Statistics
Unstructured Data
Do not conform to a pre-defined, row-column There is an abundance of data on the Internet. Many
format. experts believe that 90% of the data in the world today
Textual and multimedia content. was created in the last two years alone. It is easy to
Do not conform to database structures. access and find data by using a search engine like
These data may have some implied structure. Google.
Still considered unstructured.
Do not conform to a row-column model
required in most database systems. Variables and Scales of Measurement
Example: social media data such as Twitter,
YouTube, Facebook, and blogs. Variable – Is a characteristic of interest that differs in
kind or degree among various observations (records).
Big Data
2 Types of Variables:
Businesses generate and gather more and more
data at an increasing pace. Categorical Data
A massive volume of structured and
Also called qualitative
unstructured data.
Represent categories
Extremely difficult to manage, process, and
Labels or names to identify distinguishing
analyze using traditional data processing tools.
characteristics
Presents great opportunities to gain knowledge
Can be defined by two or more categories
and game-changing intelligence.
Coded into numbers for data processing
May not be used when available since it is
Example: marital status, grade in a course
inconvenient, and computationally
burdensome. Numerical Variables
Music: nominal
Food quality: ordinal
Closing time: interval
Own money spent: ratio