0% found this document useful (0 votes)
40 views3 pages

Chapter 1

This document discusses data preparation and analysis. It defines key terms like data, statistics, descriptive statistics, inferential statistics, population and sample. It also covers types of data like structured and unstructured data, and characteristics of big data. Variables are introduced as characteristics that differ among observations. Scales of measurement are covered, including nominal, ordinal, interval and ratio scales. Common data preparation tasks like counting, sorting, and handling missing values are also summarized.

Uploaded by

Cruzzy Kait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views3 pages

Chapter 1

This document discusses data preparation and analysis. It defines key terms like data, statistics, descriptive statistics, inferential statistics, population and sample. It also covers types of data like structured and unstructured data, and characteristics of big data. Variables are introduced as characteristics that differ among observations. Scales of measurement are covered, including nominal, ordinal, interval and ratio scales. Common data preparation tasks like counting, sorting, and handling missing values are also summarized.

Uploaded by

Cruzzy Kait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Chapter 1 – Data and Data Preparation

Data – are compilations of facts, figures, or other 2 ways to collect sample data generally:
contents, both numerical and non-numerical.
1. Cross-sectional Data - refers to data collected
Statistics – is the language of data. It is the science that by recording a characteristic of many subjects at
deals with the collection, preparation, analysis, the same point in time, or without regard to
interpretation, and presentation of data. differences in time. (e.g., 2018-2019 NBA
Eastern Conference Standings)
 First: find the right data and prepare it for the
analysis.
 Second: use the appropriate statistical tool,
which depends on the data.
 Third: clearly communicate information with
actionable business insights.
2. Time Series Data - refers to data collected over
2 Branches of Statistics: several time periods focusing on certain groups
Descriptive Statistics of people, specific events, or objects. It can
include hourly, daily, weekly, monthly,
 Refers to the summary of important aspects of quarterly, or annual observations. (e.g.,
a data set. homeownership rates % in the US)
 Includes collecting, organizing, and presenting
the data in the form of charts and tables.
 Often calculate numerical measures (typical
value, variability).

Inferential Statistics

 Refers to drawing conclusions about a larger set


of data (population) based on a smaller set of
data (sample). Types of Data:
 Population – consists of all items/members of
interest. Structured Data
 Sample – is a subset of the population.
 Reside in a pre-defined, row-column format.
We rely on sample data to make inferences about  Spreadsheet or database applications.
various characteristics of the population.  Enter, store, query, and analyze.
 Numerical information that is objective and not
It is generally not feasible to obtain population data. open to interpretation.
Obtaining information on the entire population is
expensive. It is impossible to examine every member of
the population.

 Today, only about 20% of all data used in


business decisions are structured.

Unstructured Data
 Do not conform to a pre-defined, row-column There is an abundance of data on the Internet. Many
format. experts believe that 90% of the data in the world today
 Textual and multimedia content. was created in the last two years alone. It is easy to
 Do not conform to database structures. access and find data by using a search engine like
 These data may have some implied structure. Google.
Still considered unstructured.
 Do not conform to a row-column model
required in most database systems. Variables and Scales of Measurement
 Example: social media data such as Twitter,
YouTube, Facebook, and blogs. Variable – Is a characteristic of interest that differs in
kind or degree among various observations (records).
Big Data
2 Types of Variables:
 Businesses generate and gather more and more
data at an increasing pace. Categorical Data
 A massive volume of structured and
 Also called qualitative
unstructured data.
 Represent categories
 Extremely difficult to manage, process, and
 Labels or names to identify distinguishing
analyze using traditional data processing tools.
characteristics
 Presents great opportunities to gain knowledge
 Can be defined by two or more categories
and game-changing intelligence.
 Coded into numbers for data processing
 May not be used when available since it is
 Example: marital status, grade in a course
inconvenient, and computationally
burdensome. Numerical Variables

3 Characteristics of Big Data:  Numeric Data


o Also called quantitative
 Volume: immense amount of data compiled for
o Represent meaningful numbers
a single or multiple sources
o Either discrete or continuous
 Velocity: generated at a rapid speed,
 A discrete variable assumes a countable
management is a critical issue
number of values.
 Variety: all types, forms, granularity, structured
o The values need not be whole numbers
or unstructured
o Example: number of children in a family
Additional Characteristics:  A continuous variable assumes an uncountable
number of values within an interval.
 Veracity: credibility and quality of the data,
o In practice, often measure in discrete
reliability
values
 Values: methodological plan for formulating
o Example: weight of a newborn baby
questions, curating the right data and unlocking
hidden potential 4 Major Scales:

Having a plethora of data does not guarantee that 1. Nominal


useful insights or measurable improvements will be - Least sophisticated
generated. - Represent categories or groups
- Values differ by label or name
- Example: marital status
2. Ordinal We often spend a considerable amount of time
- Stronger level of measurement inspecting and preparing the data for the subsequent
- Categorize and rank data with respect to analysis (ways): Counting & sorting, Handling missing
some characteristic values, Subsetting.
- Cannot interpret the difference between
Counting and Sorting
the ranked values, numbers are arbitrary
- Example: reviews from 1 star (poor) to 5  Among the very first tasks analysts perform
starts (outstanding)  Gain a better understanding and insights into
 Nominal and ordinal scales are used for the data
categorical variables. Categorical variables are  Help to verify that the data set is complete or
typically expressed in words but are coded into determine if there are missing values
numbers for purposes of data processing.
 Sorting allows us to review the range of values
- Typically count the number of observations
for each variable
that fall into each category (or find
 Sort based on a single or multiple variables
percentages)
- Unable to perform meaningful arithmetic 2 common strategies for dealing w/ missing values:
operations
3. Interval  Omission strategy – recommends that
- Categorize and rank, differences are observations with missing values be excluded
meaningful from subsequent analysis.
- Zero value is arbitrary and does not reflect  Imputation strategy – recommends that the
absence of characteristic missing values be replaced with some
- Ratios are not meaningful reasonable imputed values.
- Example: temperature o Numeric variables: replace with the
4. Ratio average
- Strongest level of measurement o Categorical variables: replace with the
- A true zero point, reflects absence of predominant category
characteristic
Subsetting – is the process of extracting a portion of the
- Ratios are meaningful
data set that is relevant for subsequent statistical
- Example: profits
analysis.
 Interval and ratio scales are used for numerical
variables. Arithmetic operations are valid on  The objective of the analysis is to compare two
interval- and ratio-scaled variable. subsets of the data.
 Eliminate observations that contain missing
Example: The owner of a ski resort gathers data on
values, low-quality data, or outliers.
tweens.
 Excluding variables that contain redundant
information, or variables with excessive
amounts of missing values.
 We can also subset data based on data ranges.

 Music: nominal
 Food quality: ordinal
 Closing time: interval
 Own money spent: ratio

You might also like