Chapter 1

This document discusses data preparation and analysis. It defines key terms like data, statistics, descriptive statistics, inferential statistics, population and sample. It also covers types of data like structured and unstructured data, and characteristics of big data. Variables are introduced as characteristics that differ among observations. Scales of measurement are covered, including nominal, ordinal, interval and ratio scales. Common data preparation tasks like counting, sorting, and handling missing values are also summarized.

Uploaded by

Cruzzy Kait

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views3 pages

Chapter 1

Uploaded by

Cruzzy Kait

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Chapter 1 – Data and Data Preparation

Data – are compilations of facts, figures, or other 2 ways to collect sample data generally:
contents, both numerical and non-numerical.
1. Cross-sectional Data - refers to data collected
Statistics – is the language of data. It is the science that by recording a characteristic of many subjects at
deals with the collection, preparation, analysis, the same point in time, or without regard to
interpretation, and presentation of data. differences in time. (e.g., 2018-2019 NBA
Eastern Conference Standings)
 First: find the right data and prepare it for the
analysis.
 Second: use the appropriate statistical tool,
which depends on the data.
 Third: clearly communicate information with
actionable business insights.
2. Time Series Data - refers to data collected over
2 Branches of Statistics: several time periods focusing on certain groups
Descriptive Statistics of people, specific events, or objects. It can
include hourly, daily, weekly, monthly,
 Refers to the summary of important aspects of quarterly, or annual observations. (e.g.,
a data set. homeownership rates % in the US)
 Includes collecting, organizing, and presenting
the data in the form of charts and tables.
 Often calculate numerical measures (typical
value, variability).

Inferential Statistics

 Refers to drawing conclusions about a larger set

of data (population) based on a smaller set of
data (sample). Types of Data:
 Population – consists of all items/members of
interest. Structured Data
 Sample – is a subset of the population.
 Reside in a pre-defined, row-column format.
We rely on sample data to make inferences about  Spreadsheet or database applications.
various characteristics of the population.  Enter, store, query, and analyze.
 Numerical information that is objective and not
It is generally not feasible to obtain population data. open to interpretation.
Obtaining information on the entire population is
expensive. It is impossible to examine every member of
the population.

 Today, only about 20% of all data used in

business decisions are structured.

Unstructured Data
 Do not conform to a pre-defined, row-column There is an abundance of data on the Internet. Many
format. experts believe that 90% of the data in the world today
 Textual and multimedia content. was created in the last two years alone. It is easy to
 Do not conform to database structures. access and find data by using a search engine like
 These data may have some implied structure. Google.
Still considered unstructured.
 Do not conform to a row-column model
required in most database systems. Variables and Scales of Measurement
 Example: social media data such as Twitter,
YouTube, Facebook, and blogs. Variable – Is a characteristic of interest that differs in
kind or degree among various observations (records).
Big Data
2 Types of Variables:
 Businesses generate and gather more and more
data at an increasing pace. Categorical Data
 A massive volume of structured and
 Also called qualitative
unstructured data.
 Represent categories
 Extremely difficult to manage, process, and
 Labels or names to identify distinguishing
analyze using traditional data processing tools.
characteristics
 Presents great opportunities to gain knowledge
 Can be defined by two or more categories
and game-changing intelligence.
 Coded into numbers for data processing
 May not be used when available since it is
 Example: marital status, grade in a course
inconvenient, and computationally
burdensome. Numerical Variables

3 Characteristics of Big Data:  Numeric Data

o Also called quantitative
 Volume: immense amount of data compiled for
o Represent meaningful numbers
a single or multiple sources
o Either discrete or continuous
 Velocity: generated at a rapid speed,
 A discrete variable assumes a countable
management is a critical issue
number of values.
 Variety: all types, forms, granularity, structured
o The values need not be whole numbers
or unstructured
o Example: number of children in a family
Additional Characteristics:  A continuous variable assumes an uncountable
number of values within an interval.
 Veracity: credibility and quality of the data,
o In practice, often measure in discrete
reliability
values
 Values: methodological plan for formulating
o Example: weight of a newborn baby
questions, curating the right data and unlocking
hidden potential 4 Major Scales:

Having a plethora of data does not guarantee that 1. Nominal

useful insights or measurable improvements will be - Least sophisticated
generated. - Represent categories or groups
- Values differ by label or name
- Example: marital status
2. Ordinal We often spend a considerable amount of time
- Stronger level of measurement inspecting and preparing the data for the subsequent
- Categorize and rank data with respect to analysis (ways): Counting & sorting, Handling missing
some characteristic values, Subsetting.
- Cannot interpret the difference between
Counting and Sorting
the ranked values, numbers are arbitrary
- Example: reviews from 1 star (poor) to 5  Among the very first tasks analysts perform
starts (outstanding)  Gain a better understanding and insights into
 Nominal and ordinal scales are used for the data
categorical variables. Categorical variables are  Help to verify that the data set is complete or
typically expressed in words but are coded into determine if there are missing values
numbers for purposes of data processing.
 Sorting allows us to review the range of values
- Typically count the number of observations
for each variable
that fall into each category (or find
 Sort based on a single or multiple variables
percentages)
- Unable to perform meaningful arithmetic 2 common strategies for dealing w/ missing values:
operations
3. Interval  Omission strategy – recommends that
- Categorize and rank, differences are observations with missing values be excluded
meaningful from subsequent analysis.
- Zero value is arbitrary and does not reflect  Imputation strategy – recommends that the
absence of characteristic missing values be replaced with some
- Ratios are not meaningful reasonable imputed values.
- Example: temperature o Numeric variables: replace with the
4. Ratio average
- Strongest level of measurement o Categorical variables: replace with the
- A true zero point, reflects absence of predominant category
characteristic
Subsetting – is the process of extracting a portion of the
- Ratios are meaningful
data set that is relevant for subsequent statistical
- Example: profits
analysis.
 Interval and ratio scales are used for numerical
variables. Arithmetic operations are valid on  The objective of the analysis is to compare two
interval- and ratio-scaled variable. subsets of the data.
 Eliminate observations that contain missing
Example: The owner of a ski resort gathers data on
values, low-quality data, or outliers.
tweens.
 Excluding variables that contain redundant
information, or variables with excessive
amounts of missing values.
 We can also subset data based on data ranges.

 Music: nominal
 Food quality: ordinal
 Closing time: interval
 Own money spent: ratio

Eco2061 Week 2
No ratings yet
Eco2061 Week 2
68 pages
Final UNIT II-DESCRIPTIVE ANALYTICS
No ratings yet
Final UNIT II-DESCRIPTIVE ANALYTICS
128 pages
Book Nasscom
No ratings yet
Book Nasscom
1,149 pages
1 English For Information Technology Elementa Students - UNIT 1
67% (3)
1 English For Information Technology Elementa Students - UNIT 1
11 pages
IOT Unit 4 Data and Analytics For IoT by Dr.M.K.Jayanthi Kannan
100% (1)
IOT Unit 4 Data and Analytics For IoT by Dr.M.K.Jayanthi Kannan
41 pages
BBS11 PPT ch01
No ratings yet
BBS11 PPT ch01
19 pages
AI 102t00a
No ratings yet
AI 102t00a
9 pages
ISO 27002 New Standards
No ratings yet
ISO 27002 New Standards
1 page
Quantitative Techniques For Management
No ratings yet
Quantitative Techniques For Management
18 pages
Introduction To Statistics..Final
No ratings yet
Introduction To Statistics..Final
221 pages
Real-Time ASP - Net Core 3 Apps With SignalR Succinctly
No ratings yet
Real-Time ASP - Net Core 3 Apps With SignalR Succinctly
81 pages
ESRI Geodatabase
50% (2)
ESRI Geodatabase
15 pages
STAT. Lec.1
No ratings yet
STAT. Lec.1
30 pages
Statistics
100% (1)
Statistics
12 pages
Manually Upgrading With TufinOS On A Single New ST - SC Server
No ratings yet
Manually Upgrading With TufinOS On A Single New ST - SC Server
5 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
Ecs DG
No ratings yet
Ecs DG
745 pages
Owasp Testing Guide v3.0
No ratings yet
Owasp Testing Guide v3.0
374 pages
1-Introduction To Statistics PDF
100% (1)
1-Introduction To Statistics PDF
37 pages
Quantitative Methods 3
No ratings yet
Quantitative Methods 3
174 pages
Revision SB Chap 2 7
No ratings yet
Revision SB Chap 2 7
55 pages
LBYACST (Lecture Notes)
No ratings yet
LBYACST (Lecture Notes)
9 pages
CAP170 Practice
No ratings yet
CAP170 Practice
153 pages
BS Week1
No ratings yet
BS Week1
141 pages
File tổng hợp kiến thức SB
No ratings yet
File tổng hợp kiến thức SB
148 pages
Business Mathematics and Statistics: Dr. Muhammad Arif Hussain
No ratings yet
Business Mathematics and Statistics: Dr. Muhammad Arif Hussain
39 pages
KNIME Will They Blend 20200817
No ratings yet
KNIME Will They Blend 20200817
165 pages
Statistical Analysis (Lecture 1)
No ratings yet
Statistical Analysis (Lecture 1)
40 pages
BA1 Introduction 2025
No ratings yet
BA1 Introduction 2025
55 pages
Solution Manual For C++ Programming: Program Design Including Data Structures, 6th Edition D.S. Malik
100% (10)
Solution Manual For C++ Programming: Program Design Including Data Structures, 6th Edition D.S. Malik
46 pages
Part 1 - Basic Statistics
No ratings yet
Part 1 - Basic Statistics
44 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
49 pages
QM 1
No ratings yet
QM 1
58 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
27 pages
مبادئ الاحصاء
No ratings yet
مبادئ الاحصاء
66 pages
FIN10002 - Notes Master
No ratings yet
FIN10002 - Notes Master
44 pages
EBA2123 1.data and Statistics
No ratings yet
EBA2123 1.data and Statistics
36 pages
Chapter1 2023
No ratings yet
Chapter1 2023
76 pages
Stat I Chapter 1 & 2 Ppt-1
No ratings yet
Stat I Chapter 1 & 2 Ppt-1
43 pages
BBA - Sem I - Unit 1
No ratings yet
BBA - Sem I - Unit 1
40 pages
Topic 3 Overview of Using Data
No ratings yet
Topic 3 Overview of Using Data
54 pages
Working With SAP Business One Mobile App For Android: User Guide - Public 2020-02-19
No ratings yet
Working With SAP Business One Mobile App For Android: User Guide - Public 2020-02-19
58 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
Descriptive Statistics: Overview of Using Data
No ratings yet
Descriptive Statistics: Overview of Using Data
47 pages
Data, Data Collection, and Sourcing
No ratings yet
Data, Data Collection, and Sourcing
54 pages
3.badm - Mba Notes
No ratings yet
3.badm - Mba Notes
13 pages
Object Analysis and Design
No ratings yet
Object Analysis and Design
15 pages
A System Review On Measuring and Evaluating Web Usability in Model Driven Web Development
No ratings yet
A System Review On Measuring and Evaluating Web Usability in Model Driven Web Development
10 pages
SM Session 1 IPL 2024 Post Session Slides
No ratings yet
SM Session 1 IPL 2024 Post Session Slides
44 pages
Ba CH01
No ratings yet
Ba CH01
14 pages
CH 01
No ratings yet
CH 01
36 pages
Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data
No ratings yet
Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data
53 pages
Msla Og
No ratings yet
Msla Og
25 pages
Introduction To Statistics - c1
No ratings yet
Introduction To Statistics - c1
19 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Data Science (Unit 02) Notes
No ratings yet
Data Science (Unit 02) Notes
7 pages
1 Introduction
No ratings yet
1 Introduction
15 pages
Remote - Trigger - Process Chain PDF
No ratings yet
Remote - Trigger - Process Chain PDF
10 pages
SBE - 11e ch01
No ratings yet
SBE - 11e ch01
36 pages
DBB2102 Quantitative Techniques For Management
No ratings yet
DBB2102 Quantitative Techniques For Management
12 pages
Network Monitoring Project
No ratings yet
Network Monitoring Project
30 pages
Intro Juniper 1
No ratings yet
Intro Juniper 1
45 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
7 pages
Davidradbergbig Data Overview - Sics Keynote Session 24septv4 PDF
No ratings yet
Davidradbergbig Data Overview - Sics Keynote Session 24septv4 PDF
29 pages
Chapter 1 & 2 - Stats
No ratings yet
Chapter 1 & 2 - Stats
5 pages
Lecture-1-Inroduction To Statistics and Data
No ratings yet
Lecture-1-Inroduction To Statistics and Data
49 pages
ttl1 Activity 6
100% (1)
ttl1 Activity 6
4 pages
Wa0002.
No ratings yet
Wa0002.
2 pages
3 Peering A Simple VPC Peering Tutorial - DEV Community
No ratings yet
3 Peering A Simple VPC Peering Tutorial - DEV Community
15 pages
Organising
No ratings yet
Organising
2 pages
Notes of Week-1 and Week-2
No ratings yet
Notes of Week-1 and Week-2
30 pages
Ba Lecture 2
No ratings yet
Ba Lecture 2
54 pages
Introduction Bus Statistics
No ratings yet
Introduction Bus Statistics
32 pages
HCM Extract DBI List REL11 Updated
No ratings yet
HCM Extract DBI List REL11 Updated
5 pages
Chapter 1 Data and Data Preparation
No ratings yet
Chapter 1 Data and Data Preparation
3 pages
Slides Week2 DataCollection
No ratings yet
Slides Week2 DataCollection
26 pages
MGT 1103
No ratings yet
MGT 1103
4 pages
SQL Server Questions
No ratings yet
SQL Server Questions
12 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
45 pages
Introduction To Business Statistics: Data, Types of Variables, Levels of Measurement, Data Sources, Types of Statistics
No ratings yet
Introduction To Business Statistics: Data, Types of Variables, Levels of Measurement, Data Sources, Types of Statistics
16 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
Chapter 3
No ratings yet
Chapter 3
2 pages
Case Study On RAID Architectures
No ratings yet
Case Study On RAID Architectures
3 pages
D-Code Presentation - Overview of ABAP 7.4 Development For SAP HANA
No ratings yet
D-Code Presentation - Overview of ABAP 7.4 Development For SAP HANA
51 pages
Reviewer +Ch+1+Data+and+Data+Preparation+
No ratings yet
Reviewer +Ch+1+Data+and+Data+Preparation+
3 pages
Chapter 5 - Investment Decisions
No ratings yet
Chapter 5 - Investment Decisions
3 pages
Chapter 2 - The One Lesson of Business: Capitalism 101 Wealth-Creating Transactions
No ratings yet
Chapter 2 - The One Lesson of Business: Capitalism 101 Wealth-Creating Transactions
3 pages
Chapter 1
No ratings yet
Chapter 1
8 pages
Ajeet Chouksey: Work Experience Skills
No ratings yet
Ajeet Chouksey: Work Experience Skills
1 page
Business Analytics (MIS171) Summary Notes
No ratings yet
Business Analytics (MIS171) Summary Notes
6 pages
Chapter 1 Data and Decisions
No ratings yet
Chapter 1 Data and Decisions
3 pages
Assignment - Panitikang Pilipino Accountancy
No ratings yet
Assignment - Panitikang Pilipino Accountancy
2 pages
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet

Chapter 1

Uploaded by

Chapter 1

Uploaded by

Chapter 1 – Data and Data Preparation

 Refers to drawing conclusions about a larger set

 Today, only about 20% of all data used in

3 Characteristics of Big Data:  Numeric Data

Having a plethora of data does not guarantee that 1. Nominal

You might also like