Unit 3
Unit 3
types of data
UNIT 3
Contents
Terminologies in Data Analytics: Observation, Data Sampling, Dataset and prediction
Types of Data: Structured, Unstructured and semi structured
Qualitative and Quantitative Data
Data Levels of measurement: Nominal, ordinal,Interval and ratio
Data Warehousing
Terminologies in Data Analytics
Observation
◦ An individual data point or record collected during a study or analysis.
◦ It represents a specific instance or measurement within a dataset.
Data Sampling
◦ The process of selecting a subset of data from a larger population or dataset for analysis.
◦ Sampling is often done to reduce the computational complexity or cost of analyzing the entire dataset while
still maintaining statistical representativeness.
Dataset
◦ A collection of related data points or observations.
◦ It refers to the entire set of data that is used for analysis or modeling, including all variables and records.
Prediction
◦ The process of using historical data and statistical or machine learning techniques to make an estimate or
forecast about future events or outcomes.
◦ Predictions are based on patterns, trends, and relationships discovered in the data
Types of Data
Structured Data
•
•The data which is to the point, factual, and highly organized is referred to as structured data.
• It is quantitative in nature, i.e., it is related to quantities that means it contains measurable numerical
values like numbers, dates, and times.
• Structured data generally exist in tables like excel files and
Google Docs spreadsheets
Types of Data
Unstructured Data
•All the unstructured files, log files, audio files, and image files are included in the unstructured data.
•Examples of human-generated unstructured data are
•Text files, Email, social media, media, mobile data, business applications
• The machine-generated unstructured data includes
• satellite images, scientific data, sensor data, digital surveillance,
•and many more.
Types of Data
Semi-structured Data
Semi-structured data is a type of data that is not purely structured, but also not completely
unstructured.
Qualitative and Quantitative Data
Quantitative data is numbers-based, countable, or measurable.
Quantitative data tells us how many, how much, or how often in calculations.
Quantitative data is fixed and universal.
Quantitative data is analyzed using statistical analysis.
You can categorize your data by labelling them in mutually •City of birth
exclusive groups, but there is no order between the categories. •Gender
•Ethnicity
•Car brands
•Marital status
Ordinal
Ordinal level Examples of ordinal scales
You can categorize and rank your data in an order, but •Top 5 Olympic medalists
you cannot say anything about the intervals between the •Language ability (e.g.,
rankings. beginner, intermediate,
Although you can rank the top 5 Olympic medalists, this fluent)
scale does not tell you how close or far apart they are in •Likert-type questions (e.g.,
number of wins. very dissatisfied to very
satisfied)
Interval
Interval level Examples of interval scales
You can categorize, rank, and infer equal •Test scores (e.g., IQ or
intervals between neighboring data points, but there is exams)
no true zero point. •Personality inventories
The difference between any two adjacent temperatures •Temperature in Fahrenheit
is the same: one degree. But zero degrees is defined or Celsius
differently depending on the scale – it doesn’t mean an
absolute absence of temperature.
The same is true for test scores and personality
inventories. A zero on a test is arbitrary; it does not
mean that the test-taker has an absolute lack of the trait
being measured.
Ratio
Ratio level Examples of ratio scales
You can categorize, rank, and infer equal •Height
intervals between neighboring data points, and there is •Age
a true zero point. •Weight
A true zero means there is an absence of the variable •Temperature in Kelvin
of interest.
In ratio scales, zero does mean an absolute lack of the
variable.
For example, in the Kelvin temperature scale, there are
no negative degrees of temperature – zero means an
absolute lack of thermal energy.
Why are levels of measurement important?
The level at which you measure a variable determines how you can analyze your data.
The different levels limit which descriptive statistics you can use to get an overall summary of
your data, and which type of inferential statistics you can perform on your data to support or
refute your hypothesis.
In many cases, your variables can be measured at different levels, so you have to choose the
level of measurement you will use before data collection begins.
Example
You can measure the variable of income at an ordinal or ratio level.
Ordinal level: You create brackets of income ranges: $0–$19,999, $20,000–$39,999, and $40,000–$59,999. You ask participants to select
the bracket that represents their annual income.
The brackets are coded with numbers from 1–3.
Ratio level: You collect data on the exact annual incomes of your participants.
Participant Income (ordinal level) Income (ratio level)
A Bracket 1 $12,550
B Bracket 2 $39,700
C Bracket 3 $40,300
At a ratio level, you can see that the difference between A and B’s incomes is far greater than the difference between B and C’s incomes.
At an ordinal level, however, you only know the income bracket for each participant, not their exact income. Since you cannot say exactly
how much each income differs from the others in your data set, you can only order the income levels and group the participants.