Data Science Class2
Data Science Class2
Types of data
• Digital data is classified into the following categories:
• Structured data
• Semi-structured data
• Unstructured data
Structured Data
• It owns a dedicated data model.
• Data Mining
• Natural Language Processing (NLP)
• Text Analytics
• Noisy Text Analytics
• Data Mining:
• 1970s and before was the era of mainframes. The data was
essentially primitive and structured.
• 2000 and beyond: The World Wide Web (WWW) and the
Internet of Things (IoT) have led to an onslaught of structured,
unstructured, and multimedia data
• Characteristics of Big Data
• Volume
• Variety
• Velocity
• Veracity
• Value
• Volume:
• Mine the data, i.e., a process to turn raw data into useful data.
Value represents benefits of data to your business such as in
finding out insights, results, etc. which were not possible
earlier
STATISTICS
• Descriptive Statistics
– Frequencies & percentages
– Means & standard deviations
• Inferential Statistics
– Correlation
– T-tests
– Chi-square
– Logistic Regression
Descriptive Statistics
Pie chart
Table
frequency
distributions –
good if more
than 20
observations
Good if more
than 20
observations Bar chart
Distributions
The distribution of scores or values can also be
displayed using Box and Whiskers Plots and Histograms
Continuous Categorical
It is possible to take
continuous data
(such as hemoglobin
levels) and turn it
into categorical data
by grouping values
together. Then we
can calculate
frequencies and
percentages for each
group.
Continuous Categorical
Distribution of
Glasgow Coma
Scale Scores
Even though
this is
continuous
data, it is
being treated
as “nominal”
as it is broken
down into
groups or
Tip: It is usually better to collect continuous data and then break it categories
down into categories for data analysis as opposed to collecting data
that fits into preconceived categories.
Ordinal Level Data
Ordinal data is a categorical, statistical data type where the
variables have natural, ordered categories and the distances
between the categories are not known.
60
50
40
30
20
10
0
Strongly Agree Disagree Strongly
Agree Disagree
Interval/Ratio Data
• Ratio data has a defined zero point.
• Interval data lacks the absolute zero point, which makes direct
comparisons of magnitude impossible (e.g. A is twice as large as
B).
We can compute frequencies and percentages for interval and ratio
level data as well
– Examples: Age, Temperature, Height, Weight, Many Clinical Serum Levels
Distribution of Injury Severity
Score in a population of patients
Interval/Ratio Distributions
The distribution of interval/ratio data often
forms a “bell shaped” curve.
– Many phenomena in life are normally
distributed (age, height, weight, IQ).
Interval & Ratio Data
Measures of central tendency and measures of dispersion are often computed with
interval/ratio data
In research, means are usually presented along with standard deviations or standard
errors.