EDA Unit-1
EDA Unit-1
WHAT IS DATA?
TYPES OF DATA:
Structured Data:
Unstructured Data:
Semi-structured Data:
The data which is to the point, factual, and highly organized is referred
to as structured data. It is quantitative in nature, i.e., it is related to
quantities that means it contains measurable numerical values like
numbers, dates, and times.
It is easy to search and analyze structured data. Structured data exists in
a predefined format. Relational database consisting of tables with rows
and columns is one of the best examples of structured data. Structured
data generally exist in tables like excel files and Google Docs
spreadsheets. Structured data is highly organized and understandable for
machine language.
Unstructured Data-
All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data
available, but they did not know how to derive data value since the data
is raw. Unstructured data is the data that lacks any predefined model or
format. It requires a lot of storage space, and it is hard to maintain
security in it. It cannot be presented in a data model or schema. That's
why managing, analyzing, or searching for unstructured data is hard. It
resides in various different formats like text, images, audio and video
files, etc. It is qualitative in nature and sometimes stored in a non-
relational database or NO-SQL.
Semi-structured Data-
Semi-structured data refers to data that is not captured or formatted in
conventional ways. Semi-structured data does not follow the format of a tabular data
model or relational databases because it does not have a fixed schema. However, the
data is not completely raw or unstructured, and does contain some structural elements
such as tags and organizational metadata that make it easier to analyze. The advantages
of semi-structured data is that it is more flexible and simpler to scale compared to
Structured Data.
HTML code, graphs and tables, e-mails, XML documents are examples of semi-
structured data.
Structured Data
When we talk about structured data, we are often talking about tabular
data(rectangular data) i.e. rows and columns from a database.These tables further
contain mainly two types of structured data:
Continuous — Data that can undertake any value in an interval. For example,
the speed of a car, heart rate, etc.
Discrete — Data that can undertake only integer values, such as counts. For
example, the number of heads in 20 flips of a coin.
2. Categorical Data Data that can undertake only a specific set of values
representing possible categories. These are also called enums, enumerated,
factors, or nominal.
Binary — A special case of categorical data where the features are
dichotomous i.e. can accept only 0/1 or True/False.
The next step is to dive deeper into structured data and how we can
use third party packages and libraries to manipulate such structures.
We have mainly two types of structures or data storage models:
1. Rectangular
2. Non-Rectangular
Rectangular Data Mostly all analyses in data science are done with a rectangular
two-dimensional data object like a dataframe, spreadsheet, CSV file, or a
database table.This mainly consists of rows that represent records(observations)
and columns(features/variables). Dataframe on the other hand is a special data
structure with a tabular format that offers super-efficient operations to
manipulate the data. Dataframes are the most commonly used data structures and
it’s important to cover a few definitions here:
Graph data structures are used to represent relationships — physical, social, and
abstract. For example, Facebook or Twitter represents connections between
people on the network as a graph of social relationships. Graph structures are
useful for certain types of problems, such as network optimization and
recommender systems. Each of these data types has a specific set of methods in
data science. The focus of this series is on rectangular data which forms the
foundational building blocks of predictive modeling.
Classification Of Data:
1.Based on Observation-
Cross Sectional Data: Cross-section data is collected in a single time period and
is characterized by individual units - people, companies, countries, etc. Some
examples include:
With cross-sectional data the ordering of the data does not matter. In other
words, we can order the data by ascending, descending or even randomized
order and this will not affect out modeling results.
Time Series Data: Data collected at a number of specific points in time is called
time series data. Such examples include stock prices, interest rates, exchange
rates as well as product prices, GDP, etc. Time series data can be observed at
many different frequencies (hourly, daily, weekly, monthly, quarterly, anually,
etc.).Unlike cross-sectional data, the ordering of the data is important in time-
series data. Each point represents the values at specific points in time. As
such, time series data are typically presented in chronological order. Changing
the order of the data ignores the time-dimensionality of the data.
Panel Data:Panel data combines cross-sectional and time series data:
the same individuals (persons, firms, cities, etc.) are observed at several
points in time (days, years, before and after treatment etc.). Panel data allows
you to control for variables you cannot observe or measure like:
2.Based on Measurement-
Nominal: If the values in a variable does not follow any particular order, we can
call it as nominal. Taking a mean or median is meaningless here. Note that
sorting the values of a nominal data type does not make any difference.
Ordinal: If the values in a variable follows a particular order, then we can call it
as ordinal. This means a lower value present in the feature holds lesser weight
compared to a higher value. Hence, sorting the values of an ordinal data makes
sense. For example,
Interval:
In interval type, 0 doesn’t have a true meaning. In the case of temperature, 0
doesn’t mean no temperature. Instead, it is a valid value. A classic example for
interval data is temperature.
Ratio: If there is a true meaning for 0, then we can call it ratio data type. For
example, in the case of length or income, a value 0 means no length or no
income. They are of type ratio.
3.Based on Availability:
A sample is the specific group that you will collect data from. The size of the
sample is always less than the total size of the population.
Example: Collecting data from a sample ,You want to study political attitudes in
young people. Your population is the 300,000 undergraduate students in the
Netherlands. Because it’s not practical to collect data from all of them, you use
a sample of 300 undergraduate volunteers from three Dutch universities – this
is the group who will complete your online survey.
You can use estimation or hypothesis testing to estimate how likely it is that a
sample statistic differs from the population parameter.
Sampling error
A sampling error is the difference between a population parameter and a sample
statistic.
Statistics & Its Types:
Statistics simply means numerical data, and is field of math that generally
deals with collection of data, tabulation, and interpretation of numerical data.
It is actually a form of mathematical analysis that uses different quantitative
models to produce a set of experimental data or studies of real life. It is an area
of applied mathematics concern with data collection analysis, interpretation,
and presentation. Statistics deals with how data can be used to solve complex
problems.
1. Descriptive Statistics :
Descriptive statistics uses data that provides a description of the population
either through numerical calculation or graph or table. It provides a graphical
summary of data. It is simply used for summarizing objects, etc.
2. Inferential Statistics :
Inferential Statistics makes inference and prediction about population based on
a sample of data taken from population. It generalizes a large dataset and
applies probabilities to draw a conclusion. It is simply used for explaining
meaning of descriptive stats. It is simply used to analyze, interpret result, and
draw conclusion. Inferential Statistics is mainly related to and associated with
hypothesis testing whose main target is to reject null hypothesis.
Hypothesis testing is a type of inferential procedure that takes help of sample
data to evaluate and assess credibility of a hypothesis about a population.
Inferential statistics are generally used to determine how strong relationship is
within sample. But it is very difficult to obtain a population list and draw a
random sample.
Application of Statistics:
Statistics is indispensable for decision-making in various sectors and verticals.
It is applied in marketing, e-commerce, banking, finance, human resource,
production, and information technology.