Unit 1 R
Unit 1 R
UNIT-I
Structured data and unstructured data are two primary types of data, distinguished by their
format, organization, and how easily they can be used by algorithms, databases, and other
analytical tools.
Structured Data:
Examples:
o Spreadsheets: Excel files, where data is stored in cells organized by rows and
columns.
o Sensor Data: Data from IoT devices, where readings are taken at regular
intervals and stored in a structured format.
Characteristics:
o Easily searchable: Due to its organization, querying structured data with tools
like SQL is straightforward.
o Predefined format: The schema is defined before data is input, making it more
rigid but easier to manage.
Unstructured Data:
Format: Unstructured data does not follow a specific format or structure, making it
more flexible but also more challenging to manage and analyze. It often includes text-
heavy data but can also contain multimedia elements.
Examples:
o Text Documents: Emails, Word documents, and PDFs, where data is presented
in natural language.
o Multimedia: Images, audio files, and videos, which do not have a predefined
data model.
o Social Media Content: Tweets, Facebook posts, and other user-generated
content.
o Web Pages: HTML content that includes text, images, and embedded media.
Characteristics:
Key Differences:
1. Organization:
o Unstructured data lacks this organization and can come in various forms.
2. Ease of Analysis:
o Unstructured data requires more advanced tools and algorithms for analysis.
3. Flexibility:
Use Cases:
Structured Data: Ideal for use in scenarios where the data format is consistent and
predefined, such as in banking, inventory management, and CRM systems.
Unstructured Data: Suitable for analysing data-rich environments like social media
monitoring, customer reviews, and multimedia content analysis.
Understanding the differences between structured and unstructured data is crucial for
determining the appropriate tools and methodologies for data processing, storage, and analysis.
Quantitative and qualitative data are two primary types of data used in research, each serving
different purposes and providing different insights.
Quantitative Data:
Definition: Quantitative data refers to information that can be measured, counted, or
expressed numerically. It quantifies variables and typically involves statistical analysis
to identify patterns, relationships, or trends.
Examples:
Characteristics:
Use Cases:
Qualitative Data:
Definition: Qualitative data refers to non-numerical information that describes
qualities, characteristics, or concepts. It is often used to explore complex phenomena
and gain deeper insights into people's experiences, behaviours, or perceptions.
Examples:
Characteristics:
o Descriptive: Captures the richness and complexity of the subject matter, often
in the form of words, images, or objects.
o Contextual: Provides context and depth to understand the "why" and "how"
behind certain phenomena.
Use Cases:
Key Differences:
1. Nature of Data:
o Quantitative Data: Numerical, objective, and measurable.
2. Analysis Methods:
o Quantitative Data: Statistical methods are used to analyse and interpret the
data.
3. Purpose:
4. Outcome:
o Quantitative Data: Results are often presented as graphs, tables, or charts.
Integration:
In many research projects, mixed methods are used, combining both quantitative and
qualitative data to provide a more comprehensive understanding of the research problem.
Quantitative data provides the "what," while qualitative data offers insights into the "why" and
"how."
1. Nominal Level:
Definition: Nominal data represents categories or groups without any inherent order or
ranking. It is the most basic level of measurement, where numbers or labels are used
solely for classification.
Characteristics:
o Categories: Data is grouped into mutually exclusive categories.
Examples:
o Analysis: Frequency counts, mode, and chi-square tests are typical analyses.
2. Ordinal Data:
Definition: Ordinal data represents categories with a meaningful order or ranking, but
the intervals between categories are not necessarily equal or known.
Characteristics:
Examples:
3. Interval Level:
Definition: Interval data is numeric data with equal intervals between values, but it
lacks a true zero point. This means that while you can measure differences between
values, you cannot make statements about how many times greater one value is
compared to another.
Characteristics:
o Equal Intervals: The difference between any two consecutive values is the
same.
o No True Zero: Zero is arbitrary and does not indicate the absence of the
quantity being measured.
o Addition/Subtraction: You can add and subtract values, but multiplication and
division are not meaningful in terms of ratios.
Examples:
o Calendar Years: The difference between the years 2000 and 2010 is the same
as between 2010 and 2020, but 0 AD does not represent the "beginning" in a
numerical sense.
Usage in Data Science:
4. Ratio Level:
Definition: Ratio data is the highest level of measurement, featuring all the
characteristics of interval data, but with a meaningful zero point. This allows for the
full range of mathematical operations, including meaningful ratios.
Characteristics:
o True Zero: Zero indicates the absence of the quantity being measured, making
statements about "twice as much" meaningful.
o Equal Intervals: Like interval data, the difference between values is consistent.
Examples:
o Machine Learning: Ratio data is often the most informative type of data for
training models, as it allows for a wide variety of transformations and analyses.
Summary:
Nominal: Categorization without order (e.g., eye colour, gender).
Ordinal: Ordered categories without equal intervals (e.g., survey ratings, education
levels).
Interval: Equal intervals, no true zero (e.g., temperature in Celsius, IQ scores).
Ratio: Equal intervals with a true zero, allowing for all mathematical operations (e.g.,
height, weight, income).
Understanding these levels is crucial for selecting the right methods in data preprocessing,
statistical analysis, and machine learning.
The data science process typically involves five key steps that guide the transformation of raw
data into actionable insights. These steps form a structured approach to problem-solving in
data-driven projects.
Objective: Clearly define the problem you are trying to solve and understand the
business or research objectives.
Activities:
Output: A well-defined problem statement and a plan that outlines the approach, scope,
and expected outcomes.
Output: A dataset or collection of datasets that are relevant, accessible, and sufficient
for analysis.
Objective: Prepare the data for analysis by cleaning and transforming it to a usable
format.
Activities:
Objective: Explore the data to uncover patterns, trends, and relationships that will
inform the modelling phase.
Activities:
o Generate hypotheses and insights that can guide the model-building process.
Output: A set of findings, visualizations, and potential features that provide insights
into the data and inform the next steps.
Objective: Build and evaluate models that solve the problem defined in the first step.
Activities:
o Select appropriate algorithms and models based on the type of data and problem.
o Train models using the prepared data, and tune parameters to optimize
performance.
Activities:
o Deploy the model using tools and platforms suited for production.
These steps may be iterative, especially the data exploration, modelling, and evaluation phases,
where you may need to revisit earlier steps based on new findings or feedback. This structured
approach ensures a thorough and methodical process, leading to reliable and actionable
outcomes in data science projects.