0% found this document useful (0 votes)
6 views

4.0 Introduction to Data

The document provides an overview of data, defining it as raw facts and figures that can be analyzed for various purposes. It categorizes data based on nature (qualitative vs. quantitative), measurement scale, structure, source, and usage in machine learning, as well as discussing big data and its characteristics. Additionally, it touches on granularity, sources of data, streaming data, and spatial data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

4.0 Introduction to Data

The document provides an overview of data, defining it as raw facts and figures that can be analyzed for various purposes. It categorizes data based on nature (qualitative vs. quantitative), measurement scale, structure, source, and usage in machine learning, as well as discussing big data and its characteristics. Additionally, it touches on granularity, sources of data, streaming data, and spatial data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to Data

Overview of nature and types of data


What is data
• Data in general refers to raw facts, figures, or statistics that are collected,
stored, and analyzed for various purposes. It is essentially any piece of information
that can be processed or used to draw conclusions, make decisions, or develop
insights. Data can take many forms, ranging from simple numbers and text to
more complex types such as images, audio, or video.

• `In common usage, data (/ˈdeɪtə/, also US: /ˈdætə/) is a collection of discrete or
continuous values that convey information, describing the quantity, quality, fact,
statistics, other basic units of meaning, or simply sequences of symbols that may
be further interpreted formally. A datum is an individual value in a collection of
data. Data are usually organized into structures such as tables that provide
additional context and meaning, and may themselves be used as data in larger
structures` -- wikipedia
Types of Data in general
• Data can be classified in several ways depending on its nature, structure, and
how it’s processed. Let’s break it down into several key categories:

• Based on Nature (Qualitative vs. Quantitative Data)


• Based on Measurement Scale (Nominal, Ordinal, Interval, Ratio)
• Based on Data Structure (Structured vs. Unstructured Data)
• Based on Data Source (Primary vs. Secondary Data)
• Based on Usage in Machine Learning (Labeled vs. Unlabeled Data)
• Big Data
Based on Nature (Qualitative vs.
Quantitative
Qualitative Data)
Data (Categorical Data): Qualitative data describes qualities or
characteristics and is often non-numerical.
and is descriptive, textual, and does not involve measurements or numbers. Like
• Color of a car (red, blue, green)
• Type of product (clothing, electronics)
• Survey responses (satisfied, neutral, unsatisfied)
Types of Qualitative Data:
• Nominal Data: Data that represents categories with no inherent order (e.g.,
gender, nationality).
• Ordinal Data: Data that represents categories with a specific order or ranking, but
the difference between ranks is not measurable (e.g., rating scales like "good,
better, best").
Based on Nature (Qualitative vs.
Quantitative Data)
Quantitative Data (Numerical Data): Quantitative data represents numerical
values and is measurable which Involves counting or measuring quantities, and can
be used for mathematical operations.
• Height of a person (in centimeters)
• Sales amount ($500, $1000)
• Age (25 years, 30 years)
Types of Quantitative Data:
• Discrete Data: Data that consists of distinct, countable values (e.g., the number
of children in a family, the number of cars sold).
• Continuous Data: Data that can take any value within a range and is typically
measured (e.g., height, temperature, time).
Based on Measurement Scale
• Nominal Data: Data categorized without a specific order or ranking. It’s purely
for labeling purposes like Eye color (blue, brown, green), types of animals (cat,
dog, bird).
• Ordinal Data: Data that has a meaningful order or ranking but no measurable
distance between ranks. Like Customer satisfaction (poor, average, excellent),
education level (high school, bachelor’s, master’s).
• Interval Data: Data with meaningful intervals between values, but no true zero
point. The difference between values is consistent, but ratios don’t make sense.
Like Temperature in Celsius or Fahrenheit, IQ scores.
• Ratio Data: Similar to interval data but with a true zero point, allowing for
meaningful ratios and comparisons. Like Weight, height, age, temperature in
Kelvin (where 0 means "absence" of heat).
Based on Data Structure
a. Structured Data:
• Definition: Data that is highly organized and follows a specific format or schema. It is easy to store, search, and analyze using traditional database systems (like
relational databases).
• Examples:
• Data in tables or databases (e.g., an Excel spreadsheet with rows and columns)
• Customer records (name, age, address, order history)
b. Unstructured Data:
• Definition: Data that has no predefined structure and doesn’t fit neatly into traditional database formats.
• Examples:
• Text documents, emails, social media posts
• Images, videos, audio files
• Webpages, PDF files
c. Semi-Structured Data:
• Definition: Data that doesn’t have a rigid structure but contains tags or markers to separate elements, making it easier to organize than unstructured data.
• Examples:
• JSON and XML files
• Emails with metadata (like sender, recipient, timestamp)
• Log files from servers
Based on Data Source
• Primary Data: Data that is collected first-hand for a specific purpose
directly from the source.
• Survey responses
• Interviews
• Experiment results
• Secondary Data: Data that has been collected previously by someone
else for a different purpose.
• Data from research papers
• Government reports
• Existing databases (e.g., census data, financial reports)
Based on Usage in Machine
Learning
a. Labeled Data:
• Definition: Data that has input features as well as associated labels or targets. It is primarily used in supervised learning tasks.
• Examples:
• A dataset of images where each image is labeled with the object it contains (e.g., cat, dog).
• A medical dataset with patient attributes and the diagnosis (cancer, no cancer).
b. Unlabeled Data:
• Definition: Data that has input features but no labels or targets. It is used in unsupervised learning tasks where the goal is to find patterns or structure in
the data.
• Examples:
• A dataset of customer purchasing behaviors with no predefined groups or outcomes.
• A set of sensor data without any predefined classifications.
Big Data
Data that is extremely large in volume, high in velocity, and diverse in variety. It
cannot be easily processed or analysed using traditional data processing
techniques.

Characteristics: Often requires distributed computing systems like Hadoop or


Spark to manage.

Examples:
• Social media data (Twitter, Facebook)
• Sensor data from IoT devices
• Streaming data from real-time systems like financial markets or website logs
Nature of data according to Data
science
In data science, the nature of data refers to the characteristics, types, and structures
of the data that are used for analysis, modeling, and decision-making. Data can vary
in its form, source, granularity, and structure, influencing how it is processed
and analyzed

The mentioned data types are used in machine learning however there are more for
example
Granularity of Data
Granularity refers to the level of detail in the data.
• Fine-grained data: Highly detailed data, such as individual transactions, which
provides more insight but can be difficult to aggregate and process. Like Sensor
data collected from a smart device every second.

• Coarse-grained data: Summarized or aggregated data, which is easier to


process but may lose some of the finer insights. Like Daily sales totals, average
monthly temperatures.
Sources of Data
• Generated Data
Sensors: Data from IoT devices, environmental sensors (e.g., weather stations), fitness
trackers.
Transactions: Data generated from financial transactions (e.g., payment records, online
purchases).
• Social and Web Data
Social Media: Posts, comments, likes, and shares from platforms like Twitter, Facebook,
and Instagram.
Web Analytics: Data from website interactions, such as clicks, page views, and time spent
on a site.
• Public and Open Data
Government datasets: Census data, weather reports, and other publicly available
information.
Open-source repositories: Datasets from sources like Kaggle, UCI Machine Learning
Repository.
• Survey Data: Collected from responses to questionnaires or polls, often used in
social science research. Like customer satisfaction surveys, demographic surveys.
Streaming Data
Data that is continuously generated and processed in real-time. For example
• Social media feeds: Real-time data from Twitter or Facebook.
• Sensor data: Data from IoT devices that continuously send signals, like smart
thermostats or fitness trackers.
Spatial Data
Data that includes information about locations, such as latitude and longitude
coordinates

Examples:
• GIS (Geographic Information System) Data: Maps, satellite images, and
data about locations (e.g., elevation, land use).
• GPS Data: Location tracking data from smartphones or vehicles.

You might also like