0% found this document useful (0 votes)
33 views17 pages

Class2-3 Data DataCollection 13-16aug2021

The document discusses different types of data in data science. It defines data science as a multi-disciplinary field that uses methods and algorithms to extract knowledge and insights from structured and unstructured data. The central goal is gaining insights from data using machine learning. It then describes different types of data based on their organization (unstructured, structured, semi-structured) and based on the variables they contain (numerical, categorical, time series). Numerical data can be continuous or discrete, while categorical data can have ordinal, nominal or binary values. Time series data involves values over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views17 pages

Class2-3 Data DataCollection 13-16aug2021

The document discusses different types of data in data science. It defines data science as a multi-disciplinary field that uses methods and algorithms to extract knowledge and insights from structured and unstructured data. The central goal is gaining insights from data using machine learning. It then describes different types of data based on their organization (unstructured, structured, semi-structured) and based on the variables they contain (numerical, categorical, time series). Numerical data can be continuous or discrete, while categorical data can have ordinal, nominal or binary values. Time series data involves values over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

8/16/2021

Data and Types of Data

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
2

1
8/16/2021

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
3

Data Collection
• Data manifests itself in many different forms
• Different forms of data require different ways to
collect them and different storage solutions
• Collection of data may consists of sending out
surveys, polls or doing other experiments
• Data based on the way it is collected:
– Data that comes from surveys
• Usually textual form of data or mixed

2
8/16/2021

Data Collection
• Data manifests itself in many different forms
• Different forms of data require different ways to
collect them and different storage solutions
• Collection of data may consists of sending out
surveys, polls or doing other experiments
• Data based on the way it is collected:
– Data that comes from surveys
• Usually textual form of data or mixed
– Data entered in a database as system entry
• E.g. Student information entered on academic automation
system etc.
– Data in the form of signals (comes from sensors)
• Speech/Audio, Images and videos, Temperature readings,
Humidity, Seismic data, EEG (all bio-type signals) etc.
• According to the objective of the task, the way the
data is collected will change

Types of Data: Based on Organization


1. Unstructured data:
– Rawest form of data
– Example: Any type of files like texts, images, sounds or
videos etc.
– This type of data stored in a repository of files
• Well organised directories on the computer hard drive

3
8/16/2021

Types of Data: Based on Organization


2. Structured data:
– It is a tabular data (rows and columns), which are very
well defined

– Stored in databases
• Spreadsheets [Comma Separated Value (CSV) format]
• Oracle
• DB2
• MySQL etc.
7

Types of Data: Based on Organization


3. Semi-Structured data:
– Anywhere between unstructured and structured data
– A consistent format is defined, however there is no strict
structure and parts of data may be incomplete or
different type
– Example: Data in the form of XML and JSON
• Stored in document oriented databases

4
8/16/2021

Types of Data: Based on Organization


3. Semi-Structured data:
– Anywhere between unstructured and structured data
– A consistent format is defined, however there is no strict
structure and parts of data may be incomplete or
different type
– Example: Data in the form of XML and JSON
• Stored in document oriented databases

Type of Data: Based on Variables


(Value) found in Data
• Mainly in Structured Data:
1. Numerical data:
– Data represented as numbers
– Data in which information is measurable
– This type of data is called quantitative data as its
describes a quantity
– Two types based on the values taken:
• Continuous valued data:
– Numbers does not have logical end
– Range lies in the natural limit of what we are measuring
– Example: Cost of the books, atmospheric temperature etc.
• Discrete valued data:
– Numbers have logical end
– There is a specific limit on the range of the values
– Example: number of members of family, number of days in a
month, number of colours in flag etc.

10

5
8/16/2021

Type of Data: Based on Variables


(Value) found in Data
2. Categorical data:
– Data that is not a number. It can be string of text or
date
– It describe an item or event to one of few different
categories
– Example: Ethnicity, gender, eye colour, etc.
– This type of data is called qualitative data as its
describes a quality
– Three types values they hold:
• Ordinal values: Values that have a set order to them
– Example: Severity of a alarm as “Critical”, “Medium” and
“”Low”, Ranking of a running race as “ First”, Second”, Third”
• Nominal values: Values that have no set order to them
– Example: Values for the variables “Marital Status”, “Country”,
“Eye Colour” etc.
• Binary values: Special type of categorical data
– Have only two values – “Yes” and “No” OR “True” and “False”
OR “1” and “0” 11

Type of Data: Based on Variables


(Value) found in Data
3. Time series data:
– Series of data. It involve time and some kind of value
– Example: Temperature at every hour
– It is clearly structured and numeric in nature
– Special case of numerical data
– This type of data is important because of IoT and
sensors
– Data from sensors are almost always time-series in
nature

12

6
8/16/2021

Data, Types of Data and Data


Collection using Sensors
Need for Data Preprocessing

Summery of Previous Class:


Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
14

7
8/16/2021

Summery of Previous Class:


Types of Data: Based on Organization
1. Unstructured data:
2. Structured data:
– It is a tabular data (rows and columns), which are very
well defined
– Each row is finite ordered list (sequence) of elements,
where each element in a column is belonging to an
attribute of specific type
– Example: Spreadsheets [Comma Separated Value
(CSV) format]
3. Semi-structured data:

15

Summery of Previous Class:


Type of Data: Based on Variables (Value)
found in Data
• Mainly in Structured Data:
1. Numerical data:
– Two types based on the values taken:
• Continuous valued data:
• Discrete valued data:
2. Categorical data:
– Three types values they hold:
• Ordinal values:
• Nominal values:
• Binary values:
3. Time series data:

16

8
8/16/2021

Summery of Previous Class:


Data Collection
• Data manifests itself in many different forms
• Different forms of data require different ways to
collect them and different storage solutions
• Collection of data may consists of sending out
surveys, polls or doing other experiments
• Data based on the way it is collected:
– Data that comes from surveys
• Usually textual form of data or mixed
– Data entered in a database as system entry
• E.g. Student information entered on academic automation
system etc.
– Data in the form of signals (comes from sensors)
• Speech/Audio, Images and videos, Temperature readings,
Humidity, Seismic data, EEG (all bio-type signals) etc.
• According to the objective of the task, the way the
data is collected will change

Data Collection from Sensors


• Sensors are the devices that respond to the
environment around it and convert the physical
parameters into a signal (e.g., optical, electrical,
mechanical ) suitable for processing

Surrounding signal
Sensor electrical,
environment
optical or
mechanical

• Example: a temperature sensor outputs an electrical


signal whose voltage or current can be used to
identify the temperature around it
• Sensors can be an electrical/mechanical component, a
module or a subsystem

18

9
8/16/2021

Different Types of Sensors


• Acoustic, sound sensors (e.g., microphone)

• Visual sensors (e.g. cameras)

• Environmental sensors (e.g., temperature, humidity,


pressure etc.)

• Chemical sensors (e.g., Diesel Nitrogen Oxide (Nox)


sensors to measure engine-out NOx gas concentration)

• Flow sensors (e.g., water flow sensors)

• Motion sensors (e.g., gyroscope)

• Proximity or presence sensor (e.g., Passive Infrared


(PIR) )

• Biosensors (e.g., glucose monitor)

• And many more …


19

IIT Mandi Weather Station: Environmental Data


(Temperature, Humidity, Pressure etc) Collection
High-Level Overview
Zigbee/802.15.4
network

Environmental Sensor

SQL Database
+ PHP (web
access)
IITMandi
intranet
This is running
inside a
Raspberry Pi
WiFi/802.11b

Source: Dr. Siddhartha Sarma 20

10
8/16/2021

High-Level Overview: Environmental Data


(Temperature, Humidity, Pressure etc) Collection

Zigbee/802.15.4
network

Environmental Sensor

SQL Database
+ PHP (web
access)
IITMandi
intranet
This is running
inside a
Raspberry Pi
WiFi/802.11b

Source: Dr. Siddhartha Sarma 21

Land Slide Monitoring System (LMS)


• LMSs that rely on Internet of Things (IoT) and low-cost Micro-
Electro-Mechanical Systems (MEMS) sensors

Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 22

11
8/16/2021

Components of LMS
• The LMS monitors a number of weather and soil
parameters via sensors on deployment location

GY 61 Pin Diagram of YL 69 Soil SIM 900A GSM E


Accelerometer GY-61 Moisture Sensor Module Force Sensor
Sensor

F G H I
Humidity Sensor Light Sensor Temperature and Tipping Rain Gauge
DHT 22 BH-1750 Pressure Sensor
BMP-180

Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 23

Architecture and Features of LMS


• The LMS monitors a number of weather and soil
parameters via sensors on deployment location

Temperature & Barometric Rainfall Light Intensity


Humidity Pressure Intensity (0 - 65535 Lux)
(-40 C to +80 C & (300-1100 (in mm)
0-100 %) mb)

Soil force Soil moisture


Soil movement
(0-100N) (0-100 %)
(±2000°/sec rotational & ±16g
gravitational acceleration)

Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 24

12
8/16/2021

Architecture and Features of LMS


• The LMS monitors a number of weather and soil
parameters via sensors on deployment location

Architecture diagram of LMS

The LMS will alert people via traffic lights, SMSs, or smart-apps on mobile
phones about the danger of impending landslides

Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 25

Architecture and Features of LMS


• The LMS monitors a number of weather and soil
parameters via sensors on deployment location

Architecture diagram of LMS

The LMS will alert people via traffic lights, SMSs, or smart-apps on mobile
phones about the danger of impending landslides

Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 26

13
8/16/2021

Data Preprocessing

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
28

14
8/16/2021

Need for Data Preprocessing


• Real world data are tend to be incomplete, noisy and
inconsistent due to their huge size and their likely
origin from multiple heterogeneous sources
• Preprocessing is important to clean the data
• Low quality data will lead to low quality of analysis
results
• If the users believe the data is of low quality (dirty),
they are unlikely to trust the results of any data
analytics that has been applied to
• Low quality data can cause confusion for analytic
procedure using machine learning techniques,
resulting in unreliable output
• Incomplete, noisy and inconsistent data are common
properties of large real world databases

Tuple (Record) in Structured Data


• A tuple (record) is finite ordered list (sequence) of
elements, where each element is belonging to an
attribute

Tuple
(record)

• Each row is a tuple

15
8/16/2021

Incomplete Data
• Many tuple (records) have no recorded value for
several attributes
• Example:

Incomplete Data
• Many tuple (records) have no recorded value for
several attributes
• Reasons for incomplete data:
– User forgot to fill in a field
– User chose not to fill out the field as it was not
considered important at the time of the entry
– Relevant data may not be recorded due to
malfunctioning of equipment
– Data might have lost while transferring from recorded
place
– Data may not be recorded due to programming error
– Data might not be recorded due to technology
limitations like limited memory

16
8/16/2021

Noisy Data
• Many tuple (records) have incorrect value for several
attributes
• Reasons for noisy data:
– There may be human or computer error occurring in
data entry
– The data collection instruments used may be faulty
– Error in data transmission
– There may be technology limitation such as limited
buffer size for coordinating synchronised data transfer
and consumption

Inconsistent Data
• Data containing discrepancies in stored values for
some attributes
• Reasons for inconsistent data:
– It may result from inconsistencies in
• name conventions or
– Example: “Dept_ID”, “Department_ID”
“Roll_No”, “Registation_No”
• data codes used (mismatch in writing values) or
– Example: For department – “SCEE”, “School of Computing and
EE”
• inconsistent formats of input fields such as date
– Example: “dd-mm-yy”, “dd-mm-yyyy”, “mm/dd/yyyy”
– Inconsistency in name convention or formats of input
fields while integrating
– Example: While Integrating temperature records from
different locations, if the name conventions are different
– Inconsistent data may be due to human or computer
error occurring in data entry

17

You might also like