Class2-3 Data DataCollection 13-16aug2021
Class2-3 Data DataCollection 13-16aug2021
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
2
1
8/16/2021
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
3
Data Collection
• Data manifests itself in many different forms
• Different forms of data require different ways to
collect them and different storage solutions
• Collection of data may consists of sending out
surveys, polls or doing other experiments
• Data based on the way it is collected:
– Data that comes from surveys
• Usually textual form of data or mixed
2
8/16/2021
Data Collection
• Data manifests itself in many different forms
• Different forms of data require different ways to
collect them and different storage solutions
• Collection of data may consists of sending out
surveys, polls or doing other experiments
• Data based on the way it is collected:
– Data that comes from surveys
• Usually textual form of data or mixed
– Data entered in a database as system entry
• E.g. Student information entered on academic automation
system etc.
– Data in the form of signals (comes from sensors)
• Speech/Audio, Images and videos, Temperature readings,
Humidity, Seismic data, EEG (all bio-type signals) etc.
• According to the objective of the task, the way the
data is collected will change
3
8/16/2021
– Stored in databases
• Spreadsheets [Comma Separated Value (CSV) format]
• Oracle
• DB2
• MySQL etc.
7
4
8/16/2021
10
5
8/16/2021
12
6
8/16/2021
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
14
7
8/16/2021
15
16
8
8/16/2021
Surrounding signal
Sensor electrical,
environment
optical or
mechanical
18
9
8/16/2021
Environmental Sensor
SQL Database
+ PHP (web
access)
IITMandi
intranet
This is running
inside a
Raspberry Pi
WiFi/802.11b
10
8/16/2021
Zigbee/802.15.4
network
Environmental Sensor
SQL Database
+ PHP (web
access)
IITMandi
intranet
This is running
inside a
Raspberry Pi
WiFi/802.11b
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 22
11
8/16/2021
Components of LMS
• The LMS monitors a number of weather and soil
parameters via sensors on deployment location
F G H I
Humidity Sensor Light Sensor Temperature and Tipping Rain Gauge
DHT 22 BH-1750 Pressure Sensor
BMP-180
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 23
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 24
12
8/16/2021
The LMS will alert people via traffic lights, SMSs, or smart-apps on mobile
phones about the danger of impending landslides
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 25
The LMS will alert people via traffic lights, SMSs, or smart-apps on mobile
phones about the danger of impending landslides
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 26
13
8/16/2021
Data Preprocessing
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
28
14
8/16/2021
Tuple
(record)
15
8/16/2021
Incomplete Data
• Many tuple (records) have no recorded value for
several attributes
• Example:
Incomplete Data
• Many tuple (records) have no recorded value for
several attributes
• Reasons for incomplete data:
– User forgot to fill in a field
– User chose not to fill out the field as it was not
considered important at the time of the entry
– Relevant data may not be recorded due to
malfunctioning of equipment
– Data might have lost while transferring from recorded
place
– Data may not be recorded due to programming error
– Data might not be recorded due to technology
limitations like limited memory
16
8/16/2021
Noisy Data
• Many tuple (records) have incorrect value for several
attributes
• Reasons for noisy data:
– There may be human or computer error occurring in
data entry
– The data collection instruments used may be faulty
– Error in data transmission
– There may be technology limitation such as limited
buffer size for coordinating synchronised data transfer
and consumption
Inconsistent Data
• Data containing discrepancies in stored values for
some attributes
• Reasons for inconsistent data:
– It may result from inconsistencies in
• name conventions or
– Example: “Dept_ID”, “Department_ID”
“Roll_No”, “Registation_No”
• data codes used (mismatch in writing values) or
– Example: For department – “SCEE”, “School of Computing and
EE”
• inconsistent formats of input fields such as date
– Example: “dd-mm-yy”, “dd-mm-yyyy”, “mm/dd/yyyy”
– Inconsistency in name convention or formats of input
fields while integrating
– Example: While Integrating temperature records from
different locations, if the name conventions are different
– Inconsistent data may be due to human or computer
error occurring in data entry
17