Unit II Notes
Unit II Notes
1. Introduction to Data
Collection in IoT:-
Data collection is one of the most vital phases in the data science
process, especially within the context of the Internet of Things (IoT). In
IoT systems, various devices, sensors, and machines continuously
generate data. This data needs to be collected, stored, and pre-
processed before any meaningful analysis can be done. Without
accurate and sufficient data, the results of any data science project may
be misleading or incomplete. Thus, understanding the sources, types,
and methods of data collection is crucial.
4. Veracity
Data quality challenges due to:
1. Sensor malfunctions
2. Transmission errors
3. Environmental interference
Edge Layer:
Physical sensors/devices collecting raw data.
Examples: Temperature sensors, accelerometers, cameras
Gateway Layer:
Aggregates and preprocesses data
May perform initial filtering/compression
Cloud Layer:
Centralized storage and processing
Enables large-scale analytics
Key Challenges in IoT Data Collection
Handling
billions of Requires distributed
Scalability
connected architectures
devices
Vulnerabilities
Risk of data
Security in device
breaches/spoofing
networks
Power
Energy Affects deployment
constraints on
Efficiency longevity
edge devices
Data Multiple
Complicates
Heterogene formats/protoc
integration
ity ols
2. Getting to Know Your Data :-
It is essential to first understand the characteristics of the data
you are working with. This step, often referred to as data
understanding, is critical because it determines the kind of
analytical methods you can use, the preprocessing required,
and the assumptions you can safely make.
1. Structured Data
Structured data refers to data that is organized in a fixed
schema—typically rows and columns—making it easy to store
and query using relational databases (like MySQL, PostgreSQL).
Characteristics:
o Clearly defined fields (columns).
IoT Applications:
Industrial sensor readings (temperature, pressure).
Smart meter energy consumption data.
Inventory tracking in warehouses.
2. Unstructured Data
Unstructured data does not follow a predefined model or structure.
This makes it harder to store and analyze using traditional methods.
Characteristics:
o Cannot be stored in traditional rows/columns easily.
o Requires advanced processing (e.g., NLP for text,
computer vision for images).
IoT Applications:
Surveillance camera footage
Drone aerial imagery
Voice recordings from smart assistants
Maintenance technician notes
Processing Challenges:
Requires computer vision for image/video
NLP techniques for text analysis
Higher computational needs
3. Semi-Structured Data
Semi-structured data lies between structured and unstructured data.
It has a flexible format but still contains markers or tags to separate
elements.
Characteristics:
o Self-describing structure with tags or keys.
o Easier to parse programmatically.
IoT Applications:
Device configuration files
API communications
Event logs with metadata
4. Time-Series Data in IoT
Characteristics:
Time-stamped values
High write throughput
Often append-only
Specialized Databases:
InfluxDB
TimescaleDB
Prometheus
IoT Applications:
Continuous monitoring (health vitals)
Predictive maintenance
Environmental sensing
Visualization Example:
Temperature Over Time
30°C | *
25°C | * *
20°C |* *
+-----------
Time
5. Spatial Data in IoT
Characteristics:
Geographic coordinates
Often combined with time data
Requires GIS processing
Common Formats:
GeoJSON
Shapefiles
KML (Keyhole Markup Language)
IoT Applications:
Fleet tracking
Smart city infrastructure
Agricultural monitoring
1.Data Types:
Understanding the data type of each field/attribute is fundamental
for selecting the correct analytical or preprocessing method.
Numerical: Integers (e.g., count of devices) and floating-point
numbers (e.g., temperature = 23.4°C).
Categorical: Represents labels or categories (e.g., status =
"active"/"inactive").
Boolean: True/False values, often used for binary conditions.
Date/Time: Timestamps that are important for trend analysis
and correlation in IoT.
Incorrect interpretation of data types can lead to errors in
calculations and visualizations.
2. Units of Measurement
Knowing the measurement units of your data is essential for
interpretation and consistency. For example:
Temperature: Celsius, Fahrenheit, Kelvin.
Distance: Meters, kilometers, miles.
Speed: km/h, m/s.
Inconsistent or incorrect units can lead to incorrect conclusions.
Always standardize units across datasets if merging from multiple
sources.
3. Frequency of Data Collection
This refers to how often the data is being collected, which affects
storage needs and granularity of analysis.
High-frequency data: Collected every few milliseconds (e.g.,
vibration data in machinery monitoring).
Low-frequency data: Collected hourly, daily, or on-demand
(e.g., smart electricity meter data).
The frequency impacts:
The size of the dataset.
The ability to capture real-time trends.
Resource requirements for processing.
Non-response in surveys
ff
Anomalous values: Outliers or data points that don’t match
expected patterns (e.g., a temperature sensor reading -100°C in
a room).
Detecting and handling such values is critical:
Techniques: Imputation (filling in missing values), filtering out
outliers, or interpolating data for missing timestamps.
Tools: Python libraries like pandas, scikit-learn, and matplotlib
for detection and correction.
3. Types of Data :-
In data science and IoT systems, understanding the type of data
being collected is crucial. It influences everything from how data is
stored, processed, and analyzed to which algorithms and
visualization techniques are applicable. Broadly, data can be
categorized based on source and nature.
Examples:
o Public datasets (e.g., from government agencies or open
data platforms)
o Historical data logs stored in cloud services
o Research databases
o Organizational reports and records
Advantages:
o Saves time and effort in data collection.
o Often readily available and low-cost.
Disadvantages:
o May not perfectly match current objectives.
o May be outdated or not fully reliable.
Discrete Data:
o Countable values (e.g., number of IoT devices online).
o No intermediate values between data points.
Continuous Data:
o Can take any value within a range (e.g., temperature,
voltage).
o Supports mathematical operations like averaging,
standard deviation.
Examples in IoT:
o Temperature readings from a sensor.
o Voltage levels in a smart grid.
o Speed of a moving vehicle from a GPS module.
Types:
o Nominal Data: Categories with no natural order (e.g.,
device status: "on", "off", "standby").
o Ordinal Data: Categories with an implied order (e.g.,
feedback levels: "bad", "average", "good", "excellent").
Examples in IoT:
o Device status: "Active", "Idle", "Disconnected"
o Sensor health: "Good", "Warning", "Critical"
o User preference: "Indoor Mode", "Eco Mode", "Turbo
Mode"
Use:
o Important for classification models.
o Represented using bar charts, pie charts, and frequency
tables.
Comparison Table
Use Cases:
o Initial device registration.
o Logging maintenance events or technician observations.
o Emergency overrides or manual error corrections.
Limitations in IoT:
o Human Error: Typos, omissions, or incorrect values.
o Scalability Issues: Not feasible when dealing with
thousands of sensors.
o Latency: Time-consuming and not real-time.
Example: A technician manually entering sensor calibration
values after maintenance.
Conclusion: While important in specific situations, manual data
entry is generally discouraged in large-scale IoT applications
due to its lack of reliability and efficiency.
Advantages:
o Real-time and automated.
o High accuracy and reliability.
o Minimal human intervention.
Challenges:
o Device failure or battery depletion.
o Data overload if sampling rate is too high.
o Environmental interference.
Examples:
o A smart thermostat collecting room temperature every 5
seconds.
o Soil moisture sensors in precision agriculture.
o Accelerometers in smartwatches detecting movement.
Use Cases:
o Weather data for smart agriculture.
o Location-based data from mapping services.
o Energy pricing data for smart grid systems.
Advantages:
o Easy integration with other platforms.
o Reduces development time.
o Reliable and consistent formats.
Challenges:
o Dependent on third-party uptime.
o Rate limits or data caps.
o Data licensing or usage fees.
Examples:
o Using OpenWeatherMap API to retrieve real-time
temperature and humidity data.
o Getting traffic data for smart transportation systems via
Google Maps API.
Sources of Logs:
o Operating systems
o Web servers
o Applications
o Network devices
o IoT platforms
Use Cases:
o System diagnostics and debugging.
o Intrusion detection in cybersecurity.
o Performance monitoring.
Advantages:
o Rich information source.
o Useful for understanding behavior over time.
o Can uncover hidden patterns through mining.
Examples:
o A smart factory logging machine operation times and
faults.
o A security system recording motion detection events.
E. Streaming Data Collection
Streaming data refers to continuous, real-time data flows from
devices. In IoT, this is especially important for applications that need
to react instantly.
Applications:
o Real-time traffic monitoring
o Live video surveillance
o Industrial automation systems
o Smart health monitoring (e.g., patient vitals)
Advantages:
o Instant processing and response.
o Scalability to handle millions of data points per second.
o Fault tolerance and high throughput.
Challenges:
o Requires robust infrastructure and high-speed networks.
o Complex system design.
o Need for data compression, buffering, and prioritization.
1. Cleaning
Purpose: To remove errors, redundancies, and inconsistencies
in the dataset.
Cleaning Techniques:
o Dropping duplicates
Example:
o A temperature sensor logs: 22°C, 22°C, NaN, 99°C (error),
2. Filtering
Purpose: To remove irrelevant or noisy data that is not useful
for the specific analysis or application.
Filtering Approaches:
o Based on time windows (e.g., keeping only last 7 days of
data).
o Based on value thresholds (e.g., discard values below a
is detected).
Benefits:
o Reduces data size and computational load.
Example:
o Discarding environmental sensor data when a device is
3. Transformation
Purpose: To convert raw data into a consistent, interpretable, and
analyzable format.
Common Transformations:
o Unit conversion: Fahrenheit to Celsius, meters to feet.
o Encoding categorical values for modeling (e.g., "on", "off"
→ 1, 0).
o Scaling or normalization: Bringing values into the same
range, e.g., [0,1].
o Date-time formatting for time-series analysis.
Example:
o Converting timestamps from “01-05-2025 10:00:00” to
ISO format: “2025-05-01T10:00:00Z”.
4. Aggregation
Purpose: To summarize and condense large datasets by combining
values over specified intervals.
Why Aggregate?:
o IoT devices can generate data every second; aggregation
reduces volume while retaining trends.
Techniques:
o Averaging: Mean temperature per hour.
o Summing: Total energy usage per day.
o Counting: Number of events per minute.
o Windowing: Time-based data grouping (sliding or fixed
windows).
Example:
o A smart meter records power consumption every 5
seconds. Aggregation produces average hourly usage for
billing or analysis.
5. Integration
Purpose: To merge data from multiple sources into a unified dataset.
Challenges:
o Devices may use different formats or schemas.
o Timestamp alignment can vary by time zone or clock drift.
o Missing IDs or inconsistent labels.
Integration Techniques:
o Schema matching and merging.
o Time alignment using interpolation or buffers.
o Entity resolution (matching devices across systems).
Example:
o Combining GPS data from a vehicle, weather API data, and
fuel sensor data to analyze route efficiency.
1. Time Synchronization
IoT systems often involve multiple sensors and devices capturing data
at different intervals. Aligning their timestamps is crucial.
Problems if not synchronized:
o Inaccurate event correlation.
o Misleading analysis.
o Data mismatches.
Solutions:
o Use of NTP (Network Time Protocol).
o Timestamp correction based on drift detection.
o Central clock reference in distributed networks.
2. Data Integrity
Ensuring that the data has not been tampered with or
corrupted during transmission or storage is critical for trust and
reliability.
Integrity Techniques:
o Checksums and hashing.
with encryption).
o Redundancy and backups.
6. Exploratory Data Analysis
(EDA):-
Once IoT data has been collected, cleaned, and pre-processed,
the next crucial step is Exploratory Data Analysis (EDA). This
process involves using statistical summaries and visualizations
to explore the dataset, identify patterns, understand
relationships, and detect anomalies.
EDA is not just about looking at numbers—it’s about asking
questions and discovering insights from the data that can guide
future modelling or decision-making. In the context of IoT, this
might include understanding sensor behaviour over time,
detecting system failures, or identifying usage patterns.
Objectives of EDA
Understand the underlying structure of the data.
Identify important variables and their characteristics.
Detect outliers, missing values, or errors.
Explore correlations and dependencies between
variables.
Generate hypotheses or assumptions for further analysis.
Assist in feature selection and model design.
Types of EDA
1. Univariate Analysis
This involves analyzing one variable at a time to
understand its distribution, range, and central tendencies.
Measures Used:
o Central tendency: Mean, median, mode.
o Spread: Variance, standard deviation, range, interquartile
range (IQR).
o Distribution Shape: Skewness (asymmetry), kurtosis
(peakedness).
Visualization Tools:
o Histogram: Shows frequency distribution.
Example in IoT:
o Examining the temperature readings of a sensor over a
Purpose:
o To detect correlations, dependencies, or causality.
o To identify clusters, patterns, or anomalies.
Techniques:
o Scatter plots: Visualize relationships between two numeric
variables.
o Correlation coefficient (e.g., Pearson’s r): Measures the
strength and direction of linear relationships.
o Cross-tabulations and heatmaps: For categorical or mixed-
type variables.
o Pair plots: For viewing multiple variable relationships at
once.
Example in IoT:
o Investigating if an increase in room occupancy (sensor
count) correlates with rising temperature or humidity.
o Analyzing how vibration frequency and machine
temperature together influence motor wear.
No-code/Low-code Tools:
o Microsoft Excel / Google Sheets
o Tableau / Power BI
IoT Dashboards:
o Grafana, ThingsBoard, AWS IoT Analytics
These tools help visualize and interpret complex sensor data
efficiently, making EDA accessible even to non-programmers.
6. Descriptive Statistics:-
What Are Descriptive Statistics?
Descriptive statistics are mathematical tools used to summarize and
describe the essential features of a dataset. Rather than examining
each individual data point, descriptive statistics allow you to extract
key information about the distribution, central tendency, spread, and
shape of data.
In IoT systems, where sensors continuously collect huge volumes of
data (e.g., temperature, voltage, speed, humidity), descriptive
statistics help engineers quickly understand data behaviour and make
informed decisions for monitoring, troubleshooting, or further
analysis.
2. Median
The middle value when the data is sorted in ascending or
descending order.
Resistant to outliers—unlike mean.
Example: In readings [20, 21, 22, 80], the median is 21.5 while
the mean is skewed due to 80.
Use in IoT: Helps when sensor data has occasional spikes or
errors.
3. Mode
The most frequent value in the dataset.
Useful for categorical data or sensor states (e.g., ON/OFF, door
open/closed).
Use in IoT: Understanding most common status or operating
condition.
4. Range
Formula:
Range = Maximum Value − Minimum Value
Example: If humidity readings are from 40% to 90%, range =
50%.
Use in IoT: Detecting variability in sensor readings or
operational conditions.
5. Variance
Measures the average of the squared differences from the
Mean.
Formula (for population):
σ 2(square)=∑ ( xi − μ ) 2( square)÷ n
where μ = mean.
Use in IoT: High variance might indicate unstable sensors or
conditions.
6. Standard Deviation (SD)
The square root of the variance. It tells us how much the data
deviates from the mean in the original units.
Formula:
σ =√ σ 2
Interpretation:
Low SD: Data points are close to the mean (stable).
High SD: Data is spread out (noisy or variable).
Use in IoT: Monitoring equipment health—e.g., increasing SD in
vibration may indicate machine wear.
1. Individual Deviation
Definition-
Deviation refers to how far each individual data point is from
the mean of the dataset. It forms the basis of standard
deviation and variance.
Types of Deviation:
Positive Deviation: Data point is greater than the mean.
Negative Deviation: Data point is less than the mean.
Use in IoT:
Helps identify unstable readings from sensors.
Can detect unusual patterns or outliers, e.g., a sudden drop in
pressure.
Example:
In a dataset of temperatures where the mean is 25°C, a sensor
reading of 30°C has a deviation of +5°C.
2. Standard Deviation (SD)
A derived measure that shows the average deviation of data
points from the mean.
A higher standard deviation means the data is more spread out.
A lower standard deviation means the data is more consistent
or stable.
B. Skewness
Skewness measures the asymmetry of a data distribution. It tells us
whether the values in a dataset are evenly distributed around the
mean or not.
1. Types of Skewness
Skewness
Type Description
Value
Distribution is balanced
Symmetric 0
on both sides
Positively Long tail on the right side
>0
Skewed (more low values)
Negatively Long tail on the left side
<0
Skewed (more high values)
Graphical View:
Positive Skew:
*
**
* *
* *
* *
*
Negative Skew:
*
**
* *
* *
* *
* *
C. Kurtosis
Kurtosis describes the "tailedness" or peakedness of a data
distribution. It shows whether the data has extreme outliers or is
more uniformly distributed.
1. Types of Kurtosis
Type Kurtosis Value Description
Mesokurtic ≈ 3 Normal bell-shaped curve
Leptokurtic > 3 Sharp peak, heavy tails – many outliers
Platykurtic < 3 Flat curve, light tails – fewer outliers
Note: Some software tools report "excess kurtosis", where 0 is
normal, > 0 is leptokurtic, and < 0 is platykurtic.
Graphical View:
Leptokurtic (Peaked): Many data points cluster near the mean
with more extreme outliers.
Platykurtic (Flat): Data is spread more evenly with fewer
outliers.
Use in IoT Systems:
Leptokurtic behaviour may indicate a normally functioning
system with occasional sharp failures (e.g., power surges or
fault spikes).
Platykurtic data suggests the system operates with moderate
variations and few extremes, useful in controlled environments.
A. Python Libraries
Python has become the de facto language for data analysis and
scientific computing due to its rich ecosystem of libraries:
1. pandas
Used for data manipulation and analysis.
Allows loading data from CSV, JSON, Excel, SQL databases, etc.
Provides data structures like DataFrame for organizing data.
Functions for handling missing values, grouping, filtering, etc.
2. numpy
Core library for numerical computing.
Offers support for multidimensional arrays, matrix operations,
and linear algebra.
Often used in combination with pandas for performance
efficiency.
3. matplotlib
A basic data visualization library.
Enables creation of plots like line graphs, bar charts,
histograms, and scatter plots.
Useful for quick visual inspection of sensor data patterns.
4. seaborn
Built on top of matplotlib; provides enhanced visualizations
with minimal code.
Automatically handles themes, color palettes, and statistical
plotting (e.g., correlation heatmaps, box plots).
Use in IoT: Once data is collected from devices, Python libraries
are used for cleaning, transforming, analyzing, and visualizing it
for further modeling or reporting.
Summary Table
Category Tools/Technologies
Programming pandas, numpy, matplotlib, seaborn
Category Tools/Technologies
Libraries
IoT Devices Raspberry Pi, Arduino, ESP32
AWS IoT Core, Google Cloud IoT,
Cloud Platforms
Azure IoT
Streaming Systems MQTT, Apache Kafka
Dashboards/DBs Grafana, InfluxDB, Node-RED