0% found this document useful (0 votes)
14 views54 pages

Unit II Notes

Data collection in IoT is crucial for effective data science, involving various devices generating vast amounts of data that must be accurately collected and processed. Key characteristics of IoT data include volume, velocity, variety, and veracity, with a typical architecture comprising edge, gateway, and cloud layers. Understanding data types, collection strategies, and challenges is essential for successful analysis and application in IoT systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views54 pages

Unit II Notes

Data collection in IoT is crucial for effective data science, involving various devices generating vast amounts of data that must be accurately collected and processed. Key characteristics of IoT data include volume, velocity, variety, and veracity, with a typical architecture comprising edge, gateway, and cloud layers. Understanding data types, collection strategies, and challenges is essential for successful analysis and application in IoT systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Data Collection (Unit II)

1. Introduction to Data
Collection in IoT:-
Data collection is one of the most vital phases in the data science
process, especially within the context of the Internet of Things (IoT). In
IoT systems, various devices, sensors, and machines continuously
generate data. This data needs to be collected, stored, and pre-
processed before any meaningful analysis can be done. Without
accurate and sufficient data, the results of any data science project may
be misleading or incomplete. Thus, understanding the sources, types,
and methods of data collection is crucial.

 Real-time monitoring (e.g., industrial equipment, healthcare vitals)


 Predictive maintenance (e.g., detecting machine failures before they occur)
 Automated decision systems (e.g., smart traffic lights adjusting to
congestion)
 Business intelligence (e.g., retail analytics from customer foot traffic)

Key Characteristics of IoT Data Collection:-


1. Volume
IoT generates enormous data quantities - a single smart factory
can produce terabytes daily.
*Example: A wind turbine with 100+ sensors generates
5GB/hour.
2. Velocity
Data streams in real-time or near-real-time.
*Example: Autonomous vehicles process lidar/camera data at 1-
10 Gbps.*
3. Variety
Includes structured (sensor readings), unstructured (video
feeds), and semi-structured (device logs) data.

4. Veracity
Data quality challenges due to:
1. Sensor malfunctions
2. Transmission errors
3. Environmental interference

IoT Data Collection Architecture:-


A typical pipeline involves three layers:
[Edge Layer] → [Gateway Layer] → [Cloud Layer]
(Sensors) (Preprocessing) (Storage/Analytics)

Edge Layer:
 Physical sensors/devices collecting raw data.
 Examples: Temperature sensors, accelerometers, cameras
Gateway Layer:
 Aggregates and preprocesses data
 May perform initial filtering/compression
Cloud Layer:
 Centralized storage and processing
 Enables large-scale analytics
Key Challenges in IoT Data Collection

Challenge Description Impact

Handling
billions of Requires distributed
Scalability
connected architectures
devices

Delay in time- Critical for


Latency sensitive healthcare/autonom
applications ous systems

Vulnerabilities
Risk of data
Security in device
breaches/spoofing
networks

Power
Energy Affects deployment
constraints on
Efficiency longevity
edge devices

Data Multiple
Complicates
Heterogene formats/protoc
integration
ity ols
2. Getting to Know Your Data :-
It is essential to first understand the characteristics of the data
you are working with. This step, often referred to as data
understanding, is critical because it determines the kind of
analytical methods you can use, the preprocessing required,
and the assumptions you can safely make.

A. Types of Data Based on Structure


Understanding the structure of your data helps in choosing the
right tools and processing techniques:

1. Structured Data
Structured data refers to data that is organized in a fixed
schema—typically rows and columns—making it easy to store
and query using relational databases (like MySQL, PostgreSQL).

 Examples: Excel spreadsheets, SQL tables, CSV files.

 Characteristics:
o Clearly defined fields (columns).

o Easy to search and filter.

o Often numerical or categorical.

 Usage in IoT: Sensor logs stored in tabular format (e.g.,


timestamp, temperature, humidity).

IoT Applications:
 Industrial sensor readings (temperature, pressure).
 Smart meter energy consumption data.
 Inventory tracking in warehouses.
2. Unstructured Data
Unstructured data does not follow a predefined model or structure.
This makes it harder to store and analyze using traditional methods.

 Examples: Images, audio clips, videos, emails, PDF documents.

 Characteristics:
o Cannot be stored in traditional rows/columns easily.
o Requires advanced processing (e.g., NLP for text,
computer vision for images).

 Usage in IoT: Surveillance videos from security cameras, voice


commands in smart home systems.

IoT Applications:
 Surveillance camera footage
 Drone aerial imagery
 Voice recordings from smart assistants
 Maintenance technician notes

Processing Challenges:
 Requires computer vision for image/video
 NLP techniques for text analysis
 Higher computational needs

3. Semi-Structured Data
Semi-structured data lies between structured and unstructured data.
It has a flexible format but still contains markers or tags to separate
elements.

 Examples: JSON, XML, YAML files.

 Characteristics:
o Self-describing structure with tags or keys.
o Easier to parse programmatically.

 Usage in IoT: Data packets sent by IoT devices in JSON format


(e.g., { "device_id": "A101", "temp": 25.5 }).

IoT Applications:
 Device configuration files
 API communications
 Event logs with metadata
4. Time-Series Data in IoT

Characteristics:
 Time-stamped values
 High write throughput
 Often append-only
Specialized Databases:
 InfluxDB
 TimescaleDB
 Prometheus

IoT Applications:
 Continuous monitoring (health vitals)
 Predictive maintenance
 Environmental sensing

Visualization Example:
Temperature Over Time
30°C | *
25°C | * *
20°C |* *
+-----------
Time
5. Spatial Data in IoT

Characteristics:
 Geographic coordinates
 Often combined with time data
 Requires GIS processing

Common Formats:
 GeoJSON
 Shapefiles
 KML (Keyhole Markup Language)

IoT Applications:
 Fleet tracking
 Smart city infrastructure
 Agricultural monitoring

Example Use Case:


Delivery Truck Route:
A→B→C
| | |
GPS GPS GPS
pings pings pings
B. Key Aspects to Understand About Your Data
Once the type and format of the data are identified, several other
critical aspects should be evaluated:

1.Data Types:
Understanding the data type of each field/attribute is fundamental
for selecting the correct analytical or preprocessing method.
 Numerical: Integers (e.g., count of devices) and floating-point
numbers (e.g., temperature = 23.4°C).
 Categorical: Represents labels or categories (e.g., status =
"active"/"inactive").
 Boolean: True/False values, often used for binary conditions.
 Date/Time: Timestamps that are important for trend analysis
and correlation in IoT.
Incorrect interpretation of data types can lead to errors in
calculations and visualizations.

2. Units of Measurement
Knowing the measurement units of your data is essential for
interpretation and consistency. For example:
 Temperature: Celsius, Fahrenheit, Kelvin.
 Distance: Meters, kilometers, miles.
 Speed: km/h, m/s.
Inconsistent or incorrect units can lead to incorrect conclusions.
Always standardize units across datasets if merging from multiple
sources.
3. Frequency of Data Collection
This refers to how often the data is being collected, which affects
storage needs and granularity of analysis.
 High-frequency data: Collected every few milliseconds (e.g.,
vibration data in machinery monitoring).
 Low-frequency data: Collected hourly, daily, or on-demand
(e.g., smart electricity meter data).
The frequency impacts:
 The size of the dataset.
 The ability to capture real-time trends.
 Resource requirements for processing.

4. Missing or Anomalous Values


IoT systems are prone to errors due to hardware issues, poor
connectivity, or power failures. As a result, you may find:
 Missing values: No data recorded for certain time intervals or
fields.
🧾 Causes:

Non-response in surveys

Data entry errors

Technical issues during collection (e.g., sensors failing)

Intentional skip based on prior answers (e.g., skip logic in forms)

ff
 Anomalous values: Outliers or data points that don’t match
expected patterns (e.g., a temperature sensor reading -100°C in
a room).
Detecting and handling such values is critical:
 Techniques: Imputation (filling in missing values), filtering out
outliers, or interpolating data for missing timestamps.
 Tools: Python libraries like pandas, scikit-learn, and matplotlib
for detection and correction.

3. Types of Data :-
In data science and IoT systems, understanding the type of data
being collected is crucial. It influences everything from how data is
stored, processed, and analyzed to which algorithms and
visualization techniques are applicable. Broadly, data can be
categorized based on source and nature.

A. Based on Data Source


1. Primary Data
Primary data is the original data collected directly from the source. It
is gathered for a specific research or analysis purpose.

 In the IoT context, primary data is typically obtained from:


o Sensors (e.g., temperature, humidity, motion sensors)
o Actuators
o User interactions with devices (e.g., smart home
commands)
o Logs generated by IoT platforms
 Advantages:
o High accuracy and relevance for the specific problem.
o Up-to-date and tailored for the system's objectives.
 Disadvantages:
o Can be costly or time-consuming to collect.
o Requires devices, networks, and proper calibration.
 Example: A GPS tracker on a delivery truck providing real-time
location data.
2. Secondary Data
Secondary data refers to data that was collected previously for a
different purpose but can be reused for current analysis.

 Examples:
o Public datasets (e.g., from government agencies or open
data platforms)
o Historical data logs stored in cloud services
o Research databases
o Organizational reports and records

 Advantages:
o Saves time and effort in data collection.
o Often readily available and low-cost.
 Disadvantages:
o May not perfectly match current objectives.
o May be outdated or not fully reliable.

 Example: Using traffic volume data from the city’s transport


department to optimize smart traffic signals.

B. Based on Nature of Data

1. Quantitative Data (Numerical Data)


Quantitative data is data that can be measured and expressed in
numerical terms. It can be further divided into discrete and
continuous data.

 Discrete Data:
o Countable values (e.g., number of IoT devices online).
o No intermediate values between data points.

 Continuous Data:
o Can take any value within a range (e.g., temperature,
voltage).
o Supports mathematical operations like averaging,
standard deviation.

 Examples in IoT:
o Temperature readings from a sensor.
o Voltage levels in a smart grid.
o Speed of a moving vehicle from a GPS module.

 Use: Ideal for statistical analysis, time series analysis, and


predictive modeling.

3. Qualitative Data (Categorical Data)


Qualitative data represents non-numerical information and is used
to describe attributes, labels, or classifications.

 Types:
o Nominal Data: Categories with no natural order (e.g.,
device status: "on", "off", "standby").
o Ordinal Data: Categories with an implied order (e.g.,
feedback levels: "bad", "average", "good", "excellent").

 Examples in IoT:
o Device status: "Active", "Idle", "Disconnected"
o Sensor health: "Good", "Warning", "Critical"
o User preference: "Indoor Mode", "Eco Mode", "Turbo
Mode"

 Use:
o Important for classification models.
o Represented using bar charts, pie charts, and frequency
tables.

C. Importance of Understanding Data Types


Understanding the type of data is critical for:
 Choosing appropriate statistical methods:
o Mean, median for quantitative data.
o Frequency distribution for qualitative data.
 Selecting data visualization techniques:
o Line graphs for continuous data.
o Bar charts for categorical data.
 Designing machine learning models:
o Regression algorithms for numerical predictions.
o Classification algorithms for categorical outcomes.
 Performing data cleaning:
o Detecting outliers in numerical data.
o Identifying inconsistent labels in categorical data.

 Comparison Table

Data Type Nature Examples Typical Use in IoT


Monitoring, control,
Primary Data Original Real-time sensor readings
automation
Secondary Pre- Government reports, Context enrichment,
Data collected weather APIs historical analysis
Temperature = 27.4°C, Predictive modeling,
Quantitative Numerical
Humidity = 60% trend analysis
Device Status = “On”, Classification, status
Qualitative Categorical
User Review = “Bad” reporting

4. Data Collection Strategies :-

In IoT and data science, data collection is the foundational step


that fuels every analytical or machine learning task. In an IoT
ecosystem, data is generated continuously by numerous
interconnected devices. Choosing the right strategy for
collecting this data is vital, as it impacts the efficiency,
reliability, and accuracy of subsequent operations.

The nature of the data, the infrastructure, and the analytical


goals all influence which strategy is most suitable. Below are
the main strategies used in modern IoT-based systems:

A.Manual Data Entry


Manual data collection involves human input—people record or
enter data into a system using interfaces like spreadsheets, forms, or
terminals.

 Use Cases:
o Initial device registration.
o Logging maintenance events or technician observations.
o Emergency overrides or manual error corrections.

 Limitations in IoT:
o Human Error: Typos, omissions, or incorrect values.
o Scalability Issues: Not feasible when dealing with
thousands of sensors.
o Latency: Time-consuming and not real-time.
 Example: A technician manually entering sensor calibration
values after maintenance.
 Conclusion: While important in specific situations, manual data
entry is generally discouraged in large-scale IoT applications
due to its lack of reliability and efficiency.

B. Sensor-Based Data Collection


This is the most common and essential method in IoT environments.
Sensors are physical devices that detect and respond to
environmental inputs such as temperature, pressure, motion, light,
and humidity.
 How It Works:
o Sensors collect data at set intervals.
o Data is transmitted via wireless or wired networks to
gateways or cloud platforms.
o Data is logged, processed, or visualized in real time or
batch mode.

 Advantages:
o Real-time and automated.
o High accuracy and reliability.
o Minimal human intervention.

 Challenges:
o Device failure or battery depletion.
o Data overload if sampling rate is too high.
o Environmental interference.

 Examples:
o A smart thermostat collecting room temperature every 5
seconds.
o Soil moisture sensors in precision agriculture.
o Accelerometers in smartwatches detecting movement.

C. API Access (Application Programming Interfaces)


APIs allow applications to request data from external services
or devices. This is especially useful when integrating third-party
data sources.

 Use Cases:
o Weather data for smart agriculture.
o Location-based data from mapping services.
o Energy pricing data for smart grid systems.

 Advantages:
o Easy integration with other platforms.
o Reduces development time.
o Reliable and consistent formats.
 Challenges:
o Dependent on third-party uptime.
o Rate limits or data caps.
o Data licensing or usage fees.

 Examples:
o Using OpenWeatherMap API to retrieve real-time
temperature and humidity data.
o Getting traffic data for smart transportation systems via
Google Maps API.

D. Logs and Events


Modern devices and software systems generate logs—structured or
unstructured files that record events and activities.

 Sources of Logs:
o Operating systems
o Web servers
o Applications
o Network devices
o IoT platforms

 Use Cases:
o System diagnostics and debugging.
o Intrusion detection in cybersecurity.
o Performance monitoring.

 Advantages:
o Rich information source.
o Useful for understanding behavior over time.
o Can uncover hidden patterns through mining.

 Examples:
o A smart factory logging machine operation times and
faults.
o A security system recording motion detection events.
E. Streaming Data Collection
Streaming data refers to continuous, real-time data flows from
devices. In IoT, this is especially important for applications that need
to react instantly.

 Tools & Technologies:


o Apache Kafka
o MQTT (Message Queuing Telemetry Transport)
o Apache Flink, Spark Streaming

 Applications:
o Real-time traffic monitoring
o Live video surveillance
o Industrial automation systems
o Smart health monitoring (e.g., patient vitals)

 Advantages:
o Instant processing and response.
o Scalability to handle millions of data points per second.
o Fault tolerance and high throughput.
 Challenges:
o Requires robust infrastructure and high-speed networks.
o Complex system design.
o Need for data compression, buffering, and prioritization.

F. The 4Vs of Big Data in IoT Data Collection


When designing a data collection strategy in IoT, it’s essential to
consider the 4Vs of Big Data:

V Description Example in IoT


Large amounts of data A smart city collecting data from
Volume generated by thousands of traffic lights, air quality monitors,
sensors. and CCTV.
Speed at which data is
Real-time fire detection from
Velocity generated and must be
smoke sensors.
processed.
V Description Example in IoT
JSON data from sensors, images
Different formats and types
Variety from drones, audio from smart
of data.
assistants.
Reliability and accuracy of Eliminating false alerts from
Veracity
data. faulty devices or noisy signals.

Understanding these factors helps in designing a scalable, reliable,


and effective data collection system.

5. Data Preprocessing and


Engineering for IoT Data:-
In IoT applications, raw data collected from sensors and devices
is often unstructured, inconsistent, and noisy. This raw data is
not suitable for direct use in analytics or machine learning
models. Therefore, data preprocessing and engineering are
essential to refine, clean, and structure the data so that it
becomes reliable, interpretable, and usable.
This step is crucial for the success of any data-driven IoT
system, whether it’s predictive maintenance in manufacturing,
smart healthcare monitoring, or real-time traffic analysis.

A. Data Preprocessing: Making Raw Data Usable

1. Cleaning
Purpose: To remove errors, redundancies, and inconsistencies
in the dataset.

 Common Issues in Raw IoT Data:


o Duplicate entries due to repeated transmissions.

o Missing values when sensors fail to capture data.

o Outliers or corrupted data caused by sensor malfunctions.

o Inconsistent units or formats, e.g., "30 C", "30°C", "86 F".

 Cleaning Techniques:
o Dropping duplicates

o Imputing missing values using:

 Mean/median (for numerical data)

 Most frequent value (for categorical data)

 Interpolation (for time-series)

o Standardizing formats and units

o Replacing or removing outliers

 Example:
o A temperature sensor logs: 22°C, 22°C, NaN, 99°C (error),

23°C → Cleaned to: 22, 22, 22.5, 23

2. Filtering
Purpose: To remove irrelevant or noisy data that is not useful
for the specific analysis or application.
 Filtering Approaches:
o Based on time windows (e.g., keeping only last 7 days of

data).
o Based on value thresholds (e.g., discard values below a

sensor’s operational range).


o Based on event triggers (e.g., keep data only when motion

is detected).

 Benefits:
o Reduces data size and computational load.

o Improves focus and efficiency in analysis.

 Example:
o Discarding environmental sensor data when a device is

offline or in maintenance mode.

3. Transformation
Purpose: To convert raw data into a consistent, interpretable, and
analyzable format.

 Common Transformations:
o Unit conversion: Fahrenheit to Celsius, meters to feet.
o Encoding categorical values for modeling (e.g., "on", "off"
→ 1, 0).
o Scaling or normalization: Bringing values into the same
range, e.g., [0,1].
o Date-time formatting for time-series analysis.
 Example:
o Converting timestamps from “01-05-2025 10:00:00” to
ISO format: “2025-05-01T10:00:00Z”.

4. Aggregation
Purpose: To summarize and condense large datasets by combining
values over specified intervals.

 Why Aggregate?:
o IoT devices can generate data every second; aggregation
reduces volume while retaining trends.

 Techniques:
o Averaging: Mean temperature per hour.
o Summing: Total energy usage per day.
o Counting: Number of events per minute.
o Windowing: Time-based data grouping (sliding or fixed
windows).

 Example:
o A smart meter records power consumption every 5
seconds. Aggregation produces average hourly usage for
billing or analysis.
5. Integration
Purpose: To merge data from multiple sources into a unified dataset.

 Challenges:
o Devices may use different formats or schemas.
o Timestamp alignment can vary by time zone or clock drift.
o Missing IDs or inconsistent labels.

 Integration Techniques:
o Schema matching and merging.
o Time alignment using interpolation or buffers.
o Entity resolution (matching devices across systems).

 Example:
o Combining GPS data from a vehicle, weather API data, and
fuel sensor data to analyze route efficiency.

B. Data Engineering: Ensuring Quality and Consistency


Beyond preprocessing, data engineering focuses on building the
architecture and pipelines that reliably transport, synchronize, and
validate IoT data.

1. Time Synchronization
IoT systems often involve multiple sensors and devices capturing data
at different intervals. Aligning their timestamps is crucial.
 Problems if not synchronized:
o Inaccurate event correlation.
o Misleading analysis.
o Data mismatches.
 Solutions:
o Use of NTP (Network Time Protocol).
o Timestamp correction based on drift detection.
o Central clock reference in distributed networks.

2. Data Integrity
Ensuring that the data has not been tampered with or
corrupted during transmission or storage is critical for trust and
reliability.

 Threats to Data Integrity:


o Network packet loss or delay.

o Malicious attacks (e.g., falsified sensor readings).

o Faulty storage or memory corruption.

 Integrity Techniques:
o Checksums and hashing.

o Secure transmission protocols (e.g., HTTPS, TLS, MQTT

with encryption).
o Redundancy and backups.
6. Exploratory Data Analysis
(EDA):-
Once IoT data has been collected, cleaned, and pre-processed,
the next crucial step is Exploratory Data Analysis (EDA). This
process involves using statistical summaries and visualizations
to explore the dataset, identify patterns, understand
relationships, and detect anomalies.
EDA is not just about looking at numbers—it’s about asking
questions and discovering insights from the data that can guide
future modelling or decision-making. In the context of IoT, this
might include understanding sensor behaviour over time,
detecting system failures, or identifying usage patterns.

Objectives of EDA
 Understand the underlying structure of the data.
 Identify important variables and their characteristics.
 Detect outliers, missing values, or errors.
 Explore correlations and dependencies between
variables.
 Generate hypotheses or assumptions for further analysis.
 Assist in feature selection and model design.

Types of EDA

1. Univariate Analysis
This involves analyzing one variable at a time to
understand its distribution, range, and central tendencies.

 Measures Used:
o Central tendency: Mean, median, mode.
o Spread: Variance, standard deviation, range, interquartile
range (IQR).
o Distribution Shape: Skewness (asymmetry), kurtosis
(peakedness).

 Visualization Tools:
o Histogram: Shows frequency distribution.

o Box plot: Highlights spread, median, and outliers.

o Bar chart: For categorical variables.

 Example in IoT:
o Examining the temperature readings of a sensor over a

day to see if they stay within normal limits.

2. Bivariate and Multivariate Analysis


These types of analysis are used to explore the relationship between
two (bivariate) or more (multivariate) variables.

 Purpose:
o To detect correlations, dependencies, or causality.
o To identify clusters, patterns, or anomalies.

 Techniques:
o Scatter plots: Visualize relationships between two numeric
variables.
o Correlation coefficient (e.g., Pearson’s r): Measures the
strength and direction of linear relationships.
o Cross-tabulations and heatmaps: For categorical or mixed-
type variables.
o Pair plots: For viewing multiple variable relationships at
once.

 Example in IoT:
o Investigating if an increase in room occupancy (sensor
count) correlates with rising temperature or humidity.
o Analyzing how vibration frequency and machine
temperature together influence motor wear.

Common EDA Visualizations


Visualization
Best For IoT Example
Type
Distribution of a single Temperature frequency
Histogram
variable across a week
Visualization
Best For IoT Example
Type
Summary statistics and Pressure variation in a
Box Plot
outliers pipeline
Relationship between two Energy usage vs. appliance
Scatter Plot
numeric variables runtime
Hourly light levels from a
Line Plot Time-series trends
sensor
Interaction between
Heatmap Correlation matrix
multiple sensor readings
Frequency of categorical Status counts: “ON”, “OFF”,
Bar Chart
variables “Faulty”

Tools for EDA


 Programming Tools:
o Python (with libraries like Pandas, Matplotlib, Seaborn,
Plotly)
o R (with ggplot2, dplyr)

 No-code/Low-code Tools:
o Microsoft Excel / Google Sheets
o Tableau / Power BI

 IoT Dashboards:
o Grafana, ThingsBoard, AWS IoT Analytics
These tools help visualize and interpret complex sensor data
efficiently, making EDA accessible even to non-programmers.

Insights Derived from EDA


EDA can lead to:
 Discovery of missing data patterns (e.g., always missing at
night).
 Identification of outliers or faults (e.g., faulty sensor showing
999°C).
 Recognition of seasonal trends or daily cycles (e.g., traffic peaks
at 8 AM).
 Detection of data drifts (sensor readings changing slowly over
time).
 Foundation for predictive modeling or machine learning feature
selection.

Example Scenario in IoT EDA

Let’s consider a smart agriculture application where soil moisture,


temperature, and sunlight data are collected every 10 minutes.
 Univariate: Box plot shows temperature has a high variance,
indicating possible sensor drift.
 Bivariate: Scatter plot between soil moisture and sunlight
reveals inverse correlation.
 Multivariate: Correlation matrix shows that high temperature
and low moisture often occur together—potential stress
condition for plants.
These findings can guide the development of a predictive model for
automated irrigation.

6. Descriptive Statistics:-
What Are Descriptive Statistics?
Descriptive statistics are mathematical tools used to summarize and
describe the essential features of a dataset. Rather than examining
each individual data point, descriptive statistics allow you to extract
key information about the distribution, central tendency, spread, and
shape of data.
In IoT systems, where sensors continuously collect huge volumes of
data (e.g., temperature, voltage, speed, humidity), descriptive
statistics help engineers quickly understand data behaviour and make
informed decisions for monitoring, troubleshooting, or further
analysis.

Objectives of Descriptive Statistics


 To summarize large datasets with just a few numbers.
 To highlight important patterns, trends, or irregularities.
 To prepare data for further analysis, such as visualization or
modeling.
 To identify anomalies (e.g., faulty sensors or system errors).

Categories of Descriptive Statistics


Descriptive statistics are commonly divided into two main categories:
A. Measures of Central Tendency
These describe the central or typical value in a dataset.
1. Mean (Average)
 Formula:
n
Mean=∑ xi ÷ n
i=1

where xi = individual values and n = total number of values.


 Example: Average room temperature over 24 hours.
 Use in IoT: Helps summarize the average sensor reading over a
time period (e.g., average voltage per day).

2. Median
 The middle value when the data is sorted in ascending or
descending order.
 Resistant to outliers—unlike mean.
 Example: In readings [20, 21, 22, 80], the median is 21.5 while
the mean is skewed due to 80.
 Use in IoT: Helps when sensor data has occasional spikes or
errors.

3. Mode
 The most frequent value in the dataset.
 Useful for categorical data or sensor states (e.g., ON/OFF, door
open/closed).
 Use in IoT: Understanding most common status or operating
condition.

B. Measures of Dispersion (Spread)


These describe how much the data varies or is spread out.

4. Range
 Formula:
Range = Maximum Value − Minimum Value
 Example: If humidity readings are from 40% to 90%, range =
50%.
 Use in IoT: Detecting variability in sensor readings or
operational conditions.

5. Variance
 Measures the average of the squared differences from the
Mean.
 Formula (for population):
σ 2(square)=∑ ( xi ​− μ ) 2( square)÷ n

where μ = mean.
Use in IoT: High variance might indicate unstable sensors or
conditions.
6. Standard Deviation (SD)
 The square root of the variance. It tells us how much the data
deviates from the mean in the original units.
 Formula:
σ =√ σ 2

 Interpretation:
 Low SD: Data points are close to the mean (stable).
 High SD: Data is spread out (noisy or variable).
 Use in IoT: Monitoring equipment health—e.g., increasing SD in
vibration may indicate machine wear.

C. Shape of the Distribution


Though not always mentioned, descriptive statistics can also include:
 Skewness: Indicates asymmetry of the distribution.
o Right-skewed: long tail to the right (e.g., occasional
spikes).
o Left-skewed: long tail to the left.
 Kurtosis: Indicates how peaked or flat the distribution is.
o High kurtosis = sharp peak, more outliers.
o Low kurtosis = flat curve.
 Use in IoT: Helps detect abnormal behavior patterns in data.

Example – Application in IoT Scenario


Let’s say you’re monitoring the temperature in a smart greenhouse
using 100 readings per day.
 Mean = 26.5°C → Average growing condition.
 Median = 26.3°C → Close to mean, indicates no skew.
 Mode = 26°C → Most stable temperature.
 Standard Deviation = 1.2°C → Low variation, good
environmental control.
 Range = 24°C to 30°C → Acceptable operational window.

Why Descriptive Statistics Matter in Engineering and IoT


 Quick Diagnostics: Know if sensors are functioning within
expected parameters.
 Data Validation: Helps identify abnormal data before advanced
analysis.
 Baseline Understanding: Required before predictive analytics,
machine learning, or control automation.
 Decision Making: Engineers can set thresholds for alerts (e.g.,
trigger alarms if SD exceeds a limit).

7. Deviation, Skewness, and Kurtosis:-


In data analysis, it’s not enough to just know the average
(mean) or spread (standard deviation). To gain deep insights
into data distribution, especially for modeling and prediction
tasks in IoT systems, higher-level statistical measures such as
Deviation, Skewness, and Kurtosis are essential.
These metrics help analysts and engineers to understand the
shape, behavior, and reliability of datasets—especially when
working with time-series sensor data that might be noisy,
asymmetric, or anomalous.

1. Individual Deviation
Definition-
Deviation refers to how far each individual data point is from
the mean of the dataset. It forms the basis of standard
deviation and variance.

Types of Deviation:
 Positive Deviation: Data point is greater than the mean.
 Negative Deviation: Data point is less than the mean.

Formula (Individual Deviation):


Deviation= x i − μ
 xi= individual data point
 μ = mean of the dataset

Use in IoT:
 Helps identify unstable readings from sensors.
 Can detect unusual patterns or outliers, e.g., a sudden drop in
pressure.

Example:
In a dataset of temperatures where the mean is 25°C, a sensor
reading of 30°C has a deviation of +5°C.
2. Standard Deviation (SD)
 A derived measure that shows the average deviation of data
points from the mean.
 A higher standard deviation means the data is more spread out.
 A lower standard deviation means the data is more consistent
or stable.

B. Skewness
Skewness measures the asymmetry of a data distribution. It tells us
whether the values in a dataset are evenly distributed around the
mean or not.

1. Types of Skewness
Skewness
Type Description
Value
Distribution is balanced
Symmetric 0
on both sides
Positively Long tail on the right side
>0
Skewed (more low values)
Negatively Long tail on the left side
<0
Skewed (more high values)

Graphical View:
 Positive Skew:

*
**
* *
* *
* *
*
 Negative Skew:

*
**
* *
* *
* *
* *

Use in IoT Systems:


 A positively skewed distribution may occur when most sensors
report low values (e.g., low battery readings) but a few report
abnormally high values.

 A negatively skewed distribution might appear when most


devices are on standby but a few operate at full capacity.
Practical Example:

 In a smart building, temperature sensors might mostly read


between 20–22°C, but one faulty sensor occasionally spikes to
40°C. This creates a positive skew.

C. Kurtosis
Kurtosis describes the "tailedness" or peakedness of a data
distribution. It shows whether the data has extreme outliers or is
more uniformly distributed.
1. Types of Kurtosis
Type Kurtosis Value Description
Mesokurtic ≈ 3 Normal bell-shaped curve
Leptokurtic > 3 Sharp peak, heavy tails – many outliers
Platykurtic < 3 Flat curve, light tails – fewer outliers
Note: Some software tools report "excess kurtosis", where 0 is
normal, > 0 is leptokurtic, and < 0 is platykurtic.

Graphical View:
 Leptokurtic (Peaked): Many data points cluster near the mean
with more extreme outliers.
 Platykurtic (Flat): Data is spread more evenly with fewer
outliers.
Use in IoT Systems:
 Leptokurtic behaviour may indicate a normally functioning
system with occasional sharp failures (e.g., power surges or
fault spikes).
 Platykurtic data suggests the system operates with moderate
variations and few extremes, useful in controlled environments.

Combined Interpretation Example – IoT Vibration Sensor


Let’s say we analyze vibration data from an industrial machine.
 Mean = 5 mm/s
 Standard Deviation = 1.2 mm/s → some variability
 Skewness = 1.5 → Positive skew (a few high vibration spikes)
 Kurtosis = 4.8 → Leptokurtic (indicating occasional extreme
events)
From this, we infer:
 The machine generally operates at normal levels.
 There are some rare spikes (probably during specific load cycles
or malfunctions).
 Preventive maintenance may be needed to avoid failures.

Why These Measures Matter in Data Science and Engineering


Metric Purpose IoT Application
Determines if sensor readings are
Deviation Checks variability
stable
Identifies trends like battery drain
Skewness Assesses symmetry
or faulty spikes
Identifies Helps in anomaly detection and
Kurtosis
peakness/outliers safety analysis

Understanding these measures guides feature selection, model


tuning, and system optimization in data-driven engineering projects.

8. Tools and Technologies:-


In the realm of IoT (Internet of Things) and data science,
collecting, managing, and preprocessing data effectively
requires the use of various hardware devices, software tools,
and platforms. These tools help to ensure that data flows
efficiently from sensors to analytical models or dashboards.

This section provides an overview of the most commonly used


tools and technologies in different stages of the IoT data
lifecycle: data acquisition, transmission, storage, and analysis.

A. Python Libraries
Python has become the de facto language for data analysis and
scientific computing due to its rich ecosystem of libraries:
1. pandas
 Used for data manipulation and analysis.
 Allows loading data from CSV, JSON, Excel, SQL databases, etc.
 Provides data structures like DataFrame for organizing data.
 Functions for handling missing values, grouping, filtering, etc.
2. numpy
 Core library for numerical computing.
 Offers support for multidimensional arrays, matrix operations,
and linear algebra.
 Often used in combination with pandas for performance
efficiency.
3. matplotlib
 A basic data visualization library.
 Enables creation of plots like line graphs, bar charts,
histograms, and scatter plots.
 Useful for quick visual inspection of sensor data patterns.
4. seaborn
 Built on top of matplotlib; provides enhanced visualizations
with minimal code.
 Automatically handles themes, color palettes, and statistical
plotting (e.g., correlation heatmaps, box plots).
Use in IoT: Once data is collected from devices, Python libraries
are used for cleaning, transforming, analyzing, and visualizing it
for further modeling or reporting.

B. IoT Devices for Data Collection


IoT data starts at the edge, i.e., on hardware devices connected
to sensors that capture physical world data.
1. Raspberry Pi
 A mini computer with USB, HDMI, and GPIO ports.
 Supports Linux OS and can run Python, making it great for data
collection and edge analytics.
 Supports camera modules, temperature sensors, motion
detectors, etc.
2. Arduino
 A microcontroller board with open-source hardware/software.
 Excellent for low-power, sensor-based applications.
 Commonly used in DIY electronics projects and smart device
prototyping.
3. ESP32
 A low-cost microcontroller with built-in Wi-Fi and Bluetooth.
 Ideal for wireless sensor networks and smart home devices.
 Supports various sensors and can publish data directly to cloud
via MQTT.
Note: These devices are programmed (often using C/C++ or
MicroPython) to read sensor data and transmit it to a server or
cloud.

C. Cloud Platforms for IoT


Once data is captured by edge devices, it’s transmitted to cloud
platforms for storage, processing, and remote control.
1. AWS IoT Core (Amazon Web Services)
 Connects billions of IoT devices to AWS cloud.
 Offers real-time message brokering, data ingestion, and rule
engines.
 Integrates with AWS services like Lambda (for processing) and
S3 (for storage).
2. Google Cloud IoT Core
 Securely manages IoT devices and streams data into Google
Cloud.
 Integrates with BigQuery, Dataflow, and AI/ML services for
advanced analysis.
3. Microsoft Azure IoT Hub
 Provides bi-directional communication between IoT apps and
devices.
 Includes built-in support for security, diagnostics, and device
management.
 Supports machine learning pipelines via Azure ML.
Why Use Cloud?
IoT systems often involve large-scale deployments with
thousands of devices. Cloud platforms ensure scalability,
remote access, real-time alerts, and data visualization
dashboards.

D. Data Streaming Technologies


Many IoT applications—like smart factories, autonomous
vehicles, or real-time weather systems—require data to be
collected and processed in real-time.
1. MQTT (Message Queuing Telemetry Transport)
 A lightweight publish-subscribe protocol designed for IoT.
 Efficient for low-bandwidth or high-latency networks.
 Devices publish data to a broker (e.g., Mosquitto), and other
clients can subscribe to it.
2. Apache Kafka
 A distributed streaming platform used for handling large-scale,
real-time data pipelines.
 Data is stored in topics, and consumers read from these topics.
 Offers fault tolerance, scalability, and high throughput.
Use Case:
 A smart city may use MQTT for sensors sending real-time data
(e.g., traffic updates), while Kafka aggregates and sends this to
analytics dashboards and ML models.

E. Other Notable Tools and Integrations


Tool Purpose
Node- Flow-based programming tool for wiring IoT
RED devices and services
InfluxDB Time-series database optimized for sensor data
Visualization dashboard for monitoring sensor
Grafana
metrics
Python library for computer vision (e.g., motion
OpenCV
detection via cameras)

Summary Table
Category Tools/Technologies
Programming pandas, numpy, matplotlib, seaborn
Category Tools/Technologies
Libraries
IoT Devices Raspberry Pi, Arduino, ESP32
AWS IoT Core, Google Cloud IoT,
Cloud Platforms
Azure IoT
Streaming Systems MQTT, Apache Kafka
Dashboards/DBs Grafana, InfluxDB, Node-RED

9. Challenges in IoT Data Collection:-


The Internet of Things (IoT) has revolutionized the way we
collect and analyze data by enabling a network of smart,
connected devices. However, collecting data in an IoT
environment is not without its technical and operational
challenges. Understanding these challenges is crucial for
building robust, secure, and efficient IoT data pipelines.

A. Data Quality Issues


1. Sensor Malfunction
 Sensors may fail temporarily or permanently, causing gaps or
corrupted readings.
 Calibration errors, battery failure, or environmental conditions
(e.g., humidity, heat) may distort sensor output.
2. Noisy or Inconsistent Data
 Due to electromagnetic interference, faulty wiring, or low-
quality sensors, data might include random errors, signal
fluctuations, or inconsistent formatting.
 This requires preprocessing steps like filtering and smoothing.
3. Missing Data
 Due to power loss, disconnections, or memory issues, some
devices might skip sending data for certain intervals.
 Missing data can severely affect time-series analysis and
forecasting models.

B. High Volume and Velocity of Data


IoT systems can have thousands or millions of devices continuously
generating data. This introduces two significant problems:
1. Volume (Data Size)
 Devices such as video cameras, smart meters, or GPS trackers
generate huge volumes of data.
 Storing this data efficiently requires distributed storage systems
like Hadoop HDFS or cloud storage services.
2. Velocity (Speed of Incoming Data)
 Data arrives at high speed, especially in systems like smart
factories or autonomous vehicles.
 Requires real-time data processing engines such as Apache
Spark Streaming, Apache Flink, or Kafka Streams to keep up
with the data flow.
Challenge: Traditional relational databases often struggle with this
scale and speed. IoT requires scalable, high-throughput systems.

C. Network Limitations and Connectivity


1. Intermittent Connectivity
 Devices in remote or mobile locations may suffer from network
dropout, leading to delayed or lost data packets.
 Wireless protocols like Wi-Fi, ZigBee, LoRaWAN, or cellular
(4G/5G) have varying range, reliability, and data rates.
2. Latency
 In time-sensitive applications like remote healthcare or
autonomous driving, latency in data transmission can cause
system failures or delayed responses.
3. Bandwidth Constraints
 Transmitting high-resolution data (e.g., video feeds) can
overload network bandwidth, especially when many devices
communicate simultaneously.
Solution: Use of edge computing—processing data locally on the
device or gateway before sending it to the cloud.

D. Security and Privacy Concerns


As IoT devices often operate unattended and wirelessly, they are
susceptible to various security threats:
1. Data Tampering
 Attackers may intercept or alter data during transmission,
affecting the reliability of downstream analytics.
2. Unauthorized Access
 Weak authentication mechanisms can allow hackers to gain
control over IoT devices and misuse them.
3. Encryption Challenges
 Encrypting data in resource-constrained devices (like sensors or
microcontrollers) is difficult due to limited processing power
and memory.
4. Data Privacy
 Personal or sensitive data (e.g., location, health info) must
comply with regulations like GDPR or HIPAA.
Solutions include:
 Using TLS/SSL encryption
 Device-level authentication (e.g., certificates or tokens)
 Firewalls and network segmentation
E. Data Redundancy and Relevance
1. Redundant Data
 Devices often collect repetitive or unchanged data over time.
For instance, a temperature sensor reporting "25°C" every
second may be unnecessary.
2. Low-Value Data
 Some collected data might not contribute meaningfully to
analysis and only consumes bandwidth and storage.
3. Need for Smart Filtering
 Implementing logic to transmit only meaningful changes or
summarize the data at the edge (edge filtering) is critical to
optimize the pipeline.

F. Other Operational Challenges


Challenge Description
Managing increasing number of devices and
Scalability
maintaining consistent performance
Devices from different vendors may use different
Interoperability
protocols/formats
Many sensors run on battery; excessive
Battery Life
communication drains power
Time Ensuring that timestamps from different devices
Synchronization align for accurate temporal analysis

You might also like