0% found this document useful (0 votes)
19 views70 pages

Big Data Analytics Lecture 1

The document outlines the MSDA9215: Big Data Analytics course, focusing on IoT and mobile data applications, taught by Temitope Oguntade. It covers course logistics, learning outcomes, key topics, and assessment methods, emphasizing the use of technologies like Hadoop, Spark, and MQTT for real-time data management. The course aims to equip students with skills in analyzing, visualizing, and interpreting Big Data to address real-world challenges.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views70 pages

Big Data Analytics Lecture 1

The document outlines the MSDA9215: Big Data Analytics course, focusing on IoT and mobile data applications, taught by Temitope Oguntade. It covers course logistics, learning outcomes, key topics, and assessment methods, emphasizing the use of technologies like Hadoop, Spark, and MQTT for real-time data management. The course aims to equip students with skills in analyzing, visualizing, and interpreting Big Data to address real-world challenges.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

1

MSDA9215: Big Data Analytics

April/May 2024

Week 1: Introduction & Big Data Ecosystem


Temitope Oguntade
Agenda

● Logistics and Introductions


● Course Description
● Course Learning Outcomes
● Student Assessment and Grading
● Course Schedule
● Academic Integrity & Well-being
Logistics

● Instructor: Temitope Oguntade


● Email: [email protected]
● TAs:
● Class Time: 08:30 - 17:00 Mon, 18:00 - 20:45 Wed
● Course Credit:
● Prerequisite: As may be determined
● Code Source:
https://github.com/toguntad/AUCA_IOT/blob/9bd866ac48a6d3071
a20007e9ac5a64b3e9020a1/AUCA_IOT.ipynb
Brief
Temitope Oguntade is the CEO and Founder of Spiral
Systems, a startup dedicated to building innovative, AI-driven
metering solutions specifically designed for micro-utilities in
Sub-Saharan Africa. These solutions are tailored to be
cost-effective, addressing unique regional challenges and
improving utility management. He holds an M.Sc. in
Information Technology from Carnegie Mellon University and
a B.Eng in Electrical and Computer Engineering from the
Federal University of Technology Minna. With a career
spanning over 15 years, Temitope has developed a deep
expertise in entrepreneurship, distributed computing, cloud
computing, and technology management, driving significant
advancements in the tech sector.
It is your turn

Let’s start with 10 people

● Who are you?


● What MSc program?
● What’s your background?
● What are your expectations for this course?
● Experience with data analytics?
Course Description

The course is designed to deepen learners' expertise in Big Data Analytics, with a focus
on IoT and mobile data applications.

● Leverages foundational platforms like Hadoop and Spark alongside advanced


timeseries databases to enhance Big Data processing capabilities.

● Emphasizes real-time data management using Message Brokers and MQTT,


addressing the specific needs of IoT device data streams.

● Equips students with skills to analyze, visualize, and interpret Big Data using scalable
machine learning algorithms, preparing them to deliver actionable insights in practical
settings.

● Focuses on applying these technologies to real-world challenges, enhancing analytical


and decision-making skills in IoT and mobile data contexts.
Course Learning Outcomes
Upon completion of this course, students will:

● Grasp the broad implications of Big Data Analytics in various sectors, emphasizing its impact
on IoT and mobile data.
● Demonstrate expertise in fundamental Big Data platforms such as Hadoop and Spark,
ensuring the capability to process and manage large datasets.
● Apply knowledge of diverse data storage systems, including key-value (KV) stores, document
databases, graph databases, and timeseries databases, to optimize data structuring and
retrieval in different Big Data scenarios.
● Utilize Message Brokers and the MQTT protocol to effectively manage real-time data flows
from IoT devices and mobile sources, addressing the unique requirements of streaming
data.
● Employ scalable machine learning algorithms to conduct comprehensive analytics on
multi-structured data from various platforms, drawing actionable insights from complex
datasets.
● Develop sophisticated data visualization skills to clearly present and interpret analytical
findings, catering to both general and specialized audiences including mobile analytics.
Key Topics
Introduction and Big Data Ecosystem

● Overview of Big Data Analytics: Scope and applications.


● Incorporating Message Brokers for Big Data applications.
● Understanding and using MQTT in the context of IoT
● Understanding timeseries databases in Big Data.

Data Storage Methods and Real-Time Data Handling

● Introduction to foundational platforms: Hadoop Ecosystem and Spark.


● In-depth exploration of HDFS and its role in the Hadoop ecosystem.
● Introduction to HBase: Concepts, architecture, and how it integrates with
Hadoop.
● Discussion on the types and characteristics of databases: KV stores,
document databases, and graph databases.
Key Topics(2)

Big Data Processing and Analytics

● Big Data processing frameworks: MapReduce and beyond.


● Analytics algorithms: Understanding the basics and applications.
● Parallel processing and scalability concerns in Big Data.
● Special focus on analytics for mobile and IoT Big Data.

Visualization and Real-world Applications

● Principles of data visualization in Big Data Analytics.


● Mobile issues and solutions in Big Data contexts.
● Case studies: Real-world Big Data challenges and solutions.
Assessment

Component Weight

Final 40%

Midterm 30%

Assignments | Quizzes | Participation 30%


What is Big Data Analytics
Overview: Big Data Analytics involves examining large and varied data sets to uncover hidden
patterns, unknown correlations, market trends, customer preferences, and other useful
information.

Mobile Big Data as a Subset: Advances in mobile computing, mobile internet, IoT, and
crowdsensing have intensified the generation of Mobile Big Data (MBD). This data comes from a
vast array of mobile and wireless devices, capturing diverse information from sensors carried
by moving objects, people (e.g., wearables), or vehicles.

Significance: The analysis of MBD and other Big Data sources provides critical insights that can
drive decision-making and operational efficiencies across multiple sectors.

Applications: Big Data Analytics is pivotal in fields like healthcare, finance, retail, and urban
planning, where large-scale, real-time data analysis can lead to impactful outcomes.
Examples & Use Cases
MBD/IOT
Agenda

● IOT Data
● Big Data vs Mobile Big Data(MBD)
● Characteristics of MBD
● Applications of MBD
● Mobile Big Data Analytics
● Summary
Internet of Things (IoT): Definition

● “A global infrastructure for the information society,


enabling advances services by interconnecting (physical
and virtual) things based on existing and evolving
interoperable information and communication
technologies.” - -
https://www.itu.int/en/ITU-T/gsi/iot/Pages/default.aspx

● Data is exchanged between systems and devices using the


Internet or other communications networks.
Source: https://www.techtarget.com/iotagenda/definition/Internet-of-Things-IoT
IoT Data

● IoT generates large amounts of data from multiple components.


● Data should be analysed to enable action/decision making.
● Data management and data mining are key technical and managerial
challenges in IoT development.
● Intrinsic properties of IoT data contribute to the two challenges above.
Properties of IoT Data

● Categorized into data generation, data quality, and data interoperability


● Data generation properties:
○ Velocity- generated at different rates
○ Scalability- large scale
○ Dynamics- changing location & environments, intermittent
connections
○ Heterogeneity- different data generators, different data formats
Properties of IoT Data (2)

● Data generation properties:


○ Incompleteness- need to find data sources to address
incompleteness
○ Semantics- need to inject semantics
Properties of IoT Data (3)

● Data interoperability properties:


○ Uncertainty- originating from different sources
○ Redundancy- multiple measures of same thing/metric
○ Ambiguity- means different things
Data types

● Textual (un/semi structured)


● Time series
● Geospatial
● Numerical
● Categorical
● Multimodal (image/video/audio)
Sources of IoT Data

● Sensors and actuators ● Users/crowd


● Social media ● Web
● Documents ● Graphs/ontologies
● Databases ● Expert/knowledge bases
Data types

● Big data: data that is too big (volume), too fast (velocity), and too diverse
(variety)
● Other characteristics: veracity, variability.
● Data can be streaming or historical.
● IoT is both a source and sink of Big Data
Big Data

● Convergence of:
○ Internet technologies,
○ Mobile Computing,
○ Cloud Computing,
○ Big Data,
○ Data Analytics, and
○ IoT
● There are challenges that are peculiar to IoT Big Data Analytics
Characteristics of Classical/Traditional Big Data
IoT Big Data vs Big Data
MBD Characteristics: Multi-sensory
MBD Characteristics: Multi-dimensional
MBD Characteristics: Personalized
MBD Characteristics: Real-time
MBD Characteristics: Spatio-temporal
IoT Analytics aka MBD Analytics
Issues with IoT Analytics (1/3)
Issues with IoT Analytics (2/3)
Issues with IoT Analytics (3/3)
PROJECT
Problem Statement 1

● You have just been hired by the World Bank to monitor 50 million
energy meters in Sub-Saharan Africa.
● Each energy meter sends consumption data (I, V, Hz, kW, kWh,
timestamp) every 15 minutes.
● Your first task is to capture and save these records for efficient
retrieval and analysis.
Challenges? Issues?

Five points for each.


Let’s go!
Technologies & Tools

● Protocol: Message Queuing Telemetry Transport (MQTT)


● MQTT Comm API: Paho
● MQTT Broker: Mosquitto (https://mosquitto.org/)
● Headend(or Server) System
● Timeseries Database: TimescaleDB
System Architecture

Sensors

Message
Broker

Headend/
Controller

TimescaleDB
Device Layer

● IoT Devices (sensors, actuators, gateways)


○ Communicate with the MQTT Broker using the Paho MQTT
Comm API
○ Publish telemetry data to MQTT topics
○ Subscribe to control commands from the Headend System
MQTT Broker

● Mosquitto
○ Receives and forwards MQTT messages between devices and
the Headend System
○ Handles device connections, authentication, and topic
subscriptions
Headend System

● Server-side application
○ Subscribes to MQTT topics for device telemetry data
○ Processes and analyzes device data
○ Sends control commands to devices via MQTT
○ Stores time-series data in TimescaleDB
Timeseries Database

● TimescaleDB
○ Stores and manages large amounts of time-series data from
IoT devices
○ Optimized for efficient storage and querying of time-series data
MQTT Crash Course 1/4
● Protocol Basics:
○ MQTT is a lightweight, publish-subscribe network protocol that
transports messages between devices.
○ It is designed for connections with remote locations where a "small
code footprint" is required or network bandwidth is limited

● Publish-Subscribe Model:
○ Unlike traditional client-server models, MQTT uses a broker-based
publish-subscribe pattern.
○ In this model, clients (publishers) do not send messages directly to
other clients (subscribers). Instead, they publish messages to a broker,
which then distributes these messages to interested subscribers
based on the topic of the messages.
MQTT Crash Course 2/4
● Quality of Service Levels:
○ MQTT supports three levels of quality of service (QoS) to deliver
messages:
■ QoS 0: At most once delivery (fire-and-forget).
■ QoS 1: At least once delivery (ensures the message is delivered at
least once).
■ QoS 2: Exactly once delivery (ensures the message is delivered
one time only).
○ These levels allow for message delivery guarantees according to the
requirements of different applications.
MQTT Crash Course 3/4
● Topics and Wildcards:
○ MQTT uses topics to filter messages for each connected client. Clients
subscribe to a topic or topics, and messages are sent to clients based
on their subscriptions.
○ MQTT also supports wildcards in topic subscription, allowing for
greater flexibility in message delivery to subscribers.

● Security Features:
○ Although MQTT itself does not provide intrinsic security features, it
supports secure transmission via SSL/TLS.
○ Additional security measures such as user name/password
authentication and access control can be implemented at the broker
level.
MQTT Crash Course 4/4
● Use Cases:
○ Ideal for IoT applications, telemetry in low-bandwidth scenarios, and
any application where minimal network overhead and low power
consumption are required.
○ Commonly used in real-time analytics, monitoring of remote sensors,
controlling devices over networks, and various M2M
(machine-to-machine) contexts.

● Last Will and Testament:


○ A unique feature where a "last will" message is defined in case of an
unexpected disconnection of the client. This message is sent by the
broker to notify other clients about the disconnection.
Broker Setup Information
● Broker Url: tcp://3.138.185.79 :1883
● QoS: 2
● Clean Session: true
● Username: auca
● Password: gishushu

Setup Python Env: Install Paho


● pip install paho-mqtt==1.6.1
Python Code: Publish 1/3
import paho.mqtt.client as mqtt

# MQTT settings

broker_url = "3.138.185.79"

broker_port = 1883

username = "auca"

password = "gishushu"

topic = "auca_class"

client_id = "my_mqtt_client"

# Callback when the client receives a CONNACK response from the server

def on_connect(client, userdata, flags, rc):

if rc == 0:

print("Connected successfully to broker")

else:

print(f"Failed to connect, return code {rc}\n")

# If the client fails to connect then we should stop the loop

client.loop_stop()
Python Code: Publish 2/3
# Create a new instance of the MQTT client with a specific client ID

client = mqtt.Client(client_id, clean_session=True)

client.on_connect = on_connect # attach the callback function to the client

client.username_pw_set(username, password) # set username and password

client.connect(broker_url, broker_port, 60) # connect to the broker

# Start the network loop in a separate thread

client.loop_start()

try:

while True:

message = input("Enter message to publish or type 'exit' to quit: ")

if message.lower() == 'exit':

break

client.publish(topic, message, qos=2)

except KeyboardInterrupt:

print("Program interrupted by user, exiting...")


Python Code: Publish 3/3

# Stop the network loop and disconnect

client.loop_stop()

client.disconnect()
Python Code: Subscribe 1/2
import paho.mqtt.client as mqtt

# MQTT settings

broker_url = "3.138.185.79"

broker_port = 1883

username = "auca"

password = "gishushu"

topic = "auca_class"

client_id = "my_mqtt_client_subscriber"

# Callback when the client receives a CONNACK response from the server

def on_connect(client, userdata, flags, rc):

if rc == 0:

print("Connected successfully to broker")

# Subscribe to the topic once connected

client.subscribe(topic, qos=2)

else:

print(f"Failed to connect, return code {rc}\n")


Python Code: Subscribe 1/2
# Callback for when a PUBLISH message is received from the server

def on_message(client, userdata, msg):

print(f"Message received on topic {msg.topic}: {msg.payload.decode()}")

# Create a new instance of the MQTT client with a specific client ID

client = mqtt.Client(client_id, clean_session=True)

client.on_connect = on_connect # attach the connection callback function to the client

client.on_message = on_message # attach the message callback function to the client

client.username_pw_set(username, password) # set username and password

client.connect(broker_url, broker_port, 60) # connect to the broker

# Start the network loop in a separate thread

client.loop_forever()
Hypertables

● Hypertables are PostgreSQL tables with special features that


make it easy to handle time-series data. Anything you can do with
regular PostgreSQL tables, you can do with hypertables. In
addition, you get the benefits of improved performance and user
experience for time-series data.
● They automatically partition your data by time.
● In Timescale, hypertables exist alongside regular PostgreSQL
tables. Use hypertables to store time-series data. This gives you
improved insert and query performance, and access to useful
time-series features. Use regular PostgreSQL tables for other
relational data.
Hypertable Partitioning

● When you create and use a hypertable, it automatically partitions


data by time, and optionally by space.
● Each hypertable is made up of child tables called chunks.
● Each chunk is assigned a range of time, and only contains data
from that range. If the hypertable is also partitioned by space, each
chunk is also assigned a subset of the space values.
Time Partitioning
● Each chunk of a
hypertable only holds
data from a specific time
range.
● When you insert data
from a time range that
doesn't yet have a chunk,
Timescale automatically
creates a chunk to store
it.
Create a Hypertable 1/2
● To create a hypertable, you need to create a standard PostgreSQL
table, and then convert it into a hypertable.
● Create a standard PostgreSQL table.
CREATE TABLE sensor (

time TIMESTAMPTZ NOT NULL,

location TEXT NOT NULL,

device TEXT NOT NULL,

voltage DOUBLE PRECISION NOT NULL,

current DOUBLE PRECISION NOT NULL,

frequency DOUBLE PRECISION NOT NULL,

power DOUBLE PRECISION NOT NULL,

energy DOUBLE PRECISION NOT NULL

);
Create a Hypertable 2/2
● Convert the table to a hypertable. Specify the name of the table
you want to convert, and the column that holds its time values.
SELECT create_hypertable('sensor', 'time', chunk_time_interval => interval '1 week');

● Some possible chunk intervals


- ‘1 hour’
- ‘1 day’
- ‘1 month’
- ‘1 year’
Hypertable Auto Compression
● Enable Compression
ALTER TABLE sensor SET (timescaledb.compress, timescaledb.compress_orderby = 'time');

● Add compression Policy


SELECT add_compression_policy('sensor', INTERVAL '7 days');
Hypertable Manual Compression
● Find Chunks (all chunks older than x minutes)
SELECT show_chunks('sensor', older_than => INTERVAL '3 minutes');

● Compression all chunks older than 3 minutes


SELECT compress_chunk(i)
FROM show_chunks('sensor', older_than => INTERVAL '3 minutes') AS i;
Timescale: Essential Commands 1/2
● Disk Size of a hypertable; both compressed and uncompressed chunks
SELECT hypertable_size('sensor');

● Retrieves the name and size of each hypertable present in the database
SELECT sensor, hypertable_size(format('%I.%I', hypertable_schema, hypertable_name)::regclass) FROM
timescaledb_information.hypertables;

● Detailed view of the disk space usage of a hypertable; If running on a distributed


hypertable, ordering by node_name would show the size distribution across different data nodes

SELECT * FROM hypertable_detailed_size('sensor') ORDER BY node_name;


Timescale: Essential Commands 2/2
● Manually compresses a specific chunk identified by its internal name.
SELECT compress_chunk( '_timescaledb_internal._hyper_1_1_chunk');

● Provides compression statistics for all chunks of the specified


hypertable.
SELECT * FROM chunk_compression_stats('sensor');

● Attempts to compress chunks of the 'sensor' hypertable that were


created between three weeks ago and one week ago.
SELECT compress_chunk(i) from show_chunks('sensor', now() - interval '1 week', now() - interval '3
weeks') i;
TimescaledDB Installation

https://docs.timescale.com/self-hosted/latest/install/

You might also like