Big Data
Big Data
IOT
SOFTWARE ENGINEEING
[INDIVIDUAL ASSIGNMENT]
[FUNDAMENTAL BIG DATA ]
[NAME:-METASEBIA MERKIN]
[R/2079/13 ]
1
1. Discuss the type of data listed below using suitable example ?
1. Nominal Data
Definition: Nominal data represents categories without any inherent order or ranking. It is
qualitative in nature and is used for labeling variables without any quantitative value.
Examples:
Gender: Categories include Male, Female, Non-binary, etc. These categories cannot be ordered
in a meaningful way.
Colors: Examples include Red, Green, Blue, Yellow, etc. There is no ranking among these
colors; they are simply different categories.
Types of Pets: Categories might include Dog, Cat, Fish, Bird, etc. Again, there is no inherent
order among these categories.
2. Ordinal Data
Definition: Ordinal data represents categories with a meaningful order or ranking, but the
intervals between categories are not uniform or measurable. This type of data indicates a relative
position but does not quantify the differences between positions.
Examples:
Education Level: Categories such as High School < Bachelor’s < Master’s < PhD indicate a
clear ranking in terms of education attained, but the difference in years of education between
each level is not uniform.
Customer Satisfaction Ratings: A scale of Poor < Fair < Good < Excellent reflects a ranking of
satisfaction levels but does not quantify the exact difference in satisfaction.
Socioeconomic Status: Categories like Low Income < Middle Income < High Income provide
an order but do not specify the exact income ranges.
Use Cases: Ordinal data is commonly used in surveys where respondents rank their preferences
or experiences, such as customer satisfaction surveys or employee performance evaluations.
3. Interval Data
Definition: Interval data contains meaningful intervals between values, allowing for the
measurement of differences, but it lacks a true zero point. This means that while you can add and
subtract values, you cannot meaningfully multiply or divide them.
2
Examples:
Temperature: Measured in Celsius or Fahrenheit (e.g., 20°C and 40°C). The difference between
these two temperatures (20°C) is meaningful; however, 0°C does not represent the absence of
temperature—it's just another point on the scale.
Calendar Years: The years 2000 and 2020 have a measurable difference of 20 years, but there is
no "true zero" year that signifies a complete absence of time.
IQ Scores: An IQ score of 100 is not twice as intelligent as an IQ score of 50; the intervals are
meaningful but lack a true zero.
Use Cases: Interval data is often used in scientific measurements and social sciences where
differences matter, but ratios do not.
4. Distance Data
Definition: Distance data measures the separation between two points using spatial or metric
systems. This type of data can be quantified in terms of physical distance or other measurable
separations.
Examples:
Geographic Distance: The distance between two cities (e.g., City A and City B) can be
measured in kilometers or miles (e.g., 200 km apart).
Travel Distance: The distance traveled during a trip can be quantified (e.g., 150 miles from New
York to Boston).
Network Latency: In computer networks, the time delay between two points can be measured
(e.g., latency of 50 ms between two servers).
Use Cases: Distance data is commonly used in geography, logistics, and network performance
analysis.
5. Ratio Data
Definition: Ratio data is similar to interval data but possesses a meaningful zero point. This
allows for the calculation of ratios and comparisons between values.
Examples:
Weight: Measured in kilograms or pounds (e.g., 10 kg is twice as heavy as 5 kg). The zero point
(0 kg) indicates the absence of weight.
3
Height: Measured in centimeters or inches (e.g., a person who is 180 cm tall is twice as tall as
someone who is 90 cm).
Use Cases: Ratio data is widely used in fields such as economics, health sciences, and
engineering where absolute measurements are critical for analysis.
2. List and discuss association rule mining algorithms and describe the method of
generating frequent item sets without candidate generation
Association rule mining is a fundamental technique used in data mining to discover interesting
relationships, patterns, or associations among a set of items in large datasets.
Here are some of the most common algorithms used for mining frequent item sets:
1. Apriori Algorithm
The Apriori algorithm is one of the earliest and most widely used algorithms for mining frequent item sets.
It relies on the principle that any subset of a frequent item set must also be a frequent item set.
How It Works:
The algorithm works in iterations, starting with single items and progressively combining them to form
larger item sets.
In each iteration, it generates candidate item sets from the previous iteration's frequent item sets and
then prunes these candidates based on a minimum support threshold.
This process continues until no more frequent item sets can be found.
Cons: Can be computationally expensive for large datasets due to candidate generation and multiple
database scans.
ECLAT is an efficient algorithm that employs a depth-first search strategy and represents data in a
vertical format, which makes it suitable for dense datasets.
4
How It Works:
Instead of generating candidate item sets, ECLAT uses a vertical representation of the dataset where
each item is associated with a list of transactions (tidset) containing it.
The algorithm recursively intersects these tidsets to find frequent item sets.
Pros: Faster than Apriori for dense datasets due to its vertical representation; reduces the number of
database scans.
• Overview: FP-Growth is another powerful algorithm for mining frequent item sets that overcomes the
limitations of Apriori by avoiding candidate generation altogether.
How It Works:
Step 1: Build an FP-Tree (Frequent Pattern Tree) that compresses the original transaction data while
retaining the item set association information.
The FP-Tree is constructed by scanning the dataset once to determine the frequency of items and then
creating a tree structure where each path represents a transaction.
Step 2: Use a divide-and-conquer approach to extract frequent patterns directly from the FP-Tree. This
involves recursively mining the tree by examining conditional patterns based on the items found in the
tree.
Pros: Significantly reduces computational cost and memory usage compared to Apriori; faster execution
time as it requires only two scans of the dataset.
Cons: More complex to implement and understand; may require substantial memory for storing the FP-
Tree.
The FP-Growth algorithm exemplifies a method for generating frequent item sets without candidate
generation, which is a significant advantage over traditional methods like Apriori.
The first step involves scanning the dataset to determine the frequency of items. Items that meet the
minimum support threshold are retained, while those that do not are discarded.
5
The remaining items are then used to construct an FP-Tree, which organizes the transactions in a
compact form. Each node in the tree represents an item, and paths through the tree represent transactions.
Once the FP-Tree is constructed, the algorithm employs a divide-and-conquer strategy. It focuses on
each item in the header table of the FP-Tree and constructs conditional pattern bases, which are sub-
databases consisting of transactions that contain that specific item.
A Real-Time Analytics Platform (RTAP) is a system designed to process and analyze data streams as they
are generated, enabling immediate insights and decision-making. These platforms are crucial for
applications where timely data analysis is essential for operational efficiency and competitive advantage.
Features:
1. Processes Live Data: RTAPs continuously ingest and analyze data from various sources in real time,
allowing organizations to respond to events as they occur.
3. Scalability: RTAPs are designed to handle large volumes of data and can scale horizontally to
accommodate increasing data loads without significant performance degradation.
4. Event-Driven Architecture: Many RTAPs utilize event-driven architectures that allow them to react
to specific triggers or events in the data stream, enhancing responsiveness.
5. Integration with IoT and Streaming Data Sources: RTAPs often integrate seamlessly with Internet of
Things (IoT) devices and other streaming data sources, enabling real-time monitoring and analytics.
6. Visualization Tools: Many RTAPs come equipped with dashboards and visualization tools that
provide real-time insights into key performance indicators (KPIs) and metrics.
Examples:
Real-Time Stock Monitoring: Platforms that track stock prices and trading volumes in real time,
providing traders with immediate insights into market movements.
Fraud Detection in Financial Transactions: Systems that analyze transaction patterns in real time to
identify potentially fraudulent activities, allowing for immediate intervention.
6
Online Analytical Processing (OLAP)
Online Analytical Processing (OLAP) is a category of software technology that enables analysts,
managers, and executives to gain insight into data through fast, consistent, interactive access in a variety
of ways. OLAP is primarily used for complex queries on historical data stored in data warehouses.
Features:
2. Slicing and Dicing Data: Users can "slice" the data to focus on specific dimensions or "dice" it to
create a sub-cube of the data for more detailed analysis.
3. Aggregation and Summarization: OLAP systems aggregate data at various levels (e.g., daily,
monthly, yearly), allowing users to view summary reports and detailed breakdowns.
4. Complex Calculations: OLAP supports complex calculations and data modeling, enabling users to
derive insights through advanced analytical operations.
5. Pre-Calculated Data: OLAP systems often use pre-calculated aggregates and hierarchies to speed up
query response times, making them efficient for analytical queries.
6. User-Friendly Interfaces: Many OLAP tools come with intuitive interfaces that allow users with
minimal technical expertise to perform sophisticated analyses easily.
Examples:
Sales Analysis Over Time: A retail company using OLAP to analyze sales trends over different periods
(daily, weekly, monthly) across various regions and product lines.
Performance Comparison Across Regions: A multinational corporation utilizing OLAP tools to compare
performance metrics across different geographical regions or business units.
4. Discuss the similarity & difference between the below-listed big-data platforms.
Similarities and Differences Between Big Data Platforms
Similarities
1. Support for Big Data Processing: All listed platforms are designed to handle large datasets, enabling
organizations to process and analyze vast amounts of data efficiently.
2. Scalable Architectures: Each platform offers scalable solutions that can grow with the data needs of an
organization, allowing for increased storage and processing power as required.
3. Hadoop as a Base Technology: Most of these platforms incorporate Hadoop in some capacity.
Cloudera, Hortonworks, and IBM Open Platform are built on Hadoop, while AWS offers Hadoop
functionality through its EMR service, leveraging Hadoop's capabilities for distributed processing.
7
4. Distributed Storage and Processing: All platforms utilize distributed computing principles, enabling
parallel processing of data across multiple nodes to enhance performance and reliability.
5. Community and Ecosystem: Each platform has a robust ecosystem and community support, providing
users with resources, tools, and shared knowledge to maximize their use of big data technologies.
Differences
Feature Hadoop Cloudera Amazon Web Hortonworks IBM Open
Services (AWS) Platform