0% found this document useful (0 votes)
12 views8 pages

Big Data

The document discusses different types of data, including nominal, ordinal, interval, distance, and ratio data, providing definitions and examples for each. It also covers association rule mining algorithms such as Apriori, ECLAT, and FP-Growth, explaining their methods and pros and cons, particularly focusing on FP-Growth's approach to generating frequent item sets without candidate generation. Additionally, it outlines Real-Time Analytics Platforms (RTAP) and Online Analytical Processing (OLAP), highlighting their features, examples, and the similarities and differences between various big data platforms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

Big Data

The document discusses different types of data, including nominal, ordinal, interval, distance, and ratio data, providing definitions and examples for each. It also covers association rule mining algorithms such as Apriori, ECLAT, and FP-Growth, explaining their methods and pros and cons, particularly focusing on FP-Growth's approach to generating frequent item sets without candidate generation. Additionally, it outlines Real-Time Analytics Platforms (RTAP) and Online Analytical Processing (OLAP), highlighting their features, examples, and the similarities and differences between various big data platforms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

JIGJIGA UNIVERSITY

IOT
SOFTWARE ENGINEEING

[INDIVIDUAL ASSIGNMENT]
[FUNDAMENTAL BIG DATA ]

[NAME:-METASEBIA MERKIN]

[R/2079/13 ]

1
1. Discuss the type of data listed below using suitable example ?

1. Nominal Data

Definition: Nominal data represents categories without any inherent order or ranking. It is
qualitative in nature and is used for labeling variables without any quantitative value.

Examples:

Gender: Categories include Male, Female, Non-binary, etc. These categories cannot be ordered
in a meaningful way.

Colors: Examples include Red, Green, Blue, Yellow, etc. There is no ranking among these
colors; they are simply different categories.

Types of Pets: Categories might include Dog, Cat, Fish, Bird, etc. Again, there is no inherent
order among these categories.

2. Ordinal Data

Definition: Ordinal data represents categories with a meaningful order or ranking, but the
intervals between categories are not uniform or measurable. This type of data indicates a relative
position but does not quantify the differences between positions.

Examples:

Education Level: Categories such as High School < Bachelor’s < Master’s < PhD indicate a
clear ranking in terms of education attained, but the difference in years of education between
each level is not uniform.

Customer Satisfaction Ratings: A scale of Poor < Fair < Good < Excellent reflects a ranking of
satisfaction levels but does not quantify the exact difference in satisfaction.

Socioeconomic Status: Categories like Low Income < Middle Income < High Income provide
an order but do not specify the exact income ranges.

Use Cases: Ordinal data is commonly used in surveys where respondents rank their preferences
or experiences, such as customer satisfaction surveys or employee performance evaluations.

3. Interval Data

Definition: Interval data contains meaningful intervals between values, allowing for the
measurement of differences, but it lacks a true zero point. This means that while you can add and
subtract values, you cannot meaningfully multiply or divide them.

2
Examples:

Temperature: Measured in Celsius or Fahrenheit (e.g., 20°C and 40°C). The difference between
these two temperatures (20°C) is meaningful; however, 0°C does not represent the absence of
temperature—it's just another point on the scale.

Calendar Years: The years 2000 and 2020 have a measurable difference of 20 years, but there is
no "true zero" year that signifies a complete absence of time.

IQ Scores: An IQ score of 100 is not twice as intelligent as an IQ score of 50; the intervals are
meaningful but lack a true zero.

Use Cases: Interval data is often used in scientific measurements and social sciences where
differences matter, but ratios do not.

4. Distance Data

Definition: Distance data measures the separation between two points using spatial or metric
systems. This type of data can be quantified in terms of physical distance or other measurable
separations.

Examples:

Geographic Distance: The distance between two cities (e.g., City A and City B) can be
measured in kilometers or miles (e.g., 200 km apart).

Travel Distance: The distance traveled during a trip can be quantified (e.g., 150 miles from New
York to Boston).

Network Latency: In computer networks, the time delay between two points can be measured
(e.g., latency of 50 ms between two servers).

Use Cases: Distance data is commonly used in geography, logistics, and network performance
analysis.

5. Ratio Data

Definition: Ratio data is similar to interval data but possesses a meaningful zero point. This
allows for the calculation of ratios and comparisons between values.

Examples:

Weight: Measured in kilograms or pounds (e.g., 10 kg is twice as heavy as 5 kg). The zero point
(0 kg) indicates the absence of weight.

3
Height: Measured in centimeters or inches (e.g., a person who is 180 cm tall is twice as tall as
someone who is 90 cm).

Income: Measured in currency (e.g., $50,000 is twice as much as $25,000). A salary of $0


indicates no income.

Use Cases: Ratio data is widely used in fields such as economics, health sciences, and
engineering where absolute measurements are critical for analysis.

2. List and discuss association rule mining algorithms and describe the method of
generating frequent item sets without candidate generation

Common Association Rule Mining Algorithms

Association rule mining is a fundamental technique used in data mining to discover interesting
relationships, patterns, or associations among a set of items in large datasets.

Here are some of the most common algorithms used for mining frequent item sets:

1. Apriori Algorithm

The Apriori algorithm is one of the earliest and most widely used algorithms for mining frequent item sets.
It relies on the principle that any subset of a frequent item set must also be a frequent item set.

How It Works:

The algorithm works in iterations, starting with single items and progressively combining them to form
larger item sets.

In each iteration, it generates candidate item sets from the previous iteration's frequent item sets and
then prunes these candidates based on a minimum support threshold.

This process continues until no more frequent item sets can be found.

Pros and Cons:

Pros: Simple to understand and implement; effective for smaller datasets.

Cons: Can be computationally expensive for large datasets due to candidate generation and multiple
database scans.

2. ECLAT (Equivalence Class Transformation)

ECLAT is an efficient algorithm that employs a depth-first search strategy and represents data in a
vertical format, which makes it suitable for dense datasets.

4
How It Works:

Instead of generating candidate item sets, ECLAT uses a vertical representation of the dataset where
each item is associated with a list of transactions (tidset) containing it.

The algorithm recursively intersects these tidsets to find frequent item sets.

Pros and Cons:

Pros: Faster than Apriori for dense datasets due to its vertical representation; reduces the number of
database scans.

Cons: May require more memory due to storage of transaction IDs.

3. FP-Growth (Frequent Pattern Growth)

• Overview: FP-Growth is another powerful algorithm for mining frequent item sets that overcomes the
limitations of Apriori by avoiding candidate generation altogether.

How It Works:

Step 1: Build an FP-Tree (Frequent Pattern Tree) that compresses the original transaction data while
retaining the item set association information.

The FP-Tree is constructed by scanning the dataset once to determine the frequency of items and then
creating a tree structure where each path represents a transaction.

Step 2: Use a divide-and-conquer approach to extract frequent patterns directly from the FP-Tree. This
involves recursively mining the tree by examining conditional patterns based on the items found in the
tree.

Pros and Cons:

Pros: Significantly reduces computational cost and memory usage compared to Apriori; faster execution
time as it requires only two scans of the dataset.

Cons: More complex to implement and understand; may require substantial memory for storing the FP-
Tree.

Generating Frequent Item Sets Without Candidate Generation

The FP-Growth algorithm exemplifies a method for generating frequent item sets without candidate
generation, which is a significant advantage over traditional methods like Apriori.

Steps Involved in FP-Growth:

1. Building the FP-Tree:

The first step involves scanning the dataset to determine the frequency of items. Items that meet the
minimum support threshold are retained, while those that do not are discarded.

5
The remaining items are then used to construct an FP-Tree, which organizes the transactions in a
compact form. Each node in the tree represents an item, and paths through the tree represent transactions.

2. Mining Frequent Patterns:

Once the FP-Tree is constructed, the algorithm employs a divide-and-conquer strategy. It focuses on
each item in the header table of the FP-Tree and constructs conditional pattern bases, which are sub-
databases consisting of transactions that contain that specific item.

3. What is Real-Time Analytics Platform (RTAP) and OLAP?

Real-Time Analytics Platform (RTAP)

A Real-Time Analytics Platform (RTAP) is a system designed to process and analyze data streams as they
are generated, enabling immediate insights and decision-making. These platforms are crucial for
applications where timely data analysis is essential for operational efficiency and competitive advantage.

Features:

1. Processes Live Data: RTAPs continuously ingest and analyze data from various sources in real time,
allowing organizations to respond to events as they occur.

2. Supports Dynamic Decision-Making: By providing immediate insights, RTAPs facilitate quick


decision-making processes, enabling businesses to adapt to changing conditions or opportunities swiftly.

3. Scalability: RTAPs are designed to handle large volumes of data and can scale horizontally to
accommodate increasing data loads without significant performance degradation.

4. Event-Driven Architecture: Many RTAPs utilize event-driven architectures that allow them to react
to specific triggers or events in the data stream, enhancing responsiveness.

5. Integration with IoT and Streaming Data Sources: RTAPs often integrate seamlessly with Internet of
Things (IoT) devices and other streaming data sources, enabling real-time monitoring and analytics.

6. Visualization Tools: Many RTAPs come equipped with dashboards and visualization tools that
provide real-time insights into key performance indicators (KPIs) and metrics.

Examples:

Real-Time Stock Monitoring: Platforms that track stock prices and trading volumes in real time,
providing traders with immediate insights into market movements.

Fraud Detection in Financial Transactions: Systems that analyze transaction patterns in real time to
identify potentially fraudulent activities, allowing for immediate intervention.

6
Online Analytical Processing (OLAP)

Online Analytical Processing (OLAP) is a category of software technology that enables analysts,
managers, and executives to gain insight into data through fast, consistent, interactive access in a variety
of ways. OLAP is primarily used for complex queries on historical data stored in data warehouses.

Features:

1. Multi-Dimensional Analysis: OLAP allows users to perform multi-dimensional analysis of business


data, enabling them to view data from different perspectives (e.g., by time, geography, product).

2. Slicing and Dicing Data: Users can "slice" the data to focus on specific dimensions or "dice" it to
create a sub-cube of the data for more detailed analysis.

3. Aggregation and Summarization: OLAP systems aggregate data at various levels (e.g., daily,
monthly, yearly), allowing users to view summary reports and detailed breakdowns.

4. Complex Calculations: OLAP supports complex calculations and data modeling, enabling users to
derive insights through advanced analytical operations.

5. Pre-Calculated Data: OLAP systems often use pre-calculated aggregates and hierarchies to speed up
query response times, making them efficient for analytical queries.

6. User-Friendly Interfaces: Many OLAP tools come with intuitive interfaces that allow users with
minimal technical expertise to perform sophisticated analyses easily.

Examples:

Sales Analysis Over Time: A retail company using OLAP to analyze sales trends over different periods
(daily, weekly, monthly) across various regions and product lines.

Performance Comparison Across Regions: A multinational corporation utilizing OLAP tools to compare
performance metrics across different geographical regions or business units.

4. Discuss the similarity & difference between the below-listed big-data platforms.
Similarities and Differences Between Big Data Platforms

Similarities

1. Support for Big Data Processing: All listed platforms are designed to handle large datasets, enabling
organizations to process and analyze vast amounts of data efficiently.

2. Scalable Architectures: Each platform offers scalable solutions that can grow with the data needs of an
organization, allowing for increased storage and processing power as required.

3. Hadoop as a Base Technology: Most of these platforms incorporate Hadoop in some capacity.
Cloudera, Hortonworks, and IBM Open Platform are built on Hadoop, while AWS offers Hadoop
functionality through its EMR service, leveraging Hadoop's capabilities for distributed processing.

7
4. Distributed Storage and Processing: All platforms utilize distributed computing principles, enabling
parallel processing of data across multiple nodes to enhance performance and reliability.

5. Community and Ecosystem: Each platform has a robust ecosystem and community support, providing
users with resources, tools, and shared knowledge to maximize their use of big data technologies.

Differences
Feature Hadoop Cloudera Amazon Web Hortonworks IBM Open
Services (AWS) Platform

An open-source A commercial A cloud Another IBM’s Hadoop-


framework for distribution of platform Hadoop based offering,
distributed Hadoop offering offering distribution, focusing on
Definition storage (HDFS) enterprise-grade services like now merged enterprise data
and processing tools EC2, S3, and with Cloudera. integration.
(MapReduce) of EMR (Elastic
large datasets. MapReduce).

Batch Enterprise- Cloud-based Community- Enterprise-


processing with grade tools for services with a driven grade security
a focus on security, focus on development and advanced
Core Focus large-scale data performance, flexibility and with an analytics
storage and and integration with emphasis on integration with
computation. management. other AWS open-source IBM Watson.
services. solutions.

Community Paid support Pay-as-you-go Community- Enterprise


support through options with model with driven support; support from
open-source comprehensive extensive merged with IBM, including
Support Model channels; no management documentation Cloudera for dedicated
formal support interfaces. and community enhanced resources and
structure. support. offerings. consulting
services.

Basic security Advanced Security Focused on Strong


features; security features features vary by community- enterprise
additional including service; AWS driven security security features
Security configurations Kerberos provides various enhancements; integrated into
Features required for authentication compliance merged security the platform
enhanced and role-based certifications. features from with compliance
security access control. Cloudera. support.

You might also like