0% found this document useful (0 votes)

12 views8 pages

Big Data

The document discusses different types of data, including nominal, ordinal, interval, distance, and ratio data, providing definitions and examples for each. It also covers association rule mining algorithms such as Apriori, ECLAT, and FP-Growth, explaining their methods and pros and cons, particularly focusing on FP-Growth's approach to generating frequent item sets without candidate generation. Additionally, it outlines Real-Time Analytics Platforms (RTAP) and Online Analytical Processing (OLAP), highlighting their features, examples, and the similarities and differences between various big data platforms.

Uploaded by

metasebiamerkin7275

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

Big Data

Uploaded by

metasebiamerkin7275

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

JIGJIGA UNIVERSITY

IOT
SOFTWARE ENGINEEING

[INDIVIDUAL ASSIGNMENT]
[FUNDAMENTAL BIG DATA ]

[NAME:-METASEBIA MERKIN]

[R/2079/13 ]

1
1. Discuss the type of data listed below using suitable example ?

1. Nominal Data

Definition: Nominal data represents categories without any inherent order or ranking. It is
qualitative in nature and is used for labeling variables without any quantitative value.

Examples:

Gender: Categories include Male, Female, Non-binary, etc. These categories cannot be ordered
in a meaningful way.

Colors: Examples include Red, Green, Blue, Yellow, etc. There is no ranking among these
colors; they are simply different categories.

Types of Pets: Categories might include Dog, Cat, Fish, Bird, etc. Again, there is no inherent
order among these categories.

2. Ordinal Data

Definition: Ordinal data represents categories with a meaningful order or ranking, but the
intervals between categories are not uniform or measurable. This type of data indicates a relative
position but does not quantify the differences between positions.

Examples:

Education Level: Categories such as High School < Bachelor’s < Master’s < PhD indicate a
clear ranking in terms of education attained, but the difference in years of education between
each level is not uniform.

Customer Satisfaction Ratings: A scale of Poor < Fair < Good < Excellent reflects a ranking of
satisfaction levels but does not quantify the exact difference in satisfaction.

Socioeconomic Status: Categories like Low Income < Middle Income < High Income provide
an order but do not specify the exact income ranges.

Use Cases: Ordinal data is commonly used in surveys where respondents rank their preferences
or experiences, such as customer satisfaction surveys or employee performance evaluations.

3. Interval Data

Definition: Interval data contains meaningful intervals between values, allowing for the
measurement of differences, but it lacks a true zero point. This means that while you can add and
subtract values, you cannot meaningfully multiply or divide them.

2
Examples:

Temperature: Measured in Celsius or Fahrenheit (e.g., 20°C and 40°C). The difference between
these two temperatures (20°C) is meaningful; however, 0°C does not represent the absence of
temperature—it's just another point on the scale.

Calendar Years: The years 2000 and 2020 have a measurable difference of 20 years, but there is
no "true zero" year that signifies a complete absence of time.

IQ Scores: An IQ score of 100 is not twice as intelligent as an IQ score of 50; the intervals are
meaningful but lack a true zero.

Use Cases: Interval data is often used in scientific measurements and social sciences where
differences matter, but ratios do not.

4. Distance Data

Definition: Distance data measures the separation between two points using spatial or metric
systems. This type of data can be quantified in terms of physical distance or other measurable
separations.

Examples:

Geographic Distance: The distance between two cities (e.g., City A and City B) can be
measured in kilometers or miles (e.g., 200 km apart).

Travel Distance: The distance traveled during a trip can be quantified (e.g., 150 miles from New
York to Boston).

Network Latency: In computer networks, the time delay between two points can be measured
(e.g., latency of 50 ms between two servers).

Use Cases: Distance data is commonly used in geography, logistics, and network performance
analysis.

5. Ratio Data

Definition: Ratio data is similar to interval data but possesses a meaningful zero point. This
allows for the calculation of ratios and comparisons between values.

Examples:

Weight: Measured in kilograms or pounds (e.g., 10 kg is twice as heavy as 5 kg). The zero point
(0 kg) indicates the absence of weight.

3
Height: Measured in centimeters or inches (e.g., a person who is 180 cm tall is twice as tall as
someone who is 90 cm).

Income: Measured in currency (e.g., $50,000 is twice as much as $25,000). A salary of $0

indicates no income.

Use Cases: Ratio data is widely used in fields such as economics, health sciences, and
engineering where absolute measurements are critical for analysis.

2. List and discuss association rule mining algorithms and describe the method of
generating frequent item sets without candidate generation

Common Association Rule Mining Algorithms

Association rule mining is a fundamental technique used in data mining to discover interesting
relationships, patterns, or associations among a set of items in large datasets.

Here are some of the most common algorithms used for mining frequent item sets:

1. Apriori Algorithm

The Apriori algorithm is one of the earliest and most widely used algorithms for mining frequent item sets.
It relies on the principle that any subset of a frequent item set must also be a frequent item set.

How It Works:

The algorithm works in iterations, starting with single items and progressively combining them to form
larger item sets.

In each iteration, it generates candidate item sets from the previous iteration's frequent item sets and
then prunes these candidates based on a minimum support threshold.

This process continues until no more frequent item sets can be found.

Pros and Cons:

Pros: Simple to understand and implement; effective for smaller datasets.

Cons: Can be computationally expensive for large datasets due to candidate generation and multiple
database scans.

2. ECLAT (Equivalence Class Transformation)

ECLAT is an efficient algorithm that employs a depth-first search strategy and represents data in a
vertical format, which makes it suitable for dense datasets.

4
How It Works:

Instead of generating candidate item sets, ECLAT uses a vertical representation of the dataset where
each item is associated with a list of transactions (tidset) containing it.

The algorithm recursively intersects these tidsets to find frequent item sets.

Pros and Cons:

Pros: Faster than Apriori for dense datasets due to its vertical representation; reduces the number of
database scans.

Cons: May require more memory due to storage of transaction IDs.

3. FP-Growth (Frequent Pattern Growth)

• Overview: FP-Growth is another powerful algorithm for mining frequent item sets that overcomes the
limitations of Apriori by avoiding candidate generation altogether.

How It Works:

Step 1: Build an FP-Tree (Frequent Pattern Tree) that compresses the original transaction data while
retaining the item set association information.

The FP-Tree is constructed by scanning the dataset once to determine the frequency of items and then
creating a tree structure where each path represents a transaction.

Step 2: Use a divide-and-conquer approach to extract frequent patterns directly from the FP-Tree. This
involves recursively mining the tree by examining conditional patterns based on the items found in the
tree.

Pros and Cons:

Pros: Significantly reduces computational cost and memory usage compared to Apriori; faster execution
time as it requires only two scans of the dataset.

Cons: More complex to implement and understand; may require substantial memory for storing the FP-
Tree.

Generating Frequent Item Sets Without Candidate Generation

The FP-Growth algorithm exemplifies a method for generating frequent item sets without candidate
generation, which is a significant advantage over traditional methods like Apriori.

Steps Involved in FP-Growth:

1. Building the FP-Tree:

The first step involves scanning the dataset to determine the frequency of items. Items that meet the
minimum support threshold are retained, while those that do not are discarded.

5
The remaining items are then used to construct an FP-Tree, which organizes the transactions in a
compact form. Each node in the tree represents an item, and paths through the tree represent transactions.

2. Mining Frequent Patterns:

Once the FP-Tree is constructed, the algorithm employs a divide-and-conquer strategy. It focuses on
each item in the header table of the FP-Tree and constructs conditional pattern bases, which are sub-
databases consisting of transactions that contain that specific item.

3. What is Real-Time Analytics Platform (RTAP) and OLAP?

Real-Time Analytics Platform (RTAP)

A Real-Time Analytics Platform (RTAP) is a system designed to process and analyze data streams as they
are generated, enabling immediate insights and decision-making. These platforms are crucial for
applications where timely data analysis is essential for operational efficiency and competitive advantage.

Features:

1. Processes Live Data: RTAPs continuously ingest and analyze data from various sources in real time,
allowing organizations to respond to events as they occur.

2. Supports Dynamic Decision-Making: By providing immediate insights, RTAPs facilitate quick

decision-making processes, enabling businesses to adapt to changing conditions or opportunities swiftly.

3. Scalability: RTAPs are designed to handle large volumes of data and can scale horizontally to
accommodate increasing data loads without significant performance degradation.

4. Event-Driven Architecture: Many RTAPs utilize event-driven architectures that allow them to react
to specific triggers or events in the data stream, enhancing responsiveness.

5. Integration with IoT and Streaming Data Sources: RTAPs often integrate seamlessly with Internet of
Things (IoT) devices and other streaming data sources, enabling real-time monitoring and analytics.

6. Visualization Tools: Many RTAPs come equipped with dashboards and visualization tools that
provide real-time insights into key performance indicators (KPIs) and metrics.

Examples:

Real-Time Stock Monitoring: Platforms that track stock prices and trading volumes in real time,
providing traders with immediate insights into market movements.

Fraud Detection in Financial Transactions: Systems that analyze transaction patterns in real time to
identify potentially fraudulent activities, allowing for immediate intervention.

6
Online Analytical Processing (OLAP)

Online Analytical Processing (OLAP) is a category of software technology that enables analysts,
managers, and executives to gain insight into data through fast, consistent, interactive access in a variety
of ways. OLAP is primarily used for complex queries on historical data stored in data warehouses.

Features:

1. Multi-Dimensional Analysis: OLAP allows users to perform multi-dimensional analysis of business

data, enabling them to view data from different perspectives (e.g., by time, geography, product).

2. Slicing and Dicing Data: Users can "slice" the data to focus on specific dimensions or "dice" it to
create a sub-cube of the data for more detailed analysis.

3. Aggregation and Summarization: OLAP systems aggregate data at various levels (e.g., daily,
monthly, yearly), allowing users to view summary reports and detailed breakdowns.

4. Complex Calculations: OLAP supports complex calculations and data modeling, enabling users to
derive insights through advanced analytical operations.

5. Pre-Calculated Data: OLAP systems often use pre-calculated aggregates and hierarchies to speed up
query response times, making them efficient for analytical queries.

6. User-Friendly Interfaces: Many OLAP tools come with intuitive interfaces that allow users with
minimal technical expertise to perform sophisticated analyses easily.

Examples:

Sales Analysis Over Time: A retail company using OLAP to analyze sales trends over different periods
(daily, weekly, monthly) across various regions and product lines.

Performance Comparison Across Regions: A multinational corporation utilizing OLAP tools to compare
performance metrics across different geographical regions or business units.

4. Discuss the similarity & difference between the below-listed big-data platforms.
Similarities and Differences Between Big Data Platforms

Similarities

1. Support for Big Data Processing: All listed platforms are designed to handle large datasets, enabling
organizations to process and analyze vast amounts of data efficiently.

2. Scalable Architectures: Each platform offers scalable solutions that can grow with the data needs of an
organization, allowing for increased storage and processing power as required.

3. Hadoop as a Base Technology: Most of these platforms incorporate Hadoop in some capacity.
Cloudera, Hortonworks, and IBM Open Platform are built on Hadoop, while AWS offers Hadoop
functionality through its EMR service, leveraging Hadoop's capabilities for distributed processing.

7
4. Distributed Storage and Processing: All platforms utilize distributed computing principles, enabling
parallel processing of data across multiple nodes to enhance performance and reliability.

5. Community and Ecosystem: Each platform has a robust ecosystem and community support, providing
users with resources, tools, and shared knowledge to maximize their use of big data technologies.

Differences
Feature Hadoop Cloudera Amazon Web Hortonworks IBM Open
Services (AWS) Platform

An open-source A commercial A cloud Another IBM’s Hadoop-

framework for distribution of platform Hadoop based offering,
distributed Hadoop offering offering distribution, focusing on
Definition storage (HDFS) enterprise-grade services like now merged enterprise data
and processing tools EC2, S3, and with Cloudera. integration.
(MapReduce) of EMR (Elastic
large datasets. MapReduce).

Batch Enterprise- Cloud-based Community- Enterprise-

processing with grade tools for services with a driven grade security
a focus on security, focus on development and advanced
Core Focus large-scale data performance, flexibility and with an analytics
storage and and integration with emphasis on integration with
computation. management. other AWS open-source IBM Watson.
services. solutions.

Community Paid support Pay-as-you-go Community- Enterprise

support through options with model with driven support; support from
open-source comprehensive extensive merged with IBM, including
Support Model channels; no management documentation Cloudera for dedicated
formal support interfaces. and community enhanced resources and
structure. support. offerings. consulting
services.

Basic security Advanced Security Focused on Strong

features; security features features vary by community- enterprise
additional including service; AWS driven security security features
Security configurations Kerberos provides various enhancements; integrated into
Features required for authentication compliance merged security the platform
enhanced and role-based certifications. features from with compliance
security access control. Cloudera. support.

SaaS Implementation Best Practices - v2
No ratings yet
SaaS Implementation Best Practices - v2
24 pages
Free Rental Receipt Template
No ratings yet
Free Rental Receipt Template
21 pages
Web Design Principles
No ratings yet
Web Design Principles
31 pages
An ATM With An Eye
No ratings yet
An ATM With An Eye
43 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Data Mining Unit 1-1
No ratings yet
Data Mining Unit 1-1
11 pages
Datamining With Big Data - Siva
No ratings yet
Datamining With Big Data - Siva
69 pages
Practice Exercises in OS
No ratings yet
Practice Exercises in OS
11 pages
New PUMA Mathematics Mastery Curriculum Maps 1
No ratings yet
New PUMA Mathematics Mastery Curriculum Maps 1
31 pages
Shweta Sharma Bengaluru - Bangalore 5.08 Yrs
No ratings yet
Shweta Sharma Bengaluru - Bangalore 5.08 Yrs
3 pages
Mohamed - CV
No ratings yet
Mohamed - CV
2 pages
Session 8-Association Rules Mining
No ratings yet
Session 8-Association Rules Mining
75 pages
Data Analytics Unit 4
No ratings yet
Data Analytics Unit 4
56 pages
Data Mining
No ratings yet
Data Mining
48 pages
LAS WEEK 1 - Grade 10 ICT
No ratings yet
LAS WEEK 1 - Grade 10 ICT
4 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Unit-1: 1. Define Data Mining and Explain Its Importance in Modern Data Analysis
No ratings yet
Unit-1: 1. Define Data Mining and Explain Its Importance in Modern Data Analysis
42 pages
DWDS Unit 4
No ratings yet
DWDS Unit 4
56 pages
Think Like Programmers
No ratings yet
Think Like Programmers
6 pages
Unit 1
No ratings yet
Unit 1
36 pages
OOP Chap 1 Deber Markose
No ratings yet
OOP Chap 1 Deber Markose
77 pages
Data Mining
No ratings yet
Data Mining
44 pages
33 GM - ASAP-Association Rule Mining
No ratings yet
33 GM - ASAP-Association Rule Mining
64 pages
Unit 1
No ratings yet
Unit 1
28 pages
Answers of Mod4 QP
No ratings yet
Answers of Mod4 QP
20 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
OREDA
No ratings yet
OREDA
2 pages
Question
No ratings yet
Question
27 pages
Lec 02
No ratings yet
Lec 02
33 pages
Unit 3 Data Science
No ratings yet
Unit 3 Data Science
15 pages
Association Rule Mining
No ratings yet
Association Rule Mining
61 pages
Lecture 3 Slides
No ratings yet
Lecture 3 Slides
49 pages
Tl-Wa850re Qig V6
No ratings yet
Tl-Wa850re Qig V6
2 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Data Mining 1
No ratings yet
Data Mining 1
36 pages
Ai Pass
No ratings yet
Ai Pass
12 pages
U3 FDS 1
No ratings yet
U3 FDS 1
17 pages
Need of Scripting Languages
No ratings yet
Need of Scripting Languages
9 pages
206 Data Mining
No ratings yet
206 Data Mining
28 pages
Web Design and Programming CH-1
No ratings yet
Web Design and Programming CH-1
21 pages
Web Design and Programming CH-4
No ratings yet
Web Design and Programming CH-4
20 pages
Major PPT
No ratings yet
Major PPT
18 pages
CS-DM Module - 1
No ratings yet
CS-DM Module - 1
27 pages
Unit-5 DM
No ratings yet
Unit-5 DM
18 pages
Data Analytics Unit-4
No ratings yet
Data Analytics Unit-4
47 pages
DMDW U3
No ratings yet
DMDW U3
16 pages
Data Mining - 2
No ratings yet
Data Mining - 2
16 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Unit No 3
No ratings yet
Unit No 3
10 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
Data 07 00011
No ratings yet
Data 07 00011
22 pages
DWDM Mod-1
No ratings yet
DWDM Mod-1
13 pages
KT88 3200 Opration 090911
No ratings yet
KT88 3200 Opration 090911
46 pages
Dmbi Ia2 Ans
No ratings yet
Dmbi Ia2 Ans
17 pages
Sky Tower Karachi Pakistan BoM For Bid V1D
No ratings yet
Sky Tower Karachi Pakistan BoM For Bid V1D
14 pages
Web Design and Programming CH-5
No ratings yet
Web Design and Programming CH-5
14 pages
Explain Architecture of Data Mining
No ratings yet
Explain Architecture of Data Mining
12 pages
OKI Printer Driver Compatibility and Schedule With Mac OS X 10.7 Lion
No ratings yet
OKI Printer Driver Compatibility and Schedule With Mac OS X 10.7 Lion
9 pages
Adobe Scan 14 Sept 2024
No ratings yet
Adobe Scan 14 Sept 2024
9 pages
Recent Trends in IT
No ratings yet
Recent Trends in IT
7 pages
Tools Chap-1
No ratings yet
Tools Chap-1
19 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Bca DM Unit I
No ratings yet
Bca DM Unit I
20 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Unit-II Association Rules
No ratings yet
Unit-II Association Rules
16 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Data Mining
No ratings yet
Data Mining
11 pages
Unit 4 Data Analytics
No ratings yet
Unit 4 Data Analytics
11 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
Probability Theory (CH 1 & 2) - 1
No ratings yet
Probability Theory (CH 1 & 2) - 1
15 pages
Data Mining UNIT 3 LECTURE NOTES
No ratings yet
Data Mining UNIT 3 LECTURE NOTES
13 pages
Web Design and Programming CH-3
No ratings yet
Web Design and Programming CH-3
7 pages
Association Rule Mining
No ratings yet
Association Rule Mining
10 pages
HG10CV2.0 Datasheet
No ratings yet
HG10CV2.0 Datasheet
5 pages
Lecture14 Unix Advanced Commands
No ratings yet
Lecture14 Unix Advanced Commands
13 pages
Data Mining Answer Key
No ratings yet
Data Mining Answer Key
10 pages
DcTrack Installation
No ratings yet
DcTrack Installation
4 pages
AIML Assignment
No ratings yet
AIML Assignment
5 pages
8 Data Mining Algorithms
No ratings yet
8 Data Mining Algorithms
8 pages
Unit 5
No ratings yet
Unit 5
9 pages
DWM
No ratings yet
DWM
66 pages
Business Analytics.
No ratings yet
Business Analytics.
18 pages
Clustering & Association Algorithms 4
No ratings yet
Clustering & Association Algorithms 4
17 pages
Arpan Koley - Oe-Ec604c - Ca-1
No ratings yet
Arpan Koley - Oe-Ec604c - Ca-1
9 pages
Mobile Phone Cloning IJERTCONV3IS10043
No ratings yet
Mobile Phone Cloning IJERTCONV3IS10043
5 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Magel Is
No ratings yet
Magel Is
40 pages
Edp 1 PDF
No ratings yet
Edp 1 PDF
10 pages
Schools Division of Cebu City/Lusaran National High School Workweek Plan
No ratings yet
Schools Division of Cebu City/Lusaran National High School Workweek Plan
2 pages
KM Assumption
No ratings yet
KM Assumption
32 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
CETECOM Antenna Testing Pocket Guide
No ratings yet
CETECOM Antenna Testing Pocket Guide
2 pages
Review Paper: Virtual Autopsy: A New Trend in Forensic Investigation
No ratings yet
Review Paper: Virtual Autopsy: A New Trend in Forensic Investigation
7 pages
Typing Lessons
No ratings yet
Typing Lessons
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

Big Data

Uploaded by

Big Data

Uploaded by

JIGJIGA UNIVERSITY

Income: Measured in currency (e.g., $50,000 is twice as much as $25,000). A salary of $0

Common Association Rule Mining Algorithms

Pros and Cons:

Pros: Simple to understand and implement; effective for smaller datasets.

2. ECLAT (Equivalence Class Transformation)

Pros and Cons:

Cons: May require more memory due to storage of transaction IDs.

3. FP-Growth (Frequent Pattern Growth)

Pros and Cons:

Generating Frequent Item Sets Without Candidate Generation

Steps Involved in FP-Growth:

1. Building the FP-Tree:

2. Mining Frequent Patterns:

3. What is Real-Time Analytics Platform (RTAP) and OLAP?

Real-Time Analytics Platform (RTAP)

2. Supports Dynamic Decision-Making: By providing immediate insights, RTAPs facilitate quick

1. Multi-Dimensional Analysis: OLAP allows users to perform multi-dimensional analysis of business

An open-source A commercial A cloud Another IBM’s Hadoop-

Batch Enterprise- Cloud-based Community- Enterprise-

Community Paid support Pay-as-you-go Community- Enterprise

Basic security Advanced Security Focused on Strong

You might also like