Q_Solve_Bigdata
Q_Solve_Bigdata
Q_Solve_Bigdata
2. (a) Using the table provided, find all frequent item sets with a minimum
support threshold of 60% using the Apriori mining algorithm.
Answer:
To solve this, list all unique items in the transactions and determine their frequency.
Then, identify item sets that meet or exceed the 60% support threshold. For
example, calculate the support for each item set like {a}, {b}, etc., and filter those
with support ≥ 60%.
3. (a) Find the mean, median, and a boxplot of the data provided.
Given data (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
32, 35, 35, 35, 36, 40, 45, 46, 52, 70.
Answer:
Question _2_Solved:
Question 1
(a) What are the fundamental data types associated with big data? Explain with example.
(4 marks)
(c) As a data scientist, what qualities are crucial? Explain clearly. (4 marks)
Question 2
(a) State the name of commonly used measures of central tendency. How the Median
Class is defined in group data distribution. (2 marks)
The Median Class in grouped data is the class interval where the cumulative frequency
reaches or exceeds half of the total frequency.
Frequency 2 12 22 8 6
(c) Suppose you have the data as follows: ... Find the boxplot. (4 marks)
To construct a boxplot:
1. Find the minimum, first quartile, median, third quartile, and maximum.
2. Plot these points on a number line and draw the box with whiskers.
1. The Box:
○ The box in the middle shows the "middle half" of the data, from the 25th
percentile (lower edge) to the 75th percentile (upper edge). This range is called
the Interquartile Range (IQR).
○ The line inside the box is the median, or the middle value of the data. Half of the
data points are above this line, and half are below it.
2. Whiskers:
○ The "whiskers" are the lines that extend out from the box. They show the range
of most of the data (anything within 1.5 times the IQR from the box).
○ The whiskers go to the lowest and highest values in this range.
3. Outliers:
○ Any data points that fall outside the whiskers are called outliers. In this plot,
there’s one outlier on the right side. It’s a value that’s quite a bit higher than the
rest of the data.
4. Overall Picture:
○ This boxplot gives a quick view of how spread out the data is, where most values
fall, and if there are any unusually high or low values (outliers).
Question 3
(a) Find all frequent item sets with minimum support threshold value of 2 using Apriori
mining algorithm for TABLE-1. (7 marks)
1. Generate Item Sets - List all single items and count their occurrences.
2. Filter Item Sets - Keep items with occurrences ≥ 2.
3. Generate Pairs - Repeat for pairs, triples, and so on.
4. Frequent Item Sets - The item sets with support ≥ 2.
Question 4(a)
Mr. Hasan is a shopkeeper... Determine the relationship between temperature and the
amount of cold drinks purchased. (5 marks)
1 20 15
2 25 17
3 18 16
4 28 20
5 31 28
6 15 9
7 26 20
8 10 2
Analysis:
● From the table, it appears that as the temperature rises, the quantity of cold drinks sold
also increases.
● For instance, at lower temperatures (e.g., 10°C), only 2 liters of cold drinks are sold,
whereas at higher temperatures (e.g., 31°C), the sales go up to 28 liters.
● This indicates a positive correlation between temperature and the consumption of cold
drinks. In simpler terms, higher temperatures lead to higher sales of cold drinks.
Diagram:
The relationship can be visually represented using a scatter plot to show how cold drink
consumption changes with temperature.
Here is the scatter plot for Question 4(a), showing the relationship between temperature and the
amount of cold drinks purchased:
TID Transaction
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, K, I, E}
● M: 2
● O: 3
● N: 2
● K: 5
● E: 4
● Y: 3
● D: 1
● A: 1
● U: 1
● C: 2
● I: 1
With a minimum support of 60%, an item needs to appear in at least 3 transactions (since 60%
of 5 transactions = 3). Let’s filter out items that don’t meet this requirement.
T100 K, E, O, Y
T200 K, E, O, Y
T300 K, E
T400 K, Y
T500 K, E, O
FP Tree Illustration
1. Root
2. |
3. K (5)
4. / | \
5. E Y Y
6. (3) (1) (1)
7. | |
8. O O
9. (3) (1)
10. |
11. Y
12. (2)
Explanation of FP Tree Structure:
1. K is the most frequent item, appearing in all transactions, so it is the root.
2. For transactions that include K and E, a branch is created under K with E.
3. For transactions with K, E, and O, the tree expands to include O under E.
4. Similarly, Y branches from O in some transactions.
This tree shows the frequency pattern in a compressed form, representing the common patterns
across transactions.
End
Question_3_Solved
Question 1:
1. Define Big Data Analytics. Discuss veracity, value, and visibility of big data.
Answer:
Big Data Analytics is the process of examining large and complex data sets to uncover
hidden patterns, unknown correlations, market trends, and other useful information.
○ Veracity: This refers to the quality or trustworthiness of the data. Big data can
come with inconsistencies, so verifying accuracy is important.
○ Value: This is about extracting useful insights from the data. The real worth of
big data lies in the information it provides to improve decision-making.
○ Visibility: This means how easy it is to access and view relevant data. Good
visibility helps in timely decision-making.
2. Draw the figure of data analytics lifecycle.
Answer:
The data analytics lifecycle typically includes the following stages:
○ Data Collection: Gathering data from various sources.
○ Data Preparation: Cleaning and organizing data.
○ Data Analysis: Analyzing the data to find insights.
○ Data Visualization: Presenting the data findings visually.
○ Decision Making: Using insights for decision-making.
3. What are the sources of big data? Describe with example.
Answer:
Big data comes from many sources:
○ Social Media: Platforms like Facebook and Twitter generate a lot of data through
user interactions.
○ E-commerce: Sites like Amazon track user purchases, reviews, and browsing
behavior.
○ Sensors and IoT: Devices in smart homes and cities generate data on
temperature, energy usage, etc.
Question 2:
1. State the name of commonly used measures of central tendency. How the Median
is defined in group data distribution.
Answer:
The commonly used measures of central tendency are Mean, Median, and Mode.
○ Mean: The average of a data set.
○ Median: The middle value when data is arranged in order. For group data, the
median is the value that separates the higher half from the lower half.
○ Mode: The most frequently occurring value in a data set.
2. Explain Data Analysis pipeline with proper diagram.
Answer:
A typical data analysis pipeline includes:
○ Data Collection
○ Data Cleaning
○ Data Transformation
○ Data Modeling
○ Data Visualization Each stage processes data to make it ready for the next
stage, with the goal of deriving useful insights.
3. What is a boxplot? How it can help to represent information?
Answer:
A boxplot is a graphical representation that shows the distribution of a data set. It
includes:
○ Minimum
○ First Quartile (Q1)
○ Median (Q2)
○ Third Quartile (Q3)
○ Maximum Boxplots help to see the spread and skewness of data, as well as any
outliers.
4. Why data preprocessing is important? Explain.
Answer:
Data preprocessing is essential to prepare raw data for analysis. It involves cleaning,
transforming, and normalizing data, which removes errors, fills missing values, and
standardizes the data, making it ready for accurate analysis.
Question 3:
1. Define correlation. Suppose that 5 students were asked their SSC GPA and their
HSC GPA, with the answers as follows in Table I. Is SSC GPA and HSC GPA
related according to this data, and if they are related, what type of correlation
exists between these two data?
Answer:
Correlation is a statistical measure that describes the strength and direction of a
relationship between two variables. It shows whether an increase or decrease in one
variable is associated with an increase or decrease in another variable.
To determine if SSC GPA and HSC GPA are related, we can calculate the correlation
coefficient based on the data provided in Table I:
○ If the correlation coefficient is close to +1, there is a strong positive
correlation, meaning as SSC GPA increases, HSC GPA also increases.
○ If the correlation coefficient is close to -1, there is a strong negative
correlation, meaning as SSC GPA increases, HSC GPA decreases.
○ If the correlation coefficient is close to 0, there is no significant relationship
between SSC GPA and HSC GPA.
2. After calculating, if we find a positive correlation coefficient, we can conclude that SSC
GPA and HSC GPA have a positive correlation, meaning students with higher SSC GPA
tend to have higher HSC GPA as well.
3. Distinguish between BI and Big Data Analytics.
Answer:
○ Business Intelligence (BI) involves using past and current data to make well-
informed business decisions. It focuses on descriptive analytics, which helps in
understanding historical trends.
○ Big Data Analytics goes beyond just past and present data and includes
predictive and prescriptive analytics. It handles large, complex data sets from
various sources to find hidden patterns, forecast future trends, and provide
deeper insights.
4. Suppose you want to use the min-max normalization by setting min = 1 and max =
5 to normalize the range is 700 to 2000. Find the normalized value of 850.
Answer:
Where:
● X=850X = 850X=850
● min original=700\text{min original} = 700min original=700
● max original=2000\text{max original} = 2000max original=2000
● min new=1\text{min new} = 1min new=1
● max new=5\text{max new} = 5max new=5
Answer:
Properties of a normal distribution curve include:
Question 4:
1. What are Multi-level Association rules? Give examples.
Answer:
Multi-level association rules are used to find patterns in data at different levels of
abstraction. For example, in a supermarket, a rule might show that "beverages" are often
purchased with "snacks" (higher level), and at a lower level, it may show that "soda" is
bought with "chips".
2. Consider the following database containing five transactions in Table II. Let
assume the min_sup = 60%. Draw the FP tree.
Answer:
To create the FP (Frequent Pattern) tree:
○ Identify items with frequency greater than or equal to 60% of transactions (which
is 3 out of 5 transactions in this case).
○ Organize the frequent items in a tree structure based on transactions in
descending order of frequency.
3. Since drawing the FP tree requires visual representation, you can list items appearing at
least 3 times, then arrange them in a structured tree with paths representing
transactions.
Question 4
1. What are Multi-level Association Rules? Give examples.
Answer:
Multi-level association rules help in finding relationships or patterns in data across
different levels of detail. For instance:
○ At a high level, a rule could show that "beverages" are often bought with
"snacks".
○ At a more detailed level, it could show that "soda" is bought with "chips".
2. These rules are useful in retail and e-commerce to understand buying behaviors at both
a broad and detailed level.
3. Consider the following database containing five transactions in Table II. Assume
the min_sup = 60%. Draw the FP tree.
Answer:
To draw an FP (Frequent Pattern) tree, follow these steps:
Step 1: Count the frequency of each item
Calculate the frequency of each item in the transactions. According to Table II:
Item Frequency
M 2
O 4
N 2
K 4
E 3
Y 3
A 1
U 1
C 1
I 1
4.
Step 2: Remove infrequent items
Since min_sup is 60%, we need items that appear in at least 3 out of 5 transactions.
Therefore, items with a frequency of 2 or less are removed.
The frequent items are: O, K, E, Y.
Step 3: Sort items by frequency and arrange transactions
Now, arrange each transaction by frequency in descending order for the frequent items.
○ T100: O, K, E, Y
○ T200: O, K, E, Y
○ T300: (skip this, as "A" is not frequent)
○ T400: (skip this, as "U" and "C" are not frequent)
○ T500: (skip this, as "C", "O", "K", "I", and "E" do not meet min_sup)
(O)
(K)
(E) - (Y)
5. Explanation:
○ The root node starts with "O", which leads to "K" as a child, then "E", and finally
"Y".
FP Tree:
The FP tree is a way to represent frequent patterns in data visually. By structuring the data into
a tree based on frequent items, you can quickly see relationships and co-occurrences of items,
which is especially useful in market basket analysis.
Question_4_Solved
1 (a) What are the fundamental data types associated with big data?
Explain with example.
Answer:
In big data, the fundamental data types can be categorized into the following types:
1. Structured Data: Data that is organized into a predefined structure, like rows and
columns in a database. It is easily searchable by algorithms.
○ Example: Data in an SQL database table, such as customer information (Name,
Address, Phone Number).
2. Unstructured Data: Data that does not have a predefined structure and is often in the
form of multimedia or text.
○ Example: Text documents, images, videos, social media posts.
3. Semi-Structured Data: Data that does not fit into traditional databases but contains
some organizational properties, such as tags and markers.
○ Example: XML and JSON files, emails (having metadata and message body).
4. Quasi-Structured Data: Data that has some structure but is irregular and not
completely organized.
○ Example: Web server logs (contains a pattern but not a strict structure).
2 (a) State the name of commonly used measures of central tendency. How
the Median Class is defined in group data distribution.
Answer:
1. Mean: This is the average of all values in a dataset. It is calculated by summing up all
the values and dividing by the number of values.
2. Median: The middle value in a sorted dataset. For grouped data, it is the middle value
based on cumulative frequency.
3. Mode: The value that appears most frequently in the dataset.
In a grouped frequency distribution, the median class is the class interval where the median
value lies. It is identified by finding the class interval with cumulative frequency just above half of
the total frequency. Once the median class is determined, the median value is calculated using
a formula for grouped data.
Frequency 2 12 22 8 6
2
Drawing a Boxplot
1. Minimum = 0
2. First Quartile (Q1) = 10-20 range
3. Median (Q2) = 25 (as calculated above)
4. Third Quartile (Q3) = 30-40 range
5. Maximum = 50
Here is the boxplot based on the given class frequency data.
● The boxplot displays the distribution of values across the class intervals.
● The median (Q2) is represented at approximately 25, which we calculated previously.
● The range extends from the minimum to the maximum class boundaries in the dataset.
3 (a) Find all frequent item sets with minimum support threshold value of 2
using Apriori mining algorithm for the following TABLE-I.
Given Table:
T200 I2, I4
T300 I2, I3
T600 I2, I3
T700 I1, I3
Answer:
3 (b) What is Multi-Dimensional Association Rules? Give example.
Answer:
Multi-Dimensional Association Rules are association rules that involve multiple dimensions
or attributes in the data. Unlike traditional association rules that focus on items within a single
transaction, multi-dimensional association rules look at associations across different attributes
or dimensions of the dataset.
Example:
● "Customers in New York are more likely to buy winter clothing on weekends in
December."
Here, the rule links the dimensions of Location (New York), Time (Weekends, December),
and Product (Winter Clothing), showing an association across multiple dimensions.
4 (a) Mr. Hasan is a shopkeeper who sells different daily necessary items such as rice,
oil, soap, biscuits, and cold drinks. Cold drinks are normally sold more when the
temperature is higher. It is reported by Mr. Hasan that the hotter the day, the more the
cold drinks are purchased by the customers. He maintains the following database to
keep record of cold drinks in liters against the temperature of a day to predict his storage
of cold drinks that he should keep in his shop based on the temperature of a day shown
in Table II. Determine the relationship between temperature and the amount of cold
drinks purchased.
1 25 15
2 28 17
3 18 16
4 22 20
5 25 28
6 21 23
7 30 29
8 10 2
Answer:
To determine the relationship between temperature and the amount of cold drinks purchased,
we can use a scatter plot to visualize any correlation and calculate the correlation coefficient.
● x: Temperature values
● y: Cold Drinks values
A positive rrr value indicates a positive correlation (as temperature increases, cold drink sales
increase).
Let’s create the scatter plot and calculate the correlation coefficient to confirm this relationship.
Scatter Plot and Correlation Calculation
The scatter plot above shows the relationship between temperature (°C) and cold drinks
purchased (liters). The correlation coefficient calculated is approximately 0.78, indicating a
strong positive correlation. This suggests that as the temperature increases, the sales of cold
drinks also increase.
TID Transaction
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E}
T300 {M, U, K, E}
T400 {C, O, K, I, E}
T500 {C, O, K, E}
Solution:
This tree structure shows the frequent patterns identified, with each node's count representing
the number of times that item (or itemset) appears in the transactions. This visualization makes
it easier to see the relationships and frequency of each item in the dataset based on the
minimum support threshold.
End