Q_Solve_Bigdata

Question _1_Solved:
1. (a) What is big data? Explain the 4V of big data.

Answer:
Big data refers to vast amounts of structured and unstructured data generated at high velocity
from various sources. This data is complex and requires advanced tools and technologies to
process and analyze.
The 4Vs of Big Data are:
● Volume: The large scale of data generated.

● Velocity: The speed at which data is created and processed.
● Variety: Different types of data (text, images, audio, etc.).
● Veracity: The reliability and accuracy of the data.
1. (b) What are the tasks of data preprocessing?

Answer:
Data preprocessing involves preparing raw data for analysis by cleaning and transforming it.
Major tasks include:
● Data Cleaning: Removing noise and handling missing values.

● Data Integration: Combining data from multiple sources.
● Data Transformation: Converting data into an appropriate format.
● Data Reduction: Reducing data volume while retaining essential information.
2. (a) Using the table provided, find all frequent item sets with a minimum
support threshold of 60% using the Apriori mining algorithm.
Answer:
To solve this, list all unique items in the transactions and determine their frequency.
Then, identify item sets that meet or exceed the 60% support threshold. For
example, calculate the support for each item set like {a}, {b}, etc., and filter those
with support ≥ 60%.
Due to space, here is an example of steps:
1. Count occurrences of each item.

2. Calculate support for each item.
3. Only keep item sets with support ≥ 60%.
2. (b) State the properties of the normal distribution curve.
Answer:
The properties of a normal distribution curve include:
● Symmetry: It is symmetrical about the mean.

● Bell-shaped: Peaks at the mean and tapers off at both ends.
● Mean, Median, and Mode: All are equal and located at the center.
● 68-95-99.7 Rule: Approximately 68% of data falls within 1 standard deviation, 95%
within 2, and 99.7% within 3.
3. (a) Find the mean, median, and a boxplot of the data provided.
Given data (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
32, 35, 35, 35, 36, 40, 45, 46, 52, 70.
Answer:
● Mean: Sum all values and divide by the count.

● Median: The middle value in the ordered data (if odd, middle; if even, average of two
middle values).
● Boxplot: Visual representation showing the minimum, Q1, median, Q3, and maximum.
3. (b) Describe methods for handling missing data in real-world datasets.

Answer:
Common methods include:
● Deletion: Removing records with missing values.

● Imputation: Filling missing values with mean, median, or mode.
● Prediction: Using algorithms to predict missing values based on other data.
● Using Algorithms that Support Missing Values: Some algorithms can handle missing
data inherently.
3. (c) State the equation of correlation analysis between two variables A

and B.
4. (a) Describe briefly the Data Analytics Lifecycle.
Answer:
The Data Analytics Lifecycle includes:
1. Discovery: Understanding the objectives and requirements.

2. Data Preparation: Gathering, cleaning, and transforming data.
3. Model Planning: Selecting techniques and planning analysis.
4. Model Building: Creating and testing models.
5. Results Communication: Interpreting and presenting findings.
6. Operationalization: Deploying the model in real-world applications.
4. (b) Differentiate between BI and Data Analytics.

Answer:
● Business Intelligence (BI): Focuses on descriptive analytics, summarizing historical

data to inform decision-making.
● Data Analytics: Encompasses a broader scope, including predictive and prescriptive
analytics, often using complex algorithms for deeper insights.
Question _2_Solved:
Question 1
(a) What are the fundamental data types associated with big data? Explain with example.
(4 marks)
1. Structured Data - Organized in a specific format, typically in rows and columns.

Example: SQL databases containing customer information.
2. Unstructured Data - Lacks a predefined structure and is typically textual. Example:
Social media posts, emails.
3. Semi-Structured Data - Partially structured with tags or markers but not organized into
a strict table format. Example: JSON or XML files.
4. Metadata - Data about other data. Example: Document creation dates, author
information in files.
(b) Draw the figure of data analytics lifecycle. (2 marks)
1. Data Collection - Gathering data from various sources.

2. Data Cleaning - Removing inconsistencies or errors in the data.
3. Data Analysis - Applying analytical techniques to find patterns.
4. Data Visualization - Representing insights through graphs and charts.
5. Decision-Making - Using insights for strategic decisions.
(c) As a data scientist, what qualities are crucial? Explain clearly. (4 marks)
1. Analytical Thinking - The ability to interpret and analyze data effectively.

2. Programming Skills - Knowledge in languages like Python and R.
3. Communication Skills - Presenting complex data insights simply.
4. Curiosity - Eagerness to find hidden patterns and insights.
Question 2
(a) State the name of commonly used measures of central tendency. How the Median
Class is defined in group data distribution. (2 marks)
1. Mean - The average value.

2. Median - The middle value in sorted data.
3. Mode - The most frequently occurring value.
The Median Class in grouped data is the class interval where the cumulative frequency
reaches or exceeds half of the total frequency.
(b) Find the median of the following data: (4 marks)
Classes 0-10 10-20 20-30 30-40 40-50
Frequency 2 12 22 8 6
To find the median:
1. Calculate cumulative frequency.

2. Determine N/2N/2N/2 (where NNN is the total frequency).
3. Identify the median class based on the cumulative frequency crossing N/2N/2N/2.
(c) Suppose you have the data as follows: ... Find the boxplot. (4 marks)
To construct a boxplot:
1. Find the minimum, first quartile, median, third quartile, and maximum.
2. Plot these points on a number line and draw the box with whiskers.
Understanding the Boxplot

A boxplot (or box-and-whisker plot) is a graphical representation of a dataset that shows its
central tendency and spread, as well as any outliers.
Here’s a simple explanation of the boxplot:
1. The Box:
○ The box in the middle shows the "middle half" of the data, from the 25th
percentile (lower edge) to the 75th percentile (upper edge). This range is called
the Interquartile Range (IQR).
○ The line inside the box is the median, or the middle value of the data. Half of the
data points are above this line, and half are below it.
2. Whiskers:
○ The "whiskers" are the lines that extend out from the box. They show the range
of most of the data (anything within 1.5 times the IQR from the box).
○ The whiskers go to the lowest and highest values in this range.
3. Outliers:
○ Any data points that fall outside the whiskers are called outliers. In this plot,
there’s one outlier on the right side. It’s a value that’s quite a bit higher than the
rest of the data.
4. Overall Picture:
○ This boxplot gives a quick view of how spread out the data is, where most values
fall, and if there are any unusually high or low values (outliers).
Question 3
(a) Find all frequent item sets with minimum support threshold value of 2 using Apriori
mining algorithm for TABLE-1. (7 marks)
Using TABLE-1 with a minimum support threshold of 2:
1. Generate Item Sets - List all single items and count their occurrences.
2. Filter Item Sets - Keep items with occurrences ≥ 2.
3. Generate Pairs - Repeat for pairs, triples, and so on.
4. Frequent Item Sets - The item sets with support ≥ 2.
(b) What is Multi-Dimensional Association Rules? Give example. (3 marks)
● Multi-Dimensional Association Rules: These rules involve multiple dimensions or

attributes.
● Example: "Customers who buy milk and bread on weekends are likely to also buy
eggs."
Question 4(a)
Mr. Hasan is a shopkeeper... Determine the relationship between temperature and the
amount of cold drinks purchased. (5 marks)
Based on the data provided in Table-II:

ID Temperature (°C) Cold Drinks (Liter)
1 20 15
2 25 17
3 18 16
4 28 20
5 31 28
6 15 9
7 26 20
8 10 2
Analysis:
● From the table, it appears that as the temperature rises, the quantity of cold drinks sold
also increases.
● For instance, at lower temperatures (e.g., 10°C), only 2 liters of cold drinks are sold,
whereas at higher temperatures (e.g., 31°C), the sales go up to 28 liters.
● This indicates a positive correlation between temperature and the consumption of cold
drinks. In simpler terms, higher temperatures lead to higher sales of cold drinks.
Diagram:
The relationship can be visually represented using a scatter plot to show how cold drink
consumption changes with temperature.
Here is the scatter plot for Question 4(a), showing the relationship between temperature and the
amount of cold drinks purchased:
● The x-axis represents the temperature in degrees Celsius.

● The y-axis represents the amount of cold drinks purchased in liters.
● Each point represents the temperature and corresponding cold drink sales.
● The upward trend indicates that as temperature increases, the amount of cold drinks
sold also rises.
4b) Step 1: Calculate Frequency of Each Item

Based on the transactions in TABLE-III:
TID Transaction
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, K, I, E}
Let’s calculate the frequency of each item.
● M: 2
● O: 3
● N: 2
● K: 5
● E: 4
● Y: 3
● D: 1
● A: 1
● U: 1
● C: 2
● I: 1
Filtering Based on Minimum Support
With a minimum support of 60%, an item needs to appear in at least 3 transactions (since 60%
of 5 transactions = 3). Let’s filter out items that don’t meet this requirement.
Frequent Items (Support ≥ 3):
● O (3), K (5), E (4), Y (3)

We discard the other items.
Step 2: Order Frequent Items in Each Transaction

Now, we only include the frequent items in each transaction and sort them based on their
frequency (K > E > O > Y).
TID Filtered and Ordered Transaction
T100 K, E, O, Y
T200 K, E, O, Y
T300 K, E
T400 K, Y
T500 K, E, O
Step 3: Construct the FP Tree

Using the filtered transactions, we will build the FP Tree.
1. Start with an empty root.

2. Insert each transaction into the tree, creating nodes if they do not exist and
incrementing counts as necessary.
FP Tree Illustration
Below is the FP Tree based on the above transactions.
1. Root
2. |
3. K (5)
4. / | \
5. E Y Y
6. (3) (1) (1)
7. | |
8. O O
9. (3) (1)
10. |
11. Y
12. (2)
Explanation of FP Tree Structure:
1. K is the most frequent item, appearing in all transactions, so it is the root.
2. For transactions that include K and E, a branch is created under K with E.
3. For transactions with K, E, and O, the tree expands to include O under E.
4. Similarly, Y branches from O in some transactions.
This tree shows the frequency pattern in a compressed form, representing the common patterns
across transactions.
End
Question_3_Solved
Question 1:
1. Define Big Data Analytics. Discuss veracity, value, and visibility of big data.
Answer:
Big Data Analytics is the process of examining large and complex data sets to uncover
hidden patterns, unknown correlations, market trends, and other useful information.
○ Veracity: This refers to the quality or trustworthiness of the data. Big data can
come with inconsistencies, so verifying accuracy is important.
○ Value: This is about extracting useful insights from the data. The real worth of
big data lies in the information it provides to improve decision-making.
○ Visibility: This means how easy it is to access and view relevant data. Good
visibility helps in timely decision-making.
2. Draw the figure of data analytics lifecycle.
Answer:
The data analytics lifecycle typically includes the following stages:
○ Data Collection: Gathering data from various sources.
○ Data Preparation: Cleaning and organizing data.
○ Data Analysis: Analyzing the data to find insights.
○ Data Visualization: Presenting the data findings visually.
○ Decision Making: Using insights for decision-making.
3. What are the sources of big data? Describe with example.
Answer:
Big data comes from many sources:
○ Social Media: Platforms like Facebook and Twitter generate a lot of data through
user interactions.
○ E-commerce: Sites like Amazon track user purchases, reviews, and browsing
behavior.
○ Sensors and IoT: Devices in smart homes and cities generate data on
temperature, energy usage, etc.
Question 2:
1. State the name of commonly used measures of central tendency. How the Median
is defined in group data distribution.
Answer:
The commonly used measures of central tendency are Mean, Median, and Mode.
○ Mean: The average of a data set.
○ Median: The middle value when data is arranged in order. For group data, the
median is the value that separates the higher half from the lower half.
○ Mode: The most frequently occurring value in a data set.
2. Explain Data Analysis pipeline with proper diagram.
Answer:
A typical data analysis pipeline includes:
○ Data Collection
○ Data Cleaning
○ Data Transformation
○ Data Modeling
○ Data Visualization Each stage processes data to make it ready for the next
stage, with the goal of deriving useful insights.
3. What is a boxplot? How it can help to represent information?
Answer:
A boxplot is a graphical representation that shows the distribution of a data set. It
includes:
○ Minimum
○ First Quartile (Q1)
○ Median (Q2)
○ Third Quartile (Q3)
○ Maximum Boxplots help to see the spread and skewness of data, as well as any
outliers.
4. Why data preprocessing is important? Explain.
Answer:
Data preprocessing is essential to prepare raw data for analysis. It involves cleaning,
transforming, and normalizing data, which removes errors, fills missing values, and
standardizes the data, making it ready for accurate analysis.
Question 3:
1. Define correlation. Suppose that 5 students were asked their SSC GPA and their
HSC GPA, with the answers as follows in Table I. Is SSC GPA and HSC GPA
related according to this data, and if they are related, what type of correlation
exists between these two data?
Answer:
Correlation is a statistical measure that describes the strength and direction of a
relationship between two variables. It shows whether an increase or decrease in one
variable is associated with an increase or decrease in another variable.
To determine if SSC GPA and HSC GPA are related, we can calculate the correlation
coefficient based on the data provided in Table I:
○ If the correlation coefficient is close to +1, there is a strong positive
correlation, meaning as SSC GPA increases, HSC GPA also increases.
○ If the correlation coefficient is close to -1, there is a strong negative
correlation, meaning as SSC GPA increases, HSC GPA decreases.
○ If the correlation coefficient is close to 0, there is no significant relationship
between SSC GPA and HSC GPA.
2. After calculating, if we find a positive correlation coefficient, we can conclude that SSC
GPA and HSC GPA have a positive correlation, meaning students with higher SSC GPA
tend to have higher HSC GPA as well.
3. Distinguish between BI and Big Data Analytics.
Answer:
○ Business Intelligence (BI) involves using past and current data to make well-
informed business decisions. It focuses on descriptive analytics, which helps in
understanding historical trends.
○ Big Data Analytics goes beyond just past and present data and includes
predictive and prescriptive analytics. It handles large, complex data sets from
various sources to find hidden patterns, forecast future trends, and provide
deeper insights.
4. Suppose you want to use the min-max normalization by setting min = 1 and max =
5 to normalize the range is 700 to 2000. Find the normalized value of 850.
Answer:
The min-max normalization formula is:
Where:
● X=850X = 850X=850
● min original=700\text{min original} = 700min original=700
● max original=2000\text{max original} = 2000max original=2000
● min new=1\text{min new} = 1min new=1
● max new=5\text{max new} = 5max new=5
Plugging these values into the formula, we get:
Calculate this for the answer.
4. State the properties of normal distributed curves.
Answer:
Properties of a normal distribution curve include:
● It is symmetric around the mean.

● The mean, median, and mode are all equal.
● Approximately 68% of data falls within one standard deviation, 95% within two, and
99.7% within three.
● It has a bell-shaped curve with tails approaching the x-axis but never touching it.
Question 4:
1. What are Multi-level Association rules? Give examples.
Answer:
Multi-level association rules are used to find patterns in data at different levels of
abstraction. For example, in a supermarket, a rule might show that "beverages" are often
purchased with "snacks" (higher level), and at a lower level, it may show that "soda" is
bought with "chips".
2. Consider the following database containing five transactions in Table II. Let
assume the min_sup = 60%. Draw the FP tree.
Answer:
To create the FP (Frequent Pattern) tree:
○ Identify items with frequency greater than or equal to 60% of transactions (which
is 3 out of 5 transactions in this case).
○ Organize the frequent items in a tree structure based on transactions in
descending order of frequency.
3. Since drawing the FP tree requires visual representation, you can list items appearing at
least 3 times, then arrange them in a structured tree with paths representing
transactions.
Question 4
1. What are Multi-level Association Rules? Give examples.
Answer:
Multi-level association rules help in finding relationships or patterns in data across
different levels of detail. For instance:
○ At a high level, a rule could show that "beverages" are often bought with
"snacks".
○ At a more detailed level, it could show that "soda" is bought with "chips".
2. These rules are useful in retail and e-commerce to understand buying behaviors at both
a broad and detailed level.
3. Consider the following database containing five transactions in Table II. Assume
the min_sup = 60%. Draw the FP tree.
Answer:
To draw an FP (Frequent Pattern) tree, follow these steps:
Step 1: Count the frequency of each item
Calculate the frequency of each item in the transactions. According to Table II:
Item Frequency
M 2
O 4
N 2
K 4
E 3
Y 3
A 1
U 1
C 1
I 1
4.
Step 2: Remove infrequent items
Since min_sup is 60%, we need items that appear in at least 3 out of 5 transactions.
Therefore, items with a frequency of 2 or less are removed.
The frequent items are: O, K, E, Y.
Step 3: Sort items by frequency and arrange transactions
Now, arrange each transaction by frequency in descending order for the frequent items.
○ T100: O, K, E, Y
○ T200: O, K, E, Y
○ T300: (skip this, as "A" is not frequent)
○ T400: (skip this, as "U" and "C" are not frequent)
○ T500: (skip this, as "C", "O", "K", "I", and "E" do not meet min_sup)
Step 4: Draw the FP Tree

Based on the transactions with the frequent items, we build the FP tree:
(O)
(K)
(E) - (Y)
5. Explanation:
○ The root node starts with "O", which leads to "K" as a child, then "E", and finally
"Y".
Additional Explanation for Each Part in Simple Terms

Multi-level Association Rules:
These rules allow analysis at various detail levels, helping businesses to find general trends and
more specific patterns in purchasing behavior. For example, finding that beverages are
generally bought with snacks is a high-level pattern, while discovering that sodas are often
bought with chips is a more specific rule.
FP Tree:
The FP tree is a way to represent frequent patterns in data visually. By structuring the data into
a tree based on frequent items, you can quickly see relationships and co-occurrences of items,
which is especially useful in market basket analysis.
Question_4_Solved
1 (a) What are the fundamental data types associated with big data?
Explain with example.
Answer:
In big data, the fundamental data types can be categorized into the following types:
1. Structured Data: Data that is organized into a predefined structure, like rows and
columns in a database. It is easily searchable by algorithms.
○ Example: Data in an SQL database table, such as customer information (Name,
Address, Phone Number).
2. Unstructured Data: Data that does not have a predefined structure and is often in the
form of multimedia or text.
○ Example: Text documents, images, videos, social media posts.
3. Semi-Structured Data: Data that does not fit into traditional databases but contains
some organizational properties, such as tags and markers.
○ Example: XML and JSON files, emails (having metadata and message body).
4. Quasi-Structured Data: Data that has some structure but is irregular and not
completely organized.
○ Example: Web server logs (contains a pattern but not a strict structure).
(b) Draw the figure of data analytics lifecycle.

Answer with Diagram:
The data analytics lifecycle typically involves the following stages:
1. Data Collection: Gathering data from various sources.

2. Data Processing: Cleaning and transforming raw data.
3. Data Analysis: Using statistical and machine learning methods to analyze the data.
4. Data Visualization: Representing the findings in visual formats for easier interpretation.
5. Data Interpretation and Decision-Making: Making decisions based on the analysis.
[Diagram of the Data Analytics Lifecycle]

I will draw the lifecycle steps in a circular flow if required.
(c) As a data scientist, what qualities are crucial? Explain clearly.

Answer:
The crucial qualities for a data scientist include:
1. Analytical Skills: Ability to analyze and interpret complex datasets.

2. Programming Knowledge: Proficiency in programming languages such as Python, R,
and SQL.
3. Statistical Knowledge: Understanding of statistical concepts to validate models.
4. Problem-Solving Skills: Ability to identify and solve real-world business problems.
5. Communication Skills: Ability to communicate findings effectively to non-technical
stakeholders.
6. Curiosity and Adaptability: Eagerness to learn new tools and adapt to changes in
technology.
2 (a) State the name of commonly used measures of central tendency. How
the Median Class is defined in group data distribution.
Answer:
The commonly used measures of central tendency are:
1. Mean: This is the average of all values in a dataset. It is calculated by summing up all
the values and dividing by the number of values.
2. Median: The middle value in a sorted dataset. For grouped data, it is the middle value
based on cumulative frequency.
3. Mode: The value that appears most frequently in the dataset.
Median Class in Grouped Data Distribution:
In a grouped frequency distribution, the median class is the class interval where the median
value lies. It is identified by finding the class interval with cumulative frequency just above half of
the total frequency. Once the median class is determined, the median value is calculated using
a formula for grouped data.
2 (b) Find the median of the following data:

Given Data Table:
Classes 0-10 10-20 20-30 30-40 40-50
Frequency 2 12 22 8 6
2
(c) Draw the boxplot for the given data.

A boxplot represents the minimum, first quartile (Q1), median, third quartile (Q3), and maximum
of a dataset. Here, we'll plot it based on class intervals.
Drawing a Boxplot
1. Minimum = 0
2. First Quartile (Q1) = 10-20 range
3. Median (Q2) = 25 (as calculated above)
4. Third Quartile (Q3) = 30-40 range
5. Maximum = 50
Here is the boxplot based on the given class frequency data.
● The boxplot displays the distribution of values across the class intervals.
● The median (Q2) is represented at approximately 25, which we calculated previously.
● The range extends from the minimum to the maximum class boundaries in the dataset.
3 (a) Find all frequent item sets with minimum support threshold value of 2
using Apriori mining algorithm for the following TABLE-I.
Given Table:
TID List of Item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4

T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
Answer:
3 (b) What is Multi-Dimensional Association Rules? Give example.
Answer:
Multi-Dimensional Association Rules are association rules that involve multiple dimensions
or attributes in the data. Unlike traditional association rules that focus on items within a single
transaction, multi-dimensional association rules look at associations across different attributes
or dimensions of the dataset.
Example:
In a retail store, consider a dataset with multiple dimensions, such as:
● Location: (e.g., City or Region)

● Time: (e.g., Day of the week or Month)
● Product: (e.g., Type of item purchased)
A multi-dimensional association rule could look like:
● "Customers in New York are more likely to buy winter clothing on weekends in
December."
Here, the rule links the dimensions of Location (New York), Time (Weekends, December),
and Product (Winter Clothing), showing an association across multiple dimensions.
4 (a) Mr. Hasan is a shopkeeper who sells different daily necessary items such as rice,
oil, soap, biscuits, and cold drinks. Cold drinks are normally sold more when the
temperature is higher. It is reported by Mr. Hasan that the hotter the day, the more the
cold drinks are purchased by the customers. He maintains the following database to
keep record of cold drinks in liters against the temperature of a day to predict his storage
of cold drinks that he should keep in his shop based on the temperature of a day shown
in Table II. Determine the relationship between temperature and the amount of cold
drinks purchased.
Given Table (Table II):
ID Temperature (°C) Cold Drinks (Liters)
1 25 15
2 28 17
3 18 16
4 22 20
5 25 28
6 21 23
7 30 29
8 10 2
Answer:
To determine the relationship between temperature and the amount of cold drinks purchased,
we can use a scatter plot to visualize any correlation and calculate the correlation coefficient.
1. Step 1: Plotting the Scatter Plot

By plotting Temperature (°C) on the x-axis and Cold Drinks (Liters) on the y-axis, we can
observe the trend between these two variables.
2. Step 2: Calculate the Correlation Coefficient
The correlation coefficient rrr measures the strength and direction of a linear relationship
between two variables.
● x: Temperature values
● y: Cold Drinks values
A positive rrr value indicates a positive correlation (as temperature increases, cold drink sales
increase).
Let’s create the scatter plot and calculate the correlation coefficient to confirm this relationship.
Scatter Plot and Correlation Calculation
The scatter plot above shows the relationship between temperature (°C) and cold drinks
purchased (liters). The correlation coefficient calculated is approximately 0.78, indicating a
strong positive correlation. This suggests that as the temperature increases, the sales of cold
drinks also increase.
4 (b) Consider the following database containing five transactions in

TABLE-III. Let assume the minimum support min_sup=60%\text{min\_sup}
= 60\%min_sup=60%. Draw the FP tree.
Given Table (Table III):
TID Transaction
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E}
T300 {M, U, K, E}
T400 {C, O, K, I, E}
T500 {C, O, K, E}
Solution:
To draw the FP tree, we follow these steps:
1. Step 1: Calculate Support for Each Item

○ Determine the frequency of each item across all transactions.
2. Step 2: Discard Items Below Minimum Support
○ With min_sup=60%\text{min\_sup} = 60\%min_sup=60%, each item needs to
appear in at least 3 transactions (since 60% of 5 transactions = 3 transactions).
3. Step 3: Order Items by Frequency
○ Arrange items in descending order of their frequency.
4. Step 4: Construct the FP Tree
○ Add each transaction to the FP tree, following the ordered items based on
frequency.
Let's perform these steps and construct the FP tree.
The frequent items with a minimum support of 3 (60%) are:
● KKK: Appears in all 5 transactions

● EEE: Appears in all 5 transactions
● OOO: Appears in 4 transactions
Based on these frequencies, we will order the items as K > E > O.
Constructing the FP Tree:

We will add each transaction in the order of K, E, O for those that contain these items.
1. Transaction T100: {K, E, O}

○ Path: K→E→OK \rightarrow E \rightarrow OK→E→O
○ Path: Reuses K→E→OK \rightarrow E \rightarrow OK→E→O,
incrementing counts along this path.
3. Transaction T300: {K, E}
○ Path: K→EK \rightarrow EK→E, incrementing counts at KKK and EEE.
Here is the FP tree based on the given transactions:
● K is the root node with a count of 5, branching to E.

● E has a count of 5 and branches to O.
● O has a count of 4.
This tree structure shows the frequent patterns identified, with each node's count representing
the number of times that item (or itemset) appears in the transactions. This visualization makes
it easier to see the relationships and frequency of each item in the dataset based on the
minimum support threshold.
End

Q_Solve_Bigdata

Uploaded by

Copyright:

Available Formats

Q_Solve_Bigdata

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Q_Solve_Bigdata

Uploaded by

Copyright:

Available Formats

Question _1_Solved:

1. (a) What is big data? Explain the 4V of big data.

The 4Vs of Big Data are:

● Volume: The large scale of data generated.

1. (b) What are the tasks of data preprocessing?

● Data Cleaning: Removing noise and handling missing values.

Due to space, here is an example of steps:

1. Count occurrences of each item.

● Symmetry: It is symmetrical about the mean.

● Mean: Sum all values and divide by the count.

3. (b) Describe methods for handling missing data in real-world datasets.

● Deletion: Removing records with missing values.

3. (c) State the equation of correlation analysis between two variables A

1. Discovery: Understanding the objectives and requirements.

4. (b) Differentiate between BI and Data Analytics.

● Business Intelligence (BI): Focuses on descriptive analytics, summarizing historical

1. Structured Data - Organized in a specific format, typically in rows and columns.

(b) Draw the figure of data analytics lifecycle. (2 marks)

1. Data Collection - Gathering data from various sources.

1. Analytical Thinking - The ability to interpret and analyze data effectively.

1. Mean - The average value.

(b) Find the median of the following data: (4 marks)

Classes 0-10 10-20 20-30 30-40 40-50

To find the median:

1. Calculate cumulative frequency.

Understanding the Boxplot

Here’s a simple explanation of the boxplot:

Using TABLE-1 with a minimum support threshold of 2:

(b) What is Multi-Dimensional Association Rules? Give example. (3 marks)

● Multi-Dimensional Association Rules: These rules involve multiple dimensions or

Based on the data provided in Table-II:

● The x-axis represents the temperature in degrees Celsius.

4b) Step 1: Calculate Frequency of Each Item

Let’s calculate the frequency of each item.

Filtering Based on Minimum Support

Frequent Items (Support ≥ 3):

● O (3), K (5), E (4), Y (3)

Step 2: Order Frequent Items in Each Transaction

TID Filtered and Ordered Transaction

Step 3: Construct the FP Tree

1. Start with an empty root.

Below is the FP Tree based on the above transactions.

The min-max normalization formula is:

Plugging these values into the formula, we get:

Calculate this for the answer.

4. State the properties of normal distributed curves.

● It is symmetric around the mean.

Step 4: Draw the FP Tree

Additional Explanation for Each Part in Simple Terms

(b) Draw the figure of data analytics lifecycle.

The data analytics lifecycle typically involves the following stages:

1. Data Collection: Gathering data from various sources.

[Diagram of the Data Analytics Lifecycle]

(c) As a data scientist, what qualities are crucial? Explain clearly.

The crucial qualities for a data scientist include:

1. Analytical Skills: Ability to analyze and interpret complex datasets.

The commonly used measures of central tendency are:

Median Class in Grouped Data Distribution:

2 (b) Find the median of the following data:

Classes 0-10 10-20 20-30 30-40 40-50