0% found this document useful (0 votes)
18 views42 pages

DMA QB Solved

The document provides a comprehensive overview of data mining and analytics, including definitions, algorithms, and processes involved in data mining such as KDD, ETL, and various clustering techniques. It discusses key concepts like data warehouses, decision trees, and association rules, along with their significance in business decision-making. Additionally, it outlines the importance of data cleaning, classification, and prediction in ensuring accurate data analysis.

Uploaded by

Ronit Laha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views42 pages

DMA QB Solved

The document provides a comprehensive overview of data mining and analytics, including definitions, algorithms, and processes involved in data mining such as KDD, ETL, and various clustering techniques. It discusses key concepts like data warehouses, decision trees, and association rules, along with their significance in business decision-making. Additionally, it outlines the importance of data cleaning, classification, and prediction in ensuring accurate data analysis.

Uploaded by

Ronit Laha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Data Mining and Analytics

1-mark question:

1. Full form of KDD.

Ans. Knowledge discovery in database.

2. What is Data?

Ans. Data is a raw fact that can be stored.

3. What is Information?

Ans. Info means processed data that can have some meaningful sense in our
own mind.

4. What is pattern in data mining?

Ans. Patterns are rules that describe specific patterns within the data.

5. Which term is used to represent left hand side of an association rule?

Ans. Antecedent.

6. Which term is used to represent right hand side of an association rule?

Ans. Consequent.

7. Aproiri algorithm follows downword property rules. True or False

Ans. False.

8. Association rule expresses relationship in the form of: if-then. The


statement is True or False

Ans. True.

9. Association rule expresses relationship in the form of: but-yes. The


statement is True or False
Ans. False.
10.Which algorithm is used for pattern mining without candidate generation?

Ans. Frequent Pattern Growth Algorithm.


11.Which data structure is followed by FP-growth algorithm to generate
frequent pattern?

Ans. extended prefix-tree.

12.Name any one unit of interestingness measures between patterns.

Ans. (1) easily understood by humans, (2) valid on new or test data with some
degree of certainty, (3) potentially useful, and (4) novel.

13.……… is an essential process where intelligent methods are applied to


extract data patterns.

Ans. Data Mining.

14.The …………. data is stored in data warehouse.

Ans. Metadata.

15.ETL stands for …………….

Ans. extract, transform, and load

16.The learning which is used to find out hidden pattern from the labelled
data is called ………

Ans. Supervised

17.Which classification algorithm is called Lazy learner?

Ans. KNN

18.Full form of k-NN is …………

Ans. K nearest neighbor.

19.Name of binary classification algorithm is ………….

Ans. Logistic Regression, knn, decision trees.

20.Which clustering algorithm does not need any dendogram?

Ans. K-means clustering.


21. Which clustering algorithm does need a dendogram?

Ans. Hierarchical clustering.

22. Big data generally contains data of size …………

Ans. From petabyte (10^15) to exabyte (10^18).

23. Which techniques helps in discovery of patterns and valuable information


from the database.

Ans. Data mining.

23.How many V’s are there in big data?

Ans. 5.

24.Which ones are the features of cloud computing? A. Security B.


Availability C. Large Network Access D. All of the above.

Ans. D.

25.Which one is not a data type in big data? A. Structured B. Unstructured C.


Semi structured D. processed.

Ans. D.

26.Which of the following is an example of cloud? A. AWS B. Dropbox C.


Cisco WebEx D. All of the above.

Ans. D

Module 1 & Module 2

1. What is Data Mining?

Ans. Data mining is the process of sorting through large data sets to identify
patterns and relationships that can help solve business problems through data
analysis.
2. Define data warehouse. What is the purpose of it?

It is a process of collecting and managing data from different sources to


provide meaningful business decision. It contains historical data.

Goals of Data Warehousing

 To help reporting as well as analysis


 Maintain the organization's historical information
 Be the foundation for decision making.

Need for Data Warehouse

 Business User: Business users require a data warehouse to view


summarized data from the past.
 Store historical data: Data Warehouse is required to store the time variable
data from the past.
 Make strategic decisions: Some strategies may be depending upon the data
in the data warehouse.
 For data consistency and quality: Bringing the data from different sources
at a commonplace, the user can effectively undertake to bring the
uniformity and consistency in data.
 High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree
of flexibility and quick response time.

3.What are the key elements of a data warehouse? Explain each of them.

4. Describe the key steps in the data mining process. Why is it important to
follow these processes.

The data mining process is divided into two parts i.e. Data Preprocessing and
Data Mining. Data Preprocessing involves data cleaning, data integration, data
reduction, and data transformation. The data mining part performs data
mining, pattern evaluation and knowledge representation of data.

It is important because:

Data cleaning: fills the missing data, removes the noisy data.
Data Integration: improves the accuracy and speed of the data mining process.

Data Reduction: This technique is applied to obtain relevant data for analysis
from the collection of data.

Data Transformation: Data is consolidated so that the mining process is more


efficient and the patterns are easier to understand.

Data Mining: intelligent patterns are applied to extract the data patterns.

Pattern Evaluation: identifying interesting patterns representing the


knowledge.

Knowledge Representation: data visualization and knowledge representation


tools are used to represent the mined data.

6. Why is data cleaning so important?

Ans. It holds importance as dirty data if used directly in mining can cause
confusion in procedures and produce inaccurate results.

Basically, this step involves the removal of noisy or incomplete data from the
collection. Many methods that generally clean data by itself are available but
they are not robust.

This step carries out the routine cleaning work by:

(i) Fill The Missing Data:

Missing data can be filled by methods such as:

 Ignoring the tuple.


 Filling the missing value manually.
 Use the measure of central tendency, median or
 Filling in the most probable value.

(ii) Remove The Noisy Data: Random error is called noisy data.

Methods to remove noise are:

Binning: Binning methods are applied by sorting values into buckets or bins.
Smoothening is performed by consulting the neighboring values.
Binning is done by smoothing by bin i.e. each bin is replaced by the mean of
the bin. Smoothing by a median, where each bin value is replaced by a bin
median. Smoothing by bin boundaries i.e. The minimum and maximum values
in the bin are bin boundaries and each bin value is replaced by the closest
boundary value.

 Identifying the Outliers


 Resolving Inconsistencies

7. Define support, confidence and lift in Association rule mining. What are
the demerits of Apriori Algorithm?

Support refers to how often a given rule appears in the database being mined.
Confidence refers to the amount of times a given rule turns out to be true in
practice.

Lift value, is the ratio of confidence to support.


Demerits of Apriori Algorithms:

 Apriori algorithm is an expensive method to find support since the


calculation has to pass through the whole database.
 Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.

4. What is the importance of Association Rules in Data Mining?

Association rules are useful for analyzing and predicting customer behavior.
They play an important part in customer analytics, market basket analysis,
product clustering, catalog design and store layout. Programmers use
association rules to build programs capable of machine learning.

9. Find the cosine similarity and the dissimilarity between the 2 vectors- ‘X’ &
‘Y’ . X= {3, 2, 0, 5} and Y = {1, 0, 0, 0}
10. For the following given Transaction Data set, generate rules using Apriori
Algorithm. Consider the values of support = 22% & Confidence = 70%.
11. Explain each step of KDD process in detail.

KDD (Knowledge Discovery in Databases) is a process that involves the


extraction of useful, previously unknown, and potentially valuable information
from large datasets. The KDD process in data mining typically involves the
following steps:

 Selection: Select a relevant subset of the data for analysis.


 Pre-processing: Clean and transform the data to make it ready for analysis.
This may include tasks such as data normalization, missing value handling,
and data integration.
 Transformation: Transform the data into a format suitable for data mining,
such as a matrix or a graph.
 Data Mining: Apply data mining techniques and algorithms to the data to
extract useful information and insights. This may include tasks such as
clustering, classification, association rule mining, and anomaly detection.
 Interpretation: Interpret the results and extract knowledge from the data.
This may include tasks such as visualizing the results, evaluating the quality
of the discovered patterns and identifying relationships and associations
among the data.
 Evaluation: Evaluate the results to ensure that the extracted knowledge is
useful, accurate, and meaningful.
 Deployment: Use the discovered knowledge to solve the business problem
and make decisions.
12. Differentiate among Enterprise Warehouse, Data mart and Virtual
warehouse.

13. Distinguish between OLTP and OLAP systems.


14. How is data warehouse different from a database?

15. Explain Metadata in brief.

Metadata is data about the data or documentation about the information


which is required by the users. In data warehousing, metadata is one of the
essential aspects.

16. Define Data Lake. What is a Data Mart?

A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale. You can store your data as-is,
without having to first structure the data, and run different types of analytics.

A data mart is a simple form of data warehouse focused on a single subject or


line of business. With a data mart, teams can access data and gain insights
faster, because they don’t have to spend time searching within a more
complex data warehouse or manually aggregating data from different sources.

17. Discuss the steps of the Apriori Algorithm for mining frequent itemsets.

Follow steps from sum no 10 and make answer by ur own.

18. Generate FP-Tree for the following Transaction dataset. [Min. Support
Count= 3]. Show the Conditional Pattern Base, Conditional FP-Tree and
Frequent Item set.

Class Notes

19. Define with suitable examples of each of the following data mining
functionalities: data characterization, data association and data
discrimination. Explain the architecture of a typical data mining system.

Data Characterization:

1. What is ETL? Explain each of the terms clearly.

ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the
warehouse. The process of ETL can be broken down into the following three
stages:

 Extract: The first stage in the ETL process is to extract data from various
sources such as transactional systems, spreadsheets, and flat files. This step
involves reading data from the source systems and storing it in a staging
area.
 Transform: In this stage, the extracted data is transformed into a format
that is suitable for loading into the data warehouse. This may involve
cleaning and validating the data, converting data types, combining data
from multiple sources, and creating new data fields.
 Load: After the data is transformed, it is loaded into the data warehouse.
This step involves creating the physical data structures and loading the data
into the warehouse.

2. Discuss the different phases of FP-tree growth algorithm.

The FP-Growth Algorithm is an alternative way to find frequent item sets


without using candidate generations, thus improving performance. For so
much, it uses a divide-and-conquer strategy. The core of this method is the
usage of a special data structure named frequent-pattern tree (FP-tree),

 First, it compresses the input database creating an FP-tree instance to


represent frequent items.
 After this first step, it divides the compressed database into a set of
conditional databases, each associated with one frequent pattern.
 Finally, each such database is mined separately.

3. Explain Jaccard similarity index. Find the Jaccard similarity index and
Jaccard distance for the following data:
A = {0, 1, 2, 5, 6} B = {0, 2, 3, 4, 5, 7, 9}

The Jaccard Similarity Index is a measure of the similarity between two sets of
data.

The Jaccard similarity index is calculated as:

Jaccard Similarity = (number of observations in both sets) / (number in either


set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

If two datasets share the exact same members, their Jaccard Similarity Index
will be 1. Conversely, if they have no members in common then their similarity
will be 0.
No. of observations: {0, 2, 5}

No. of obs in either: {0, 1, 2, 3, 4, 5, 6, 7, 9}

Jaccard Similarity: 3 / 9 = 0.33.

6. Generate all Frequent Itemsets from the following transaction data given
minimum support = 0.3.

Find the Association Rules from the above frequent sets at minimum 50%
confidence.
Module3 & Module 4:

1. Define decision tree.

A decision tree is a flowchart-like tree structure where each internal node


denotes the feature, branches denote the rules and the leaf nodes denote the
result of the algorithm. It is a versatile supervised machine-learning algorithm,
which is used for both classification and regression problems. It is one of the
very powerful algorithms. And it is also used in Random Forest to train on
different subsets of training data, which makes random forest one of the most
powerful algorithms in machine learning.

2. What are the advantages and disadvantages of the decision tree approach
over other approaches for data mining?

Advantages:

 Compared to other algorithms decision trees requires less effort for data
preparation during pre-processing.
 A decision tree does not require normalization of data.
 A decision tree does not require scaling of data as well.
 Missing values in the data also do NOT affect the process of building a
decision tree to any considerable extent.
 A Decision tree model is very intuitive and easy to explain to technical
teams as well as stakeholders.

Disadvantage:

 A small change in the data can cause a large change in the structure of the
decision tree causing instability.
 For a Decision tree sometimes calculation can go far more complex
compared to other algorithms.
 Decision tree often involves higher time to train the model.
 Decision tree training is relatively expensive as the complexity and time has
taken are more.
 The Decision Tree algorithm is inadequate for applying regression and
predicting continuous values.

3. What is clustering? What are the different clustering techniques? Write


some applications of cluster analysis.

Clustering, is a method of data mining that groups similar data points together.
The goal of cluster analysis is to divide a dataset into groups (or clusters) such
that the data points within each group are more similar to each other than to
data points in other groups.

Types of Clustering:

 Centroid-based Clustering.
 Density-based Clustering.
 Distribution-based Clustering.
 Hierarchical Clustering

Applications Of Cluster Analysis:

 It is widely used in image processing, data analysis, and pattern recognition.


 It helps marketers to find the distinct groups in their customer base and
they can characterize their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on the web.

4. Define Entropy and Information Gain with suitable examples.

Entropy is uncertainty/ randomness in the data, the more the randomness the
higher will be the entropy. Information gain uses entropy to make decisions. If
the entropy is less, information will be more.

Information gain is used in decision trees and random forest to decide the best
split. Thus, the more the information gain the better the split and this also
means lower the entropy.

The entropy of a dataset before and after a split is used to calculate


information gain.

Entropy is the measure of uncertainty in the data. The effort is to reduce the
entropy and maximize the information gain. The feature having the most
information is considered important by the algorithm and is used for training
the model.

By using Information gain you are actually using entropy.

5. Define Classification and Prediction.

Classification is the process of identifying which category a new observation


belongs to based on a training data set containing observations whose
category membership is known. The accuracy depends on finding the class
label correctly.

Predication is the process of identifying the missing or unavailable numerical


data for a new observation. The accuracy depends on how well a given
predictor can guess the value of a predicated attribute for new data.

6. Describe K-medoids algorithm in brief.

Input:- K number of medoids from the dataset D is the dataset of n data

objects.
Output:- K-number of clusters.
Step-1: Select any k objects from dataset D as the initial medoids.
Step-2: Assign each remaining object to the cluster with the nearest
representative object.
Step-3: Randomly select a non-representative object Orandom from the dataset.
Step-4: Compute the total cost (TC) of swapping representative object (Oj) with
Orandom .
Step-5: If TC<0 then swap Oj with Orandom to form the new set of
representative object.

7. Using K-means clustering algorithm, determine 3 clusters for the following


eight data points: A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2),
C2(4,9). Distance function is Euclidean distance. Do it for 3 iterations.

Repeat for 2 more times.


8. Define Jaccard coefficient.

It's a measure of similarity for the two sets of data, with a range from 0% to
100%. The higher the percentage, the more similar the two populations.

9. Apply the K-means clustering for the following dataset for two clusters.
Consider data point S1 and S2 are the initial centroid of the respective
clusters. Continue the procedure for three iterations.

Same as problem 7.

10. What do you mean by attribute selection measure with respect to


decision tree induction?

An attribute selection measure is a heuristic for choosing the splitting test that
“best” separates a given data partition, D, of class-labeled training tuples into
single classes.

11. Write down the algorithm for K-means clustering.

Step-1: Select the number K to decide the number of clusters.


Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third step, which means reassign each data points to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

12. What is hierarchical clustering technique?

A Hierarchical clustering method works via grouping data into a tree of


clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:

 Identify the 2 clusters which can be closest together, and


 Merge the 2 maximum comparable clusters. We need to continue these
steps until all the clusters are merged together.

13. Distinguish between partitional clustering and hierarchical clustering.

Partitional -> non-hierarchical.


14. What is Classification and Clustering? Explain the key differences
between them.

In the case of Classification, there are predefined labels assigned to each input
instance according to their properties whereas in clustering those labels are
missing.

15. What is a classification problem? What is the difference between


Supervised and Unsupervised Learning?

Classification problems are the problems in which an object is to be classified


in one of the n classes based on the similarity index of its features with that of
each class.
16. Differentiate agglomerative hierarchical clustering and divisive
hierarchical clustering.
Agglomerative: Image segmentation, Customer segmentation, Social network
analysis etc.

Divisive: Market segmentation, Anomaly detection, Biological classification etc.


17. Explain the ID3 algorithm for Decision Trees.

1. Entropy of the database D having class level yes or no is calculated as


follows: E(D) = -[P(yes)log2(yes) + P(no)log2(no)]
2. Now entropy is calculated for each attribute. Assume dataset D has two
attributes A1 & A2. A1 attribute is having two categories part1 and part2.
3. E(A1=part1) = -[P(yes, part1)log2P(yes, part1) + P(no, part1)log2P(no,
part1)] Similarly, E(A1=part2) is calculated.
E(D, A1) = (part1/D)*E(A1, part1) + (part2/D)*E(A1, part2)
Gain(D, A1) = E(D) – E(D, A1)
4. Database D is divided into smaller subsets according to the category of
decision node.

Repeat step 1 to 4 unless every record is classified.

18. What is a dendrogram? Explain it with the help of an example.

A dendrogram is a tree or branch diagram that visually shows the relationship


between similar objects. Each of the branches of the tree represents a
category of class, while the entire tree diagram shows the hierarchy
relationship between all the classes or branches. The objects in a certain
category of class share similar features or characteristics, and are referred to
as clusters. A cluster is a group of objects that have something in common with
each other. The process of organizing these objects into classes or clusters is
called clustering.
19. Define Euclidean and Manhattan distance metric.

Euclidean distance is the shortest path between source and destination which
is a straight line.
Manhattan distance is sum of all the real distances between source(s) and
destination(d) and each distance are always the straight lines.

20. What is a centroid point in K-means clustering?

In K-means, each cluster is represented by its center (called a “centroid”),


which corresponds to the arithmetic mean of data points assigned to the
cluster. A centroid is a data point that represents the center of the cluster (the
mean), and it might not necessarily be a member of the dataset.

22. Use single and complete linkage agglomerative clustering to group the
data described by the following distance matrix. Show the dendrograms.
23. How does agglomerative hierarchical clustering works?

It is a bottom-up approach.

1. Consider every data point as individual cluster.


2. Calculate the similarity of one cluster w.r.t all other clusters.
3. Merge the cluster with highest similarity.
4. Recalculate the similarity for each cluster once again.
5. Repeat step 3 & 4 until and unless a single cluster is obtained.

There are 3 methods of agglomerative approach:

i. Single linkage method- we consider the distance between one cluster to


another cluster to be equal to the shortest distance from any member of
one cluster to any member of the other cluster.
ii. Complete linkage method- we consider the distance between one
cluster to another cluster to be equal to the greatest distance from any
member of one cluster to any member of the other cluster.
iii. Average linkage method- we consider the distance between one cluster
to another cluster to be equal to the average distance from any member
of one cluster to any member of the other cluster.
24. How does divisive hierarchical clustering works?

The divisive clustering algorithm is a top-down clustering approach, initially, all


the points in the dataset belong to one cluster and split is performed
recursively as one moves down the hierarchy.

Steps of Divisive Clustering:

1. Initially, all points in the dataset belong to one single cluster.


2. Partition the cluster into two least similar cluster.
3. Proceed recursively to form new clusters until the desired number of
clusters is obtained.

25. What is a regression model?

A regression model provides a function that describes the relationship


between one or more independent variables and a response, dependent, or
target variable.

26. What are the different types of regression?

1. Linear: A linear regression is a model where the relationship between


inputs and outputs is a straight line.
2. Multiple: Multiple regression indicates that there are more than one input
variables that may affect the outcome, or target variable.
3. Non-Linear: This model will initially show a positive relationship between
number of emails and the response, but as the number of emails increases,
the model will flatten out and become almost constant.

27.Explain simple linear regression.

A linear regression is a model where the relationship between inputs and


outputs is a straight line.

One example may be around the number of responses to a marketing


campaign. If we send 1,000 emails, we may get five responses. If this
relationship can be modeled using a linear regression, we would expect to get
ten responses when we send 2,000 emails. Your chart may vary, but the
general idea is that we associate a predictor and a target, and we assume a
relationship between the two.

In other words, if the linear model fits our observations well enough, then we
can estimate that the more emails we send, the more responses we will get.

28. Explain multiple linear regression.

Multiple regression indicates that there are more than one input variables that
may affect the outcome, or target variable. For our email campaign example,
you may include an additional variable with the number of emails sent in the
last month.

By looking at both input variables, a clearer picture starts to emerge about


what drives users to respond to a campaign and how to optimize email timing
and frequency. While conceptualizing the model becomes more complex with
more inputs, the relationship may continue to be linear.

For these models, it is important to understand exactly what effect each input
has and how they combine to produce the final target variable results.
29. Use the data given in Dataset as shown below, create a regression model
to predict the Test2 from Test1 score. Then predict the score for the one who
got a 46 in Test1.
30. Marks obtained by 12 students in the college test (x) and the university
test (y) are as follows:

Construct the regression line that approximates the data set. What is your
estimate of the marks a student could have in the university test if he
obtained 60 marks in the college test but was ill at the time of the university
test?

31. List down the advantages of the Decision Trees.

 Decision trees are able to generate understandable rules.


 Decision trees perform classification without requiring much computation.
 Decision trees are able to handle both continuous and categorical variables.
 Decision trees provide a clear indication of which fields are most important
for prediction or classification.
 Easy of use.
 Scalability: Decision trees can handle large datasets and can be easily
parallelized to improve processing time.
 Missing value tolerance
 Handling non-linear relationships
 Ability to handle imbalanced data: Decision trees can handle imbalanced
datasets, where one class is heavily represented compared to the others.

32. List down the disadvantages of the Decision Trees.

 Decision trees are less appropriate for estimation tasks where the goal is to
predict the value of a continuous attribute.
 Decision trees are prone to errors in classification problems with many
classes and a relatively small number of training examples.
 Decision trees can be computationally expensive to train.
 Decision trees are prone to overfitting the training data, particularly when
the tree is very deep or complex.
 Small variations in the training data can result in different decision trees
being generated.
 Many decision tree algorithms do not handle missing data well, and require
imputation or deletion of records with missing values.
 The initial splitting criteria used in decision tree algorithms can lead to
biased trees, particularly when dealing with unbalanced datasets or rare
classes.
 Decision trees are limited in their ability to represent complex relationships
between variables.

33. Create a decision tree for the following data given below. The objective is
to predict the class category (Play Tennis or not?).

34. Write down the kNN Algorithm.

 Select the value of k.


 Calculate the distance between new instances and all remaining samples.
 Take the kNN as per the calculated distance among these k neighbors,
count the no. of data points in each category.
 Assign new data points to that category for which the number of neighbor
is maximum.
 Model is ready.

35. Why kNN is known as lazy learning and non-parametric algorithm?

Because it does no training at all when you supply the training data. At training
time, all it is doing is storing the complete data set but it does not do any
calculations at this point. Neither does it try to derive a more compact model
from the data which it could use for scoring. Therefore, we call this algorithm
lazy.

The reason why kNN is non-parametric because in the model parameters


actually grows with the training set - you can imagine each training instance as
a "parameter" in the model, because they're the things you use during
prediction.

36. List down the advantages and disadvantages of kNN algorithm.

Advantages:

 It's easy to understand and simple to implement.


 It can be used for both classification and regression problems
 It's ideal for non-linear data since there's no assumption about underlying
data
 It can naturally handle multi-class cases
 It can perform well with enough representative data

Disadvantages:

 Associated computation cost is high as it stores all the training data


 Requires high memory storage
 Need to determine the value of K
 Prediction is slow if the value of N is high
 Sensitive to irrelevant features

39. Apply the data set of question 33 for the Naïve Bayes Classification also.

40. What is SVM? Application of SVM.


SVM is a supervised machine learning algorithm used for both classification
and regression. It’s best suited for classification. The objective of the SVM
algorithm is to find a hyperplane in an N-dimensional space that distinctly
classifies the data points. The dimension of the hyperplane depends upon the
number of features. If the number of input features is two, then the
hyperplane is just a line. If the number of input features is three, then the
hyperplane becomes a 2-D plane.
Module 5

1. What do you understand by ‘Secular Trend’ in the analysis of a time series?


Explain with examples.

Secular Trend: In secular trend the changes that have occurred could be as the
result of general tendency of data to increase or decrease. The sale record of
any product may increase or decrease due to general tendency of common
people which is known as secular trend.
Time series relating to Economic, Business, and Commerce may show an
upward or increasing tendency. Whereas, the time series relating to death
rates, birth rates, share prices, etc. may show a downward or decreasing
tendency.

2. What is time series data? Explain with an example. Explain different


components of time series data.

A time series data is a set of observations taken at specific times usually at


equal intervals.

Planning for future in an imp aspect in any working organization which can be
done by analyzing the time sharing data.
The long run of any organization is dependent on how well the business
manager can predict or forecast the future trend. And future trend is possible
to predict the time series data.

Different components of time series data:

1. Secular Trend: In secular trend the changes that have occurred could be as
the result of general tendency of data to increase or decrease. The sales
record of any product may increase or decrease due to general tendency of
common people which is known as secular trend.

2. Seasonal Movement: The seasonal movement represents a type of periodic


movement where the period is not longer than one year. In seasonal
movement changes that have taken place during a period of one year may be
as a result of change in climate.
3. Cyclic Movement: Cyclic fluctuation is another type of periodic movement
where the period is more than a year considered. Such movements are very
regular in any kind of business industry. It has 4 stages: prosperity, decline,
depression, recovery.

4. Irregular Movement: In irregular movement fluctuations may happen due to


some unpredictable reason such as earthquake, sky rockout, flood, covid-19.

3. Mention the merits and demerits of Moving Average Method & Semi
Average Method.

Advantages of moving averages:

 Moving averages provide more perspective on prices.


 Moving averages smooth out the noise.
 They can help during volatile markets.

Disadvantages of moving averages:

 Requires maintaining history of different time periods for each forecasted


period.
 Often overlooks complex relationships mentioned in the data.
 Does not respond to the fluctuation that take place for a reason, for
example cycles and seasonal impacts.

Advantages of Semi average:

 This method is simple to understand as compare to other methods for


measuring the secular trends.
 Everyone who applies this method will get the same result.

Disadvantages of semi average:

 The method assumes a straight line relationship between the plotted points
without considering the fact whether that relationship exists or not.
 If we add more data to the original data then we have to do the complete
process again for the new data to get the trend values and the trend line
also changes.
4. Distinguish between ‘seasonal’ and ‘cyclical’ fluctuations in time series
data.

If the fluctuations are not of a fixed frequency then they are cyclic; if the
frequency is unchanging and associated with some aspect of the calendar, then
the pattern is seasonal.

5. Find the trend for the following series using a three-year moving average.
7. Fit a straight-line trend equation by the method of least squares from the
following data and then estimate the trend value for the year 2025.
E.g sum with different dataset
8. Assuming a four-yearly cycle, calculate the trend by the method of moving
averages from the following data relating to the production of tea in India:

9. Using 1964 as the origin, obtain a straight-line trend equation by the


method of least squares. Find the trend value of the missing year 1961.

Same as sum no 7, just put x=1961 & calculate the answer at the last step.

10. What is machine learning? Name some application of machine learning


technique. What are the different machine learning techniques available?

In the real world, we are surrounded by humans who can learn everything
from their experiences with their learning capability, and we have computers
or machines which work on our instructions. But can a machine also learn from
experiences or past data like a human does? So here comes the role of
Machine Learning. Machine Learning is said as a subset of artificial intelligence.
Applications:

Image recognition, speech recognition, traffic prediction, product


recommendations, self-driving cars, email spam and malware filtering, online
fraud detection, virtual personal assistant, stock market trading.

ML techniques: supervised, un-supervised, reinforcement.

11. What is cloud computing? What are the benefits of cloud computing?
What are the different layers in cloud computing?

The term cloud refers to a network or the internet. It is a technology that uses
remote servers on the internet to store, manage, and access data online rather
than local drives. The data can be anything such as files, images, documents,
audio, video, and more.

Benefits:

Agility, high availability & reliability, high scalability, multi-sharing, device and
location independence, maintenance, low cost, service in the pay per use
mode.

The cloud computing layers that are available: infrastructure as a service (IaaS),
platform as a service (PaaS), and software as a service (SaaS).

You might also like