0% found this document useful (0 votes)
10 views14 pages

Q (1 8)

The document outlines a series of tasks involving data classification and clustering using various algorithms on randomly generated and existing datasets. It includes generating random values, classifying them using KNN, WKNIV, and radius-based NNC, as well as clustering using Leader clustering and calculating purity values. Additionally, it covers tasks involving the Digits and Olivetti Face datasets, focusing on classification accuracy and dimensionality reduction techniques.

Uploaded by

manumanoz048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

Q (1 8)

The document outlines a series of tasks involving data classification and clustering using various algorithms on randomly generated and existing datasets. It includes generating random values, classifying them using KNN, WKNIV, and radius-based NNC, as well as clustering using Leader clustering and calculating purity values. Additionally, it covers tasks involving the Digits and Olivetti Face datasets, focusing on classification accuracy and dimensionality reduction techniques.

Uploaded by

manumanoz048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Q1. Randomly generate 100 values of x in the range [0,1]. Let them be x1, x2, . . ., x100.

Perform
the following based on the data set generated.
a. Label the first 50 points {x1, . . ., x50} as follows: if (xi< 0.5), then x; € Class1, else x, € Class2.
b. Classification.
i. Classify the remaining points, x51…., x100 using KNN. Perform this for k = 1, 2, 3, 4,5, 20,30.
ii. Classify the remaining points, x51...., x100 using WKNIV. Perform this for k = 1,2, 3, 4, 5,
20,30.
iii. Classify the remaining points, x51, ...., x100 using radius-based NNC. Perform this for k = 1,
2, 3, 4, 5, 20,30.
c. Compute the classification accuracy in all three cases and report. [Note: Classification accuracy
= nc/n, where n = 50, nc = number correctly classified]
Output:
Q2. Cluster the entire set of 100 points (as mentioned in Q1) using Leader clustering. Choose
different values for threshold (T) and carry out the clustering.

Output:
Q3. Let the clustering obtained using some threshold, T₁ be C₁ = {Cluster1, Computer the purity
value for each clustering, which is given by
𝑃𝑢𝑟𝑖𝑡𝑦(𝐶𝑖) = ∑𝑖𝑙
𝑗=1 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 (|𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑗 ∩ 𝐶𝑙𝑢𝑠𝑡𝑒𝑟1|, |𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑗 ∩ 𝐶𝑙𝑢𝑠𝑡𝑒𝑟2|, . . . , |𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑗 ∩

𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑙|)

Output:
Q4. Use the Digits data set available under sklearn: https://scikit-learn.org/stable/modules/genera
ted/sklearn.datasets.load\_digits.html Consider 10% of the data for training (179 samples). Each
pattern is an 8 x 8-sized character where each value is an integer in the range 0 to 16. Convert it i
nto binary form by replacing a value below 8 by 0 and other values (≥ 8) by 1.
a. Use these 179 patterns with labels and the remaining without labels for this subtask. Use KNN
and label the patterns without labels. Obtain the % classification accuracy. Perform this task with
k values from the set {1, 3, 5, 10, 20}.
b. Obtain the frequent itemsets for these 179 patterns using FP-growth by viewing each binary
pattern as a transaction of 64 items. Repeat this task with different minsup values from {0.1, 0.3,
0.5, 0.7}.
Output:
Q5. Download the Olivetti Face data set. There are 40 classes (corresponding to 40 people), each
class having 10 faces of the individual; so there are 400 images in total. Here, each face is viewed
as an image of size 64 x 64 (=4096) pixels; each pixel having values 0 to 255 which are ultimatel
y converted into floating numbers in the range [0,1]. Visit https://scikit-learn. org/0.19/datasets/ol
ivetti\_faces.html for more details. Your Tasks: There are three tasks. For all the tasks, split the d
ata set into train and test parts. Carry out this splitting randomly 10 times and report the average a
ccuracy. You may vary the test and train data set sizes. The tasks are:
a. Task 1: Build a decision tree using the training data. Tune the parameters corresponding to pru
ning the decision tree. Use the best decision tree to classify the test data set and obtain the accura
cy. Use both Gini and entropy impurities.
b. Task 2: Build a random forest classifier using the training data set. Use RF with 50 decision tre
es. Obtain the classification accuracy on the test data with the number of features as 20%, 40%
and 60% of the given set of features.
c. Task 3: Use the XGBoost classifier to classify by viewing the entire data set as the training
data set. Find out the accuracy on the data set using 50 and 100 trees.
Output:
Q6. Download the Olivetti Face data set. There are 40 classes (corresponding to 40 people), each
class having 10 faces of the individual; so there are a total of 400 images. Here, each face is view
ed as an image of size 64 x 64 (= 4096) pixels; each pixel has values 0 to 255 which are ultimatel
y converted into floating numbers in the range [0,1]. Visit https://scikit-learn. Split the data sets i
nto train and test parts. Perform this splitting randomly 10 times and report the average accuracy.
You may vary the test and train data set sizes. Use NBC to classify the test data set. Obtain the ac
curacy on the test data.

Output:
Q7. Use the Wisconsin Breast Cancer data set available under sklearn. There are 569 sam-ples co
rresponding to two classes. Each is a 30-dimensional vector. For more details, visit https://scikit-l
earn.org/stable/modules/generated/sklearn.datore details, _breast\_cancer.html. There are two tas
ks. For both the tasks, split the data set into train and test parts using train_size = 0.8. Perform thi
s splitting randomly 10 times and report the average accuracy. The tasks are:
a. Task 1: Here, you are supposed to reduce the dimensionality of the data set by clustering the 30
features into 12, 20 and 30 clusters obtained using the k-means algorithm. Note
that the resulting feature values are obtained by the centroids of the K clusters in each case. Com
pute the percentage accuracy using Gaussian naïve Bayes classifier on the test data. So, the result
ing training data set is of size 455 × K and the test data set is of size 114 x К.
b. Task 2: Repeat the task in (a) using k-means++ in place of the k-means algorithm.
Output:
Q8. Download the Olivetti Face data set. There are 40 classes (corresponding to 40 people), each
class having 10 faces of the individual; so there are 400 images in total. Here each face is viewed
as an imgae of size 64 x 64 (= 4096) pixels; each pixel having values 0 to 255 which are ultimate
ley converted into floating numbers in the range [0,1]. Visit https://scikit-learn. org/0.19/datasets/
olivetti\_faces.html for more details. Your Tasks: There are two subtasks. For both the subtasks, s
plit the data set into the train and test parts. Vary the test size. The tasks are:
a. Task 1: Reduce the dimensionality of the data set from 4096 to 400 using PCA and classify the
P-dimensional data set using perceptron, SVM, logistic regression and MLP.
b. Task 2: Repeat Task 1 using SVs instead of PCs.
Output:

You might also like