0% found this document useful (0 votes)
98 views2 pages

Assigment 3

This document provides instructions for Assignment 3, which involves exploring a dataset using clustering and outlier detection techniques. Students are to: 1) Select a dataset from two options and describe it in 10 sentences. 2) Perform data preprocessing, visualization, and analysis including k-Means clustering and farthestFirst clustering. 3) Detect outliers using Local Outlier Factor and Isolation Forest, compare results, and discuss findings. The assignment must be submitted as a zip folder containing required files and following a specified naming convention. It will be graded on completeness, explanations, and professional report structure and style.

Uploaded by

Erick Menjivar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views2 pages

Assigment 3

This document provides instructions for Assignment 3, which involves exploring a dataset using clustering and outlier detection techniques. Students are to: 1) Select a dataset from two options and describe it in 10 sentences. 2) Perform data preprocessing, visualization, and analysis including k-Means clustering and farthestFirst clustering. 3) Detect outliers using Local Outlier Factor and Isolation Forest, compare results, and discuss findings. The assignment must be submitted as a zip folder containing required files and following a specified naming convention. It will be graded on completeness, explanations, and professional report structure and style.

Uploaded by

Erick Menjivar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CST8390 Assignment 3

Due: July 15, 2022 at 11:59 PM Sharp!!!


(Late submissions will not be accepted)

Goal: The goal of this lab is to explore and analyze one dataset from the given list
and perform clustering using kMeans and farthestFirst and outlier detection
using Local Outlier Factor and Isolation Forest.
Steps:
Select one dataset from the list below:

• Dataset 1 – Glass
o https://archive.ics.uci.edu/ml/datasets/Glass+Identification
o http://odds.cs.stonybrook.edu/glass-data/
o This dataset contains attributes regarding several glass types (multi-class).
Here, class 6 is a clear minority class, as such points of class 6 should be
marked as outliers, while all other points are inliers. For outlier detection,
you need to create a column named Outlier and mark class 6 instances as
Yes and all other attributes as No. After this, remove class attribute.
• Dataset 2 – Lymphography
o https://archive.ics.uci.edu/ml/datasets/Lymphography
o http://odds.cs.stonybrook.edu/lympho/
o It is a multi-class dataset having four classes, but two of them are quite small (2 and 4
data records). Therefore, those two small classes should be merged and considered
as outliers compared to other two large classes (81 and 61 data records). For outlier
detection, you need to create a column named Outlier and mark instances of smaller
classes as Yes and all other attributes as No. After this, remove class attribute.

Data Understanding
You have to include a brief description (10 sentences) about the selected dataset. From the
papers given with the dataset, you may be able to find the performance of some clustering
and outlier detection methods applied on those datasets. If so, include that also in the
description. Thoroughly analyze your data to have a clear understanding of your data and
their attributes and types. Tabulate attributes, its description (if available), and its data types.
Data Preprocessing
Load your file to Weka. Double check the type of your attributes in Weka. If they are not as
expected, apply filters to convert them to the right types. Tabulate statistics and counts
(whichever apply) for each attribute. Provide that information in one table. Perform data cleaning,
remove duplicates, handle missing information etc. Specify which all filters you applied and the
corresponding reason. Now, navigate to Visualize tab to visualize your data. Include 3 interesting
charts in your submission. You need to specify how those charts are interesting (you may have
clusters, or classes are separable, or classes have too much of overlapping etc.). You need
to compare the attributes on your x and y axes and their impact on the class attribute.

Data Analysis
Clustering: Now perform clustering using k-Means for different k(which makes sense for your
dataset) and tabulate those results. (Hint: if you have 3 class labels, then 3 and above may be a
good value for k. You need to run with at least 5 different values of k). Highlight the row with the
best k. You have to create a single table with results. Scanned images and different tables are not
acceptable. Next, perform clustering using farthestFirst method and tabulate the results.
Outlier detection: Based on the class attribute, you have to create a new column named “Outlier”.
Once “Outlier” column is created, remove class column. Based on the description of the dataset,
type “Yes” for outlier instances and “No” for other instances. Perform Outlier Detection using
Local Outlier Factor method (For LOF, perform it with 10-fold cross validation). Open “Visualize
classifier errors” and save the file as datasetName_LOF.arff. Open datasetName_LOF.arff and
select predicted Outlier in the attributes list. Get a screenshot and paste it in the report. Find how
many of the actual outliers are predicted as outliers. If the result is not close enough, repeat the
process with only selected attributes. Give detailed explanation on your findings.

Now, perform Outlier Detection using Isolation Forest method on the original dataset.
Open “Visualize classifier errors” and save the file as datasetName_ISF.arff. Open
datasetName_ISF.arff and select predicted Outlier in the attributes list. Get a screenshot
and paste it here. Find how many of the actual outliers are predicted as outliers.
Discussion of Results: Combine results from LOF and ISF by creating an excel file named
combinedResults_datasetname and find the ensemble results. Paste the screenshot of final
results (as we did in Outlier detection lab). Also, include the excel sheet in the zipped folder.
Provide a discussion on comparison of clustering results and outlier detection results.

Submission Details:
This is a partner assignment. Assignment should have a cover page with the name (Last name,
first name of both students) and student numbers. Create a zipped folder named
LastNameFirstStud_FirstNameFirstStud_LastNameSecondStud_FirstNameSecondStud.zip with
the report, datasetName_ LOF.arff, datasetName_ISF.arff and combinedResults_datasetname.xls,
and model files of LOF, ISF, kMeans and FarthestFirst. There will be mark deduction if folder name
doesn’t match with the requirements. Upload the zipped folder to Brightspace.

Marks:
This assignment will have a total of 40 marks. Each step is important. Every step/question should
be answered with explanation. The assignment should look like a professional report. You should
have a cover page, table of content, table of pictures, and report should have sections like
Introduction, Data Understanding, Data Preprocessing, Data Analysis, Discussion of results,
Comparison of results, Conclusion, and References. There will be negative marks if you miss
explanation for any of the steps. Also deductions will be applied if the report is not professional.

You might also like