Assigment 3

This document provides instructions for Assignment 3, which involves exploring a dataset using clustering and outlier detection techniques. Students are to: 1) Select a dataset from two options and describe it in 10 sentences. 2) Perform data preprocessing, visualization, and analysis including k-Means clustering and farthestFirst clustering. 3) Detect outliers using Local Outlier Factor and Isolation Forest, compare results, and discuss findings. The assignment must be submitted as a zip folder containing required files and following a specified naming convention. It will be graded on completeness, explanations, and professional report structure and style.

Uploaded by

Erick Menjivar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views2 pages

Assigment 3

Uploaded by

Erick Menjivar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

CST8390 Assignment 3

Due: July 15, 2022 at 11:59 PM Sharp!!!

(Late submissions will not be accepted)

Goal: The goal of this lab is to explore and analyze one dataset from the given list
and perform clustering using kMeans and farthestFirst and outlier detection
using Local Outlier Factor and Isolation Forest.
Steps:
Select one dataset from the list below:

• Dataset 1 – Glass
o https://archive.ics.uci.edu/ml/datasets/Glass+Identification
o http://odds.cs.stonybrook.edu/glass-data/
o This dataset contains attributes regarding several glass types (multi-class).
Here, class 6 is a clear minority class, as such points of class 6 should be
marked as outliers, while all other points are inliers. For outlier detection,
you need to create a column named Outlier and mark class 6 instances as
Yes and all other attributes as No. After this, remove class attribute.
• Dataset 2 – Lymphography
o https://archive.ics.uci.edu/ml/datasets/Lymphography
o http://odds.cs.stonybrook.edu/lympho/
o It is a multi-class dataset having four classes, but two of them are quite small (2 and 4
data records). Therefore, those two small classes should be merged and considered
as outliers compared to other two large classes (81 and 61 data records). For outlier
detection, you need to create a column named Outlier and mark instances of smaller
classes as Yes and all other attributes as No. After this, remove class attribute.

Data Understanding
You have to include a brief description (10 sentences) about the selected dataset. From the
papers given with the dataset, you may be able to find the performance of some clustering
and outlier detection methods applied on those datasets. If so, include that also in the
description. Thoroughly analyze your data to have a clear understanding of your data and
their attributes and types. Tabulate attributes, its description (if available), and its data types.
Data Preprocessing
Load your file to Weka. Double check the type of your attributes in Weka. If they are not as
expected, apply filters to convert them to the right types. Tabulate statistics and counts
(whichever apply) for each attribute. Provide that information in one table. Perform data cleaning,
remove duplicates, handle missing information etc. Specify which all filters you applied and the
corresponding reason. Now, navigate to Visualize tab to visualize your data. Include 3 interesting
charts in your submission. You need to specify how those charts are interesting (you may have
clusters, or classes are separable, or classes have too much of overlapping etc.). You need
to compare the attributes on your x and y axes and their impact on the class attribute.

Data Analysis
Clustering: Now perform clustering using k-Means for different k(which makes sense for your
dataset) and tabulate those results. (Hint: if you have 3 class labels, then 3 and above may be a
good value for k. You need to run with at least 5 different values of k). Highlight the row with the
best k. You have to create a single table with results. Scanned images and different tables are not
acceptable. Next, perform clustering using farthestFirst method and tabulate the results.
Outlier detection: Based on the class attribute, you have to create a new column named “Outlier”.
Once “Outlier” column is created, remove class column. Based on the description of the dataset,
type “Yes” for outlier instances and “No” for other instances. Perform Outlier Detection using
Local Outlier Factor method (For LOF, perform it with 10-fold cross validation). Open “Visualize
classifier errors” and save the file as datasetName_LOF.arff. Open datasetName_LOF.arff and
select predicted Outlier in the attributes list. Get a screenshot and paste it in the report. Find how
many of the actual outliers are predicted as outliers. If the result is not close enough, repeat the
process with only selected attributes. Give detailed explanation on your findings.

Now, perform Outlier Detection using Isolation Forest method on the original dataset.
Open “Visualize classifier errors” and save the file as datasetName_ISF.arff. Open
datasetName_ISF.arff and select predicted Outlier in the attributes list. Get a screenshot
and paste it here. Find how many of the actual outliers are predicted as outliers.
Discussion of Results: Combine results from LOF and ISF by creating an excel file named
combinedResults_datasetname and find the ensemble results. Paste the screenshot of final
results (as we did in Outlier detection lab). Also, include the excel sheet in the zipped folder.
Provide a discussion on comparison of clustering results and outlier detection results.

Submission Details:
This is a partner assignment. Assignment should have a cover page with the name (Last name,
first name of both students) and student numbers. Create a zipped folder named
LastNameFirstStud_FirstNameFirstStud_LastNameSecondStud_FirstNameSecondStud.zip with
the report, datasetName_ LOF.arff, datasetName_ISF.arff and combinedResults_datasetname.xls,
and model files of LOF, ISF, kMeans and FarthestFirst. There will be mark deduction if folder name
doesn’t match with the requirements. Upload the zipped folder to Brightspace.

Marks:
This assignment will have a total of 40 marks. Each step is important. Every step/question should
be answered with explanation. The assignment should look like a professional report. You should
have a cover page, table of content, table of pictures, and report should have sections like
Introduction, Data Understanding, Data Preprocessing, Data Analysis, Discussion of results,
Comparison of results, Conclusion, and References. There will be negative marks if you miss
explanation for any of the steps. Also deductions will be applied if the report is not professional.

NetBackup10 EEB Guide
No ratings yet
NetBackup10 EEB Guide
184 pages
DAMA Notes
No ratings yet
DAMA Notes
157 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
DFD
No ratings yet
DFD
13 pages
PostgreSQL Administration
No ratings yet
PostgreSQL Administration
8 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
DS 5-Marks Semeseter Suggestion
No ratings yet
DS 5-Marks Semeseter Suggestion
56 pages
122AD0005 BDA Project Final Term Presentation
No ratings yet
122AD0005 BDA Project Final Term Presentation
27 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Practical Guide and Concepts Data Mining
No ratings yet
Practical Guide and Concepts Data Mining
63 pages
Distributed Database Chapter 3 Modified
No ratings yet
Distributed Database Chapter 3 Modified
40 pages
Alinuxmaterial
No ratings yet
Alinuxmaterial
192 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Report File Programs (24-25) (Solved) - Sample #1
No ratings yet
Report File Programs (24-25) (Solved) - Sample #1
33 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Dealing With Outliers
No ratings yet
Dealing With Outliers
19 pages
Storage Donvito Chep 2013
No ratings yet
Storage Donvito Chep 2013
43 pages
Anomaly Detection
No ratings yet
Anomaly Detection
22 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
No ratings yet
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
12 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Guide On Outlier Detection Methods
No ratings yet
Guide On Outlier Detection Methods
11 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
What Is A Magnetic Disk?
No ratings yet
What Is A Magnetic Disk?
3 pages
Outlier Treatment
No ratings yet
Outlier Treatment
16 pages
Data Analysis Ideas & Data
No ratings yet
Data Analysis Ideas & Data
32 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Data Mining Journal 2 Kashan
No ratings yet
Data Mining Journal 2 Kashan
13 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
Ads 7
No ratings yet
Ads 7
6 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Research File 3
No ratings yet
Research File 3
10 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
No ratings yet
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
19 pages
Part A
No ratings yet
Part A
16 pages
Edukuron Data Engineering
No ratings yet
Edukuron Data Engineering
10 pages
Expt 2
No ratings yet
Expt 2
3 pages
10 - Anomaly Detection
No ratings yet
10 - Anomaly Detection
12 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
DW3 Part1 Partial Correction-3
No ratings yet
DW3 Part1 Partial Correction-3
5 pages
UNIT-I Notes BBA III Sem
No ratings yet
UNIT-I Notes BBA III Sem
11 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Final Report - Dhruv Mishra
No ratings yet
Final Report - Dhruv Mishra
12 pages
Access - Workbook Three: Exercise 1
0% (1)
Access - Workbook Three: Exercise 1
13 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
6735367a5d6e24a5f185bf9c 99512104437
No ratings yet
6735367a5d6e24a5f185bf9c 99512104437
2 pages
Aditya Garg DMDW
No ratings yet
Aditya Garg DMDW
40 pages
Outliers
No ratings yet
Outliers
3 pages
Distance Based Outlier Detection
No ratings yet
Distance Based Outlier Detection
40 pages
Its665 Isp565 Group Project March 2023
No ratings yet
Its665 Isp565 Group Project March 2023
10 pages
Management Information Systems: Managing The Digital Firm, 12e Authors: Kenneth C. Laudon and Jane P. Laudon
No ratings yet
Management Information Systems: Managing The Digital Firm, 12e Authors: Kenneth C. Laudon and Jane P. Laudon
34 pages
Handling Ouliers
No ratings yet
Handling Ouliers
5 pages
Data Mining Assignment 1 2023 Preprocessing and Frequent Pattern
No ratings yet
Data Mining Assignment 1 2023 Preprocessing and Frequent Pattern
2 pages
Ex. No. 6 Implementation of Binary Search Tree
No ratings yet
Ex. No. 6 Implementation of Binary Search Tree
8 pages
SQLite C Tutorial - SQLite Programming in C
No ratings yet
SQLite C Tutorial - SQLite Programming in C
28 pages
Lab5 OutlierDetection
No ratings yet
Lab5 OutlierDetection
3 pages
11 Different Ways For Outlier Detection in Python
No ratings yet
11 Different Ways For Outlier Detection in Python
11 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Lect3 Shift Registers PDF
No ratings yet
Lect3 Shift Registers PDF
10 pages
Xiiip Practical 2023-24 - Final
0% (1)
Xiiip Practical 2023-24 - Final
38 pages
Discusion Forum Unit 2
No ratings yet
Discusion Forum Unit 2
2 pages
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
No ratings yet
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
4 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Use Excel Data Model To Manage Your Cashflow: Let's Get Started
No ratings yet
Use Excel Data Model To Manage Your Cashflow: Let's Get Started
17 pages
SQL Questions PDF
No ratings yet
SQL Questions PDF
2 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
Data Security in CC
No ratings yet
Data Security in CC
16 pages
Kolom Frame Manual
No ratings yet
Kolom Frame Manual
12 pages
Oracle Plsql1
No ratings yet
Oracle Plsql1
4 pages
Business Analytics
100% (1)
Business Analytics
10 pages
Dipankar Banerjee: Associate Engineer-Technology
No ratings yet
Dipankar Banerjee: Associate Engineer-Technology
2 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Sheet6 - Stacks and Queues - S2018 - Solution
No ratings yet
Sheet6 - Stacks and Queues - S2018 - Solution
7 pages
Usha DA
No ratings yet
Usha DA
3 pages
How To Install Icinga On Ubuntu 10.04 (Lucid) With Web Api IDOUtils, NConf
No ratings yet
How To Install Icinga On Ubuntu 10.04 (Lucid) With Web Api IDOUtils, NConf
3 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Assigment 3

Uploaded by

Assigment 3

Uploaded by

CST8390 Assignment 3

Due: July 15, 2022 at 11:59 PM Sharp!!!

You might also like