0% found this document useful (0 votes)

4 views

Project Assignment

Uploaded by

Farah Jahangir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Project Assignment

Uploaded by

Farah Jahangir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

“Data Mining”

Clustering and outlier detection

Learning Objectives:
1. Learn to use popular clustering algorithms, namely K-means, DBSCAN and detect
outliers
2. Learn how to summarize and interpret clustering results
3. Learn to write analysis and evaluation functions which operate on the top of clustering
algorithms and clustering results
4. Learning how to interpret unsupervised data mining results

Task-1
In the project you will use the Houston Weather Dataset, or HWD for short. The first and
last attribute of the HWD should be ignored when clustering this data set; the last
attributes denotes a class variable which will be used in the post analysis of the clusters
generated by running K-means, and DBSCAN.

1
Houston_Weather Dataset has the the following attributes:
DATE / nominal / Each record has a date starting from 01/01/2006 to
12/31/2021. You may download this dataset from
https://www.kaggle.com/datasets/alejandrochapa/houston-weather-data.
cloud_cover / categorical / %/ 0 to 16, each number represents a category
rainfall / continuous / inch / Amount of rainfall of the day
min_temp / continuous / farenhit / Minimum temperture of the day
max_temp / continuous / farenhit / Maximum temperture of the day
wind_speed/ continuous / mile per hour / wind speed at 3pm
pressure/ continuous / pai / atmospheric pressure at 3pm
humidity / continuous / % / relative humidity at 3pm
class/ / categorical / %/ H, M, L repregenting High, Midium and Low
humidity

Examples in the Weather Prediction Dataset:

Date min_temp max_temp rainfall wind_speed humidity pressure cloud Class

1/1/2021 41 55 0 8 51 29.95 4 M
1/2/2021 41 59 0 7 42 30.09 3 L
1/3/2021 43 68 0 13 37 30.01 3 L
1/4/2021 49 75 0 3 43 29.99 0 L

a. Run K-means for k=31(check footnote) for the HWD dataset excluding the Date and
Class attribute. Using the function you developed in step a, compute the purity of the
obtained clustering results; next, create box plots for attributes temp_max,
temp_min, rainfall, humidity,wind_speed, pressure of the obtained 3 clusters for each
clustering and report their centroids, means. Finally, summarize based on the
obtained boxplots and centroids/cluster means what kind of objects each of 3 clusters
contains (you need to compare attributes in terms of their clusters). Finally, report
the purity for the clustering result and interepret it. ***

b. Try to obtain a DBSCAN clustering for the HWD dataset exclusing the Date, and class
attribute, having between 2 and 15 clusters with less than 20% outliers. Report its
purity score. Compare the result with the K-means result you obtained in task 1! ***

Deliverables:
A. A Report2 which contains all deliverables for the subtasks of Task 1.
B. Properly commented software/code you developed as part of Task 1.

Task-2
In this task you will be developing outlier detection techniques for a HWD Dataset as
provided in Task-1; the objective is to find “unusual weather days” in this dataset.

1
Actually run it 10 times but then analyze only the (single) clustering with the lowest SSE further.
2
Single-spaced; please use an 11-point or 12-point font!

2
A day can be unusual if it's much hotter or colder than usual (temperature), windier or
calmer than usual (wind speed), more humid or less humid than usual (humidity), or wetter
or drier than usual (rainfall). Each of these things can affect our daily lives. For example,
a very hot day in winter or a very cold day in summer would be unusual. Or, if it rains a
lot more or a lot less than normal, that could also be unusual. To know if a day is unusual,
we need to compare it to what's typical for the location.

In this task, you will use a dataset called the HWD dataset. It contains daily weather data
for Houston in the year 2021, with attributes like date, min_temp, max_temp, rainfall,
wind_speed9am, wind_speed3pm, humidity9am, humidity3pm, pressure9am,
pressure3pm, cloud9am, cloud3pm, temp9am, temp3pm, rain_today, and rain_tomorrow.
However, for this task, we will focus on a subset of the dataset called RHOUSTONW. This
subset includes the following attributes: Date, min_temp, max_temp, rainfall, wind_speed,
humidity, and cloud. In the dataset, wind_speed and humidity refer to wind_speed3pm and
humidity3pm, while cloud is the numerical conversion of cloud3pm from the original
dataset.
Houston_Weather Dataset has the the following attributes:
DATE / nominal / Each record has a date starting from 01/01/2021 to
12/31/2021
cloud / categorical / %/ 17 different types of cloud cover. Categories
are Fair / Windy","Partly Cloudy","Partly Cloudy /
Windy","Cloudy","Cloudy / Windy","Mostly Cloudy","Mostly Cloudy /
Windy","Fog","Haze", "Light Rain" , "Light Rain with Thunder", "Thunder",
"Rain" "Thunder / Windy" "Heavy T-Storm", "Thunder in the Vicinity", "T-
Storm"
rainfall / continuous / inch / Amount of rainfall of the day/ from 0 to
5
min_temp / continuous / farenhit / Minimum temperture at 3pm / from 34 to
83
max_temp / continuous / farenhit / Maximum temperture at 3pm/ from 46 to
98
wind_speed/ continuous / mile per hour / wind speed at 3pm/ from 0 to 29
humidity / continuous / % / Humidity at 3pm/ from 0 to 100

3 Examples in the Weather Dataset:

date min_temp max_temp rainfall wind_speed humidity cloud
Mostly
1/1/2021 41 55 0 8 51 Cloudy
1/2/2021 41 59 0 7 42 Fair
1/3/2021 43 68 0 13 37 Fair

Subtasks:
1) Design and implement a distance-based and a model/density-based object outlier
detection technique for the Houston Weather Dataset. The technique if applied to the
Houston Weather Dataset should add a column to the examples in the dataset named
OLS (Outlier Score) which contains a single number which measures the strength of
our belief that the particular example is an outlier. The challenge for the first task will
be the development of a “good” distance function for the RHOUSTONW dataset; the

3
challenge for the second task will be to develop a “good” density function for the
RHOUSTONW dataset. ***********
a) You must design a multivariate distance function and a multivariate density
function that has been tailored to the dataset. You can also use clustering
algorithms, but in such case marks related to density function and distance function
would be zero.

b) Please provide clear definition of the distance and density function you designed
and describe and justify your design choices.

2) Apply the two outlier detection techniques to the RHOUSTONW dataset; if your
methods involves hyper parameters, apply the methods 3 times to the dataset using 3
different hyper parameter settings. ****
3) Sort the obtained augmented RHOUSTONW Datasets using the OLS attribute. Discuss
the top 3 examples of each augmented dataset; explain why you believe the particular
examples were viewed as likely outlier. Also discuss the bottom example in each
augmented dataset: try to explain why were rated to be “most normal”.****
4) Based on the results you obtained in the previous steps evaluate and compare the two
outlier detection techniques you developed. **
5) If necessary, enhance your two outlier detection techniques and redo steps d, e, and f!

Deliverable:

a) Properly commented code. [Add comments above each block. Make variable
and function names big enough to understand their purpose. And Add a doc
section at beginning of each module describing their inputs, outputs, and
briefly mention what they will do and how they will do ]
b) Explanation containing

i. Algorithm/Psudocode that explain your detection mechanism

ii. Explanation how the algorithm works
iii. Example input and output and discussion of input/output

Level 0 Level 1 Level 2 Level 3 Weig

ht
Quality of No The Distance The Distance The 4
the Distance function is not function is Distance
Distance function is very modestly function is
function presented sophisticated/inc sophisticated/inc very good
orrect and will orrect and will
produce wrong produce wrong
outputs in most outputs in some
cases cases
Distance- No The distance- The distance- The 4
based distance- based outlier based outlier distance-
based based

4
outlier outlier detection detection outlier
detection detection technique is not technique is detection
technique technique very modestly technique
Quality is sophisticated/inc sophisticated/inc is very
presented orrect and will orrect and will good
produce wrong produce wrong
outputs in most outputs in some
cases cases
Quality of No Density The Density The Density The 4
the Density function is function is not function is Density
function presented very modestly function is
sophisticated/inc sophisticated/inc very good
orrect and will orrect and will
produce wrong produce wrong
outputs in most outputs in some
cases cases
Model/den No The The The 4
sity -based Model/den Model/density - Model/density - Model/den
outlier sity -based based outlier based outlier sity -based
detection outlier detection detection outlier
technique detection technique is not technique is detection
Quality technique very modestly technique
is sophisticated/inc sophisticated/inc is very
presented orrect and will orrect and will good
produce wrong produce wrong
outputs in most outputs in some
cases cases

Apply the two outlier detection techniques to the RHOUSTONW dataset; if your
methods involves hyper parameters, apply the methods 3 times to the dataset using 3
different hyper parameter settings.
Deliverable:

1. Properly commented code. [Add comments above each block. Make

variable and function names big enough to understand their purpose. And
Add a doc section at beginning of each module describing their inputs,
outputs, and briefly mention what they will do and how they will do ]
2. Explanation containing

a) Example input and output of each iteration

b) Discussion of input/output of each iteration

Level 0 Level 1 Level 2 Level 3 Weight

Input, One out of Two out of All runs are 3
Input/ outputs three runs are three runs are done
Output and their done and/ or done and/ or properly,

5
from the discussions Input, outputs Input, outputs Input,
three runs are not and their and their outputs and
Quality written in discussions discussions their
the report are poorly are modestly discussions
written in the written in the are very
report and has report and has good
many some mistakes
mistakes

Sort the obtained augmented RHOUSTONW Datasets using the OLS attribute. Discuss the
top 3 examples of each augmented dataset; explain why you believe the particular examples
were viewed as likely outlier. Also discuss the bottom example in each augmented dataset:
try to explain why twere rated to be “most normal
Deliverable:

3. Code showing sorts using OLS attribute

4. A report containing

iv. The top 3 examples of each augmented dataset

v. Discussion of why they viewed as likely outlier candidates
vi. The bottom 1 examples in the augmented dataset
vii. Discussion of why rated to be “most normal”

Level 0 Level 1 Level 2 Level 3 Weight

Presentation No Presented Presented Presented 3
of first 3 samples samples from samples from samples
and bottom are both sides are at least one from both
1 samples presented wrong side is wrong sides are
correct
Discussion No Discussion is Discussion is Discussion is 4
of the discussion wrong with modest with very good
samples given lots of some of
erroneous erroneous
claims claims

Based on the results you obtained in the previous steps evaluate and compare the two
outlier detection techniques you developed.

Deliverable:

A report containing the discussion

Level 0 Level 1 Level 2 Level 3 Weight

6
Comparison No Discussion is Discussion is Discussion is 4
of the two discussion wrong with modest with very good
outlier given lots of some of
detection erroneous erroneous
techniques claims claims

Report No report The report is The report The report is 2

Quality is given poorly written quality is very well
with lots of moderate with written with
mistakes and some mistakes no
contains many and contains a redundancy
redundant few redundant and good
comments and comments and organization
bad okay
organization organization

2nd Ed 1 To 3 Exercise Answer and Qutsion Compilers Principles, Techniques, & Tools (Purple Dragon Book) Second Edition Exercise Answers
No ratings yet
2nd Ed 1 To 3 Exercise Answer and Qutsion Compilers Principles, Techniques, & Tools (Purple Dragon Book) Second Edition Exercise Answers
59 pages
Weather App Documentation
No ratings yet
Weather App Documentation
6 pages
Yash Week 3 Uber Case Study
No ratings yet
Yash Week 3 Uber Case Study
38 pages
Stat
No ratings yet
Stat
5 pages
B777 QUIZ For Pilots
100% (4)
B777 QUIZ For Pilots
49 pages
M-R 1
No ratings yet
M-R 1
12 pages
AD3461_ML Lab Manual
No ratings yet
AD3461_ML Lab Manual
54 pages
Final Report 1301174460 1301174539 AMLdocx
No ratings yet
Final Report 1301174460 1301174539 AMLdocx
12 pages
Lab Manual
No ratings yet
Lab Manual
55 pages
BIT 415 Term Paper Questions
No ratings yet
BIT 415 Term Paper Questions
2 pages
Rainfall Prediction using Machine Learning
No ratings yet
Rainfall Prediction using Machine Learning
9 pages
Weather Pattern Analysis and Prediction Chaman
No ratings yet
Weather Pattern Analysis and Prediction Chaman
18 pages
Anomaly Detection
No ratings yet
Anomaly Detection
5 pages
MACHINE LEARNING LAB MANUAL (1)
No ratings yet
MACHINE LEARNING LAB MANUAL (1)
23 pages
Dma 89
No ratings yet
Dma 89
21 pages
Lab Record 10-15
No ratings yet
Lab Record 10-15
17 pages
SC Cat
No ratings yet
SC Cat
6 pages
MLlab Manual LIET
No ratings yet
MLlab Manual LIET
52 pages
weather_report
No ratings yet
weather_report
7 pages
Wk10 Algorithms
No ratings yet
Wk10 Algorithms
123 pages
DLWSS551 - Algorithms Part I
No ratings yet
DLWSS551 - Algorithms Part I
59 pages
Machine Learning Manual Final
No ratings yet
Machine Learning Manual Final
37 pages
IDM Assignment
No ratings yet
IDM Assignment
15 pages
Task 4P-1 (2)
No ratings yet
Task 4P-1 (2)
5 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Practise Questions
No ratings yet
Practise Questions
26 pages
Objective
No ratings yet
Objective
7 pages
BDA - Lecture 4
No ratings yet
BDA - Lecture 4
41 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
SDC Project Activity F12B - Activity Paper - A4-22-23-V
No ratings yet
SDC Project Activity F12B - Activity Paper - A4-22-23-V
2 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Python Scripts For Machine Learning
No ratings yet
Python Scripts For Machine Learning
13 pages
Lesson - 3 - 1 Data Wrangling
No ratings yet
Lesson - 3 - 1 Data Wrangling
29 pages
ML Lab Observation
100% (1)
ML Lab Observation
44 pages
Lab Manual Computer Science & Engineering
No ratings yet
Lab Manual Computer Science & Engineering
29 pages
ML Lab PFG - Removed - Removed - Removed
No ratings yet
ML Lab PFG - Removed - Removed - Removed
22 pages
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
No ratings yet
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
25 pages
ML MANUAL (1)
No ratings yet
ML MANUAL (1)
74 pages
Decision Trees
No ratings yet
Decision Trees
49 pages
AuxiliaryPrograms PDF
No ratings yet
AuxiliaryPrograms PDF
259 pages
Auxiliary Programs: Energyplus™ Version 9.1.0 Documentation
No ratings yet
Auxiliary Programs: Energyplus™ Version 9.1.0 Documentation
259 pages
Report
No ratings yet
Report
5 pages
Weather Patterns Analysis and Prediction
No ratings yet
Weather Patterns Analysis and Prediction
17 pages
6735367a5d6e24a5f185bf9c_99512104437
No ratings yet
6735367a5d6e24a5f185bf9c_99512104437
2 pages
Auxiliary Programs
No ratings yet
Auxiliary Programs
260 pages
DLWSS551 - Algorithms Part II
No ratings yet
DLWSS551 - Algorithms Part II
44 pages
Auxiliary Programs
No ratings yet
Auxiliary Programs
257 pages
Anomaly Detection in High Dimensional Data
No ratings yet
Anomaly Detection in High Dimensional Data
30 pages
1[1][1]
No ratings yet
1[1][1]
6 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Auxiliary Programs: Energyplus™ Version 8.8.0 Documentation
No ratings yet
Auxiliary Programs: Energyplus™ Version 8.8.0 Documentation
258 pages
Data Handling in R - Introduction To Dplyr
No ratings yet
Data Handling in R - Introduction To Dplyr
2 pages
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
No ratings yet
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
10 pages
R20 Iii-Ii ML Lab Manual
100% (1)
R20 Iii-Ii ML Lab Manual
79 pages
1.implement FIND-S Algorithm: Desription
No ratings yet
1.implement FIND-S Algorithm: Desription
19 pages
AOD Lab1 2
No ratings yet
AOD Lab1 2
6 pages
Proposing A New Methodology For Weather
No ratings yet
Proposing A New Methodology For Weather
6 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
IT445 Project
No ratings yet
IT445 Project
10 pages
hadoop-big-data-unit-3
No ratings yet
hadoop-big-data-unit-3
22 pages
An Introduction To Data Acquisition
From Everand
An Introduction To Data Acquisition
Jason King
No ratings yet
and Data/uk - and - Regional - Series
0% (1)
and Data/uk - and - Regional - Series
5 pages
lecture6-tfidf Vector Space Model (2)
No ratings yet
lecture6-tfidf Vector Space Model (2)
45 pages
Lec_08 SIP CIS-322 Im Enhan
No ratings yet
Lec_08 SIP CIS-322 Im Enhan
10 pages
lecture7a-vectorspace Computing Scores
No ratings yet
lecture7a-vectorspace Computing Scores
43 pages
Lecture - 06 (Shared Memory Programming With OpenMP)
No ratings yet
Lecture - 06 (Shared Memory Programming With OpenMP)
65 pages
Ethical Relativism-1
No ratings yet
Ethical Relativism-1
12 pages
Monsoon Failure Strikes Paddy Growing On TNAU Research Plot
No ratings yet
Monsoon Failure Strikes Paddy Growing On TNAU Research Plot
6 pages
Four Seasons Sonnet
No ratings yet
Four Seasons Sonnet
1 page
The Daily Telegrams, 12 May
No ratings yet
The Daily Telegrams, 12 May
4 pages
Pollution and Climate Change: Allison S. Larr and Matthew Neidell
No ratings yet
Pollution and Climate Change: Allison S. Larr and Matthew Neidell
21 pages
Reading Lessons Maritime Mysteries Elementary - Reading Text 1 1
No ratings yet
Reading Lessons Maritime Mysteries Elementary - Reading Text 1 1
1 page
Linking Words
100% (2)
Linking Words
2 pages
Tiếng Anh 8 I-Learn Smart World - Luyện Tập - Unit 4 - BS - 1
No ratings yet
Tiếng Anh 8 I-Learn Smart World - Luyện Tập - Unit 4 - BS - 1
4 pages
English 8 Exercises
No ratings yet
English 8 Exercises
19 pages
Measurement of Meteorological Variables
No ratings yet
Measurement of Meteorological Variables
67 pages
In Defense of The Right To Life Analyzing Factors Affecting Filipino-Compressed
No ratings yet
In Defense of The Right To Life Analyzing Factors Affecting Filipino-Compressed
42 pages
CatalogoMebraLowRes Mebraplastik FURTUNE
No ratings yet
CatalogoMebraLowRes Mebraplastik FURTUNE
69 pages
Climate Change and Youth
No ratings yet
Climate Change and Youth
7 pages
Hydraulic Release Shackle
No ratings yet
Hydraulic Release Shackle
3 pages
Sharifi Ha House
No ratings yet
Sharifi Ha House
13 pages
sih project solar tracker 1
No ratings yet
sih project solar tracker 1
7 pages
4 Swat Calibration Techniques Slides
No ratings yet
4 Swat Calibration Techniques Slides
53 pages
Reviewer
50% (2)
Reviewer
7 pages
The Butterfly Effect
No ratings yet
The Butterfly Effect
17 pages
Marking Key Geog Kcom Paper 1 4202
No ratings yet
Marking Key Geog Kcom Paper 1 4202
18 pages
Semi-Colons& Colons Studysheet Answer Key
No ratings yet
Semi-Colons& Colons Studysheet Answer Key
4 pages
Environment Persuasive 2
No ratings yet
Environment Persuasive 2
1 page
Mill Manual
No ratings yet
Mill Manual
57 pages
Module13 Doors and Windows
No ratings yet
Module13 Doors and Windows
11 pages
How To Prevent IR Reflecrions 1 3
No ratings yet
How To Prevent IR Reflecrions 1 3
3 pages
Inside Reading Ex. Answers Fall 2013
No ratings yet
Inside Reading Ex. Answers Fall 2013
38 pages
Engleza Romana Lista Verbelor Neregulate PDF
No ratings yet
Engleza Romana Lista Verbelor Neregulate PDF
7 pages
The Evidence For Rapid Climate Change
No ratings yet
The Evidence For Rapid Climate Change
5 pages

Project Assignment

Uploaded by

Project Assignment

Uploaded by

“Data Mining”

Clustering and outlier detection

Examples in the Weather Prediction Dataset:

Date min_temp max_temp rainfall wind_speed humidity pressure cloud Class

3 Examples in the Weather Dataset:

i. Algorithm/Psudocode that explain your detection mechanism

Level 0 Level 1 Level 2 Level 3 Weig

1. Properly commented code. [Add comments above each block. Make

a) Example input and output of each iteration

Level 0 Level 1 Level 2 Level 3 Weight

3. Code showing sorts using OLS attribute

iv. The top 3 examples of each augmented dataset

Level 0 Level 1 Level 2 Level 3 Weight

A report containing the discussion

Level 0 Level 1 Level 2 Level 3 Weight

Report No report The report is The report The report is 2

You might also like