0% found this document useful (0 votes)
4 views

Project Assignment

Uploaded by

Farah Jahangir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Project Assignment

Uploaded by

Farah Jahangir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

“Data Mining”

Clustering and outlier detection


Learning Objectives:
1. Learn to use popular clustering algorithms, namely K-means, DBSCAN and detect
outliers
2. Learn how to summarize and interpret clustering results
3. Learn to write analysis and evaluation functions which operate on the top of clustering
algorithms and clustering results
4. Learning how to interpret unsupervised data mining results

Task-1
In the project you will use the Houston Weather Dataset, or HWD for short. The first and
last attribute of the HWD should be ignored when clustering this data set; the last
attributes denotes a class variable which will be used in the post analysis of the clusters
generated by running K-means, and DBSCAN.

1
Houston_Weather Dataset has the the following attributes:
DATE / nominal / Each record has a date starting from 01/01/2006 to
12/31/2021. You may download this dataset from
https://www.kaggle.com/datasets/alejandrochapa/houston-weather-data.
cloud_cover / categorical / %/ 0 to 16, each number represents a category
rainfall / continuous / inch / Amount of rainfall of the day
min_temp / continuous / farenhit / Minimum temperture of the day
max_temp / continuous / farenhit / Maximum temperture of the day
wind_speed/ continuous / mile per hour / wind speed at 3pm
pressure/ continuous / pai / atmospheric pressure at 3pm
humidity / continuous / % / relative humidity at 3pm
class/ / categorical / %/ H, M, L repregenting High, Midium and Low
humidity

Examples in the Weather Prediction Dataset:

Date min_temp max_temp rainfall wind_speed humidity pressure cloud Class


1/1/2021 41 55 0 8 51 29.95 4 M
1/2/2021 41 59 0 7 42 30.09 3 L
1/3/2021 43 68 0 13 37 30.01 3 L
1/4/2021 49 75 0 3 43 29.99 0 L

a. Run K-means for k=31(check footnote) for the HWD dataset excluding the Date and
Class attribute. Using the function you developed in step a, compute the purity of the
obtained clustering results; next, create box plots for attributes temp_max,
temp_min, rainfall, humidity,wind_speed, pressure of the obtained 3 clusters for each
clustering and report their centroids, means. Finally, summarize based on the
obtained boxplots and centroids/cluster means what kind of objects each of 3 clusters
contains (you need to compare attributes in terms of their clusters). Finally, report
the purity for the clustering result and interepret it. ***

b. Try to obtain a DBSCAN clustering for the HWD dataset exclusing the Date, and class
attribute, having between 2 and 15 clusters with less than 20% outliers. Report its
purity score. Compare the result with the K-means result you obtained in task 1! ***

Deliverables:
A. A Report2 which contains all deliverables for the subtasks of Task 1.
B. Properly commented software/code you developed as part of Task 1.

Task-2
In this task you will be developing outlier detection techniques for a HWD Dataset as
provided in Task-1; the objective is to find “unusual weather days” in this dataset.

1
Actually run it 10 times but then analyze only the (single) clustering with the lowest SSE further.
2
Single-spaced; please use an 11-point or 12-point font!

2
A day can be unusual if it's much hotter or colder than usual (temperature), windier or
calmer than usual (wind speed), more humid or less humid than usual (humidity), or wetter
or drier than usual (rainfall). Each of these things can affect our daily lives. For example,
a very hot day in winter or a very cold day in summer would be unusual. Or, if it rains a
lot more or a lot less than normal, that could also be unusual. To know if a day is unusual,
we need to compare it to what's typical for the location.

In this task, you will use a dataset called the HWD dataset. It contains daily weather data
for Houston in the year 2021, with attributes like date, min_temp, max_temp, rainfall,
wind_speed9am, wind_speed3pm, humidity9am, humidity3pm, pressure9am,
pressure3pm, cloud9am, cloud3pm, temp9am, temp3pm, rain_today, and rain_tomorrow.
However, for this task, we will focus on a subset of the dataset called RHOUSTONW. This
subset includes the following attributes: Date, min_temp, max_temp, rainfall, wind_speed,
humidity, and cloud. In the dataset, wind_speed and humidity refer to wind_speed3pm and
humidity3pm, while cloud is the numerical conversion of cloud3pm from the original
dataset.
Houston_Weather Dataset has the the following attributes:
DATE / nominal / Each record has a date starting from 01/01/2021 to
12/31/2021
cloud / categorical / %/ 17 different types of cloud cover. Categories
are Fair / Windy","Partly Cloudy","Partly Cloudy /
Windy","Cloudy","Cloudy / Windy","Mostly Cloudy","Mostly Cloudy /
Windy","Fog","Haze", "Light Rain" , "Light Rain with Thunder", "Thunder",
"Rain" "Thunder / Windy" "Heavy T-Storm", "Thunder in the Vicinity", "T-
Storm"
rainfall / continuous / inch / Amount of rainfall of the day/ from 0 to
5
min_temp / continuous / farenhit / Minimum temperture at 3pm / from 34 to
83
max_temp / continuous / farenhit / Maximum temperture at 3pm/ from 46 to
98
wind_speed/ continuous / mile per hour / wind speed at 3pm/ from 0 to 29
humidity / continuous / % / Humidity at 3pm/ from 0 to 100

3 Examples in the Weather Dataset:


date min_temp max_temp rainfall wind_speed humidity cloud
Mostly
1/1/2021 41 55 0 8 51 Cloudy
1/2/2021 41 59 0 7 42 Fair
1/3/2021 43 68 0 13 37 Fair

Subtasks:
1) Design and implement a distance-based and a model/density-based object outlier
detection technique for the Houston Weather Dataset. The technique if applied to the
Houston Weather Dataset should add a column to the examples in the dataset named
OLS (Outlier Score) which contains a single number which measures the strength of
our belief that the particular example is an outlier. The challenge for the first task will
be the development of a “good” distance function for the RHOUSTONW dataset; the

3
challenge for the second task will be to develop a “good” density function for the
RHOUSTONW dataset. ***********
a) You must design a multivariate distance function and a multivariate density
function that has been tailored to the dataset. You can also use clustering
algorithms, but in such case marks related to density function and distance function
would be zero.

b) Please provide clear definition of the distance and density function you designed
and describe and justify your design choices.

2) Apply the two outlier detection techniques to the RHOUSTONW dataset; if your
methods involves hyper parameters, apply the methods 3 times to the dataset using 3
different hyper parameter settings. ****
3) Sort the obtained augmented RHOUSTONW Datasets using the OLS attribute. Discuss
the top 3 examples of each augmented dataset; explain why you believe the particular
examples were viewed as likely outlier. Also discuss the bottom example in each
augmented dataset: try to explain why were rated to be “most normal”.****
4) Based on the results you obtained in the previous steps evaluate and compare the two
outlier detection techniques you developed. **
5) If necessary, enhance your two outlier detection techniques and redo steps d, e, and f!

Deliverable:

a) Properly commented code. [Add comments above each block. Make variable
and function names big enough to understand their purpose. And Add a doc
section at beginning of each module describing their inputs, outputs, and
briefly mention what they will do and how they will do ]
b) Explanation containing

i. Algorithm/Psudocode that explain your detection mechanism


ii. Explanation how the algorithm works
iii. Example input and output and discussion of input/output

Level 0 Level 1 Level 2 Level 3 Weig


ht
Quality of No The Distance The Distance The 4
the Distance function is not function is Distance
Distance function is very modestly function is
function presented sophisticated/inc sophisticated/inc very good
orrect and will orrect and will
produce wrong produce wrong
outputs in most outputs in some
cases cases
Distance- No The distance- The distance- The 4
based distance- based outlier based outlier distance-
based based

4
outlier outlier detection detection outlier
detection detection technique is not technique is detection
technique technique very modestly technique
Quality is sophisticated/inc sophisticated/inc is very
presented orrect and will orrect and will good
produce wrong produce wrong
outputs in most outputs in some
cases cases
Quality of No Density The Density The Density The 4
the Density function is function is not function is Density
function presented very modestly function is
sophisticated/inc sophisticated/inc very good
orrect and will orrect and will
produce wrong produce wrong
outputs in most outputs in some
cases cases
Model/den No The The The 4
sity -based Model/den Model/density - Model/density - Model/den
outlier sity -based based outlier based outlier sity -based
detection outlier detection detection outlier
technique detection technique is not technique is detection
Quality technique very modestly technique
is sophisticated/inc sophisticated/inc is very
presented orrect and will orrect and will good
produce wrong produce wrong
outputs in most outputs in some
cases cases

Apply the two outlier detection techniques to the RHOUSTONW dataset; if your
methods involves hyper parameters, apply the methods 3 times to the dataset using 3
different hyper parameter settings.
Deliverable:

1. Properly commented code. [Add comments above each block. Make


variable and function names big enough to understand their purpose. And
Add a doc section at beginning of each module describing their inputs,
outputs, and briefly mention what they will do and how they will do ]
2. Explanation containing

a) Example input and output of each iteration


b) Discussion of input/output of each iteration

Level 0 Level 1 Level 2 Level 3 Weight


Input, One out of Two out of All runs are 3
Input/ outputs three runs are three runs are done
Output and their done and/ or done and/ or properly,

5
from the discussions Input, outputs Input, outputs Input,
three runs are not and their and their outputs and
Quality written in discussions discussions their
the report are poorly are modestly discussions
written in the written in the are very
report and has report and has good
many some mistakes
mistakes

Sort the obtained augmented RHOUSTONW Datasets using the OLS attribute. Discuss the
top 3 examples of each augmented dataset; explain why you believe the particular examples
were viewed as likely outlier. Also discuss the bottom example in each augmented dataset:
try to explain why twere rated to be “most normal
Deliverable:

3. Code showing sorts using OLS attribute


4. A report containing

iv. The top 3 examples of each augmented dataset


v. Discussion of why they viewed as likely outlier candidates
vi. The bottom 1 examples in the augmented dataset
vii. Discussion of why rated to be “most normal”

Level 0 Level 1 Level 2 Level 3 Weight


Presentation No Presented Presented Presented 3
of first 3 samples samples from samples from samples
and bottom are both sides are at least one from both
1 samples presented wrong side is wrong sides are
correct
Discussion No Discussion is Discussion is Discussion is 4
of the discussion wrong with modest with very good
samples given lots of some of
erroneous erroneous
claims claims

Based on the results you obtained in the previous steps evaluate and compare the two
outlier detection techniques you developed.

Deliverable:

A report containing the discussion

Level 0 Level 1 Level 2 Level 3 Weight

6
Comparison No Discussion is Discussion is Discussion is 4
of the two discussion wrong with modest with very good
outlier given lots of some of
detection erroneous erroneous
techniques claims claims

Report No report The report is The report The report is 2


Quality is given poorly written quality is very well
with lots of moderate with written with
mistakes and some mistakes no
contains many and contains a redundancy
redundant few redundant and good
comments and comments and organization
bad okay
organization organization

You might also like