Project Assignment
Project Assignment
Task-1
In the project you will use the Houston Weather Dataset, or HWD for short. The first and
last attribute of the HWD should be ignored when clustering this data set; the last
attributes denotes a class variable which will be used in the post analysis of the clusters
generated by running K-means, and DBSCAN.
1
Houston_Weather Dataset has the the following attributes:
DATE / nominal / Each record has a date starting from 01/01/2006 to
12/31/2021. You may download this dataset from
https://www.kaggle.com/datasets/alejandrochapa/houston-weather-data.
cloud_cover / categorical / %/ 0 to 16, each number represents a category
rainfall / continuous / inch / Amount of rainfall of the day
min_temp / continuous / farenhit / Minimum temperture of the day
max_temp / continuous / farenhit / Maximum temperture of the day
wind_speed/ continuous / mile per hour / wind speed at 3pm
pressure/ continuous / pai / atmospheric pressure at 3pm
humidity / continuous / % / relative humidity at 3pm
class/ / categorical / %/ H, M, L repregenting High, Midium and Low
humidity
a. Run K-means for k=31(check footnote) for the HWD dataset excluding the Date and
Class attribute. Using the function you developed in step a, compute the purity of the
obtained clustering results; next, create box plots for attributes temp_max,
temp_min, rainfall, humidity,wind_speed, pressure of the obtained 3 clusters for each
clustering and report their centroids, means. Finally, summarize based on the
obtained boxplots and centroids/cluster means what kind of objects each of 3 clusters
contains (you need to compare attributes in terms of their clusters). Finally, report
the purity for the clustering result and interepret it. ***
b. Try to obtain a DBSCAN clustering for the HWD dataset exclusing the Date, and class
attribute, having between 2 and 15 clusters with less than 20% outliers. Report its
purity score. Compare the result with the K-means result you obtained in task 1! ***
Deliverables:
A. A Report2 which contains all deliverables for the subtasks of Task 1.
B. Properly commented software/code you developed as part of Task 1.
Task-2
In this task you will be developing outlier detection techniques for a HWD Dataset as
provided in Task-1; the objective is to find “unusual weather days” in this dataset.
1
Actually run it 10 times but then analyze only the (single) clustering with the lowest SSE further.
2
Single-spaced; please use an 11-point or 12-point font!
2
A day can be unusual if it's much hotter or colder than usual (temperature), windier or
calmer than usual (wind speed), more humid or less humid than usual (humidity), or wetter
or drier than usual (rainfall). Each of these things can affect our daily lives. For example,
a very hot day in winter or a very cold day in summer would be unusual. Or, if it rains a
lot more or a lot less than normal, that could also be unusual. To know if a day is unusual,
we need to compare it to what's typical for the location.
In this task, you will use a dataset called the HWD dataset. It contains daily weather data
for Houston in the year 2021, with attributes like date, min_temp, max_temp, rainfall,
wind_speed9am, wind_speed3pm, humidity9am, humidity3pm, pressure9am,
pressure3pm, cloud9am, cloud3pm, temp9am, temp3pm, rain_today, and rain_tomorrow.
However, for this task, we will focus on a subset of the dataset called RHOUSTONW. This
subset includes the following attributes: Date, min_temp, max_temp, rainfall, wind_speed,
humidity, and cloud. In the dataset, wind_speed and humidity refer to wind_speed3pm and
humidity3pm, while cloud is the numerical conversion of cloud3pm from the original
dataset.
Houston_Weather Dataset has the the following attributes:
DATE / nominal / Each record has a date starting from 01/01/2021 to
12/31/2021
cloud / categorical / %/ 17 different types of cloud cover. Categories
are Fair / Windy","Partly Cloudy","Partly Cloudy /
Windy","Cloudy","Cloudy / Windy","Mostly Cloudy","Mostly Cloudy /
Windy","Fog","Haze", "Light Rain" , "Light Rain with Thunder", "Thunder",
"Rain" "Thunder / Windy" "Heavy T-Storm", "Thunder in the Vicinity", "T-
Storm"
rainfall / continuous / inch / Amount of rainfall of the day/ from 0 to
5
min_temp / continuous / farenhit / Minimum temperture at 3pm / from 34 to
83
max_temp / continuous / farenhit / Maximum temperture at 3pm/ from 46 to
98
wind_speed/ continuous / mile per hour / wind speed at 3pm/ from 0 to 29
humidity / continuous / % / Humidity at 3pm/ from 0 to 100
Subtasks:
1) Design and implement a distance-based and a model/density-based object outlier
detection technique for the Houston Weather Dataset. The technique if applied to the
Houston Weather Dataset should add a column to the examples in the dataset named
OLS (Outlier Score) which contains a single number which measures the strength of
our belief that the particular example is an outlier. The challenge for the first task will
be the development of a “good” distance function for the RHOUSTONW dataset; the
3
challenge for the second task will be to develop a “good” density function for the
RHOUSTONW dataset. ***********
a) You must design a multivariate distance function and a multivariate density
function that has been tailored to the dataset. You can also use clustering
algorithms, but in such case marks related to density function and distance function
would be zero.
b) Please provide clear definition of the distance and density function you designed
and describe and justify your design choices.
2) Apply the two outlier detection techniques to the RHOUSTONW dataset; if your
methods involves hyper parameters, apply the methods 3 times to the dataset using 3
different hyper parameter settings. ****
3) Sort the obtained augmented RHOUSTONW Datasets using the OLS attribute. Discuss
the top 3 examples of each augmented dataset; explain why you believe the particular
examples were viewed as likely outlier. Also discuss the bottom example in each
augmented dataset: try to explain why were rated to be “most normal”.****
4) Based on the results you obtained in the previous steps evaluate and compare the two
outlier detection techniques you developed. **
5) If necessary, enhance your two outlier detection techniques and redo steps d, e, and f!
Deliverable:
a) Properly commented code. [Add comments above each block. Make variable
and function names big enough to understand their purpose. And Add a doc
section at beginning of each module describing their inputs, outputs, and
briefly mention what they will do and how they will do ]
b) Explanation containing
4
outlier outlier detection detection outlier
detection detection technique is not technique is detection
technique technique very modestly technique
Quality is sophisticated/inc sophisticated/inc is very
presented orrect and will orrect and will good
produce wrong produce wrong
outputs in most outputs in some
cases cases
Quality of No Density The Density The Density The 4
the Density function is function is not function is Density
function presented very modestly function is
sophisticated/inc sophisticated/inc very good
orrect and will orrect and will
produce wrong produce wrong
outputs in most outputs in some
cases cases
Model/den No The The The 4
sity -based Model/den Model/density - Model/density - Model/den
outlier sity -based based outlier based outlier sity -based
detection outlier detection detection outlier
technique detection technique is not technique is detection
Quality technique very modestly technique
is sophisticated/inc sophisticated/inc is very
presented orrect and will orrect and will good
produce wrong produce wrong
outputs in most outputs in some
cases cases
Apply the two outlier detection techniques to the RHOUSTONW dataset; if your
methods involves hyper parameters, apply the methods 3 times to the dataset using 3
different hyper parameter settings.
Deliverable:
5
from the discussions Input, outputs Input, outputs Input,
three runs are not and their and their outputs and
Quality written in discussions discussions their
the report are poorly are modestly discussions
written in the written in the are very
report and has report and has good
many some mistakes
mistakes
Sort the obtained augmented RHOUSTONW Datasets using the OLS attribute. Discuss the
top 3 examples of each augmented dataset; explain why you believe the particular examples
were viewed as likely outlier. Also discuss the bottom example in each augmented dataset:
try to explain why twere rated to be “most normal
Deliverable:
Based on the results you obtained in the previous steps evaluate and compare the two
outlier detection techniques you developed.
Deliverable:
6
Comparison No Discussion is Discussion is Discussion is 4
of the two discussion wrong with modest with very good
outlier given lots of some of
detection erroneous erroneous
techniques claims claims