DM Lab 1
DM Lab 1
arff
Aim: This experiment illustrates some of the basic data preprocessing operations that can be
performed using WEKA-Explorer. The sample dataset used for this example is the student
data available in arff format.
EXCEL sheet
Step1: Take the existing Student data set and save it as CSV(Macintosh). Open WEKA tool and then
Click on Tools- ArffViewer. Open the Student file and converted arff file is as follows:
CSV (Comma Separated Values)
Open the saved Student.xls and save as csv.
We generate Student.csv file.
Step2: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step3: Once the data is loaded, weka will recognize the attributes and during the scan of the
data weka will compute some basic strategies on each attribute. The left panel in the above
figure shows the list of recognized attributes while the top panel indicates the names of the
base relation or table and the current working relation (which are same initially).
Step4:Clicking on an attribute in the left panel will show the basic statistics on the attributes
for the categorical attributes the frequency of each attribute value is shown, while for
continuous attributes we can obtain min, max, mean, standard deviation and deviation etc.,
Step5:The visualization in the right button panel in the form of cross-tabulation across two
attributes.
Dataset Student .arff file opened with Note-pad:
1) Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.
In the following example let us discretize age attribute.
To change the defaults for the filters, click on the box immediately to the right of the
choose button.
We enter the index for the attribute to be discretized. In this case the attribute is age. So
we must enter ‘1’ corresponding to the age attribute.
Enter ‘3’ as the number of bins. Leave the remaining field values as they are.
Click OK button.
Click apply in the filter panel. This will result in a new working relation with the
selected attribute partition into 3 bins.
2.ReplaceWithMissingValues:
Select the path as follows: “choose-filters-unsupervised-attribute-ReplaceWithMissingValue”.
On clicking that attribute, the current data will be replaced with the missing values based on the
probability .
3.ReplaceMissingValuesWithUserConstant:
Select the path as follows:
“ choose -filters -unsupervised -attribute -ReplaceMissingValuesWithUserConstant”.
On clicking that attribute, the current data with the missing values will be replaced based on the
Constants given by the user .