Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
NOTE: The result will be a data set only containing those columns we believe will contribute well
to our outlier detection. We will use a distance-based outlier detection algorithm which
calculates the Euclidean distance between the data points and marks those points which are
farthest away from other data points as outliers. The Euclidean distance uses the distances
between two data points for each individual attribute. Think about it: what is the effect on the
distance if the attributes have different value ranges (one attribute between 0 and 5 and
another attribute between 1 and 1000)? Attributes with larger values will contribute much more
than those with smaller values. For this reason we should ensure that all attributes are using
similar value ranges. This transformation is called Normalization.
NOTE: In general, you should always normalize your data before you apply distance-based
algorithms like outlier detection or k-Means clustering. Using the default parameters, the
Normalize operator will perform a z-Transformation (also known as Standardization) which
results in a mean value of 0 and a standard deviation of 1 for each attribute. In other words, all
of the attributes are on the same scale after normalization and can be compared with one
another.
DETECT OUTLIERS.
1. Search for the operator Detect Outlier (Distances), add it, and connect it to Normalize.
NOTE: This operator will identify the 10 examples which are farthest away from all others and
mark them as outliers. It creates a new column named outlier with true as the value for the
outliers and false for all other examples.
Page 1 of 2
Dr. Stephan Kupsch
DATA ANALYTICS 02
1. Add Filter Examples to the process and connect it to the previous operator and also to
the result port on the right.
2. In its Parameters, add a new filter with Outlier, equals, and false as values.
NOTE: The process might run for some time but will switch to the Results view automatically
when it is finished. You will notice that the result is a data set with 1299 examples - the 10
outliers have successfully been removed.
USE BREAKPOINTS TO SEE INTERMEDIATE RESULTS.
NOTE: The process is paused at the breakpoint and the intermediate results are shown. This
makes breakpoints a useful tool for debugging your process! All 1309 examples are still in the
data set at this point of the process. Click the Resume icon in the toolbar to continue the process
and see the final result.
By now, you found 10 outliers in your data and sucessfully removed them! This cleansing step
can often improve the quality of your models.
TASKS:
How would you change the process so it finds 20 outliers instead of 10?
How can you change the process to only show outliers instead of removing them?
Replace the outlier detection operator with Detect Outlier (LOF) and add a breakpoint
after this operator before you execute. What is the difference to before?
How do you now need to change the filter to only keep the top outliers?
Page 2 of 2
Dr. Stephan Kupsch