Unit 3 - Part 2
Unit 3 - Part 2
Every business sector recognizes that the volumes of diverse data they collect about their
customers, competitors and operations are important resources with significant potential value
in the digital economy era to resolve the uncertainty and inherent risk.
In a digital economy where rapid changes in the internal and external business environment
occur, mistakes in making strategic decisions can have dire consequences for a company. This
results in an increasing need for business data analytics and visualization technologies to assist
companies in voluminous data management and analysis to support and improve management
decision-making.
The cluster analysis tool is mainly used to segregate customers into different categories, figure
out the target audience and potential leads, and understand customer traits. We can also
understand cluster analysis as an automated segmentation technique that divides data into
different groups based on their characteristics. It comes under the broad category of big data.
Example: Let's understand the clustering technique with the real-world example of Mall: When
we visit any shopping mall, we can observe that the things with similar usage are grouped
together. Such as the t-shirts are grouped in one section, and trousers are at other sections,
similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the things. The clustering technique also works in the
same way. Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this
technique to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the degree
of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Benefits of Cluster Analysis
Here are the two most significant benefits of cluster analysis!
Undirected Data Mining Technique:- Cluster analysis is an undirected or exploratory
data mining technique. It means that one cannot form a hypothesis or predict the result
of cluster analysis. Instead, it produces hidden patterns and structures from unstructured
data. In simple terms, while performing cluster analysis, one does not have a target
variable in mind. It produces unexpected results.
Arranged Data for Other Algorithms:- Businesses use various analytics and machine
learning tools. However, some analytics tools can only work if we provide structured
data. We can use cluster analysis tools to arrange data into a meaningful form for
analysis by machine learning software.
Outlier Analysis
“Outlier Analysis is a process that involves identifying the anomalous observation in the
dataset.”
Let us first understand what outliers are. Outliers are nothing but an extreme value that deviates
from the other observations in the dataset. Outliers refer to the data points that exist outside of
what is to be expected.
When going through the process of data analysis to analyze data sets, you will always have
some assumptions based on how this data is generated. If you find some data points that are
likely to contain some form of error, then these are definitely outliers, and depending on the
context, you want to overcome those errors. Outliers can cause anomalies in the results
obtained. This means that they require some special attention and, in some cases, will need to
be removed in order to analyze data effectively.
There are two main reasons why giving outliers special attention is a necessary aspect of the
data analytics process:
Outliers may have a negative effect on the result of an analysis
Outliers—or their behavior—may be the information that a data analyst requires from
the analysis
Outliers are caused due to the incorrect entry or computational error, sampling error etc. For
example, displaying a person’s weight as 1000kg could be caused by a program default setting
of an unrecorded weight. Alternatively, outliers may be a result of indigenous data
changeability.
In another real-world example, the average height of a giraffe is about 16 feet tall. However,
there have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively.
These two giraffes would be considered outliers in comparison to the general giraffe
population.
The process in which the behaviour of the outliers is identified in a dataset is called outlier
analysis.
Many algorithms are used to minimize the effect of outliers or eliminate them. Outlier Analysis
has various applications in fraud detection, such as unusual usage of credit card or
telecommunication services, to identify the spending nature of the customers in marketing.
Let’s see how we will view the analysis problem -
1. In a given data set, define what data could be considered as inconsistent.
2. Find an efficient method to extract the outliers so defined.
Types of Outliers
Outliers are divided into three different types
1. Global or point outliers
2. Collective outliers
3. Contextual or conditional outliers
Global Outliers
Collective Outliers
As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual outliers
are also known as Conditional outliers. These types of outliers happen if a data object deviates
from the other data points because of any specific condition in a given data set. As we know,
there are two types of attributes of objects of data: contextual attributes and behavioral
attributes. Contextual outlier analysis enables the users to examine outliers in different contexts
and conditions, which can be useful in various applications. For example, A temperature
reading of 45 degrees Celsius may behave as an outlier in a rainy season. Still, it will behave
like a normal data point in the context of a summer season. In the given diagram, a green dot
representing the low-temperature value in June is a contextual outlier since the same value in
December is not an outlier.
Data Entry Errors:- Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data. For example: Annual income of a
customer is $100,000. Accidentally, the data entry operator puts an additional zero in
the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently,
this will be the outlier value when compared with rest of the population.
Measurement Error: It is the most common source of outliers. This is caused when
the measurement instrument used turns out to be faulty. For example: There are 10
weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on
the faulty machine will be higher / lower than the rest of people in the group. The
weights measured on faulty machine can lead to outliers.
Experimental Error: Another cause of outliers is experimental error. For example: In
a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’
call which caused him to start late. Hence, this caused the runner’s run time to
be more than other runners. His total run time can be an outlier.
Intentional Outlier: This is commonly found in self-reported measures that involves
sensitive data. For example: Teens would typically under report the amount of alcohol
that they consume. Only a fraction of them would report actual value. Here actual
values might look like outliers because rest of the teens are under reporting the
consumption.
Data Processing Error: Whenever we perform data mining, we extract data from
multiple sources. It is possible that some manipulation or extraction errors may lead to
outliers in the dataset.
Sampling error: For instance, we have to measure the height of athletes. By mistake,
we include a few basketball players in the sample. This inclusion is likely to cause
outliers in the dataset.
Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier.
For instance: In a renowned insurance company, it can be noticed that the performance
of top 50 financial advisors is far higher than rest of the population. Surprisingly, it is
not due to any error. Hence, whenever we perform any data mining activity with
advisors, we used to treat this segment separately.
Example:
As you can see, data set with outliers has significantly different mean and standard deviation.
In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30.
This would change the estimate completely.
2. Visualizations
In data analytics, analysts create data visualizations to present data graphically in a meaningful
and impactful way, in order to present their findings to relevant stakeholders. These
visualizations can easily show trends, patterns, and outliers from a large set of data in the form
of maps, graphs and charts.
(i) Identifying outliers with box plots
Visualizing data as a box plot makes it very easy to spot outliers. A box plot will show
the “box” which indicates the interquartile range (from the lower quartile to the upper
quartile, with the middle indicating the median data value) and any outliers will be shown
outside of the “whiskers” of the plot, each side representing the minimum and maximum
values of the dataset, respectively.
If the box skews closer to the maximum whisker, the prominent outlier would be the
minimum value. Likewise, if the box skews closer to the minimum-valued whisker, the
prominent outlier would then be the maximum value. Box plots can be produced easily
using Excel or in Python, using a module such as Plotly.
3. Statistical methods
Here, we’ll describe some commonly-used statistical methods for finding outliers. A data
analyst may use a statistical method to assist with machine learning modeling, which can be
improved by identifying, understanding, and—in some cases—removing outliers.
Here, we’ll discuss two algorithms commonly used to identify outliers, but there are many
more that may be more or less useful to your analyses.
(i) Identifying outliers with DBSCAN
DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a clustering
method that’s used in machine learning and data analytics applications. Relationships
between trends, features, and populations in a dataset are graphically represented by
DBSCAN, which can also be applied to detect outliers.
DBSCAN is a density-based clustering non-parametric algorithm, focused on finding and
grouping together neighbours that are closely packed together. Outliers are marked as
points that lie alone in low-density regions, far away from other neighbours.
Fig: Illustration of a DBSCAN cluster analysis. Points around A are core points. Points
B and C are not core points, but are density-connected via the cluster of A (and thus
belong to this cluster). Point N is Noise, since it is neither a core point nor reachable
from a core point.