0% found this document useful (0 votes)
46 views17 pages

Unit 3 - Part 2

Uploaded by

Prince Rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views17 pages

Unit 3 - Part 2

Uploaded by

Prince Rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT 3: Efficiency measures

Every business sector recognizes that the volumes of diverse data they collect about their
customers, competitors and operations are important resources with significant potential value
in the digital economy era to resolve the uncertainty and inherent risk.
In a digital economy where rapid changes in the internal and external business environment
occur, mistakes in making strategic decisions can have dire consequences for a company. This
results in an increasing need for business data analytics and visualization technologies to assist
companies in voluminous data management and analysis to support and improve management
decision-making.

Data envelopment analysis


The purpose of data envelopment analysis (DEA) is to compare the operating performance of
a set of units such as companies, university departments, hospitals, bank branch offices,
production plants, or transportation systems.
In order for the comparison to be meaningful, the units being investigated must be
homogeneous. The performance of a unit can be measured on several dimensions.
For example, to evaluate the activity of a production plant one may use quality indicators,
which estimate the rate of rejects resulting from manufacturing a set of products, and also
flexibility indicators, which measure the ability of a system to react to changes in the
requirements with quick response times and low costs.
Data envelopment analysis relies on a productivity indicator that provides a measure of the
efficiency that characterizes the operating activity of the units being compared. This measure
is based on the results obtained by each unit, which will be referred to as outputs, and on the
resources utilized to achieve these results, which will be generically designated as inputs or
production factors.
If the units represent bank branches, the outputs may consist of the number of active bank
accounts, checks cashed or loans raised; the inputs may be the number of cashiers, managers
or rooms used at each branch. If the units are university departments, it is possible to consider
as outputs the number of active teaching courses and scientific publications produced by the
members of each department; the inputs may include the amount of financing received by each
department, the cost of teaching, the administrative staff and the availability of offices and
laboratories.
Pattern Matching
Pattern matching is an algorithmic task that finds pre-determined patterns among sequences of
raw data or processed tokens. In contrast to pattern recognition, this task can only make exact
matches from an existing database and won’t discover new patterns. Pattern matching isn’t a
deep learning technique, but rather a basic tool used in programming, parsing and error-
checking algorithms and data sets.
Pattern Matching is one of the primary reasons experienced people tend to make better
decisions than inexperienced people-they've learned more accurate patterns via their
experience. Having a larger mental database to draw from is what gives experts their expertise.
Pattern matching is one of the foundational capabilities of our mind and how it works. The
more accurate patterns you have stored in your memory, the more quickly and accurately you
can respond to whatever life throws at you.
Applications of pattern matching include identification of phrases within the larger text, finding
specific shapes within an image, distinguish sound patterns in the spoken language, search all
instances of a particular gene within the larger set of DNA sequences, etc.
Take an example of searching for a specific word in the book. You can use pattern matching
to find all instances of that word present in the book.

How does Pattern Matching Work?


Since pattern matching is essentially just filtering and/or replacing data, practically anything
can be used for a pattern, including complex strings with wildcards inside and not just discrete
variables. The exact process of finding patterns varies, depending on what type of data is being
searched for. In most cases, either regular expressions or tree patterns (strings), are checked
and then matched with a process-of-elimination approach, like backtracking, rather than a
complete, “brute force” search of all the data.
Our brains are Pattern Matching machines, constantly trying to find patterns and associating
them with previous patterns.
This happens unconsciously, your brain does it simply by paying attention to the world.
Humans learn patterns primarily via Experimentation.
One of the most interesting things about the brain is its ability to automatically learn and
recognize patterns.
Patterns get stored in our memory, waiting to be recalled. This process is optimized for speed
to help you remember things quickly, not accurately. The more accurate patterns you’ve
learned, the more options you have when solving a problem.
Long before you knew what gravity was, you knew that a ball would move towards the ground
if you released it. The first few times you let go of a ball, it would always fall to the ground. It
didn’t take many such experiences for you to learn that any object you released would fall.
Gravity is just a name for something your brain learned by itself.
If a small child wants to be held by its mother, it doesn’t take long to try several different
approaches and learn which response typically produces the desired result, typically “If I cry,
mom is going to pick me up and hold me.” From then on, the child will rely on the pattern
whenever it desires the result.
You can think of your memory as the database of patterns you’ve learned via past
experience. Patterns get stored in our long-term memory, waiting to be used to determine
responses to new or uncommon situations.

Steps for pattern matching:


1. Defining the pattern - Use regular expressions, keywords, phrases, or other pattern
definitions to define the sequence you want to search for.
2. Selecting the dataset - Identify where will you search your sequence. It could be a text
file, image, DNA sequence, or any other type of data.
3. Applying the pattern matching algorithm - Depending on the pattern you're searching
for, and the characteristics of the dataset, apply the relevant pattern matching algorithm.
4. Analyzing the results - Depending on the application, you need to analyze the results
that are achieved, and process them to extract meaningful insights or information.

What is pattern matching in ML?


With regards to machine learning, pattern matching is generally used in developing predictive
models that are able to make accurate predictions based on the input data.
Common pattern matching approaches in machine learning are - Supervised learning,
Unsupervised learning, & Reinforcement learning.
1. Supervised learning - The machine learning algorithm is trained on a labelled dataset, where
each data point is associated with a specific label or output.
2. Unsupervised learning - The machine learning algorithm is trained on an unlabelled dataset,
where no output labels are provided.
3. Reinforcement learning - The machine learning algorithm learns through the method of trial
& error. The algorithm makes the prediction, receives the feedback, and them implements it
again for the next prediction.
Using pattern matching in machine learning, we can develop predictive models that can be
used in image recognition, image analysis, natural language processing, fraud detection,
predictive maintenance, etc
Cluster Analysis
Businesses have a lot of unstructured data. According to statistics, almost 80% of companies’
data is unstructured. Also, the growth rate of unstructured data is 55-65% per year. Since this
data cannot be arranged into a tabular form, it is difficult for enterprises, especially small
businesses, to use unstructured data. This is why business analytics tools are becoming widely
popular. Cluster analysis is a business analytics tool that helps companies sort unstructured
data and use it for their maximum advantage.

What is Cluster Analysis?


Cluster means arranging or grouping similar items. Therefore, as the name suggests, Cluster
Analysis is a statistical tool that classifies identical objects in different groups. Objects within
a cluster have similar properties, whereas objects of two separate clusters are entirely
different. Cluster analysis serves as a data mining or exploratory data tool in business analytics.
It is used to identify similar patterns or trends and compare one set of data with another.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and
it deals with the unlabelled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.
The clustering technique is commonly used for statistical data analysis.
(Note: Clustering is somewhere similar to the classification algorithm, but the difference is the
type of dataset that we are using. In classification, we work with the labelled data set, whereas
in clustering, we work with the unlabelled dataset.)

The cluster analysis tool is mainly used to segregate customers into different categories, figure
out the target audience and potential leads, and understand customer traits. We can also
understand cluster analysis as an automated segmentation technique that divides data into
different groups based on their characteristics. It comes under the broad category of big data.
Example: Let's understand the clustering technique with the real-world example of Mall: When
we visit any shopping mall, we can observe that the things with similar usage are grouped
together. Such as the t-shirts are grouped in one section, and trousers are at other sections,
similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the things. The clustering technique also works in the
same way. Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this
technique to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.

Types of Clustering Models/ Methods


The clustering methods are broadly divided into Hard clustering (data point belongs to only
one group) and Soft Clustering (data points can belong to another group also).
There are broadly two types of clustering:
 Hard clustering, each data point is definite and included only in one cluster/ group.
 Soft clustering data points can belong to another group also. Data points are arranged
based on probability. We can fit one data point in different clusters.
But there are also other various approaches of Clustering exist. The following are the most
popular in business analytics:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k
groups, where K is used to define the number of
pre-defined groups. The cluster center is created
in such a way that the distance between the data
points of one cluster is minimum as compared
to another cluster centroid. The K-means cluster
analysis model used predefined clusters. Using
the K – means clustering algorithm is to find
local maxima in each iteration. This algorithm
keeps on calculating the centroid until it finds
the correct centroid.
Density-Based Clustering
The density-based clustering method
connects the highly-dense areas into
clusters, and the arbitrarily shaped
distributions are formed as long as the dense
region can be connected. This algorithm
does it by identifying different clusters in
the dataset and connects the areas of high
densities into clusters. The dense areas in
data space are divided from each other by
sparser areas.
These algorithms can face difficulty in
clustering the data points if the dataset has
varying densities and high dimensions.

Distribution Model-Based Clustering


In the distribution model-based clustering method, the data is divided based on the probability
of how a dataset belongs to a particular distribution. It uses normal or Gaussian rules to find
the probability between data points of one cluster. The data points are arranged in a cluster
based on a hypothesis or a probability in the distribution model. However, this is an overfitting
model. It means that we need to put some limitations while using the distribution algorithm.
The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. The hierarchical clustering
algorithm arranges the clusters in a hierarchy. It creates a tree of clusters. Then, the two closest
clusters are arranged into one pair. This new pair is further combined with another pair.
For example, if there are eight clusters,
the two clusters with maximum similar
characteristics will be arranged
together and form one branch.
Similarly, the other six clusters will be
arranged into a pair of three clusters.
The four pairs of clusters will be
brought together to form two pairs of
clusters. The remaining two clusters
will also be merged to form a head
cluster. The clusters appear in the
shape of a pyramid.
Hierarchical clustering is further divided into two different categories – agglomerative and
divisive clustering. Agglomerative clustering is also called AGNES (Agglomerative Nesting)
in which two similar clusters are merged at every step till one combined cluster is left. On the
other hand, divisive hierarchical clustering, also called DIANA (Divise Analysis), contradicts
AGNES. This algorithm divides one cluster into two clusters.

Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the degree
of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Benefits of Cluster Analysis
Here are the two most significant benefits of cluster analysis!
 Undirected Data Mining Technique:- Cluster analysis is an undirected or exploratory
data mining technique. It means that one cannot form a hypothesis or predict the result
of cluster analysis. Instead, it produces hidden patterns and structures from unstructured
data. In simple terms, while performing cluster analysis, one does not have a target
variable in mind. It produces unexpected results.
 Arranged Data for Other Algorithms:- Businesses use various analytics and machine
learning tools. However, some analytics tools can only work if we provide structured
data. We can use cluster analysis tools to arrange data into a meaningful form for
analysis by machine learning software.

Cluster Analysis Applications


Businesses can use cluster analysis for the following purposes:
 Market Segmentation: - Cluster analysis helps businesses in market segmentation by
creating groups of homogenous customers with the same behaviours. It is beneficial for
businesses with a wide range of products and services and cater to a large audience.
Cluster analysis helps businesses determine customer response to their products and
services by arranging the customers with the same attributes in one cluster. This allows
the businesses to organize their services and offer specific products to different groups.
 Understanding Consumer’s Behaviour: - Cluster analysis is beneficial for companies
to understand consumer behaviour like their preferences, response to products or
services, and purchasing patterns. This helps businesses to decide their marketing and
sales strategies.
 Figuring Out New Market Opportunities: - Businesses can also use cluster analysis
to understand news trends in the market by analyzing consumer behaviour. It can help
them expand their business and explore new products and services. Cluster analysis can
also help businesses figure out the strengths and weaknesses and their competitors.
 Reduction of Data: - It is difficult for businesses to manage and store tons of data.
Cluster analysis helps businesses segregate valuable information into different clusters,
making it easier for companies to differentiate between valuable and redundant data
that can be discarded.
 In Land Use: The clustering technique is used in identifying the area of similar lands
use in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.

How to perform Cluster Analysis?


Each cluster analysis model requires a different strategy. However, the following steps can be
used for all cluster analysis techniques.
 Collect Unstructured Data:- You can perform cluster analysis on existing customer
data. However, you will need to collect fresh information if you wish to understand
recent trends or consumer traits. You can conduct a survey to learn about new market
developments.
 Selecting the right variable:- We begin cluster analysis by choosing a variable or a
property based on which we can segregate one data point from another. It helps narrow
down the property based on which clusters will be formed.
 Data scaling:- The next step is to scale the data into different categories. It means
categorizing data based on the selected variables.
 Distance Calculation:- The last step of cluster analysis is calculating the distance
between variables. Since the data points are arranged into clusters with different factors,
we need to prepare an equation considering all the variables. One of the simplest ways
is to calculate the distance between the centers of two clusters.

Outlier Analysis
“Outlier Analysis is a process that involves identifying the anomalous observation in the
dataset.”
Let us first understand what outliers are. Outliers are nothing but an extreme value that deviates
from the other observations in the dataset. Outliers refer to the data points that exist outside of
what is to be expected.
When going through the process of data analysis to analyze data sets, you will always have
some assumptions based on how this data is generated. If you find some data points that are
likely to contain some form of error, then these are definitely outliers, and depending on the
context, you want to overcome those errors. Outliers can cause anomalies in the results
obtained. This means that they require some special attention and, in some cases, will need to
be removed in order to analyze data effectively.
There are two main reasons why giving outliers special attention is a necessary aspect of the
data analytics process:
 Outliers may have a negative effect on the result of an analysis
 Outliers—or their behavior—may be the information that a data analyst requires from
the analysis
Outliers are caused due to the incorrect entry or computational error, sampling error etc. For
example, displaying a person’s weight as 1000kg could be caused by a program default setting
of an unrecorded weight. Alternatively, outliers may be a result of indigenous data
changeability.
In another real-world example, the average height of a giraffe is about 16 feet tall. However,
there have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively.
These two giraffes would be considered outliers in comparison to the general giraffe
population.
The process in which the behaviour of the outliers is identified in a dataset is called outlier
analysis.

Many algorithms are used to minimize the effect of outliers or eliminate them. Outlier Analysis
has various applications in fraud detection, such as unusual usage of credit card or
telecommunication services, to identify the spending nature of the customers in marketing.
Let’s see how we will view the analysis problem -
1. In a given data set, define what data could be considered as inconsistent.
2. Find an efficient method to extract the outliers so defined.

Types of Outliers
Outliers are divided into three different types
1. Global or point outliers
2. Collective outliers
3. Contextual or conditional outliers

Global Outliers

Global outliers are also called point outliers.


Global outliers are taken as the simplest form of
outliers. When data points deviate from all the
rest of the data points in a given data set, it is
known as the global outlier. In most cases, all the
outlier detection procedures are targeted to
determine the global outliers. The green data
point is the global outlier.

Collective Outliers

In a given set of data, when a group of data


points deviates from the rest of the data set is
called collective outliers. Here, the particular
set of data objects may not be outliers, but when
you consider the data objects as a whole, they
may behave as outliers. To identify the types of
different outliers, you need to go through
background information about the relationship
between the behaviour of outliers shown by
different data objects. The green data points as
a whole represent the collective outlier.
Contextual Outliers

As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual outliers
are also known as Conditional outliers. These types of outliers happen if a data object deviates
from the other data points because of any specific condition in a given data set. As we know,
there are two types of attributes of objects of data: contextual attributes and behavioral
attributes. Contextual outlier analysis enables the users to examine outliers in different contexts
and conditions, which can be useful in various applications. For example, A temperature
reading of 45 degrees Celsius may behave as an outlier in a rainy season. Still, it will behave
like a normal data point in the context of a summer season. In the given diagram, a green dot
representing the low-temperature value in June is a contextual outlier since the same value in
December is not an outlier.

Besides the above distinction, outliers can also be categorized as:


 A univariate outlier is an extreme value that relates to just one variable. For
example, Sultan is currently the tallest man alive, with a height of 8ft, 2.8 inches
(251cm). This case would be considered a univariate outlier as it’s an extreme case of
just one factor: height.
 A multivariate outlier is a combination of unusual or extreme values for at least two
variables. For example, if you’re looking at both the height and weight of a group of
adults, you might observe that one person in your dataset is 5ft 9 inches tall—a
measurement that would fall within the normal range for this particular variable. You
may also observe that this person weighs 110lbs. Again, this observation alone falls
within the normal range for the variable of interest: weight. However, when you
consider these two observations in conjunction, you have an adult who is 5ft 9 inches
and weighs 110lbs—a surprising combination. That’s a multivariate outlier.

How do outliers end up in datasets?


Now that we’ve learned about what outliers are and how to identify them, it’s worthwhile
asking: how do outliers end up in datasets in the first place?
Here are some of the more common causes of outliers in datasets:

 Data Entry Errors:- Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data. For example: Annual income of a
customer is $100,000. Accidentally, the data entry operator puts an additional zero in
the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently,
this will be the outlier value when compared with rest of the population.
 Measurement Error: It is the most common source of outliers. This is caused when
the measurement instrument used turns out to be faulty. For example: There are 10
weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on
the faulty machine will be higher / lower than the rest of people in the group. The
weights measured on faulty machine can lead to outliers.
 Experimental Error: Another cause of outliers is experimental error. For example: In
a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’
call which caused him to start late. Hence, this caused the runner’s run time to
be more than other runners. His total run time can be an outlier.
 Intentional Outlier: This is commonly found in self-reported measures that involves
sensitive data. For example: Teens would typically under report the amount of alcohol
that they consume. Only a fraction of them would report actual value. Here actual
values might look like outliers because rest of the teens are under reporting the
consumption.
 Data Processing Error: Whenever we perform data mining, we extract data from
multiple sources. It is possible that some manipulation or extraction errors may lead to
outliers in the dataset.
 Sampling error: For instance, we have to measure the height of athletes. By mistake,
we include a few basketball players in the sample. This inclusion is likely to cause
outliers in the dataset.
 Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier.
For instance: In a renowned insurance company, it can be noticed that the performance
of top 50 financial advisors is far higher than rest of the population. Surprisingly, it is
not due to any error. Hence, whenever we perform any data mining activity with
advisors, we used to treat this segment separately.

Impact of Outliers on a dataset


Outliers can drastically change the results of the data analysis and statistical modeling. There
are numerous unfavourable impacts of outliers in the data set:
 It increases the error variance and reduces the power of statistical tests
 If the outliers are non-randomly distributed, they can decrease normality
 They can bias or influence estimates that may be of substantive interest
 They can also impact the basic assumption of Regression, ANOVA and other statistical
model assumptions.
To understand the impact deeply, let’s take an example to check what happens to a data set
with and without outliers in the data set.

Example:
As you can see, data set with outliers has significantly different mean and standard deviation.
In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30.
This would change the estimate completely.

Outlier Analysis Techniques


There are a variety of ways to find outliers. All these methods employ different approaches
for finding values that are unusual compared to the rest of the dataset. Here we’ll look at just a
few of these techniques are as follows:
1. Sorting
Sorting is the easiest technique for outlier analysis. Load your dataset into any kind of data
manipulation tool, such as a spreadsheet, and sort the values by their magnitude. Then, look at
the range of values of various data points. If any data points are significantly higher or lower
than others in the dataset, they may be treated as outliers.
Let’s look at an example of sorting in actual. Consider that a CEO of a company has a salary
that is two times that of the other employees. Upon entering the data analysis phase, they
should look to make sure no outliers are present in the dataset. By sorting from the highest
salaries, they will be able to identify unusually high observations. Knowing that the average
salary is more, an observation of CEO salary would stand out as an outlier.

2. Visualizations
In data analytics, analysts create data visualizations to present data graphically in a meaningful
and impactful way, in order to present their findings to relevant stakeholders. These
visualizations can easily show trends, patterns, and outliers from a large set of data in the form
of maps, graphs and charts.
(i) Identifying outliers with box plots
Visualizing data as a box plot makes it very easy to spot outliers. A box plot will show
the “box” which indicates the interquartile range (from the lower quartile to the upper
quartile, with the middle indicating the median data value) and any outliers will be shown
outside of the “whiskers” of the plot, each side representing the minimum and maximum
values of the dataset, respectively.
If the box skews closer to the maximum whisker, the prominent outlier would be the
minimum value. Likewise, if the box skews closer to the minimum-valued whisker, the
prominent outlier would then be the maximum value. Box plots can be produced easily
using Excel or in Python, using a module such as Plotly.

Fig: Elements of a boxplot, showing outliers (to the left)

(ii) Identifying outliers with scatter plots


As the name suggests, scatter plots show the values of a dataset “scattered” on an axis
for two variables. The visualization of the scatter will show outliers easily—these will be
the data points shown furthest away from the regression line (a single line that best fits
the data). As with box plots, these types of visualizations are also easily produced
using Excel or in Python.

3. Statistical methods
Here, we’ll describe some commonly-used statistical methods for finding outliers. A data
analyst may use a statistical method to assist with machine learning modeling, which can be
improved by identifying, understanding, and—in some cases—removing outliers.
Here, we’ll discuss two algorithms commonly used to identify outliers, but there are many
more that may be more or less useful to your analyses.
(i) Identifying outliers with DBSCAN
DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a clustering
method that’s used in machine learning and data analytics applications. Relationships
between trends, features, and populations in a dataset are graphically represented by
DBSCAN, which can also be applied to detect outliers.
DBSCAN is a density-based clustering non-parametric algorithm, focused on finding and
grouping together neighbours that are closely packed together. Outliers are marked as
points that lie alone in low-density regions, far away from other neighbours.

Fig: Illustration of a DBSCAN cluster analysis. Points around A are core points. Points
B and C are not core points, but are density-connected via the cluster of A (and thus
belong to this cluster). Point N is Noise, since it is neither a core point nor reachable
from a core point.

(ii) Identifying outliers by finding the Z-Score


Z-score—sometimes called the standard score—is defined as “the number of standard
deviations by which the value of a raw score (i.e., an observed value or data point) is
above or below the mean value of what is being observed or measured.”
Computing a z-score helps describe any data point by placing it in relation to the standard
deviation and mean of the whole group of data points. Positive standard scores appear as
raw scores above the mean, whereas negative standard scores appear below the mean.
The mean is 0 and standard deviation is 1, creating a normal distribution.
Outliers are found from z-score calculations by observing the data points that are too far
from 0 (mean). In many cases, the “too far” threshold will be +3 to -3, where anything
above +3 or below -3 respectively will be considered outliers.
Z-scores are often used in s*tock market data. Z-scores can be calculated
using Excel, R and by using the Quick Z-Score Calculator.

4. Isolation Forest algorithm


Isolation Forest—otherwise known as iForest—is another anomaly detection algorithm. The
founders of the algorithm used two quantitative features of anomalous data points—that they
are “few” in quantity and have “different” attribute-values to those of normal instances—to
isolate outliers from normal data points in a dataset.
To show these outliers, the Isolation Forest will build “Isolation Trees” from the set of data,
and outliers will be shown as the points that have shorter average path lengths than the rest of
the branches.

When should you remove outliers?


It may seem natural to want to remove outliers as part of the data cleaning process. But in
reality, sometimes it’s best—even absolutely necessary—to keep outliers in your dataset.
Removing outliers solely due to their place in the extremes of your dataset may create
inconsistencies in your results, which would be counterproductive to your goals as a data
analyst. These inconsistencies may lead to reduced statistical significance in an analysis.
But what do we mean by statistical significance? Let’s take a look.

A quick introduction to hypothesis testing and statistical significance (p-value)


When you collect and analyze data, you’re looking to draw conclusions about a wider
population based on your sample of data. For example, if you’re interested in the eating habits
of the New York City population, you’ll gather data on a sample of that population (say, 1000
people). When you analyze this data, you want to determine if your findings can be applied to
the wider population, or if they just occurred within this particular sample by chance (or due to
another influencing factor). You do this by calculating the statistical significance of your
findings.
This is part of hypothesis testing.
With hypothesis testing, you start with two hypotheses: the null hypothesis and the alternative
hypothesis. Based on your findings and the statistical significance (or insignificance) of these
findings, you’ll accept one of your hypotheses and reject the other. The null hypothesis states
that there is no statistical significance between the two variables you’re looking at. The
alternative hypothesis states the opposite.
Let’s explain with an example. Imagine you’re looking at the relationship between people’s
self-esteem (measured as a score out of 100) and their coffee consumption (measured in terms
of cups per day). These are your two variables: self-esteem and coffee consumption. When
analyzing your data, you find that there does indeed appear to be a correlation (or a relationship)
between self-esteem and coffee consumption. For instance, higher coffee consumption
correlates with a higher self-esteem score. Is this a fluke finding? Or do people who drink more
coffee really tend to have higher self-esteem?
To evaluate the strength of your findings, you’ll need to determine if the relationship between
the two variables is statistically significant. There are several different tests used to calculate
statistical significance, depending on the type of data you have. You run the appropriate
significance test in order to find the p-value.
The p-value is a measure of probability, and it tells you how likely it is that your findings
occurred by chance. A p-value of less than 0.05 indicates strong evidence against the null
hypothesis; in other words, there is less than a 5% probability that the results occurred by
chance. In this case, your findings can be deemed statistically significant. If, on the other
hand, your statistical significance test finds a p-value greater than 0.05, your findings are
deemed statistically insignificant. They may have just occurred by chance.
Removing outliers without good reason can skew your results in a way that impacts the p-
value, thus making your findings unreliable. So: it’s essential to think carefully before simply
removing outliers from your dataset!
While evaluating potential outliers to remove from your dataset, consider the following:
 Is the outlier a measurement error or data entry error? If so, correct it manually where
possible. If it’s unable to be corrected, it should be considered incorrect, and thus
legitimately removed from the dataset.
 Is the outlier a natural part of the data population being analyzed? If not, you should
remove it.
 Can you explain your reasoning for removing an outlier? If not, you should not remove
it. When removing outliers, you should provide documentation of the excluded data
points, giving reasoning for your choices.
If there is disagreement within your group about the removal of an outlier (or a group of
outliers), it may be useful to perform two analyses: the first with the dataset intact, and the
second with the outliers removed. Compare the results and see which one has provided the
most useful and realistic insights.

You might also like