Detect and Remove the Outliers using Python
Last Updated :
26 Jul, 2025
Outliers are data points that deviate significantly from other data points in a dataset. They can arise from a variety of factors such as measurement errors, rare events or natural variations in the data. If left unchecked it can distort data analysis, skew statistical results and impact machine learning model performance. In this article, we’ll see how to detect and handle outliers in Python using various techniques to improve the quality and reliability of our data.
Common Causes of Outliers
Understanding the causes of outliers helps in finding the best approach to handle them. Some common causes include:
- Measurement errors: Errors during data collection or from instruments can result in extreme values that don't reflect the underlying data distribution.
- Sampling errors: Outliers can arise if the sample we collected isn’t representative of the population we're studying.
- Natural variability: Certain data points naturally fall outside the expected range especially in datasets with inherently high variability.
- Data entry errors: Mistakes made during manual data entry such as incorrect values or typos can create outliers.
- Experimental errors: Outliers can occur due to equipment malfunctions, environmental factors or unaccounted variables in experiments.
- Sampling from multiple populations: Combining data from distinct populations with different characteristics can create outliers if researchers don't properly segment the datasets.
- Intentional outliers: Sometimes outliers are deliberately introduced into datasets for testing purposes to evaluate the robustness of models or algorithms.
Need for Outliers Removal
Outliers can create significant issues in data analysis and machine learning which makes their removal important:
- Skewed Statistical Measures: Outliers can distort the mean, standard deviation and correlation values. For example an extreme value can make the mean unrepresentative of the actual data which leads to incorrect conclusions.
- Reduced Model Accuracy: Outliers can influence machine learning models especially those sensitive to extreme values like linear regression. They may cause the model to focus too much on these rare events helps in reducing its ability to generalize to new, unseen data.
- Misleading Visualizations: Outliers can stretch the scale of charts and graphs helps in making it difficult to interpret the main data trends. For example when visualizing a dataset with a few extreme values, it might find meaningful patterns in the majority of the data.
By removing or handling outliers, we prevent these issues and ensure more accurate analysis and predictions.
Methods for Detecting and Removing Outliers
There are several ways to detect and handle outliers in Python. We can use visualization techniques or statistical methods depending on the nature of our data Each method serves different purposes and is suited for specific types of data. Here we will be using Pandas and Matplotlib libraries on the Diabetes dataset which is preloaded in the Sckit-learn library.
1. Visualizing and Removing Outliers Using Box Plots
A boxplot is an effective way for visualizing the distribution of data using quartiles and the points outside the "whiskers" of the plot are considered outliers. They provide a quick way to see where the data is concentrated and where potential outliers lie.
Python
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
diabetes = load_diabetes()
column_name = diabetes.feature_names
df_diabetics = pd.DataFrame(diabetes.data, columns=column_name)
sns.boxplot(df_diabetics['bmi'])
plt.title('Boxplot of BMI')
plt.show()
Output:
Box PlotIn the boxplot, outliers appear as points outside the whiskers. These values are much higher or lower than the rest of the data. For example, bmi values above 0.12 could be identified as outliers.
To remove outliers, we can define a threshold value and filter the data.
Python
def removal_box_plot(df, column, threshold):
removed_outliers = df[df[column] <= threshold]
sns.boxplot(removed_outliers[column])
plt.title(f'Box Plot without Outliers of {column}')
plt.show()
return removed_outliers
threshold_value = 0.12
no_outliers = removal_box_plot(df_diabetics, 'bmi', threshold_value)
Output:
Box Plot 2. Visualizing and Removing Outliers Using Scatter Plots
Scatter plots help visualize relationships between two variables. It is used when we have paired numerical data and when our dependent variable has multiple values for each reading independent variable. Outliers appear as points far from the main cluster of data.
Python
fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(df_diabetics['bmi'], df_diabetics['bp'])
ax.set_xlabel('BMI')
ax.set_ylabel('Blood Pressure')
plt.title('Scatter Plot of BMI vs Blood Pressure')
plt.show()
Output:
Visualizing Using Scatter PlotsLooking at the graph can summarize that most of the data points are in the bottom left corner of the graph but there are few points that are exactly opposite that is the top right corner of the graph. Those points in the top right corner can be regarded as Outliers.
Here’s how we can remove the outliers identified visually from the scatter plot.
- np.where(): Used to find the positions (indices) where the condition is true in the DataFrame.
- (df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8): Checks for outliers where 'bmi' is greater than 0.12 and 'bp' is less than 0.8.
Python
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
outlier_indices = np.where((df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8))
no_outliers = df_diabetics.drop(outlier_indices[0])
fig, ax_no_outliers = plt.subplots(figsize=(6, 4))
ax_no_outliers.scatter(no_outliers['bmi'], no_outliers['bp'])
ax_no_outliers.set_xlabel('(body mass index of people)')
ax_no_outliers.set_ylabel('(bp of the people )')
plt.show()
Output:
Removing Outliers Using Scatter PlotsThis removes rows where BMI > 0.12 and BP < 0.8 conditions derived from visual inspection.
3. Z-Score Method for Outlier Detection
Z- Score is also called a standard score. This score measures how far a data point is from the mean, in terms of standard deviations. If the Z-score exceeds a given threshold (commonly 3) the data point is considered an outlier.
Z-score = \frac{x - \mu}{\sigma}
Where:
- x = data point
- μ = mean
- σ = standard deviation
Here we are calculating the Z scores for the 'age' column in the DataFrame df_diabetics
using the z-score
function from the SciPy stats module. The resulting array z
contains the absolute Z scores for each data point in the 'age' column which shows how many standard deviations each value is from the mean.
Python
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df_diabetics['age']))
print(z)
Output:
Z scoresNow to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between \pm3 standard deviation using Gaussian Distribution approach.
Let's remove rows where Z value is greater than 2.
- np.where() : Used to find the positions (indices) in the Z-score array where the condition is true.
- z > threshold : Checks for outliers in the 'age' column where the absolute Z-score exceeds the defined threshold (typically 2 or 3).
- threshold = 2 : A cutoff value used to identify outliers, data points with a Z-score greater than 2 are considered outliers.
Python
import numpy as np
threshold_z = 2
outlier_indices = np.where(z > threshold_z)[0]
no_outliers = df_diabetics.drop(outlier_indices)
print("Original DataFrame Shape:", df_diabetics.shape)
print("DataFrame Shape after Removing Outliers:", no_outliers.shape)
Output:
Original DataFrame Shape: (442, 10)
DataFrame Shape after Removing Outliers: (426, 10)
4. Interquartile Range (IQR) Method
IQR (Inter Quartile Range) method is a widely used and reliable technique for detecting outliers. It is robust to skewed data and helps identify extreme values based on quartiles and it most trusted approach used in the research field. The IQR is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):
IQR = Q3 - Q1
Syntax : numpy.percentile(arr, n, axis=None, out=None)
Parameters:
- arr: Input array.
- n: Percentile value.
Here we are calculating the interquartile range (IQR) for the 'bmi' column in the DataFrame df_diabetics
. It first finds the first quartile (Q1) and third quartile (Q3) using the midpoint method then calculates the IQR as the difference between Q3 and Q1 which providing a measure of the spread of the middle 50% of the data in the 'bmi' column.
Python
Q1 = np.percentile(df_diabetics['bmi'], 25, method='midpoint')
Q3 = np.percentile(df_diabetics['bmi'], 75, method='midpoint')
IQR = Q3 - Q1
print(IQR)
Output:
0.06520763046978838
To define the outlier base value is defined above and below dataset's normal range namely Upper and Lower bounds define the upper and the lower bound (1.5*IQR value is considered i.e:
- upper = Q3 +1.5*IQR
- lower = Q1 - 1.5*IQR
In the above formula the 0.5 scale-up of IQR (new_IQR = IQR + 0.5*IQR) is taken to consider all the data between 2.7 standard deviations in the Gaussian Distribution.
Python
upper = Q3+1.5*IQR
upper_array = np.array(df_diabetics['bmi'] >= upper)
print("Upper Bound:", upper)
print(upper_array.sum())
lower = Q1-1.5*IQR
lower_array = np.array(df_diabetics['bmi'] <= lower)
print("Lower Bound:", lower)
print(lower_array.sum())
Output:
Upper and Lower boundsNow lets detect and remove outlier using the interquartile range (IQR).
Here we are using the interquartile range (IQR) method to detect and remove outliers in the 'bmi' column of the diabetes dataset. It calculates the upper and lower limits based on the IQR it identifies outlier indices using Boolean arrays and then removes the corresponding rows from the DataFrame which results in a new DataFrame with outliers excluded. The before and after shapes of the DataFrame are printed for comparison.
Python
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
diabetes = load_diabetes()
column_name = diabetes.feature_names
df_diabetes = pd.DataFrame(diabetes.data)
df_diabetes .columns = column_name
df_diabetes .head()
print("Old Shape: ", df_diabetes.shape)
Q1 = df_diabetes['bmi'].quantile(0.25)
Q3 = df_diabetes['bmi'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
upper_array = np.where(df_diabetes['bmi'] >= upper)[0]
lower_array = np.where(df_diabetes['bmi'] <= lower)[0]
df_diabetes.drop(index=upper_array, inplace=True)
df_diabetes.drop(index=lower_array, inplace=True)
print("New Shape: ", df_diabetes.shape)
Output:
Old Shape: (442, 10)
New Shape: (439, 10)
With outlier detection and removal we ensure that our data is clean, reliable and ready to provide valuable insights and setting the foundation for robust analysis and accurate models.
Similar Reads
Data Science Tutorial Data Science is a field that combines statistics, machine learning and data visualization to extract meaningful insights from vast amounts of raw data and make informed decisions, helping businesses and industries to optimize their operations and predict future trends.This Data Science tutorial offe
3 min read
Introduction to Machine Learning
What is Data Science?Data science is the study of data that helps us derive useful insight for business decision making. Data Science is all about using tools, techniques, and creativity to uncover insights hidden within data. It combines math, computer science, and domain expertise to tackle real-world challenges in a
8 min read
Top 25 Python Libraries for Data Science in 2025Data Science continues to evolve with new challenges and innovations. In 2025, the role of Python has only grown stronger as it powers data science workflows. It will remain the dominant programming language in the field of data science. Its extensive ecosystem of libraries makes data manipulation,
10 min read
Difference between Structured, Semi-structured and Unstructured dataBig Data includes huge volume, high velocity, and extensible variety of data. There are 3 types: Structured data, Semi-structured data, and Unstructured data. Structured data - Structured data is data whose elements are addressable for effective analysis. It has been organized into a formatted repos
2 min read
Types of Machine LearningMachine learning is the branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data and improve from previous experience without being explicitly programmed for every task.In simple words, ML teaches the systems to think and understand like h
13 min read
What's Data Science Pipeline?Data Science is a field that focuses on extracting knowledge from data sets that are huge in amount. It includes preparing data, doing analysis and presenting findings to make informed decisions in an organization. A pipeline in data science is a set of actions which changes the raw data from variou
3 min read
Applications of Data ScienceData Science is the deep study of a large quantity of data, which involves extracting some meaning from the raw, structured, and unstructured data. Extracting meaningful data from large amounts usesalgorithms processing of data and this processing can be done using statistical techniques and algorit
6 min read
Python for Machine Learning
Learn Data Science Tutorial With PythonData Science has become one of the fastest-growing fields in recent years, helping organizations to make informed decisions, solve problems and understand human behavior. As the volume of data grows so does the demand for skilled data scientists. The most common languages used for data science are P
3 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Introduction to Statistics
Statistics For Data ScienceStatistics is like a toolkit we use to understand and make sense of information. It helps us collect, organize, analyze and interpret data to find patterns, trends and relationships in the world around us.From analyzing scientific experiments to making informed business decisions, statistics plays a
12 min read
Descriptive StatisticStatistics is the foundation of data science. Descriptive statistics are simple tools that help us understand and summarize data. They show the basic features of a dataset, like the average, highest and lowest values and how spread out the numbers are. It's the first step in making sense of informat
5 min read
What is Inferential Statistics?Inferential statistics is an important tool that allows us to make predictions and conclusions about a population based on sample data. Unlike descriptive statistics, which only summarize data, inferential statistics let us test hypotheses, make estimates, and measure the uncertainty about our predi
7 min read
Bayes' TheoremBayes' Theorem is a mathematical formula used to determine the conditional probability of an event based on prior knowledge and new evidence. It adjusts probabilities when new information comes in and helps make better decisions in uncertain situations.Bayes' Theorem helps us update probabilities ba
13 min read
Probability Data Distributions in Data ScienceUnderstanding how data behaves is one of the first steps in data science. Before we dive into building models or running analysis, we need to understand how the values in our dataset are spread out and thatâs where probability distributions come in.Let us start with a simple example: If you roll a f
8 min read
Parametric Methods in StatisticsParametric statistical methods are those that make assumptions regarding the distribution of the population. These methods presume that the data have a known distribution (e.g., normal, binomial, Poisson) and rely on parameters (e.g., mean and variance) to define the data.Key AssumptionsParametric t
6 min read
Non-Parametric TestsNon-parametric tests are applied in hypothesis testing when the data does not satisfy the assumptions necessary for parametric tests, such as normality or equal variances. These tests are especially helpful for analyzing ordinal data, small sample sizes, or data with outliers.Common Non-Parametric T
5 min read
Hypothesis TestingHypothesis testing compares two opposite ideas about a group of people or things and uses data from a small part of that group (a sample) to decide which idea is more likely true. We collect and study the sample data to check if the claim is correct.Hypothesis TestingFor example, if a company says i
9 min read
ANOVA for Data Science and Data AnalyticsANOVA is useful when we need to compare more than two groups and determine whether their means are significantly different. Suppose you're trying to understand which ingredients in a recipe affect its taste. Some ingredients, like spices might have a strong influence while others like a pinch of sal
9 min read
Bayesian Statistics & ProbabilityBayesian statistics sees unknown values as things that can change and updates what we believe about them whenever we get new information. It uses Bayesâ Theorem to combine what we already know with new data to get better estimates. In simple words, it means changing our initial guesses based on the
6 min read
Feature Engineering
Model Evaluation and Tuning
Data Science Practice