Search Queries Anomaly Detection Using Python
Search Queries Anomaly Detection Using Python
Python
Aman Kharwal
Machine Learning
Search Queries Anomaly Detection means identifying queries that are outliers according to their
performance metrics. It is valuable for businesses to spot potential issues or opportunities, such
as unexpectedly high or low CTRs. If you want to learn how to detect anomalies in search
queries, this article is for you. In this article, I’ll take you through the task of Search Queries
Anomaly Detection with Machine Learning using Python.
1. Gather historical search query data from the source, such as a search engine or a
website’s search functionality.
2. Conduct an initial analysis to understand the distribution of search queries, their
frequency, and any noticeable patterns or trends.
3. Create relevant features or attributes from the search query data that can aid in anomaly
detection.
4. Choose an appropriate anomaly detection algorithm. Common methods include statistical
approaches like Z-score analysis and machine learning algorithms like Isolation Forests
or One-Class SVM.
5. Train the selected model on the prepared data.
6. Apply the trained model to the search query data to identify anomalies or outliers.
So, the process starts with collecting a dataset based on search queries. I found an ideal dataset
for this task. You can download the dataset from here.
8
queries_df = pd.read_csv("Queries.csv")
9
print(queries_df.head())
Top queries Clicks Impressions CTR \
0 number guessing game python 5223 14578 35.83%
1 thecleverprogrammer 2809 3456 81.28%
2 python projects with source code 2077 73380 2.83%
3 classification report in machine learning 2012 4959 40.57%
4 the clever programmer 1931 2528 76.38%
Position
0 1.61
1 1.02
2 5.94
3 1.28
4 1.09
1
print(queries_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Top queries 1000 non-null object
1 Clicks 1000 non-null int64
2 Impressions 1000 non-null int64
3 CTR 1000 non-null object
4 Position 1000 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB
None
Now, let’s convert the CTR column from a percentage string to a float:
1
# Cleaning CTR column
2
queries_df['CTR'] = queries_df['CTR'].str.rstrip('%').astype('float') / 100
1
# Function to clean and split the queries into words
2
def clean_and_split(query):
3
words = re.findall(r'\b[a-zA-Z]+\b', query.lower())
4
return words
5
6
# Split each query into words and count the frequency of each word
7
word_counts = Counter()
8
for query in queries_df['Top queries']:
9
word_counts.update(clean_and_split(query))
10
11
word_freq_df = pd.DataFrame(word_counts.most_common(20), columns=['Word',
'Frequency'])
12
13
# Plotting the word frequencies
14
fig = px.bar(word_freq_df, x='Word', y='Frequency', title='Top 20 Most Common
Words in Search Queries')
15
fig.show()
Now, let’s have a look at the top queries by clicks and impressions:
1
# Top queries by Clicks and Impressions
2
top_queries_clicks_vis = queries_df.nlargest(10, 'Clicks')[['Top queries',
'Clicks']]
3
top_queries_impressions_vis = queries_df.nlargest(10, 'Impressions')[['Top
queries', 'Impressions']]
4
5
# Plotting
6
fig_clicks = px.bar(top_queries_clicks_vis, x='Top queries', y='Clicks',
title='Top Queries by Clicks')
7
fig_impressions = px.bar(top_queries_impressions_vis, x='Top queries',
y='Impressions', title='Top Queries by Impressions')
8
fig_clicks.show()
9
fig_impressions.show()
Now, let’s analyze the queries with the highest and lowest CTRs:
1
# Queries with highest and lowest CTR
2
top_ctr_vis = queries_df.nlargest(10, 'CTR')[['Top queries', 'CTR']]
3
bottom_ctr_vis = queries_df.nsmallest(10, 'CTR')[['Top queries', 'CTR']]
4
5
# Plotting
6
fig_top_ctr = px.bar(top_ctr_vis, x='Top queries', y='CTR', title='Top Queries
by CTR')
7
fig_bottom_ctr = px.bar(bottom_ctr_vis, x='Top queries', y='CTR',
title='Bottom Queries by CTR')
8
fig_top_ctr.show()
9
fig_bottom_ctr.show()
Now, let’s have a look at the correlation between different metrics:
1
# Correlation matrix visualization
2
correlation_matrix = queries_df[['Clicks', 'Impressions', 'CTR',
'Position']].corr()
3
fig_corr = px.imshow(correlation_matrix, text_auto=True, title='Correlation
Matrix')
4
fig_corr.show()
1. Clicks and Impressions are positively correlated, meaning more Impressions tend to lead
to more Clicks.
2. Clicks and CTR have a weak positive correlation, implying that more Clicks might
slightly increase the Click-Through Rate.
3. Clicks and Position are weakly negatively correlated, suggesting that higher ad or page
Positions may result in fewer Clicks.
4. Impressions and CTR are negatively correlated, indicating that higher Impressions tend to
result in a lower Click-Through Rate.
5. Impressions and Position are positively correlated, indicating that ads or pages in higher
Positions receive more Impressions.
6. CTR and Position have a strong negative correlation, meaning that higher Positions result
in lower Click-Through Rates.
Now, let’s detect anomalies in search queries. You can use various techniques for anomaly
detection. A simple and effective method is the Isolation Forest algorithm, which works well
with different data distributions and is efficient with large datasets:
1
from sklearn.ensemble import IsolationForest
2
3
# Selecting relevant features
4
features = queries_df[['Clicks', 'Impressions', 'CTR', 'Position']]
5
6
# Initializing Isolation Forest
7
iso_forest = IsolationForest(n_estimators=100, contamination=0.01) #
contamination is the expected proportion of outliers
8
9
# Fitting the model
10
iso_forest.fit(features)
11
12
# Predicting anomalies
13
queries_df['anomaly'] = iso_forest.predict(features)
14
15
# Filtering out the anomalies
16
anomalies = queries_df[queries_df['anomaly'] == -1]
Here’s how to analyze the detected anomalies to understand their nature and whether they
represent true outliers or data errors:
1
print(anomalies[['Top queries', 'Clicks', 'Impressions', 'CTR', 'Position']])
Top queries Clicks Impressions CTR Position
0 number guessing game python 5223 14578 0.3583 1.61
1 thecleverprogrammer 2809 3456 0.8128 1.02
2 python projects with source code 2077 73380 0.0283 5.94
4 the clever programmer 1931 2528 0.7638 1.09
15 rock paper scissors python 1111 35824 0.0310 7.19
21 classification report 933 39896 0.0234 7.53
34 machine learning roadmap 708 42715 0.0166 8.97
82 r2 score 367 56322 0.0065 9.33
167 text to handwriting 222 11283 0.0197 28.52
929 python turtle 52 18228 0.0029 18.75
The anomalies in our search query data are not just outliers. They are indicators of potential
areas for growth, optimization, and strategic focus. These anomalies are reflecting emerging
trends or areas of growing interest. Staying responsive to these trends will help in maintaining
and growing the website’s relevance and user engagement.
Summary
So, Search Queries Anomaly Detection means identifying queries that are outliers according to
their performance metrics. It is valuable for businesses to spot potential issues or opportunities,
such as unexpectedly high or low CTRs. I hope you liked this article on Search Queries Anomaly
Detection with Machine Learning using Python. Feel free to ask valuable questions in the
comments section below.
https://thecleverprogrammer.com/2023/11/20/search-queries-anomaly-detection-using-python/