Solve SEO Challenges With Data Science Using Python
Solve SEO Challenges With Data Science Using Python
with Python
Solve SEO Challenges with Data Science
Using Python
—
Andreas Voniatis
Foreword by Will Critchlow,
Founder and CEO, SearchPilot
Data-Driven SEO with
Python
Solve SEO Challenges with Data
Science Using Python
Andreas Voniatis
Foreword by Will Critchlow,
Founder and CEO, SearchPilot
Data-Driven SEO with Python: Solve SEO Challenges with Data Science
Using Python
Andreas Voniatis
Surrey, UK
Acknowledgments��������������������������������������������������������������������������������������������������xix
Foreword���������������������������������������������������������������������������������������������������������������xxv
Chapter 1: Introduction�������������������������������������������������������������������������������������������� 1
The Inexact (Data) Science of SEO������������������������������������������������������������������������������������������������ 1
Noisy Feedback Loop�������������������������������������������������������������������������������������������������������������� 1
Diminishing Value of the Channel�������������������������������������������������������������������������������������������� 2
Making Ads Look More like Organic Listings��������������������������������������������������������������������������� 2
Lack of Sample Data��������������������������������������������������������������������������������������������������������������� 2
Things That Can’t Be Measured����������������������������������������������������������������������������������������������� 3
High Costs������������������������������������������������������������������������������������������������������������������������������� 4
Why You Should Turn to Data Science for SEO������������������������������������������������������������������������������ 4
SEO Is Data Rich���������������������������������������������������������������������������������������������������������������������� 4
SEO Is Automatable����������������������������������������������������������������������������������������������������������������� 5
Data Science Is Cheap������������������������������������������������������������������������������������������������������������� 5
Summary�������������������������������������������������������������������������������������������������������������������������������������� 5
v
Table of Contents
Chapter 3: Technical����������������������������������������������������������������������������������������������� 63
Where Data Science Fits In��������������������������������������������������������������������������������������������������������� 64
Modeling Page Authority������������������������������������������������������������������������������������������������������������� 64
Filtering in Web Pages����������������������������������������������������������������������������������������������������������� 66
Examine the Distribution of Authority Before Optimization��������������������������������������������������� 67
vi
Table of Contents
vii
Table of Contents
viii
Table of Contents
x
Table of Contents
Diagnostics�������������������������������������������������������������������������������������������������������������������������� 454
Road Map���������������������������������������������������������������������������������������������������������������������������� 463
Summary���������������������������������������������������������������������������������������������������������������������������������� 467
xi
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 569
xii
About the Author
Andreas Voniatis is the founder of Artios and a SEO
consultant with over 20 year’s experience working with
ad agencies (PHD, Havas, Universal Mcann, Mindshare
and iProspect), and brands (Amazon EU, Lyst, Trivago,
GameSys). Andreas founded Artios in 2015 – to apply an
advanced mathematical approach and cloud AI/Machine
Learning to SEO.
With a background in SEO, data science and cloud engineering, Andreas has helped
companies gain an edge through data science and automation. His work has been
featured in publications worldwide including The Independent, PR Week, Search Engine
Watch, Search Engine Journal and Search Engine Land.
Andreas is a qualified accountant, holds a degree in Economics from Leeds
University and has specialised in SEO science for over a decade. Through his firm Artios,
Andreas helps grow startups providing ROI guarantees and trains enterprise SEO teams
on data driven SEO.
xiii
About the Contributing Editor
Simon Dance is the Chief Commercial Officer at Lyst.com, a fashion shopping platform
serving over 200M users a year; an angel investor; and an experienced SEO having
spent a 15-year career working in senior leadership positions including Head of SEO
for Amazon’s UK and European marketplaces and senior SEO roles at large-scale
marketplaces in the flights and vacation rental space as well as consulting venture–
backed companies including Depop, Carwow, and HealthUnlocked. Simon has worn
multiple hats over his career from building links, manually auditing vast backlink
profiles, carrying our comprehensive bodies of keyword research, and writing technical
audit documents spanning hundreds of pages to building, mentoring, and leading teams
who have unlocked significant improvements in SEO performance, generating hundreds
of millions of dollars of incremental revenue. Simon met Andreas in 2015 when he had
just built a rudimentary set of Python scripts designed to vastly increase the scale, speed,
and accuracy of carrying out detailed keyword research and classification. They have
worked together almost ever since.
xv
About the Technical Reviewer
Joos Korstanje is a data scientist with over five years
of industry experience in developing machine learning
tools. He has a double MSc in Applied Data Science and
in Environmental Science and has extensive experience
working with geodata use cases. He has worked at a
number of large companies in the Netherlands and France,
developing machine learning for a variety of tools.
xvii
Acknowledgments
It’s my first book and it wouldn’t have been possible without the help of a few people. I’d
like to thank Simon Dance, my contributing editor, who has asked questions and made
suggested edits using his experience as an SEO expert and commercial director. I’d also
like to thank all of the people at Springer Nature and Apress for their help and support.
Wendy for helping me navigate the commercial seas of getting published. Will Critchlow
for providing the foreword to this book. All of my colleagues, clients, and industry peers
including SEOs, data scientists, and cloud engineers that I have had the pleasure of
working with. Finally, my family, Petra and Julia.
xix
Why I Wrote This Book
Since 2003, when I first got into SEO (by accident), much has changed in the practice of
SEO. The ingredients were lesser known even though much of the focus was on getting
backlinks, be they reciprocal, one-way links or from private networks (which are still
being used in the gaming space). Other ingredients include transitioning to becoming a
recognized brand, producing high-quality content which is valuable to users, a delightful
user experience, producing and organizing content by search intent, and, for now and
tomorrow, optimizing the search journey.
Many of the ingredients are now well known and are more complicated with the
advent of mobile, social media, and voice and the increasing sophistication of search
engines.
Now more than ever, the devil is in the details. There is more data being generated
than ever before from ever more third-party data sources and tools. Spreadsheets alone
won’t hack it. You need a sharper blade, and data science (combined with your SEO
knowledge) is your best friend.
I created this book for you, to make your SEO data driven and therefore the best
it can be.
And why now in 2023? Because COVID-19 happened which gave me time to think
about how I could add value to the world and in particular the niche world of SEO.
Even more presciently, there are lots of conversations on Twitter and LinkedIn about
SEOs and the use of Python in SEO. So we felt the timing is right as the SEO industry has
the appetite and we have knowledge to share.
I wish you the very best in your new adventure as a data-driven SEO specialist!
xxi
Why I Wrote This Book
• Code to get going: The best way to learn naturally is by doing. While
there are many courses in SEO, the most committed students of SEO
will build their own websites and test SEO ideas and practices. Data
science for SEO is no different if you want to make your SEO data
driven. So, you’ll be provided with starter scripts in Python to try
your own hand in clustering pages and content, analyzing ranking
factors. There will be code for most things but not for everything, as
not everything has been coded for (yet). The code is there to get you
started and can always be improved upon.
xxii
Why I Wrote This Book
Beyond the Scope
While this book promises and delivers on making your SEO data driven, there are a
number of things that are better covered by other books out there, such as
• How to become an SEO specialist: What this book won’t cover is how
to become an SEO expert although you’ll certainly come away with a
lot of knowledge on how to be a better SEO specialist. There are some
fundamentals that are beyond the scope of this book.
For example, we don’t get into how a search engine works, what a content
management system is, how it works, and how to read and code HTML and CSS. We also
don’t expose all of the ranking factors that a search engine might use to rank websites or
how to perform a site relaunch or site migration.
This book assumes you have a rudimentary knowledge of how SEO works and what
SEO is. We will give a data-driven view of the many aspects of SEO, and that is to reframe
the SEO challenge from a data science perspective so that you have a useful construct to
begin with.
• How to become a data scientist: This book will certainly expose the
data science techniques to solve SEO challenges. What it won’t do is
teach you to become a data scientist or teach you how to program in
the Python computing language.
xxiii
Why I Wrote This Book
1. Keyword research
2. Technical
3. Content and UX
4. Authority
5. Competitor analysis
6. Experiments
7. Dashboards
9. Google updates
• Data sources
• Data structures
• Models
• Activation suggestions
I’ve tried to apply data science to as many SEO processes as possible in the areas
identified earlier. Naturally, there will be some areas that could be applied that have not.
However, technology is changing, and Google is already releasing updates to combat AI-
written content. So I’d imagine in the very near future, more and more areas of SEO will
be subject to data science.
xxiv
Foreword
The data we have access to as SEOs has changed a lot during my 17 years in the indus-
try. Although we lost analytics-level keyword data, and Yahoo! Site Explorer, we gained
a wealth of opportunity in big data, proprietary metrics, and even some from the horse’s
mouth in Google Search Console.
You don’t have to be able to code to be an effective SEO. But there is a certain kind of
approach and a certain kind of mindset that benefits from wrangling data in all its forms.
If that’s how you prefer to work, you will very quickly hit the limits of spreadsheets and
text editors. When you do, you’ll do well to turn to more powerful tools to help you scale
what you’re capable of, get things done that you wouldn’t even have been able to do
without a computer helping, and speed up every step of the process.
There are a lot of programming languages, and a lot of ways of learning them. Some
people will tell you there is only one right way. I’m not one of those people, but my
personal first choice has been Python for years now. I liked it initially for its relative
simplicity and ease of getting started, and very quickly fell for the magic of being able to
import phenomenal power written by others with a single line of code. As I got to know
the language more deeply and began to get some sense of the “pythonic” way of doing
things, I came to appreciate the brevity and clarity of the language. I am no expert, and
I’m certainly not a professional software engineer, but I hope that makes me a powerful
advocate for the approach outlined in this book - because I have been the target market.
When I was at university, I studied neural networks among many other things. At the
time, they were fairly abstract concepts in operations research. At that point in the late
90s, there wasn’t the readily available computing power plus huge data sets needed to
realise the machine learning capabilities hidden in those nodes, edges, and statistical
relationships. I’ve remained fascinated by what is possible and with the help of magical
import statements and remarkably mature frameworks, I have even been able to build
and train my own neural networks in Python. As a stats geek, I love that it’s all stats under
the hood, but at the same time, I appreciate the beauty in a computer being able to do
something a person can’t.
A couple of years after university, I founded the SEO agency Distilled with my co-
founder Duncan Morris, and one of the things that we encouraged among our SEO
xxv
Foreword
consultants was taking advantage of the data and tools at their disposal. This led to
fun innovation - both decentralised, in individual consultants building scripts and
notebooks to help them scale their work, do it faster, or be more effective, and centrally
in our R&D team.
That R&D team would be the group who built the platform that would become
SearchPilot and launched the latest stage of my career where we are very much leading
the charge for data aware decisions in SEO. We are building the enterprise SEO A/B
testing platform to help the world’s largest websites prove the value of their on-site SEO
initiatives. All of this uses similar techniques to those outlined in the pages that follow to
decide how to implement tests, to consume data from a variety of APIs, and to analyse
their results with neural networks.
I believe that as Google implements more and more of their own machine learning
into their ranking algorithms, that SEO becomes fundamentally harder as the system
becomes harder to predict, and has a greater variance across sites, keywords, and topics.
It’s for this reason that I am investing so much time, energy, and the next phase of my
career into our corner of data driven SEO. I hope that this book can set a whole new
cohort of SEOs on a similar path.
I first met Andreas over a decade ago in London. I’ve seen some of the things he
has been able to build over the years, and I’m sure he is going to be an incredible
guide through the intricacies of wrangling data to your benefit in the world of
SEO. Happy coding!
Will Critchlow, CEO, SearchPilot
September 2022
xxvi
CHAPTER 1
Introduction
Before the Google Search Essentials (formerly Webmaster Guidelines), there was an
unspoken contract between SEOs and search engines which promised traffic in return
for helping search engines extract and index website content. This chapter introduces
you to the challenges of applying data science to SEO and why you should use data.
• High costs
1
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_1
Chapter 1 Introduction
within their systems before it gets reflected in the search engine results (which may or
may not result in a change of ranking position).
Because of this variable and unpredictable time lag, this makes it rather difficult to
undertake cause and effect analysis to learn from SEO experiments.
2
Chapter 1 Introduction
To have a dataset that has a true representation of the SEO reality, it must have
multiple audit measurements which allow for statistics such as average and standard
deviations per day of
• Duplicate content
• Missing titles
With this type of data, data scientists are able to do meaningful SEO science work
and track these to rankings and UX outcomes.
• Search query: Google, for some time, has been hiding the search
query detail of organic traffic, of which the keyword detail in Google
Analytics is shown as “Not Provided.” Naturally, this would be useful
as there are many keywords to one URL relationship, so getting the
breakdown would be crucial for attribution modeling outcomes, such
as leads, orders, and revenue.
• Search volume: Google Ads does not fully disclose search volume per
search query. The search volume data for long tail phrases provided
by Ads is reallocated to broader matches because it’s profitable for
Google to encourage users to bid on these terms as there are more
bidders in the auction. Google Search Console (GSC) is a good
substitute, but is first-party data and is highly dependent on your
site’s presence for your hypothesis keyword.
3
Chapter 1 Introduction
• Segment: This would tell us who is searching, not just the keyword,
which of course would in most cases vastly affect the outcomes of
any machine-learned SEO analysis because a millionaire searching
for “mens jeans” would expect different results to another user of
more modest means. After all, Google is serving personalized results.
Not knowing the segment simply adds noise to any SERPs model or
otherwise.
High Costs
Can you imagine running a large enterprise crawling technology like Botify daily? Most
brands run a crawl once a month because it’s cost prohibitive, and not just on your site.
To get a complete dataset, you’d need to run it on your competitors, and that’s only one
type of SEO data.
Cost won’t matter as much to the ad agency data scientist, but it will affect whether
they will get access to the data because the agency may decide the budget isn’t
worthwhile.
4
Chapter 1 Introduction
SEO Is Automatable
At least in certain aspects. We’re not saying that robots will take over your career. And
yet, we believe there is a case that some aspects of your job as an SEO a computer can do
instead. After all, computers are extremely good at doing repetitive tasks, they don’t get
tired nor bored, can “see” beyond three dimensions, and only live on electricity.
Andreas has taken over teams where certain members spent time constantly copying
and pasting information from one document to another (the agency and individual will
remain unnamed to spare their blushes).
Doing repetitive work that can be easily done by a computer is not value adding,
emotionally engaging, nor good for your mental health. The point is we as humans are
at our best when we’re thinking and synthesizing information about a client’s SEO; that’s
when our best work gets done.
Summary
This brief introductory chapter has covered the following:
• The inexact science of SEO
5
CHAPTER 2
Keyword Research
Behind every query a user enters within a search engine is a word or series of words.
For instance, a user may be looking for a “hotel” or perhaps a “hotel in New York City.”
In search engine optimization (SEO), keywords are invariably the target, providing a
helpful way of understanding demand for said queries and helping to more effectively
understand various ways that users search for products, services, organizations, and,
ultimately, answers.
As well as SEO starting from keywords, it also tends to end with the keyword as an
SEO campaign may be evaluated on the value of the keyword’s contribution. Even if this
information is hidden from us by Google, attempts have been made by a number of SEO
tools to infer the keyword used by users to reach a website.
In this chapter, we will give you data-driven methods for finding valuable keywords
for your website (to enable you to have a much richer understanding of user demand).
It’s also worth noting that given keyword rank tracking comes at a cost (usually
charged per keyword tracked or capped at a total number of keywords), it makes sense to
know which keywords are worth the tracking cost.
Data Sources
There are a number of data sources when it comes to keyword research, which we’ll list
as follows:
• Google Search Console
• Competitor Analytics
• SERPs
• Google Trends
• Google Ads
• Google Suggest
7
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_2
Chapter 2 Keyword Research
We’ll cover the ones highlighted in bold as they are not only the more informative of
the data sources, they also scale as data science methods go. Google Ads data would only
be so appealing if it were based on actual impression data.
We will also show you how to make forecasts of keyword data both in terms of the
amount of impressions you get if you achieve a ranking on page 1 (within positions 1 to
10) and what this impact would be over a six-month horizon.
Armed with a detailed understanding of how customers search, you’re in a much
stronger position to benchmark where you index vs. this demand (in order to understand
the available opportunity you can lean into), as well as be much more customer focused
when orienting your website and SEO activity to target that demand.
Let’s get started.
1
In 2006, AOL shared click-through rate data based upon over 35 million search queries, and
since then it has inspired numerous models to try and estimate the click-through rate (CTR) by
search engine ranking position. That is, for every 100 people searching for “hotels in New York,”
30% (for example) click on the position 1 ranking, with just 16% clicking on position 2 (hence the
importance of achieving the top ranked position, in order to, effectively, double your traffic (for
that keyword))
8
Chapter 2 Keyword Research
There is no hard and fast rule. Two sigmas simply mean that there’s a less than 5%
chance that the search query is actually less like the average search query, so a lower
significance threshold like one sigma could easily suffice.
The data are several exports from Google Search Console (GSC) of the top 1000
rows based on a number of filters. The API could be used, and some code is provided in
Chapter 10 showing how to do so.
For now, we’re reading multiple GSC export files stored in a local folder.
Set the path to read the files:
Initialize an empty list that will store the data being read in:
gsc_li = []
The for loop iterates through each export file and takes the filename as the modifier
used to filter the results and then appends it to the preceding list:
for cf in gsc_csvs:
df = pd.read_csv(cf, index_col=None, header=0)
df['modifier'] = os.path.basename(cf)
df.modifier = df.modifier.str.replace('_queries.csv', '')
gsc_li.append(df)
Once the list is populated with the export data, it’s combined into a single dataframe:
gsc_raw_df = pd.DataFrame()
gsc_raw_df = pd.concat(gsc_li, axis=0, ignore_index=True)
9
Chapter 2 Keyword Research
gsc_raw_df.columns = gsc_raw_df.columns.str.strip().str.lower().str.
replace(' ', '_').str.replace('(', '').str.replace(')', '')
gsc_raw_df.head()
With the data imported, we’ll want to format the column values to be capable of
being summarized. For example, we’ll remove the percent signs in the ctr column and
convert it to a numeric format:
GSC data contains a funny character “<” in the impressions and clicks columns for
values less than 10; our job is to clean this up by removing them and then arranging
impressions in descending order. In Python, this would look like
gsc_clean_ctr_df['impressions'] = gsc_clean_ctr_df.impressions.str.
replace('<', '')
pd.to_numeric(gsc_import_df.impressions)
gsc_dedupe_df = gsc_clean_ctr_df.drop_duplicates(subset='top_queries',
keep="first")
10
Chapter 2 Keyword Research
query_conds = [
gsc_segment_strdetect['query'].str.contains('|'.join(retail_vex)),
gsc_segment_strdetect['query'].str.contains('|'.join(platform_vex)),
gsc_segment_strdetect['query'].str.contains('|'.join(title_vex)),
gsc_segment_strdetect['query'].str.contains('|'.join(network_vex))
]
Create a new column and use np.select to assign values to it using our lists as
arguments:
gsc_segment_strdetect
11
Chapter 2 Keyword Research
gsc_segment_strdetect['rank_bracket'] = gsc_segment_strdetect.position.
round(0)
gsc_segment_strdetect
12
Chapter 2 Keyword Research
def imp_aggregator(col):
d = {}
d['avg_imps'] = col['impressions'].mean()
d['imps_median'] = col['impressions'].quantile(0.5)
d['imps_lq'] = col['impressions'].quantile(0.25)
d['imps_uq'] = col['impressions'].quantile(0.95)
d['n_count'] = col['impressions'].count()
13
Chapter 2 Keyword Research
overall_rankimps_agg = group_by_rank_bracket.apply(imp_aggregator)
overall_rankimps_agg
In this case, we went with the 25th and 95th percentiles. The lower percentile
number doesn’t matter as much as we’re far more interested in finding queries with
averages beyond the 95th percentile. If we can do that, we have a juicy keyword. Quick
note, in data science, a percentile is known as a “quantile.”
Could we make a table for each and every segment? For example, show the statistics
for impressions by ranking position by section. Yes, of course, you could, and in theory,
it would provide a more contextual analysis of queries performed vs. their segment
average. The deciding factor on whether to do so or not depends on how many data
points (i.e., ranked queries) you have for each rank bracket to make it worthwhile (i.e.,
statistically robust). You’d want at least 30 data points in each to go that far.
14
Chapter 2 Keyword Research
query_quantile_stats = gsc_segment_strdetect.merge(overall_rankimps_agg, on
=['rank_bracket'], how='left')
query_quantile_stats
Explore the Data
Now you might be wondering, how many keywords are punching above and below their
weight (i.e., above and below their quantile limits relative to their ranking position) and
what are those keywords?
Get the number of keywords with high volumes of impressions:
query_stats_uq = query_quantile_stats.loc[query_quantile_stats.impressions
> query_quantile_stats.imps_uq]
query_stats_uq['query'].count()
8390
Get the number of keywords with impressions and ranking beyond page 1:
15
Chapter 2 Keyword Research
query_stats_uq_p2b = query_quantile_stats.loc[(query_quantile_stats.
impressions > query_quantile_stats.imps_uq) & (query_quantile_stats.rank_
bracket > 10)]
query_stats_uq_p2b['query'].count()
2510
Depending on your resources, you may wish to track all 8390 keywords or just the
2510. Let’s see how the distribution of impressions looks visually across the range of
ranking positions:
sns.set(rc={'figure.figsize':(15, 6)})
imprank_plt.savefig("images/imprank_plt.png")
What’s interesting is the upper quantile impression keywords are not all in the top
10, but many are on pages 2, 4, and 6 of the SERP results (Figure 2-1). This indicates
that the site is either targeting the high-volume keywords but not doing a good job of
achieving a high ranking position or not targeting these high-volume phrases.
16
Chapter 2 Keyword Research
Figure 2-1. Line chart showing GSC impressions per ranking position bracket for
each distribution quantile
imprank_seg.savefig("images/imprank_seg.png")
Most of the high impression keywords are in Accessories, Console, and of course Top
1000 (Figure 2-2).
17
Chapter 2 Keyword Research
Figure 2-2. Line chart showing GSC impressions per ranking position bracket for
each distribution quantile faceted by segment
query_stats_uq_p2b.to_csv('exports/query_stats_uq_p2b_TOTRACK.csv')
Activation
Now that you’ve identified high impression value keywords, you can
• Think about how to integrate these new targets into your strategy
18
Chapter 2 Keyword Research
Obviously, the preceding list is reductionist, and yet as a minimum, you have better
nonbrand targets to better serve your SEO campaign.
Google Trends
Google Trends is another (free) third-party data source, which shows time series data
(data points over time) up to the last five years for any search phrase that has demand.
Google Trends can also help you compare whether a search is on the rise (or decline)
while comparing it to other search phrases. It can be highly useful for forecasting.
Although no Google Trends API exists, there are packages in Python (i.e., pytrends)
that can automate the extraction of this data as we’ll see as follows:
import pandas as pd
from pytrends.request import TrendReq
import time
Single Keyword
Now that you’ve identified high impression value keywords, you can see how they’ve
trended over the last five years:
kw_list = ["Blockchain"]
pytrends.build_payload(kw_list, cat=0, timeframe='today 5-y', geo='GB',
gprop='')
pytrends.interest_over_time()
19
Chapter 2 Keyword Research
Multiple Keywords
As you can see earlier, you get a dataframe with the date, the keyword, and the number of
hits (scaled from 0 to 100), which is great, and what if you had 10,000 keywords that you
wanted trends for?
In that case, you’d want a for loop to query the search phrases one by one and stick
them all into a dataframe like so:
Read in your target keyword data:
csv_raw = pd.read_csv('data/your_keyword_file.csv')
keywords_df = csv_raw[['query']]
keywords_list = keywords_df['query'].values.tolist()
keywords_list
20
Chapter 2 Keyword Research
['nintendo switch',
'ps4',
'xbox one controller',
'xbox one',
'xbox controller',
'ps4 vr',
'Ps5' ...]
Let’s now get Google Trends data for all of your keywords in one dataframe:
dataset = []
exceptions = []
for q in keywords_list:
q_lst = [q]
try:
pytrends.build_payload(kw_list=q_lst, timeframe='today 5-y',
geo='GB', gprop='')
data = pytrends.interest_over_time()
data = data.drop(labels=['isPartial'],axis='columns')
dataset.append(data)
time.sleep(3)
except:
exceptions.append(q_lst)
21
Chapter 2 Keyword Research
22
Chapter 2 Keyword Research
Looking at Google Trends raw, we now have data in long format showing
• Date
• Keyword
• Hits
Let’s visualize some of these over time. We start by subsetting the dataframe:
23
Chapter 2 Keyword Research
sns.set(rc={'figure.figsize':(15, 6)})
keyword_gtrends_plt.figure.savefig("images/keyword_gtrends.png")
keyword_gtrends_plt
Here, we can see that the “ps5” and “xbox series x” show a near identical trend which
ramp up significantly, while other models are fairly stable and seasonal until the arrival
of the new models.
24
Chapter 2 Keyword Research
df = pd.read_csv("exports/keyword_gtrends_df.csv", index_col=0)
df.head()
As we’d expect, the data from Google Trends is a very simple time series with date,
query, and hits spanning a five-year period. Time to format the dataframe to go from
long to wide:
We no longer have a hits column as these are the values of the queries in their
respective columns. This format is not only useful for SARIMA2 (which we will be
exploring here) but also neural networks such as long short-term memory (LSTM). Let’s
plot the data:
ps_unstacked.plot(figsize=(10,5))
From the plot (Figure 2-4), you’ll note that the profiles of both “PS4” and “PS5” are
different.
2
Seasonal Autoregressive Integrated Moving Average
26
Chapter 2 Keyword Research
For the nongamers among you, “PS4” is the fourth generation of the Sony
PlayStation console, and “PS5” the fifth. “PS4” searches are highly seasonal and have
a regular pattern apart from the end when the “PS5” emerged. The “PS5” didn’t exist
five years ago, which would explain the absence of trend in the first four years of the
preceding plot.
Decomposing the Trend
Let’s now decompose the seasonal (or nonseasonal) characteristics of each trend:
ps_unstacked.set_index("date", inplace=True)
ps_unstacked.index = pd.to_datetime(ps_unstacked.index)
query_col = 'ps5'
a = seasonal_decompose(ps_unstacked[query_col], model = "add")
a.plot();
Figure 2-5 shows the time series data and the overall smoothed trend showing it rises
from 2020.
27
Chapter 2 Keyword Research
The seasonal trend box shows repeated peaks which indicates that there is
seasonality from 2016, although it doesn’t seem particularly reliable given how flat the
time series is from 2016 until 2020. Also suspicious is the lack of noise as the seasonal
plot shows a virtually uniform pattern repeating periodically.
The Resid (which stands for “Residual”) shows any pattern of what’s left of the time
series data after accounting for seasonality and trend, which in effect is nothing until
2020 as it’s at zero most of the time.
For “ps4,” see Figure 2-6.
28
Chapter 2 Keyword Research
We can see fluctuation over the short term (Seasonality) and long term (Trend), with
some noise (Resid). The next step is to use the augmented Dickey-Fuller method (ADF)
to statistically test whether a given time series is stationary or not:
We can see that the p-value of “PS5” shown earlier is more than 0.05, which means
that the time series data is not stationary and therefore needs differencing. “PS4” on
the other hand is less than 0.05 at 0.01, meaning it’s stationery and doesn’t require
differencing.
The point of all this is to understand the parameters that would be used if we were
manually building a model to forecast Google searches.
29
Chapter 2 Keyword Research
ps5_s = auto_arima(ps_unstacked['ps4'],
trace=True,
m=52, #there are 52 period per season (weekly data)
start_p=0,
start_d=0,
start_q=0,
seasonal=False)
30
Chapter 2 Keyword Research
The preceding printout shows that the parameters that get the best results are
PS4: ARIMA(4,0,3)(0,0,0)
PS5: ARIMA(3,1,3)(0,0,0)
The PS5 estimate is further detailed when printing out the model summary:
ps5_s.summary()
31
Chapter 2 Keyword Research
32
Chapter 2 Keyword Research
By minimizing AIC and BIC, we get the best estimated parameters for p and q.
Test the Model
Now that we have the parameters, we can now start making forecasts for both products:
ps4_order = ps4_s.get_params()['order']
ps4_seasorder = ps4_s.get_params()['seasonal_order']
ps5_order = ps5_s.get_params()['order']
ps5_seasorder = ps5_s.get_params()['seasonal_order']
params = {
"ps4": {"order": ps4_order, "seasonal_order": ps4_seasorder},
"ps5": {"order": ps5_order, "seasonal_order": ps5_seasorder}
}
results = []
fig, axs = plt.subplots(len(X.columns), 1, figsize=(24, 12))
Make forecasts:
33
Chapter 2 Keyword Research
Plot predictions:
For ps4, the forecasts are pretty accurate from the beginning until March when the
search values start to diverge (Figure 2-7), while the ps5 forecasts don’t appear to be very
good at all, which is unsurprising.
Figure 2-7. Time series line plots comparing forecasts and actual data for both
ps4 and ps5
34
Chapter 2 Keyword Research
The forecasts show the models are good when there is enough history until they
suddenly change like they have for PS4 from March onward. For PS5, the models are
hopeless virtually from the get-go. We know this because the Root Mean Squared Error
(RMSE) is 8.62 for PS4 which is more than a third of the PS5 RMSE of 27.5, which, given
Google Trends varies from 0 to 100, is a 27% margin of error.
Forecast the Future
At this point, we’ll now make the foolhardy attempt to forecast the future based on the
data we have to date:
oos_train_data = ps_unstacked
oos_train_data.tail()
As you can see from the preceding table extract, we’re now using all available data.
Now we shall predict the next six months (defined as 26 weeks) in the following code:
oos_results = []
weeks_to_predict = 26
fig, axs = plt.subplots(len(ps_unstacked.columns), 1, figsize=(24, 12))
Again, iterate through the columns to fit the best model each time:
35
Chapter 2 Keyword Research
oos_arima_model = SARIMAX(oos_train_data[col],
order = s.get_params()['order'],
seasonal_order = s.get_params()['seasonal_
order'])
oos_arima_result = oos_arima_model.fit()
Make forecasts:
Plot predictions:
36
Chapter 2 Keyword Research
Best model: ARIMA(3,1,3)(0,0,0)[0]
Total fit time: 7.954 seconds
Column: ps5 - Mean: 3.973076923076923
This time, we automated the finding of the best-fitting parameters and fed that
directly into the model.
37
Chapter 2 Keyword Research
The forecasts don’t look great (Figure 2-8) because there’s been a lot of change in the
last few weeks of the data; however, that’s in the case of those two keywords.
Figure 2-8. Out-of-sample forecasts of Google Trends for ps4 and ps5
The forecast quality will be dependent on how stable the historic patterns are and
will obviously not account for unforeseeable events like COVID-19.
Export your forecasts:
What we learn here is where forecasting using statistical models are useful or are
likely to add value for forecasting, particularly in automated systems like dashboards,
that is, when there’s historical data and not when there is a sudden spike like PS5.
38
Chapter 2 Keyword Research
“Life insurance”
“Trench coats” will share the same search intent as “Ladies trench coats” but won’t
share the same intent as “Life insurance.” To work this out, a simple comparison of the
top 10 ranking sites for both search phrases in Google will offer a strong suggestion of
what Google thinks of the search intent between the two phrases.
It’s not a perfect method, but it works well because you’re using the ranking results
which are a distillation of everything Google has learned to date on what content
satisfies the search intent of the search query (based upon the trillions of global searches
per year). Therefore, it’s reasonable to surmise that if two search queries have similar
enough SERPs, then the search intent is shared between keywords.
This is useful for a number of reasons:
• Paid search ads: Good keyword content mappings also mean you
can improve the account structure and resulting quality score of your
paid search activity.
39
Chapter 2 Keyword Research
Starting Point
Okay, time to cluster. We’ll assume you already have the top 100 SERPs3 results for each
of your keywords stored as a Python dataframe “serps_input.” The data is easily obtained
from a rank tracking tool, especially if they have an API:
serps_input
Here, we’re using DataForSEO’s SERP API,4 and we have renamed the column from
“rank_absolute” to “rank.”
3
Search Engine Results Pages (SERP)
4
Available at https://dataforseo.com/apis/serp-api/
40
Chapter 2 Keyword Research
Here it goes:
Split:
serps_grpby_keyword = serps_input.groupby("keyword")
def filter_twenty_urls(group_df):
filtered_df = group_df.loc[group_df['url'].notnull()]
filtered_df = filtered_df.loc[filtered_df['rank'] <= 20]
return filtered_df
filtered_serps = serps_grpby_keyword.apply(filter_twenty_urls)
normed = normed.add_prefix('normed_')
filtered_serps_df = pd.concat([filtered_serps],axis=0)
41
Chapter 2 Keyword Research
filtserps_grpby_keyword = filtered_serps_df.groupby("keyword")
def string_serps(df):
df['serp_string'] = ''.join(df['url'])
return df
Combine
strung_serps = filtserps_grpby_keyword.apply(string_serps)
strung_serps = pd.concat([strung_serps],axis=0)
strung_serps = strung_serps[['keyword', 'serp_string']]#.head(30)
strung_serps = strung_serps.drop_duplicates()
strung_serps
Now we have a table showing the keyword and their SERP string, we’re ready to
compare SERPs. Here’s an example of the SERP string for “fifa 19 ps4”:
strung_serps.loc[1, 'serp_string']
42
Chapter 2 Keyword Research
'https://www.amazon.co.uk/Electronic-Arts-221545-FIFA-PS4/dp/
B07DLXBGN8https://www.amazon.co.uk/FIFA-19-GAMES/dp/B07DL2SY2Bhttps://
www.game.co.uk/en/fifa-19-2380636https://www.ebay.co.uk/b/FIFA-19-Sony-
PlayStation-4-Video-Games/139973/bn_7115134270https://www.pricerunner.com/
pl/1422-4602670/PlayStation-4-Games/FIFA-19-Compare-Priceshttps://pricespy.
co.uk/games-consoles/computer-video-games/ps4/fifa-19-ps4--p4766432https://
store.playstation.com/en-gb/search/fifa%2019https://www.amazon.com/FIFA-19-
Standard-PlayStation-4/dp/B07DL2SY2Bhttps://www.tesco.com/groceries/
en-GB/products/301926084https://groceries.asda.com/product/ps-4-games/
ps-4-fifa-19/1000076097883https://uk.webuy.com/product-detail/?id=503094
5121916&categoryName=playstation4-software&superCatName=gaming&title=fi
fa-19https://www.pushsquare.com/reviews/ps4/fifa_19https://en.wikipedia.
org/wiki/FIFA_19https://www.amazon.in/Electronic-Arts-Fifa19SEPS4-Fifa-
PS4/dp/B07DVWWF44https://www.vgchartz.com/game/222165/fifa-19/https://www.
metacritic.com/game/playstation-4/fifa-19https://www.johnlewis.com/fifa-19-
ps4/p3755803https://www.ebay.com/p/22045274968'
filtserps_grpby_keyword = filtered_serps_df.groupby("keyword")
def string_serps(df):
df['serp_string'] = ' '.join(df['url'])
return df
strung_serps = filtserps_grpby_keyword.apply(string_serps)
strung_serps = pd.concat([strung_serps],axis=0)
43
Chapter 2 Keyword Research
Here, we now have the keywords and their respective SERPs all converted into
a string which fits into a single cell. For example, the search result for “beige trench
coats” is
'https://www.zalando.co.uk/womens-clothing-coats-trench-coats/_beige/
https://www.asos.com/women/coats-jackets/trench-coats/cat/?cid=15143
https://uk.burberry.com/womens-trench-coats/beige/ https://www2.hm.com/
44
Chapter 2 Keyword Research
en_gb/productpage.0751992002.html https://www.hobbs.com/clothing/
coats-jackets/trench/beige/ https://www.zara.com/uk/en/woman-outerwear-
trench-l1202.html https://www.ebay.co.uk/b/Beige-Trench-Coats-for-
Women/63862/bn_7028370345 https://www.johnlewis.com/browse/women/womens-
coats-jackets/trench-coats/_/N-flvZ1z0rnyl https://www.elle.com/uk/fashion/
what-to-wear/articles/g30975/best-trench-coats-beige-navy-black/'
Time to put these side by side. What we’re effectively doing here is taking a product
of the column to itself, that is, squaring it, so that we get all the SERPs combinations
possible to put the SERPs side by side.
Add a function to align SERPs:
serps_align('ps4', strung_serps)
for q in queries:
temp_df = serps_align(q, strung_serps)
matched_serps = matched_serps.append(temp_df)
45
Chapter 2 Keyword Research
The preceding result shows all of the keywords with SERPs compared side by
side with other keywords and their SERPs. Next, we’ll infer keyword intent similarity
by comparing serp_strings, but first here’s a note on the methods like Levenshtein,
Jaccard, etc.
Levenshtein distance is edit based, meaning the number of edits required to
transform one string (in our case, serp_string) into the other string (serps_string_b).
This doesn’t work very well because the websites within the SERP strings are individual
tokens, that is, not a single continuous string.
Sorensen-Dice is better because it is token based, that is, it treats the individual
websites as individual items or tokens. Using set similarity methods, the logic is to
find the common tokens and divide them by the total number of tokens present by
combining both sets. It doesn’t take the order into account, so we must go one better.
M Measure which looks at both the token overlap and the order of the tokens, that is,
weighting the order tokens earlier (i.e., the higher ranking sites/tokens) more than the
later tokens. There is no API for this unfortunately, so we wrote the function for you here:
import py_stringmatching as sm
ws_tok = sm.WhitespaceTokenizer()
46
Chapter 2 Keyword Research
ws_tok = sm.WhitespaceTokenizer()
#keep only first k URLs
serps_1 = ws_tok.tokenize(serps_str1)[:k]
serps_2 = ws_tok.tokenize(serps_str2)[:k]
#get positions of matches
match = lambda a, b: [b.index(x)+1 if x in b else None for x in a]
#positions intersections of form [(pos_1, pos_2), ...]
pos_intersections = [(i+1,j) for i,j in enumerate(match(serps_1,
serps_2)) if j is not None]
pos_in1_not_in2 = [i+1 for i,j in enumerate(match(serps_1, serps_2)) if
j is None]
pos_in2_not_in1 = [i+1 for i,j in enumerate(match(serps_2, serps_1)) if
j is None]
47
Chapter 2 Keyword Research
Before sorting the keywords into topic groups, let’s add search volumes for each. This
could be an imported table like the following one called “keysv_df”:
keysv_df
48
Chapter 2 Keyword Research
Let’s now join the data. What we’re doing here is giving Python the ability to group
keywords according to SERP similarity and name the topic groups according to the
keyword with the highest search volume.
49
Chapter 2 Keyword Research
Group keywords by search intent according to a similarity limit. In this case, keyword
search results must be 40% or more similar. This is a number based on trial and error of
which the right number can vary by the search space, language, or other factors.
simi_lim = 0.4
keywords_crossed_vols = keywords_crossed_vols.merge(keysv_df, on =
'keyword', how = 'left')
Simulate si_simi:
#keywords_crossed_vols['si_simi'] = np.random.rand(len(keywords_crossed_
vols.index))
keywords_crossed_vols.sort_values('topic_volume', ascending = False)
keywords_filtered_nonnan = keywords_crossed_vols.dropna()
We now have the potential topic name, keyword SERP similarity, and search volumes
of each. You’ll note the keyword and keyword_b have been renamed to topic and
keyword, respectively. Now we’re going to iterate over the columns in the dataframe
using list comprehensions.
List comprehension is a technique for looping over lists. We applied it to the Pandas
dataframe because it’s much quicker than the .iterrows() function. Here it goes.
Add a dictionary comprehension to create numbered topic groups from keywords_
filtered_nonnan:
# {1: [k1, k2, ..., kn], 2: [k1, k2, ..., kn], ..., n: [k1, k2, ..., kn]}
queries_in_df = list(set(keywords_filtered_nonnan.topic.to_list()))
50
Chapter 2 Keyword Research
topic_groups_numbered = {}
topics_added = []
def latest_index(dicto):
if topic_groups_numbered == {}:
i = 0
else:
i = list(topic_groups_numbered)[-1]
return i
The list comprehension will now apply the function to group keywords into clusters:
52
Chapter 2 Keyword Research
The preceding results are statements printing out what keywords are in which topic
group. We do this to make sure we don’t have duplicates or errors, which is crucial for
the next step to perform properly. Now we’re going to convert the dictionary into a
dataframe so you can see all of your keywords grouped by search intent:
topic_groups_lst = []
for k, l in topic_groups_numbered.items():
for v in l:
topic_groups_lst.append([k, v])
53
Chapter 2 Keyword Research
As you can see, the keywords are grouped intelligently, much like a human SEO
analyst would group these, except these have been done at scale using the wisdom of
Google which is distilled from its vast number of users. Name the clusters:
def highest_demand(df):
54
Chapter 2 Keyword Research
topic_groups_vols_keywgrp = topic_groups_vols.groupby('topic_group_no')
topic_groups_vols_keywgrp.get_group(1)
high_demand_topics = topic_groups_vols_keywgrp.apply(highest_demand).
reset_index()
del high_demand_topics['level_1']
high_demand_topics = high_demand_topics.rename(columns = {'keyword':
'topic'})
def shortest_name(df):
df['k_len'] = df.topic.str.len()
min_kl = df.k_len.min()
df = df.loc[df.k_len == min_kl]
del df['topic_group_no']
del df['k_len']
del df['search_volume']
return df
high_demand_topics_spl = high_demand_topics.groupby('topic_group_no')
named_topics = high_demand_topics_spl.apply(shortest_name).reset_index()
del named_topics['level_1']
The resulting table shows that we now have keywords clustered by topic:
55
Chapter 2 Keyword Research
56
Chapter 2 Keyword Research
This is really starting to take shape, and you can quickly see opportunities emerging.
57
Chapter 2 Keyword Research
niche) is so new in terms of what it offers that there’s insufficient demand (that has yet
to be created by advertising and PR to generate nonbrand searches), then these external
tools won’t be as valuable. So, our approach will be to
1. Crawl your own website
2. Filter and clean the data for sections covering only what you sell
import pandas as pd
import numpy as np
crawl_import_df = pd.read_csv('data/crawler-filename.csv')
crawl_import_df
5
Like Screaming Frog, OnCrawl, or Botify, for instance
58
Chapter 2 Keyword Research
The preceding result shows the dataframe of the crawl data we’ve just imported.
We’re most interested in live indexable6 URLs, so let’s filter and select the page_title and
URL columns:
Now we’re going to clean the title tags to make these nonbranded, that is, remove the
site name and the magazine section.
titles_urls_df['page_title'] = titles_urls_df.page_title.str.replace(' -
Saga', '')
titles_urls_df = titles_urls_df.loc[~titles_urls_df.url.str.contains('/
magazine/')]
titles_urls_df
6
That is, pages with a 200 HTTP response that do block search indexing with “noindex”
59
Chapter 2 Keyword Research
We now have 349 rows, so we will query some of the keywords to illustrate the
process.
pd.set_option('display.max_rows', 1000)
serps_ngrammed = filtered_serps_df.set_index(["keyword", "rank_absolute"])\
.apply(lambda x: x.str.split('[-,|?()&:;\[\]=]').
explode())\
.dropna()\
.reset_index()
serps_ngrammed.head(10)
60
Chapter 2 Keyword Research
Courtesy of the explode function, the dataframe has been unnested such that we can
see the keyword rows expanded for the different text previously within the same title and
conjoined by the punctuation mark.
61
Chapter 2 Keyword Research
Eh voila, the preceding result shows a dataframe of keywords obtained from the
SERPs. Most of it makes sense and can now be added to your list of keywords for serious
consideration and tracking.
Summary
This chapter has covered data-driven keyword research, enabling you to
In the next chapter, we will cover the mapping of those keywords to URLs.
62
CHAPTER 3
Technical
Technical SEO mainly concerns the interaction of search engines and websites such that
• Website content is made discoverable by search engines.
• Extract the content meaning from those URLs again for search result
inclusion (known as indexing).
In this chapter, we’ll look at how data-driven approach can be taken toward
improving technical SEO in the following manner:
• Modeling page authority: This is useful for helping fellow SEO and
non-SEOs understand the impact of technical SEO changes.
• Core Web Vitals (CWV): While the benefits to the UX are often
lauded, there are ranking boost benefits to an improved CWV
because of the conserved search engine resources used to extract
content from a web page.
By no means will we claim that this is the final word on data-driven SEO from a
technical perspective. What we will do is expose data-driven ways of solving technical
SEO issues using some data science such as distribution analysis.
63
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_3
Chapter 3 Technical
Ultimately, the preceding list will help you build better cases for getting technical
recommendations implemented.
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
64
Chapter 3 Technical
import requests
import json
from datetime import timedelta
from glob import glob
import os
from client import RestClient # If using the Data For SEO API
from textdistance import sorensen_dice
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools
from urllib.parse import urlparse
import tldextract
pd.set_option('display.max_colwidth', None)
%matplotlib inline
Set variables:
root_domain = 'boundlesshq.com'
hostdomain = 'www.boundlesshq.com'
hostname = 'boundlesshq'
full_domain = 'https://www.boundlesshq.com'
client_name = 'Boundless'
audit_monthyear = 'jul_2022'
Import the crawl data from the Sitebulb desktop crawler. Screaming Frog or any
other site crawling software can be used; however, the column names may differ:
crawl_csv = pd.read_csv('data/boundlesshq_com_all_urls__excluding_
uncrawled__filtered.csv')
crawl_csv.columns = [col.lower().replace('.','').replace('(','').
replace(')','').replace(' ','_')
for col in crawl_csv.columns]
crawl_csv
65
Chapter 3 Technical
The dataframe is loaded into a Pandas dataframe. The most important fields are as
follows:
• url: To detect patterns for noindexing and canonicalizing
crawl_html = crawl_csv.copy()
crawl_html = crawl_html.loc[crawl_html['content_type'] == 'HTML']
crawl_html = crawl_html.loc[crawl_html['host'] == root_domain]
crawl_html = crawl_html.loc[crawl_html['passes_pagerank'] == 'Yes']
crawl_html
66
Chapter 3 Technical
The dataframe has been reduced to 309 rows. For ease of data handling, we’ll select
some columns:
crawl_select['project'] = client_name
crawl_select['count'] = 1
print(crawl_select['ur'].sum(), crawl_select['ur'].sum()/crawl_select.
shape[0])
10993 35.57605177993528
URLs on this site have an average page authority level (measured as UR). Let’s look at
some further stats, indexable and nonindexable pages. We’ll dimension on (I) indexable
and (II) passes pagerank to sum the number of URLs and UR (URL Rating):
overall_pagerank_agg = crawl_select.groupby(['indexable',
67
Chapter 3 Technical
'passes_pagerank']).agg
({'count': 'sum',
'ur':
'sum'}).
reset_
index()
Then we derive the page authority per URL by dividing the total UR by the total
number of URLs:
We see that there are 32 nonindexable URLs with a total authority of 929 that could
be consolidated to the indexable URLs.
There are some more stats, this time analyzed by site level purely out of curiosity:
site_pagerank_agg = crawl_select.groupby(['indexable',
'crawl_depth']).
agg({'count': 'sum',
'ur':
'sum'}).
reset_
index()
site_pagerank_agg['PA'] = site_pagerank_agg['ur'] / site_pagerank_
agg['count']
site_pagerank_agg
68
Chapter 3 Technical
Most of the URLs that have the authority for reallocation are four clicks away from
the home page.
Let’s visualize the distribution of the authority preoptimization, using the geom_
histogram function:
pageauth_dist_plt = (
ggplot(crawl_select, aes(x = 'ur')) +
geom_histogram(alpha = 0.7, fill = 'blue', bins = 20) +
labs(x = 'Page Authority', y = 'URL Count') +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
pageauth_dist_plt.save(filename = 'images/1_pageauth_dist_plt.png',
height=5, width=8, units = 'in', dpi=1000)
pageauth_dist_plt
As we’d expect from looking at the stats computed previously, most of the pages have
between 25 and 50 UR, with the rest spread out (Figure 3-1).
69
Chapter 3 Technical
Figure 3-1. Histogram plot showing URL count of URL Page Authority scores
parent_pa_map
70
Chapter 3 Technical
The table shows all the parent URLs and their mapping.
The next step is to mark pages that will be noindexed, so we can reallocate their
authority:
crawl_optimised = crawl_select.copy()
reallocate_conds = [
crawl_optimised['url'].str.contains('/page/[0-9]/'),
crawl_optimised['url'].str.contains('/country/')
]
reallocate_vals = [1, 1]
The reallocate column uses the np.select function to mark URLs for noindex. Any
URLs not for noindex are marked as “0,” using the default parameter:
71
Chapter 3 Technical
crawl_optimised
The reallocate column is added so we can start seeing the effect of the reallocation,
that is, the potential upside of technical optimization.
As usual, a groupby operation by reallocate and the average PA are calculated:
72
Chapter 3 Technical
So we’ll be actually reallocating 681 UR from the noindex URLs to the 285 indexable
URLs. These noindex URLs have an average UR of 28.
We filter the URLs just for the ones that will be noindexed to help us in determining
what the extra page authority will be:
no_indexed = crawl_optimised.loc[crawl_optimised['reallocate'] == 1]
We aggregate by the first parent URL (the parent node) for the total URLs within and
their URL, because the UR is likely to be reallocated to the remaining indexable URLs
that share the same parent node:
no_indexed_map = no_indexed.groupby('first_parent_url').agg({'count':
'sum', 'ur': sum}).reset_index()
add_ur is a new column created representing the additional authority as a result of the
optimization. This is the total UR divided by the number of URLs:
The preceding table will be merged into the indexable URLs by the first parent URL.
73
Chapter 3 Technical
Filter the URLs just for the indexable and add more authority as a result of the
noindexing reallocate URLs:
crawl_new = crawl_optimised.copy()
crawl_new = crawl_new.loc[crawl_new['reallocate'] == 0]
Often, when joining data, there will be null values for first parent URLs not in the
mapping. np.where() is used to replace those null values with zeros. This enables further
data manipulation to take place as you’ll see shortly.
crawl_new
74
Chapter 3 Technical
The indexable URLs now have their authority scores post optimization, which we’ll
visualize as follows:
pageauth_newdist_plt = (
ggplot(crawl_new, aes(x = 'new_ur')) +
geom_histogram(alpha = 0.7, fill = 'lightgreen', bins = 20) +
labs(x = 'Page Authority', y = 'URL Count') +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
pageauth_newdist_plt.save(filename = 'images/2_pageauth_newdist_plt.png',
height=5, width=8, units = 'in', dpi=1000)
pageauth_newdist_plt
75
Chapter 3 Technical
The impact is noticeable, as we see most pages are above 60 UR post optimization,
should the implementation move forward.
There are some quick stats to confirm:
print(new_pagerank_agg)
The average page authority is now 57 vs. 36, which is a significant improvement.
While this method is not an exact science, it could help you to build a case for getting
your change requests for technical SEO fixes implemented.
76
Chapter 3 Technical
3. Anchor text
import pandas as pd
import numpy as np
from textdistance import sorensen_dice
from plotnine import *
import matplotlib.pyplot as plt
from mizani.formatters import comma_format
target_name = 'ON24'
target_filename = 'on24'
website = 'www.on24.com'
The link data is sourced from the Sitebulb auditing software which is being imported
along with making the column names easier to work with:
77
Chapter 3 Technical
link_data.columns = [col.lower().replace('.','').replace('(','').
replace(')','').replace(' ','_')
for col in link_data.columns]
link_data
• Referring URL Rank UR: The page authority of the referring page
• Target URL Rank UR: The page authority of the target page
78
Chapter 3 Technical
crawl_data.columns = [col.lower().replace('.','').replace('(','').
replace(')','').replace(' ','_')
for col in crawl_data.columns]
crawl_data
So we have the usual list of URLs and how they were found (crawl source) with other
features spanning over 100 columns.
As you’d expect, the number of rows in the link data far exceeds the crawl dataframe
as there are many more links than pages!
Import the external inbound link data:
ahrefs_raw.columns = [col.lower().replace('.','').replace('(','').
replace(')','').replace(' ','_')
for col in ahrefs_raw.columns]
ahrefs_raw
79
Chapter 3 Technical
There are over 210,000 URLs with backlinks, which is very nice! There’s quite a bit of
data, so let’s simplify a little by removing columns and renaming some columns so we
can join the data later:
80
Chapter 3 Technical
Now we have the data in its simplified form which is important because we’re not
interested in the detail of the links but rather the estimated page-level authority that they
import into the target website.
redir_live_urls.groupby(['crawl_depth']).size()
crawl_depth
0 1
1 70
10 5
11 1
81
Chapter 3 Technical
12 1
13 2
14 1
2 303
3 378
4 347
5 253
6 194
7 96
8 33
9 19
Not Set 2351
dtype: int64
We can see how Python is treating the crawl depth as a string character rather than a
numbered category, which we can fix shortly.
Most of the site URLs can be found in the site depths of 2 to 6. There are over 2351
orphaned URLs, which means these won’t inherit any authority unless they have
backlinks.
We’ll now filter for redirected and live links:
Crawl depth is set as a category and ordered so that Python treats the column
variable as a number as opposed to a string character type:
redir_live_urls['crawl_depth'])
redir_live_urls['crawl_depth'] = redir_live_urls['crawl_depth'].
astype('category')
redir_live_urls['crawl_depth'] = redir_live_urls['crawl_depth'].cat.
reorder_categories(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
'Not Set'
82
Chapter 3 Technical
])
redir_live_urls = redir_live_urls.loc[redir_live_urls.host == website]
redir_live_urls.drop('host', axis = 1, inplace = True)
redir_live_urls
redir_live_urls.groupby(['crawl_depth']).size()
crawl_depth
0 1
1 66
2 169
3 280
4 253
5 201
6 122
7 64
83
Chapter 3 Technical
8 17
9 6
10 1
Not Set 2303
dtype: int64
Note how the size has dropped slightly to 2303 URLs. The 48 nonindexable URLs
were probably paginated pages.
Let’s visualize the distribution:
ove_intlink_dist_plt.save(filename = 'images/1_overall_intlink_dist_
plt.png',
height=5, width=5, units = 'in', dpi=1000)
ove_intlink_dist_plt
84
Chapter 3 Technical
The distribution is negatively skewed such that most pages have close to zero links.
This would be of some concern to an SEO manager.
While the overall distribution gives one view, it would be good to deep dive into the
distribution of internal links by crawl depth:
redir_live_urls.groupby('crawl_depth').agg({'no_internal_links_to_url':
['describe']}).sort_values('crawl_depth')
85
Chapter 3 Technical
The table describes the distribution of internal links by crawl depth or site level. Any
URL that is 3+ clicks away from the home page can expect two internal links on average.
This is probably the blog content as the marketing team produces a lot of it.
To visualize it graphically
86
Chapter 3 Technical
theme(legend_position = 'none')
)
Figure 3-4. Box plot distributions of the number of internal links to a URL by
site level
As suspected, the most variation is in the first level directly below the home page,
with very little variation beyond.
However, we can compare the variation between site levels for content in level 2 and
beyond. For a quick peek, we’ll use a logarithmic scale for the number of internal links
to a URL:
87
Chapter 3 Technical
intlink_dist_plt.save(filename = 'images/1_log_intlink_dist_plt.png',
height=5, width=5, units = 'in', dpi=1000)
intlink_dist_plt
The picture is clearer and more insightful, as we can see how much better and varied
the distribution of the lower site levels compared to each other (Figure 3-5).
Figure 3-5. Box plot distribution of the number of internal links by site level with
logarized vertical axis
88
Chapter 3 Technical
For example, it’s much more obvious that the median number of inbound internal
links for pages on site level 2 is much higher than the lower levels.
It’s also very obvious that the variation in internal inbound links for pages in site
levels 3 and 4 is higher than those in levels 5 and 6.
Remember though the preceding example was achieved using a log scale of the same
input variable.
What we’ve learned here is that having a new variable which is taking a log of the
internal links would yield a more helpful picture to compare levels from 2 to 10.
We’ll achieve this by creating a new column variable “log_intlinks” which is a log of
the internal link count. To avoid negative infinity values from taking a log of zero, we’ll
add 0.01 to the calculation:
redir_live_urls['log_intlinks'] = np.log2(redir_live_urls['no_internal_
links_to_url'] + .01)
intlink_dist_plt.save(filename = 'images/1c_loglinks_dist_plt.png',
height=5, width=5, units = 'in', dpi=1000)
intlink_dist_plt
The intlink_dist_plt plot (Figure 3-6) is quite similar to the logarized scale, only this
time the numbers are easier to read because we’re using normal scales for the vertical
axis. The comparative averages and variations are easier to compare.
89
Chapter 3 Technical
Figure 3-6. Box plot distributions of logarized internal links by site level
intlink_dist = redir_live_urls.groupby('crawl_depth').agg({'no_internal_
links_to_url': ['mean'],
'log_intlinks':
['mean']
90
Chapter 3 Technical
}).reset_index()
intlink_dist.columns = ['_'.join(col) for col in intlink_dist.
columns.values]
intlink_dist = intlink_dist.rename(columns = {'no_internal_links_to_url_
mean': 'avg_int_links',
'log_intlinks_mean': 'logavg_
int_links',
})
intlink_dist
The averages are in place by site level. Notice how the log column helps make the
range of values between crawl depths less extreme and skewed, that is, 4239 to 0.06
for the average vs. 12 to –6.39 for the log average, which makes it easier to normalize
the data.
Now we’ll set the lower quantile at 35% for all site levels. This will use a customer
function quantile_lower:
91
Chapter 3 Technical
def quantile_lower(x):
return x.quantile(.35).round(0)
quantiled_intlinks = redir_live_urls.groupby('crawl_depth').agg({'log_
intlinks':
[quantile_
lower]}).
reset_
index()
quantiled_intlinks.columns = ['_'.join(col) for col in quantiled_intlinks.
columns.values]
quantiled_intlinks = quantiled_intlinks.rename(columns = {'crawl_depth_':
'crawl_depth',
'log_intlinks_
quantile_lower':
'sd_intlink_
lowqua'})
quantiled_intlinks
92
Chapter 3 Technical
The lower quantile stats are set. Quartiles are limited to the 25th percentile, whereas
a quantile means the lower limits can be set to any number, such as 11th, 18th, 24th, etc.,
which is why we use quantiles instead of quartiles. The next steps are to join the data to
the main dataframe, then we’ll apply a function to mark URLs that are underlinked for
their given site level:
redir_live_urls_underidx = redir_live_urls.merge(quantiled_intlinks, on =
'crawl_depth', how = 'left')
The following function assesses whether the URL has less links than the lower
quantile. If yes, then the value of “sd_int_uidx” is 1, otherwise 0:
def sd_intlinkscount_underover(row):
if row['sd_intlink_lowqua'] > row['log_intlinks']:
val = 1
else:
val = 0
return val
redir_live_urls_underidx['sd_int_uidx'] = redir_live_urls_underidx.
apply(sd_intlinkscount_underover, axis=1)
There’s some code to account for “Not Set” which are effectively orphaned URLs. In
this instance, we set these to 1 – meaning they’re underlinked:
redir_live_urls_underidx['sd_int_uidx'] = np.where(redir_live_urls_
underidx['crawl_depth'] == 'Not Set', 1,
redir_live_urls_
underidx['sd_int_uidx'])
redir_live_urls_underidx
93
Chapter 3 Technical
The dataframe shows that the column is in place marking underlinked URLs as 1.
With the URLs marked, we’re ready to get an overview of how under-linked the URLs are,
which will be achieved by aggregating by crawl depth and summing the total number of
underlinked URLs:
intlinks_agged = redir_live_urls_underidx.groupby('crawl_depth').agg({'sd_
int_uidx': ['sum', 'count']}).reset_index()
The following line tidies up the column names by inserting an underscore using a list
comprehension:
To get a proportion (or percentage), we divide the sum by the count and
multiply by 100:
intlinks_agged['sd_uidx_prop'] = (intlinks_agged.sd_int_uidx_sum) /
intlinks_agged.sd_int_uidx_count * 100
print(intlinks_agged)
94
Chapter 3 Technical
crawl_depth sd_int_uidx_sum sd_int_uidx_count sd_uidx_prop
0 0 0 1 0.000000
1 1 38 66 57.575758
2 2 67 169 39.644970
3 3 75 280 26.785714
4 4 57 253 22.529644
5 5 31 201 15.422886
6 6 9 122 7.377049
7 7 9 64 14.062500
8 8 3 17 17.647059
9 9 2 6 33.333333
10 10 0 1 0.000000
11 Not Set 2303 2303 100.000000
So even though the content in levels 1 and 2 have more links than any of the lower
levels, they have a higher proportion of underlinked URLs than any other site level (apart
from the orphans in Not Set of course).
For example, 57% of pages just below the home page are underlinked.
Let’s visualize:
It’s good to visualize using depth_uidx_plt because we can also see (Figure 3-7) that
levels 2, 3, and 4 have the most underlinked URLs by volume.
95
Chapter 3 Technical
depth_uidx_prop_plt.save(filename = 'images/1_depth_uidx_prop_plt.png',
height=5, width=5, units = 'in', dpi=1000)
depth_uidx_prop_plt
96
Chapter 3 Technical
Figure 3-8. Column chart of the proportion of under internally linked URLs by
site level
It’s not a given that URLs in the site level that are underlinked are a problem or
perhaps more so by design. However, they are worth reviewing as perhaps they should
be at that site level or they do deserve more internal links after all.
The following code exports the underlinked URLs to a CSV which can be viewed in
Microsoft Excel:
underlinked_urls = redir_live_urls_underidx.loc[redir_live_urls_underidx.
sd_int_uidx == 1]
underlinked_urls = underlinked_urls.sort_values(['crawl_depth', 'no_
internal_links_to_url'])
underlinked_urls.to_csv('exports/underlinked_urls.csv')
97
Chapter 3 Technical
Given that not all pages earn inbound links, it is normally desired by SEOs to have
pages without backlinks crawled more often. So it would make sense to analyze and
explore opportunities to redistribute this PageRank to other pages within the website.
We’ll start by tacking on the AHREFs data to the main dataframe so we can see
internal links by page authority.
We now have page authority and referring domains at the URL level. Predictably,
the home page has a lot of referring domains (over 3000) and the most page-level
authority at 81.
As usual, we’ll perform some aggregations and explore the distribution of the
PageRank (interchangeable with page authority).
First, we’ll clean up the data to make sure we replace null values with zero:
intlinks_pageauth['page_authority'] = np.where(intlinks_pageauth['page_
authority'].isnull(),
0, intlinks_pageauth['page_
authority'])
Aggregate by page authority:
98
Chapter 3 Technical
intlinks_pageauth.groupby('page_authority').agg({'no_internal_links_to_
url': ['describe']})
The preceding table shows the distribution of internal links by different levels of page
authority.
At the lower levels, most URLs have around two internal links.
A graph will give us the full picture:
# distribution of page_authority
page_authority_dist_plt = (ggplot(intlinks_pageauth, aes(x = 'page_
authority')) +
geom_histogram(fill = 'blue', alpha = 0.6, bins
= 30 ) +
labs(y = '# URLs', x = 'Page Authority') +
#scale_y_log10() +
theme_classic() +
theme(legend_position = 'none')
)
99
Chapter 3 Technical
page_authority_dist_plt.save(filename = 'images/2_page_authority_dist_
plt.png',
height=5, width=5, units = 'in', dpi=1000)
page_authority_dist_plt
Using the log scale, we can see how the higher levels of authority compare:
# distribution of page_authority
page_authority_dist_plt = (ggplot(intlinks_pageauth, aes(x = 'page_
authority')) +
geom_histogram(fill = 'blue', alpha = 0.6, bins
= 30 ) +
labs(y = '# URLs (Log)', x = 'Page Authority') +
100
Chapter 3 Technical
scale_y_log10() +
theme_classic() +
theme(legend_position = 'none')
)
page_authority_dist_plt.save(filename = 'images/2_page_authority_dist_log_
plt.png',
height=5, width=5, units = 'in', dpi=1000)
page_authority_dist_plt
Given this more insightful view, taking a log of “page_authority” to form a new
column variable “log_pa” is justified:
101
Chapter 3 Technical
intlinks_pageauth['page_authority'] = np.where(intlinks_pageauth['page_
authority'] == 0, .1, intlinks_pageauth['page_authority'])
intlinks_pageauth['log_pa'] = np.log2(intlinks_pageauth.page_authority)
intlinks_pageauth.head()
page_authority_trans_dist_plt.save(filename = 'images/2_page_authority_
trans_dist_plt.png',
height=5, width=5, units = 'in', dpi=1000)
page_authority_trans_dist_plt
102
Chapter 3 Technical
The decimal points will be rounded to make the 3000+ URLs easier to categorize:
intlinks_pageauth['pa_band'] = intlinks_pageauth['log_pa'].apply(np.floor)
103
Chapter 3 Technical
def quantile_lower(x):
return x.quantile(.4).round(0)
quantiled_pageau = intlinks_pageauth.groupby('pa_band').agg({'no_internal_
links_to_url': [quantile_lower]}).reset_index()
quantiled_pageau.columns = ['_'.join(col) for col in quantiled_pageau.
columns.values]
quantiled_pageau = quantiled_pageau.rename(columns = {'pa_band_':
'pa_band',
'no_internal_links_
to_url_quantile_
lower': 'pa_intlink_
lowqua'})
quantiled_pageau
104
Chapter 3 Technical
Going by PageRank, we now have the minimum threshold of inbound internal links
we would expect. Time to join the data and mark the URLs that are underlinked for their
authority level:
intlinks_pageauth_underidx = intlinks_pageauth.merge(quantiled_pageau, on =
'pa_band', how = 'left')
def pa_intlinkscount_underover(row):
if row['pa_intlink_lowqua'] > row['no_internal_links_to_url']:
val = 1
else:
val = 0
return val
intlinks_pageauth_underidx['pa_int_uidx'] = intlinks_pageauth_underidx.
apply(pa_intlinkscount_underover, axis=1)
This function will allow us to make some aggregations to see how many URLs there
are at each PageRank band and how many are under-linked:
pageauth_agged = intlinks_pageauth_underidx.groupby('pa_band').agg({'pa_
int_uidx': ['sum', 'count']}).reset_index()
pageauth_agged.columns = ['_'.join(col) for col in pageauth_agged.
columns.values]
print(pageauth_agged)
105
Chapter 3 Technical
pa_band_ pa_int_uidx_sum pa_int_uidx_count uidx_prop
0 -4.0 0 1320 0.000000
1 3.0 0 1950 0.000000
2 4.0 77 203 37.931034
3 5.0 4 9 44.444444
4 6.0 0 1 0.000000
Most of the underlinked content appears to be those that have the highest page
authority, which is slightly contrary to what the site-level approach suggests (that pages
lower down are underlinked). That’s assuming most of the high authority pages are
closer to the home page.
What is the right answer? It depends on what we’re trying to achieve. Let’s continue
with more analysis for now and visualize the authority stats:
# distribution of page_authority
pageauth_agged_plt = (ggplot(intlinks_pageauth_underidx.loc[intlinks_
pageauth_underidx['pa_int_uidx'] == 1],
aes(x = 'pa_band')) +
geom_histogram(fill = 'blue', alpha = 0.6, bins = 10) +
labs(y = '# URLs Under Linked', x = 'Page Authority
Level') +
theme_classic() +
theme(legend_position = 'none')
)
pageauth_agged_plt.save(filename = 'images/2_pageauth_agged_hist.png',
height=5, width=5, units = 'in', dpi=1000)
pageauth_agged_plt
106
Chapter 3 Technical
Figure 3-12. Distribution of under internally linked URLs by page authority level
Content Type
Perhaps it would be more useful to visualize this by content type just by a “quick and
dirty” analysis using the first subdirectory:
intlinks_content_underidx = intlinks_depthauth_underidx.copy()
To get the first subfolder, we’ll define a function that allows the operation to
continue in case of a fail (which would happen for the home page URL because
there is no subfolder). The k parameter specifies the number of slashes in the URL to
find the desired folder and parse the subdirectory name:
107
Chapter 3 Technical
intlinks_content_underidx['content'] = intlinks_content_underidx['url'].
apply(lambda x: get_folder(x))
intlinks_content_underidx.groupby('content').agg({'no_internal_links_to_
url': ['describe']})
Wow, 183 subfolders! That’s way too much for categorical analysis. We could break
it down and aggregate it into fewer categories using the ngram techniques described in
Chapter 9; feel free to try.
In any case, it looks like the site architecture is too flat and could be better structured
to be more hierarchical, that is, more pyramid like.
Also, many of the content folders only have one inbound internal link, so even
without the benefit of data science, it’s obvious these require SEO attention.
108
Chapter 3 Technical
intlinks_depthauth_underidx = intlinks_pageauth_underidx.copy()
intlinks_depthauth_underidx['depthauth_uidx'] = np.where((intlinks_
depthauth_underidx['sd_int_uidx'] +
intlinks_
depthauth_
underidx['pa_
int_uidx'] ==
2), 1, 0)
'''intlinks_depthauth_underidx['depthauth_uidx'] = np.where((intlinks_
depthauth_underidx['sd_int_uidx'] == 1) &
(intlinks_
depthauth_
underidx['pa_int_
uidx'] == 1),
1, 0)'''
depthauth_uidx = intlinks_depthauth_underidx.groupby(['crawl_depth',
'pa_band']).agg({'depthauth_uidx': 'sum'}).reset_index()
depthauth_urls = intlinks_depthauth_underidx.groupby(['crawl_depth',
'pa_band']).agg({'url': 'count'}).reset_index()
depthauth_stats = depthauth_uidx.merge(depthauth_urls,
on = ['crawl_depth',
'pa_band'], how = 'left')
depthauth_stats['depthauth_uidx_prop'] = (depthauth_stats['depthauth_uidx']
/ depthauth_stats['url']).round(2)
depthauth_stats.sort_values('depthauth_uidx', ascending = False)
109
Chapter 3 Technical
Most of the underlinked URLs are orphaned and have page authority (probably from
backlinks).
Visualize to get a fuller picture:
depthauth_stats_plt = (
ggplot(depthauth_stats,
aes(x = 'pa_band', y = 'crawl_depth', fill = 'depthauth_
uidx')) +
geom_tile(stat = 'identity', alpha = 0.6) +
labs(y = '', x = '') +
theme_classic() +
theme(legend_position = 'right')
)
depthauth_stats_plt.save(filename = 'images/3_depthauth_stats_plt.png',
height=5, width=10, units = 'in', dpi=1000)
depthauth_stats_plt
There we have it, depthauth_stats_plt (Figure 3-13) shows most of the focus should
go into the orphaned URLs (which they should anyway), but more importantly we know
which orphaned URLs to prioritize over others.
110
Chapter 3 Technical
Figure 3-13. Heatmap of page authority level, site level, and underlinked URLs
We can also see the extent of the issue. The second highest priority group of
underindexed URLs are at site levels 2, 3, and 4.
Anchor Texts
If the count and their distribution represent the quantitative aspect of internal links, then
the anchor texts could be said to represent their quality.
Anchor texts signal to search engines and users what content to expect after
accessing the hyperlink. This makes anchor texts an important signal and one worth
optimizing.
We’ll start by aggregating the crawl data from Sitebulb to get an overview of
the issues:
111
Chapter 3 Technical
'no_anchors_with_url_in_onclick': ['sum'],
'no_anchors_with_username_and_password_in_href': ['sum'],
'no_image_anchors_with_no_alt_text': ['sum']
}).reset_index()
112
Chapter 3 Technical
Over 4000 links with no descriptive anchor text jump out as the most common issue,
not to mention the 19 anchors with empty HREF (albeit very low in number).
To visualize
anchor_issues_count_plt.save(filename = 'images/4_anchor_issues_count_
plt.png',
height=5, width=5, units = 'in', dpi=1000)
anchor_issues_count_plt
113
Chapter 3 Technical
anchor_issues_levels = crawl_data.groupby('crawl_depth').agg({'no_anchors_
with_empty_href': ['sum'],
'no_anchors_with_leading_or_trailing_whitespace_in_href':
['sum'],
'no_anchors_with_local_file': ['sum'],
'no_anchors_with_localhost': ['sum'],
'no_anchors_with_malformed_href': ['sum'],
'no_anchors_with_no_text': ['sum'],
'no_anchors_with_non_descriptive_text': ['sum'],
'no_anchors_with_non-http_protocol_in_href': ['sum'],
'no_anchors_with_url_in_onclick': ['sum'],
'no_anchors_with_username_and_password_in_href': ['sum'],
'no_image_anchors_with_no_alt_text': ['sum']
}).reset_index()
114
Chapter 3 Technical
print(anchor_issues_levels)
crawl_depth issues instances
111 Not Set non_descriptive_text 2458
31 Not Set leading_or_trailing_whitespace_in_href 2295
104 3 non_descriptive_text 350
24 3 leading_or_trailing_whitespace_in_href 328
105 4 non_descriptive_text 307
.. ... ... ...
85 13 no_text 0
115
Chapter 3 Technical
84 12 no_text 0
83 11 no_text 0
82 10 no_text 0
0 0 empty_href 0
Most of the issues are on orphaned pages followed by URLs three to four levels deep.
To visualize
anchor_levels_issues_count_plt.save(filename = 'images/4_anchor_levels_
issues_count_plt.png',
height=5, width=5, units = 'in', dpi=1000)
anchor_levels_issues_count_plt
116
Chapter 3 Technical
Figure 3-15. Heatmap of site level, anchor text issues, and instances
Merge with the crawl data using the URL as the primary key and then filter for
indexable URLs only:
anchor_merge['crawl_depth'] = anchor_merge['crawl_depth'].
astype('category')
117
Chapter 3 Technical
anchor_merge['crawl_depth'] = anchor_merge['crawl_depth'].cat
.reorder_categories(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'10', 'Not Set'])
Then we compare the string similarity of the anchor text and title tag of the
destination URLs:
And any URLs with less than 70% relevance score will be marked as irrelevant under
the new column “irrel_anchors” as a 1.
Why 70%? This is from experience, and you’re more than welcome to try different
thresholds.
With Sorensen-Dice, which is not only fast but meets SEO needs for measuring
relevance, 70% seems to be the right limit between relevance and irrelevance, especially
when accounting for the site markers in the title tag string:
anchor_merge['project'] = target_name
anchor_merge
118
Chapter 3 Technical
anchor_rel_stats_site_agg = anchor_merge.groupby('project').agg({'irrel_
anchors': 'sum'}).reset_index()
anchor_rel_stats_site_agg['total_urls'] = anchor_merge.shape[0]
anchor_rel_stats_site_agg['irrel_anchors_prop'] = anchor_rel_stats_site_
agg['irrel_anchors'] /anchor_rel_stats_site_agg['total_urls']
print(anchor_rel_stats_site_agg)
project irrel_anchors total_urls irrel_anchors_prop
0 ON24 333946 350643 0.952382
About 95% of anchor texts on this site are irrelevant. How does this compare to their
competitors? That’s your homework.
Let’s go slightly deeper and analyze this by site depth:
anchor_rel_depth_irrels = anchor_merge.groupby(['crawl_depth']).
agg({'irrel_anchors': 'sum'}).reset_index()
anchor_rel_depth_urls = anchor_merge.groupby(['crawl_depth']).
agg({'project': 'count'}).reset_index()
anchor_rel_depth_stats = anchor_rel_depth_irrels.merge(anchor_rel_depth_
urls, on = 'crawl_depth', how = 'left')
119
Chapter 3 Technical
anchor_rel_depth_stats['irrel_anchors_prop'] = anchor_rel_depth_
stats['irrel_anchors'] / anchor_rel_depth_stats['project']
anchor_rel_depth_stats
Virtually, all content at all site levels with the exception of those three clicks away
from the home page (probably blog posts) have irrelevant anchors.
Let’s visualize:
120
Chapter 3 Technical
theme(legend_position = 'none')
)
anchor_rel_stats_site_agg_plt.save(filename = 'images/3_anchor_rel_stats_
site_agg_plt.png',
height=5, width=5, units = 'in', dpi=1000)
anchor_rel_stats_site_agg_plt
Location
More insight could be gained by looking at the location of the anchors:
anchor_rel_locat_irrels = anchor_merge.groupby(['location']).agg({'irrel_
anchors': 'sum'}).reset_index()
121
Chapter 3 Technical
anchor_rel_locat_urls = anchor_merge.groupby(['location']).agg({'project':
'count'}).reset_index()
anchor_rel_locat_stats = anchor_rel_locat_irrels.merge(anchor_rel_locat_
urls, on = 'location', how = 'left')
anchor_rel_locat_stats['irrel_anchors_prop'] = anchor_rel_locat_
stats['irrel_anchors'] / anchor_rel_locat_stats['project']
anchor_rel_locat_stats
The irrelevant anchors are within the header or footer which make these relatively
easy to solve.
anchor_count = anchor_merge[['anchor_text']].copy()
anchor_count['count'] = 1
anchor_count_agg = anchor_count.groupby('anchor_text').agg({'count':
'sum'}).reset_index()
anchor_count_agg = anchor_count_agg.sort_values('count', ascending = False)
anchor_count_agg
122
Chapter 3 Technical
There are over 1,808 variations of anchor texts of which “Contact Us” is the most
popular along with “Live Demo” and “Resources.”
Let’s visualize using a word cloud. We’ll have to import the WordCloud package and
convert the dataframe into a dictionary:
data = anchor_count_agg.set_index('anchor_text').to_dict()['count']
data
123
Chapter 3 Technical
wc = WordCloud(background_color='white',
width=800, height=400,
max_words=30).generate_from_frequencies(anchor_count_agg)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
# Save image
wc.to_file("images/wordcloud.png")
plt.show()
Figure 3-17. Word cloud of the most commonly used anchor texts
124
Chapter 3 Technical
The activation from this point would be to see about finding semiautomated rules
to improve the relevance of anchor texts, which is made easier by virtue of the fact that
these are within the header or footer.
Landscape
import re
import time
import random
import pandas as pd
import numpy as np
import requests
import json
import plotnine
import tldextract
from plotnine import *
from mizani.transforms import trans
from client import RestClient
target_bu = 'boundless'
125
Chapter 3 Technical
target_site = 'https://boundlesshq.com/'
target_name = target_bu
We start by obtaining the SERPs for your target keywords using the pandas read_csv
function. We’re interested in the URL which will form the input for querying the Google
PageSpeed API which gives us the CWV metric values:
The SERPs data can get a bit noisy, and ultimately the business is only interested in
their direct competitors, so we’ll create a list of them to filter the SERPs accordingly:
desktop_serps_select = desktop_serps_df[~desktop_serps_df['url'].
isnull()].copy()
126
Chapter 3 Technical
desktop_serps_select = desktop_serps_select[desktop_serps_select['url'].
str.contains('|'.join(selected_sites))]
desktop_serps_select
There are much less rows as a result, which means less API queries and less time
required to get the data.
Note the data is just for desktop, so this process would need to be repeated for
mobile SERPs also.
To query the PageSpeed API efficiently and avoid duplicate requests, we want a
unique set of URLs. We achieve this by
Exporting the URL column to a list
desktop_serps_urls = desktop_serps_select['url'].to_list()
desktop_serps_urls = list(dict.fromkeys(desktop_serps_urls))
desktop_serps_urls
['https://papayaglobal.com/blog/how-to-avoid-permanent-
establishment-risk/',
'https://www.omnipresent.com/resources/permanent-establishment-risk-a-
remote-workforce',
'https://www.airswift.com/blog/permanent-establishment-risks',
'https://www.letsdeel.com/blog/permanent-establishment-risk',
127
Chapter 3 Technical
'https://shieldgeo.com/ultimate-guide-permanent-establishment/',
'https://remote.com/blog/what-is-permanent-establishment',
'https://remote.com/lp/global-payroll',
'https://remote.com/services/global-payroll?nextInternalLocale=
en-us', . . . ]
With the list, we query the API, starting by setting the parameters for the API itself,
the device, and the API key (obtained by getting a Google Cloud Platform account which
is free):
base_url = 'https://www.googleapis.com/pagespeedonline/v5/
runPagespeed?url='
strategy = '&strategy=desktop'
api_key = '&key=[Your PageSpeed API key]'
Initialize an empty dictionary and set i to zero which will be used as a counter to help
us keep track of how many API calls have been made and how many to go:
desktop_cwv = {}
i = 1
The result is a dictionary containing the API response. To get this output into a
usable format, we iterate through the dictionary to pull out the actual CWV scores as
the API has a lot of other micro measurement data which doesn’t serve our immediate
objectives.
Initialize an empty list which will store the API response data:
desktop_psi_lst = []
Loop through the API output which is a JSON dictionary, so we need to pull out the
relevant “keys” and add them to the list initialized earlier:
128
Chapter 3 Technical
The PageSpeed data on all of the ranking URLs is in a dataframe with all of the CWV
metrics:
• FCP: First Contentful Paint
129
Chapter 3 Technical
To show the relevance of the ranking (and hopefully the benefit to ranking by
improving CWV), we want to merge this with the rank data:
The dataframe is complete with the keyword, its rank, URL, device, and CWV
metrics.
At this point, rather than repeat near identical code for mobile, you can assume we
have the data for mobile which we have combined into a single dataframe using the
pandas concat function (same headings).
To add some additional features, we have added another column is_target indicating
whether the ranking URL is the client or not:
130
Chapter 3 Technical
overall_psi_serps_bu['is_target'] = np.where(overall_psi_serps_bu['url'].
str.contains(target_site), '1', '0')
overall_psi_serps_bu['site'] = overall_psi_serps_bu['url'].apply(lambda
url: tldextract.extract(url).domain)
overall_psi_serps_bu['count'] = 1
The aggregation will be executed at the site level so we can compare how each site
scores on average for their CWV metrics and correlate that with performance:
overall_psi_serps_agg = overall_psi_serps_bu.groupby('site').
agg({'LCP': 'mean',
'FCP': 'mean',
'CLS': 'mean',
'FID': 'mean',
'SIS': 'mean',
131
Chapter 3 Technical
'rank_
absolute':
'mean',
'count':
'sum'}).
reset_
index()
overall_psi_serps_agg = overall_psi_serps_agg.rename(columns = {'count':
'reach'})
Here are some operations to make the site names shorter for the graphs later:
overall_psi_serps_agg['site'] = np.where(overall_psi_serps_agg['site'] ==
'papayaglobal', 'papaya',
overall_psi_serps_agg['site'])
overall_psi_serps_agg['site'] = np.where(overall_psi_serps_agg['site'] ==
'boundlesshq', 'boundless',
overall_psi_serps_agg['site'])
overall_psi_serps_agg
That’s the summary which is not so easy to discern trends, and now we’re ready to
plot the data, starting with the overall speed index. The Speed Index Score (SIS) is scaled
between 0 and 100, 100 being the fastest and therefore best.
Note that in all of the charts that will compare Google rank with the individual CWV
metrics, the vertical axis will be inverted such that the higher the position, the higher the
ranking. This is to make the charts more intuitive and easier to understand.
132
Chapter 3 Technical
SIS_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'SIS', y = 'rank_absolute', fill = 'site', colour = 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'Speed Score') +
scale_y_reverse() +
scale_size_continuous(range = [7, 17]) +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
SIS_cwv_landscape_plt.save(filename = 'images/0_SIS_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
SIS_cwv_landscape_plt
Figure 3-18. Scatterplot comparing speed scores and Google rank of different
websites
133
Chapter 3 Technical
Boundless in this instance are doing relatively well. Although they don’t rank the
highest, this could indicate that either some aspects of CWV are not being attended to or
something non-CWV related or more likely a combination of both.
LCP_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'LCP', y = 'rank_absolute', fill = 'site', colour
= 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'Largest Contentful Paint') +
scale_y_reverse() +
scale_size_continuous(range = [7, 17]) +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
LCP_cwv_landscape_plt.save(filename = 'images/0_LCP_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
LCP_cwv_landscape_plt
The LCP_cwv_landscape_plt plot (Figure 3-19) shows that Papaya and Remote look
like outliers; in any case, the trend does indicate that the less time it takes to load the
largest content element, the higher the rank.
134
Chapter 3 Technical
Figure 3-19. Scatterplot comparing Largest Contentful Paint (LCP) and Google
rank by website
FID_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'FID', y = 'rank_absolute', fill = 'site', colour
= 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'First Input Delay') +
scale_y_reverse() +
scale_x_log10() +
scale_size_continuous(range = [7, 17]) +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
FID_cwv_landscape_plt.save(filename = 'images/0_FID_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
FID_cwv_landscape_plt
135
Chapter 3 Technical
Figure 3-20. Scatterplot comparing First Input Delay (FID) and Google rank
by website
The trend indicates that the less time it takes to make the page interactive for users,
the higher the rank.
Boundless are doing well in this respect.
CLS_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'CLS', y = 'rank_absolute', fill = 'site', colour
= 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'Cumulative Layout Shift') +
scale_y_reverse() +
scale_size_continuous(range = [7, 17]) +
136
Chapter 3 Technical
CLS_cwv_landscape_plt.save(filename = 'images/0_CLS_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
CLS_cwv_landscape_plt
Figure 3-21. Scatterplot comparing Cumulative Layout Shift (CLS) and Google
rank by website
FCP_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'FCP', y = 'rank_absolute', fill = 'site', colour
= 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
137
Chapter 3 Technical
FCP_cwv_landscape_plt.save(filename = 'images/0_FCP_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
FCP_cwv_landscape_plt
Figure 3-22. Scatterplot comparing First Contentful Paint (FCP) and Google rank
by website
That’s the deep dive into the overall scores. The preceding example can be repeated
for both desktop and mobile scores to drill down into, showing which specific CWV
metrics should be prioritized. Overall, for boundless, CLS appears to be its weakest point.
In the following, we’ll summarize the analysis on a single chart by pivoting the data
in a format that can be used to power the single chart:
138
Chapter 3 Technical
overall_psi_serps_long = overall_psi_serps_agg.copy()
overall_psi_serps_long = overall_psi_serps_long.melt(id_vars=['site'],
value_vars=['LCP',
'FCP', 'CLS', 'FID', 'SIS'],
var_name='Metric',
value_name='Index')
overall_psi_serps_long['x_axis'] = overall_psi_serps_long['Metric']
overall_psi_serps_long['site'] = np.where(overall_psi_serps_long['site'] ==
'papayaglobal', 'papaya',
overall_psi_serps_long['site'])
overall_psi_serps_long['site'] = np.where(overall_psi_serps_long['site'] ==
'boundlesshq', 'boundless',
overall_psi_serps_long['site'])
overall_psi_serps_long
139
Chapter 3 Technical
speed_ex_plt = (
ggplot(overall_psi_serps_long,
aes(x = 'site', y = 'Index', fill = 'site')) +
geom_bar(stat = 'identity', alpha = 0.8) +
labs(y = '', x = '') +
theme(legend_position = 'right',
axis_text_x =element_text(rotation=90, hjust=1, size = 12),
legend_title = element_blank()
) +
facet_grid('Metric ~ .', scales = 'free')
)
speed_ex_plt.save(filename = 'images/0_CWV_Metrics_plt.png',
height=5, width=8, units = 'in', dpi=1000)
speed_ex_plt
140
Chapter 3 Technical
The speed_ex_plt chart (Figure 3-23) shows the competitors being compared for
each metric. Remote seem to perform the worst on average, so their prominent rankings
are probably due to non-CWV factors.
Onsite CWV
The purpose of the landscape was to use data to motivate the client, colleagues, and
stakeholders of the SEO benefits that would follow CWV improvement. In this section,
we’re going to drill into the site itself to see where the improvements could be made.
We’ll start by importing the data and cleaning up the columns as usual:
target_crawl_raw = pd.read_csv('data/boundlesshq_com_all_urls__excluding_
uncrawled__filtered_20220427203402.csv')
141
Chapter 3 Technical
We’re using Sitebulb crawl data, and we want to only include onsite indexable URLs
since those are the ones that rank, which we will filter as follows:
target_crawl_raw = target_crawl_raw.loc[target_crawl_raw['host'] ==
target_host]
target_crawl_raw = target_crawl_raw.loc[target_crawl_raw['indexable_
status'] == 'Indexable']
target_crawl_raw = target_crawl_raw.loc[target_crawl_raw['content_type']
== 'HTML']
target_crawl_raw
142
Chapter 3 Technical
With 279 rows, it’s a small website. The next step is to select the desired columns
which will comprise the CWV measures and anything that could possibly explain it:
target_speedDist_df
143
Chapter 3 Technical
The dataframe columns have reduced from 71 to 29, and the CWV scores are more
apparent.
Attempting to analyze the sites at the URL will not be terribly useful, so to make
pattern identification easier, we will classify the content by folder location:
section_conds = [
target_speedDist_df['url'] == 'https://boundlesshq.com/',
target_speedDist_df['url'].str.contains('/guides/'),
target_speedDist_df['url'].str.contains('/how-it-works/')
]
target_speedDist_df[cols] = pd.to_numeric(target_speedDist_df[cols].
stack(), errors='coerce').unstack()
target_speedDist_df
144
Chapter 3 Technical
A new column has been created in which each indexable URL is labeled by their
content category.
Time for some aggregation using groupby on “content”:
speed_dist_agg = target_speedDist_df.groupby('content').agg({'url':
'count', 'performance_score'}).reset_index()
speed_dist_agg
Most of the content are guides followed by blog posts with three offer pages.
To visualize, we’re going to use a histogram showing the distribution of the overall
performance score and color code the URLs in the score columns by their segment.
The home page and the guides are by far the fastest.
target_speedDist_plt = (
ggplot(target_speedDist_df,
aes(x = 'performance_score', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 20) +
labs(y = 'Page Count', x = '\nSpeed Score') +
145
Chapter 3 Technical
target_speedDist_plt.save(filename = 'images/3_target_speedDist_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_speedDist_plt
The target_speedDist_plt plot (Figure 3-24) shows the home page (in purple)
performs reasonably well with a speed score of 84. The guides vary, but most of these
have a speed above 80, and the majority of blog posts are in the 70s.
target_CLS_plt = (
ggplot(target_speedDist_df,
aes(x = 'cumulative_layout_shift', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 20) +
labs(y = 'Page Count', x = '\ncumulative_layout_shift') +
146
Chapter 3 Technical
target_CLS_plt.save(filename = 'images/3_target_CLS_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_CLS_plt
So we now know which content templates to focus our CLS development efforts.
target_FCP_plt = (
ggplot(target_speedDist_df,
aes(x = 'first_contentful_paint', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 30) +
labs(y = 'Page Count', x = '\nContentful paint') +
theme(legend_position = 'right',
147
Chapter 3 Technical
target_FCP_plt.save(filename = 'images/3_target_FCP_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_FCP_plt
target_LCP_plt = (
ggplot(target_speedDist_df,
aes(x = 'largest_contentful_paint', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 20) +
labs(y = 'Page Count', x = '\nlargest_contentful_paint') +
theme(legend_position = 'right',
axis_text_x = element_text(rotation=90, hjust=1, size = 7))
)
148
Chapter 3 Technical
target_LCP_plt.save(filename = 'images/3_target_LCP_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_LCP_plt
target_LCP_plt (Figure 3-27) shows most guides and some blogs have the fastest LCP
scores; in any case, the blog template and the rogue guides would be the areas of focus.
target_FID_plt = (
ggplot(target_speedDist_df,
aes(x = 'time_to_interactive', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 20) +
labs(y = 'Page Count', x = '\ntime_to_interactive') +
theme(legend_position = 'right',
axis_text_x = element_text(rotation=90, hjust=1, size = 7))
)
target_FID_plt.save(filename = 'images/3_target_FID_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_FID_plt
149
Chapter 3 Technical
The majority of the site appears in target_FID_plt (Figure 3-28) to enjoy fast FID
times, so this would be the least priority for CWV improvement.
Summary
In this chapter, we covered how data-driven approach could be taken toward technical
SEO by way of
• Modeling page authority to estimate the benefit of technical SEO
recommendations to colleagues and clients
The next chapter will focus on using data to improve content and UX.
150
CHAPTER 4
Content and UX
Content and UX for SEO is about the quality of the experience you’re delivering to your
website users, especially when they are referred from search engines. This means a
number of things including but not limited to
By no means do we claim that this is the final word on data-driven SEO from a
content and UX perspective. What we will do is expose data-driven ways of solving the
most important SEO challenge using data science techniques, as not all require data
science.
For example, getting scientific evidence that fast page speeds are indicative of higher
ranked pages uses similar code from Chapter 6. Our focus will be on the various flavors
of content that best satisfies the user query: keyword mapping, content gap analysis, and
content creation.
151
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_4
Chapter 4 Content and UX
• Decide what content to create for target keywords that will satisfy
users searching for them
Data Sources
Your most likely data sources will be a combination of
Keyword Mapping
While there is so much to be gained from creating value-adding content, there is also
much to be gained from retiring or consolidating content. This is achieved by merging it
with another on the basis that they share the same search intent. Assuming the keywords
have been grouped together by search intent, the next stage is to map them.
Keyword mapping is the process of mapping target keywords to pages and then
optimizing the page toward these – as a result, maximizing a site’s rank position potential
in the search result. There are a number of approaches to achieve this:
• TF-IDF
• String matching
152
Chapter 4 Content and UX
We recommend string matching as it’s fast, reasonably accurate, and the easiest
to deploy.
String Matching
String matching works to see how many strings overlap and is used in DNA sequencing.
String matching can work in two ways, which are to either treat strings as one object or
strings made up of tokens (i.e., words within a string). We’re opting for the latter because
words mean something to humans and are not serial numbers. For that reason, we’ll be
using Sorensen-Dice which is fast and accurate compared to others we’ve tested.
The following code extract shows how we use string distance to map keywords to
content by seeking the most similar URL titles to the target keyword. Let’s go, importing
libraries:
import requests
from requests.exceptions import ReadTimeout
from json.decoder import JSONDecodeError
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from client import RestClient
import json
import py_stringmatching as sm
from textdistance import sorensen_dice
from plotnine import *
import matplotlib.pyplot as plt
target = 'wella'
153
Chapter 4 Content and UX
We’ll start by importing the crawl data, which is a CSV export of website auditing
software, in this case from “Sitebulb”:
crawl_raw = pd.read_csv('data/www_wella_com_internal_html_urls_by_
indexable_status_filtered_20220629220833.csv')
crawl_raw.columns = [col.lower().replace('(','').replace(')','').
replace('%','').replace(' ', '_')
for col in crawl_raw.columns]
crawl_df = crawl_raw.copy()
We’re only interested in indexable pages as those are the URLs available for
mapping:
The crawl import is complete. However, we’re only interested in the URL and title as
that’s all we need for mapping keywords to URLs. Still it’s good to import the whole file to
visually inspect it, to be more familiar with the data.
154
Chapter 4 Content and UX
The dataframe is showing the URLs and titles. Let’s load the keywords we want to
map that have been clustered using techniques in Chapter 2:
keyword_discovery = pd.read_csv('data/keyword_discovery.csv)
155
Chapter 4 Content and UX
The dataframe shows the topics, keywords, number of search engine results for the
keywords, topic web search results, and the topic group. Note these were clustered using
the methods disclosed in Chapter 2.
We’ll map the topic as this is the central keyword that would also rank for their topic
group keywords. This means we only require the topic column.
total_mapping_simi = keyword_discovery[['topic']].copy().drop_duplicates()
We want all the combinations of topics and URL titles before we can test each
combination for string similarity. We achieve this using the cross-product merge:
A new column “test” is created which will be formatted to remove boilerplate brand
strings and force lowercase. This will make the string matching values more accurate.
total_mapping_simi['test'] = total_mapping_simi['title']
total_mapping_simi['test'] = total_mapping_simi['test'].str.lower()
total_mapping_simi['test'] = total_mapping_simi['test'].str.replace(' \|
wella', '')
total_mapping_simi
156
Chapter 4 Content and UX
Now we’re ready to compare strings by creating a new column “simi,” meaning
string similarity. The scores will take the topic and test columns as inputs and feed the
sorensen_dice function imported earlier:
The simi column has been added complete with scores. A score of 1 is identical, and
0 is completely dissimilar. The next stage is to select the closest matching URLs to topic
keywords:
keyword_mapping_grp = total_mapping_simi.copy()
The dataframe is first sorted by similarity score and topic in descending order so that
the first row by topic is the closest matching:
157
Chapter 4 Content and UX
After sorting, we use the first() function to select the top matching URL for each topic
using the groupby() function:
keyword_mapping_grp = keyword_mapping_grp.groupby('topic').first().
reset_index()
keyword_mapping_grp
Each topic now has its closest matching URL. The next stage is to decide whether
these matches are good enough or not:
At this point, we eyeball the data to see what threshold number is good enough. I’ve
gone with 0.7 or 70% as it seems to do the job mostly correctly, which is to act as the
natural threshold for matching test content to URLs.
Using np.where(), which is equivalent to Excel’s IF formula, we’ll make any rows
exceeding 0.7 as “mapped” and the rest as “unmatched”:
keyword_mapping
158
Chapter 4 Content and UX
Finally, we have keywords mapped to URLs and some stats on the overall exercise.
keyword_mapping_aggs = keyword_mapping.copy()
keyword_mapping_aggs = keyword_mapping_aggs.groupby('mapped').count().
reset_index()
Keyword_mapping_aggs
159
Chapter 4 Content and UX
• Content gaps: The extent to which the brand is not visible for
keywords that form the content set
Without this analysis, your site risks being left behind in terms of audience reach and
also appearing less authoritative because your site appears less knowledgeable about the
topics covered by your existing content. This is particularly important when considering
the buying cycle. Let’s imagine you’re booking a holiday, and now imagine the variety
of search queries that you might use as you carry out that search, perhaps searching
by destination (“beach holidays to Spain”), perhaps refining by a specific requirement
(“family beach holidays in Spain”), and then more specific including a destination
(Majorca), and perhaps (“family holidays with pool in Majorca”). Savvy SEOs think
deeply about mapping customer demand (right across the search journey) to compelling
landing page (and website) experiences that can satisfy this demand. Data science
enables you to manage this opportunity at a significant scale.
Warnings and motivations over, let’s roll starting with the usual package loading:
import re
import time
import random
import pandas as pd
import numpy as np
160
Chapter 4 Content and UX
OS and Glob allow the environment to read the SEMRush files from a folder:
import os
import glob
pd.set_option('display.max_colwidth', None)
These variables are set in advance so that when copying this script over for another
site, the script can be run with minimal changes to the code:
root_domain = 'wella.com'
hostdomain = 'www.wella.com'
hostname = 'wella'
full_domain = 'https://www.wella.com'
target_name = 'Wella'
With the variables set, we’re now ready to start importing data.
Getting the Data
We set the directory path where all of the SEMRush files are stored:
data_dir = os.path.join('data/semrush/')
Glob reads all of the files in the folder, and we store the output in a variable
“semrush_csvs”:
161
Chapter 4 Content and UX
['data/hair.com-organic.Positions-uk-20220704-2022-07-05T14_04_59Z.csv',
'data/johnfrieda.com-organic.Positions-
uk-20220704-2022-07-05T13_29_57Z.csv',
'data/madison-reed.com-organic.Positions-
uk-20220704-2022-07-05T13_38_32Z.csv',
'data/sebastianprofessional.com-organic.Positions-
uk-20220704-2022-07-05T13_39_13Z.csv',
'data/matrix.com-organic.Positions-uk-20220704-2022-07-05T14_04_12Z.csv',
'data/wella.com-organic.Positions-uk-20220704-2022-07-05T13_30_29Z.csv',
'data/redken.com-organic.Positions-uk-20220704-2022-07-05T13_37_31Z.csv',
'data/schwarzkopf.com-organic.Positions-
uk-20220704-2022-07-05T13_29_03Z.csv',
'data/garnier.co.uk-organic.Positions-
uk-20220704-2022-07-05T14_07_16Z.csv']
Initialize the final dataframe where we’ll be storing the imported SEMRush data:
semrush_raw_df = pd.DataFrame()
semrush_li = []
The for loop uses the pandas read_csv() function to read the SEMRush CSV file and
extract the filename which is put into a new column “filename.” A bit superfluous to
requirements but it will help us know where the data came from.
Once the data is read, it is added to the semrush_li list we initialized earlier:
for cf in semrush_csvs:
df = pd.read_csv(cf, index_col=None, header=0)
df['filename'] = os.path.basename(cf)
df['filename'] = df['filename'].str.replace('.csv', '')
df['filename'] = df['filename'].str.replace('_', '.')
semrush_li.append(df)
162
Chapter 4 Content and UX
semrush_raw_df.columns = semrush_raw_df.columns.str.strip().str.lower().
str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
A site column is created so we know which content the site belongs to. Here, we used
regex on the filename column, but we could have easily derived this from the URL also:
semrush_raw_df['site'] = semrush_raw_df['filename'].str.extract('(.*?)\-')
semrush_raw_df.head()
That’s the dataframe, although we’re more interested in the keywords and the site it
belongs to.
semrush_raw_presect = semrush_raw_sited.copy()
semrush_raw_presect = semrush_raw_presect[['keyword', 'site']]
semrush_raw_presect
163
Chapter 4 Content and UX
The aim of the exercise is to find keywords to two or more competitors which will
define the core content set.
To achieve this, we will use a list comprehension to split the semrush_raw_presect
dataframe by site into unnamed dataframes:
df1, df2, df3, df4, df5, df6, df7, df8, df9 = [x for _, x in semrush_raw_
presect.groupby(semrush_raw_presect['site'])]
Now that each dataframe has the site and keywords, we can dispense with the site
column as we’re only interested in the keywords and not where they come from.
We start by defining a list of dataframes, df_list:
df_list = [df1, df2, df3, df4, df5, df6, df7, df8, df9]
164
Chapter 4 Content and UX
df1
165
Chapter 4 Content and UX
keywords_lists = []
List comprehension which will go through all of the keyword sets in df_list, and these
as lists to get a list of keyword lists.
The lists within the list of lists are too long to print here; however, the double bracket
at the beginning should show this is indeed a list of lists.
keywords_lists
[['garnier',
'hair colour',
'garnier.co.uk',
'garnier hair color',
'garnier hair colour',
'garnier micellar water',
'garnier hair food',
'garnier bb cream',
'garnier face mask',
'bb cream from garnier',
'garnier hair mask',
'garnier shampoo',
'hair dye',
lst_1
166
Chapter 4 Content and UX
['garnier',
'hair colour',
'garnier.co.uk',
'garnier hair color',
'garnier hair colour',
'garnier micellar water',
'garnier hair food',
'garnier bb cream',
'garnier face mask',
'bb cream from garnier',
'garnier hair mask',
'garnier shampoo',
'hair dye',
'garnier hair dye',
'garnier shampoo bar',
'garnier vitamin c serum',
Now we want to generate combinations of lists so we can control how each of the
site’s keywords get intersected:
The dictionary comprehension will append each list into a dictionary we create
called keywords_dict, where the key (index) is the number of the list:
keywords_dict.keys()
we get the list numbers. The reason it goes from 0 to 8 and not 1 to 9 is because
Python uses zero indexing which means it starts from zero:
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8])
167
Chapter 4 Content and UX
Now we’ll convert the keys to a list for ease of manipulation shortly:
keys_list = list(keywords_dict.keys())
keys_list
[0, 1, 2, 3, 4, 5, 6, 7, 8]
With the list, we can construct combinations of the site's keywords to intersect.
The intersection of the website keyword lists will be the words that are common to the
websites.
Creating the Combinations
Initialize list_combos which will be a list of the combinations generated:
list_combos = []
List comprehension using the combinations function picking four site keywords at
random and storing it in list combos using the append() function:
This line converts the combination into a list so that list_combos will be a list of lists:
list_combos
[[0, 1, 2, 3],
[0, 1, 2, 4],
[0, 1, 2, 5],
[0, 1, 2, 6],
[0, 1, 2, 7],
[0, 1, 2, 8],
[0, 1, 3, 4],
[0, 1, 3, 5],
[0, 1, 3, 6], ...
168
Chapter 4 Content and UX
With the list of lists, we’re ready to start intersecting the keyword lists to build the
core content (keyword) set.
keywords_intersected = []
Define the multi_intersect function which takes a list of dictionaries and their keys,
then finds the common keywords (i.e., intersection), and adds it to the keywords_
intersected list.
The function can be adapted to just compare two sites, three sites, and so on. Just
ensure you rerun the combinations function with the number of lists desired and edit
the function as follows:
Using the list comprehension, we loop through the list of combinations list_combos
to run the multi_intersect function which takes the dictionary containing all the site
keywords (keywords_dict), pulls the appropriate keywords, and finds the common ones,
before adding to keywords_intersected:
And we get a list of lists, because each list is an iteration of the function for each
combination:
keywords_intersected
169
Chapter 4 Content and UX
unique_keywords_intersected = list(set(flat_keywords_intersected))
print(len(flat_keywords_intersected), len(unique_keywords_intersected))
87031 8380
There were 87K keywords originally and 8380 keywords post deduplication.
unique_keywords_intersected
170
Chapter 4 Content and UX
That’s the list, but it’s not over yet as we need to establish the gap, which we all want
to know.
Establishing Gap
The question is which keywords are “Wella” not targeting and how many are there?
We’ll start by filtering the SEMRush site for the target site Wella.com:
target_semrush = semrush_raw_sited.loc[semrush_raw_sited['site'] ==
root_domain]
And then we include only the keywords in the core content set:
target_on = target_semrush.loc[target_semrush['keyword'].isin(unique_
keywords_intersected)]
target_on
171
Chapter 4 Content and UX
Let’s get some stats starting with the number of keywords in the preceding dataframe
and the number of keywords in the core content set:
print(target_on[['keyword'].drop_duplicates().shape[0], len(unique_
keywords_intersected))
6936 8380
So just under 70% of Wella’s keyword content is in the core content set, which is
about 1.4K keywords short.
To find the 6.9K intersect keywords, we can use the list and set functions:
172
Chapter 4 Content and UX
To find the keywords that are not in the core content set, that is, the content gap, we’ll
remove the target SEMRush keywords from the core content set:
Now that we know what these gap keywords are, we can filter the dataframe by listing
keywords:
cga_semrush = semrush_raw_sited.loc[semrush_raw_sited['keyword'].
isin(target_gap)]
cga_semrush
173
Chapter 4 Content and UX
We only want the highest ranked target URLs per keyword, which we’ll achieve with
a combination of sort_values(), groupby(), and first():
cga_unique = cga_semrush.sort_values('position').groupby('keyword').
first().reset_index()
cga_unique['project'] = target_name
Ready to export:
cga_unique.to_csv('exports/cga_unique.csv')
cga_unique
174
Chapter 4 Content and UX
3. Check the search results for each heading as writers can phrase
the intent differently
This strategy won’t work for all verticals as there’s a lot of noise in some market
sectors compared to others. For example, with hair styling articles, a lot of the headings
(and their sections) are celebrity names which will not have the same detectable search
intent as another celebrity.
In contrast, in other verticals this method works really well because there aren’t
endless lists with the same HTML heading tags shared with related article titles (e.g.,
“Drew Barrymore” and “54 ways to wear the modern Marilyn”).
Instead, the headings are fewer in number and have a meaning in common, for
example, “What is account-based marketing?” and “Defining ABM,” which is something
Google is likely to understand.
With those caveats in mind, let’s go.
import requests
from requests.exceptions import ReadTimeout
from json.decoder import JSONDecodeError
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
import requests
import json
175
Chapter 4 Content and UX
target = 'on24'
These are the keywords the target website wants to rank for. There’s only eight
keywords, but as you’ll see, this process generates a lot of noisy data, which will need
cleaning up:
serps_input
176
Chapter 4 Content and UX
The extract function from the TLD extract package is useful for extracting the
hostname and domain name from URLs:
serps_input_clean = serps_input.copy()
serps_input_clean['url'] = serps_input_clean['url'].astype(str)
serps_input_clean['host'] = serps_input_clean['url'].apply(lambda x:
extract(x))
Extract the hostname by taking the penultimate list element from the list using the
string get method:
serps_input_clean['host'] = serps_input_clean['host'].str.get(-2)
177
Chapter 4 Content and UX
serps_input_clean['site'] = serps_input_clean['url'].apply(lambda x:
extract(x))
serps_input_clean['site'] = [list(lst) for lst in serps_input_
clean['site']]
Only this time, we want both the hostname and the top-level domain (TLD) which
we will join to form the site or domain name:
serps_input_clean
The augmented dataframe shows the host and site columns added.
This line allows the column values to be read by setting the column widths to their
maximum value:
pd.set_option('display.max_colwidth', None)
178
Chapter 4 Content and UX
Crawling the Content
The next step is to get a list of top ranking URLs that we’ll crawl for their content sections:
serps_to_crawl_df = serps_input_clean.copy()
There are some sites not worth crawling because they won’t let you, which are
defined in the following list:
serps_to_crawl_df = serps_to_crawl_df.loc[~serps_to_crawl_df['host'].
isin(dont_crawl)]
We’ll also remove nulls and sites outside the top 10:
serps_to_crawl_df = serps_to_crawl_df.loc[~serps_to_crawl_df['domain'].
isnull()]
serps_to_crawl_df = serps_to_crawl_df.loc[serps_to_crawl_df['rank'] < 10]
serps_to_crawl_df.head(10)
179
Chapter 4 Content and UX
With the dataframe filtered, we just want the URLs to export to our desktop crawler.
Some URLs may rank for multiple search phrases. To avoid crawling the same URL
multiple times, we’ll use drop_duplicates() to make the URL list unique:
serps_to_crawl_upload = serps_to_crawl_df[['url']].drop_duplicates()
serps_to_crawl_upload.to_csv('data/serps_to_crawl_upload.csv', index=False)
serps_to_crawl_upload
Now we have a list of 62 URLs to crawl, which cover the eight target keywords.
Let’s import the results of the crawl:
crawl_raw = pd.read_csv('data/all_inlinks.csv')
pd.set_option('display.max_columns', None)
Using a list comprehension, we’ll clean up the column names to make it easier to
work with:
180
Chapter 4 Content and UX
Print out the column names to see how many extractor fields were extracted:
print(crawl_raw.columns)
181
Chapter 4 Content and UX
There are 6 primary headings (H1 in HTML) and 65 H2 headings altogether. These
will form the basis of our content sections which tell us what content should be on
those pages.
crawl_raw
crawl_headings = crawl_raw.loc[crawl_raw['link_position'] ==
'Content'].copy()
182
Chapter 4 Content and UX
The dataframe also contains columns that are superfluous to our requirements such
as link_position and link_origin. We can remove these by listing the columns by position
(saves space and typing out the names of which there are many!).
drop_cols = [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20]
Using the .drop() method, we can drop multiple columns in place (i.e., without
having to copy the result onto itself ):
crawl_headings.drop(crawl_headings.columns[drop_cols], axis = 1,
inplace = True)
Rename the columns from source to URL, which will be useful for joining later:
crawl_headings
With the desired columns of URL and their content section columns, these need to
be converted to long format, where all of the sections will be in a single column called
“heading”:
crawl_headings_long = crawl_headings.copy()
183
Chapter 4 Content and UX
We’ll want a list of the extractor column names (again to save typing) by subsetting
the dataframe from the second column onward using .iloc and extracting the column
names (.columns.values):
Using the .melt() function, we’ll pivot the dataframe to reshape the content sections
into a single column “heading” using the preceding list:
crawl_headings_long = crawl_headings_long.loc[~crawl_headings_
long['heading'].isnull()]
crawl_headings_long = crawl_headings_long.drop_duplicates()
crawl_headings_long
184
Chapter 4 Content and UX
The resulting dataframe shows the URL, the heading, and the position where the first
number denotes whether it was an h1 or h2 and the second number indicates the order
of the heading on the page. The heading is the text value.
You may observe that the heading contains some values that are not strictly content
but boilerplate content that is sitewide, such as Company, Resources, etc. These will
require removal at some point.
serps_headings = serps_to_crawl_df.copy()
serps_headings['heading'] = np.where(serps_headings['heading'].isnull(),
'', serps_headings['heading'])
serps_headings['project'] = 'target'
serps_headings
185
Chapter 4 Content and UX
With the data joined, we’ll take the domain, heading, and the position:
Split position by underscore and extract the last number in the list (using -1) to get
the order the heading appears on the page:
headings_tosum['pos_n'] = headings_tosum['position'].str.split('_').str[-1]
headings_tosum['pos_n'] = headings_tosum['pos_n'].astype(float)
headings_tosum['count'] = 1
headings_tosum
186
Chapter 4 Content and UX
187
Chapter 4 Content and UX
stop_headings = domsheadings_tosum_agg.loc[domsheadings_tosum_
agg['count'] > 1]
stop_headings = stop_headings.loc[stop_headings['heading'].str.
contains('\n')]
stop_headings = stop_headings['heading'].tolist()
stop_headings
188
Chapter 4 Content and UX
We’ll now analyze the headings per se, starting by counting the number of headings:
headings_tosum_agg = headings_tosum_agg.loc[~headings_tosum_agg['heading'].
isin(stop_headings)]
headings_tosum_agg = headings_tosum_agg.loc[headings_tosum_
agg['heading'] != '']
headings_tosum_agg.head(10)
The dataframe looks to contain more sensible content headings with the exception
of “company,” which also is much further down the order of the page at 25.
189
Chapter 4 Content and UX
headings_tosum_filtered = headings_tosum_agg.copy()
headings_tosum_filtered = headings_tosum_filtered.loc[headings_tosum_
filtered['count'] < 10 ]
headings_tosum_filtered['tokens'] = headings_tosum_filtered['heading'].str.
count(' ') + 1
headings_tosum_filtered['heading'] = headings_tosum_filtered['heading'].
str.strip()
Split heading using colons as a punctuation mark and extract the right-hand side of
the colon:
headings_tosum_filtered['heading'] = headings_tosum_filtered['heading'].
str.split(':').str[-1]
headings_tosum_filtered['heading'] = headings_tosum_filtered['heading'].
str.split('.').str[-1]
headings_tosum_filtered = headings_tosum_filtered.loc[~headings_tosum_
filtered['heading'].str.contains('[0-9] of [0-9]', regex = True)]
Remove headings that are less than 5 words long or more than 12:
headings_tosum_filtered = headings_tosum_filtered.loc[headings_tosum_
filtered['tokens'].between(5, 12)]
headings_tosum_filtered = headings_tosum_filtered.sort_values('count',
ascending = False)
190
Chapter 4 Content and UX
headings_tosum_filtered = headings_tosum_filtered.loc[headings_tosum_
filtered['heading'] != '' ]
headings_tosum_filtered.head(10)
Now we have headings that look more like actual content sections. These are now
ready for clustering.
Cluster Headings
The reason for clustering is that writers will describe the same section heading using
different words and deliberately so as to avoid copyright infringement and plagiarism.
However, Google is smart enough to know that “webinar best practices” and “best
practices for webinars” are the same.
To make use of Google’s knowledge, we’ll make use of the SERPs to see if the search
results of each heading are similar enough to know if they mean the same thing or not
(i.e., whether the underlying meaning or intent is the same).
We’ll create a list and use the search intent clustering code (see Chapter 2) to
categorize the headings into topics:
191
Chapter 4 Content and UX
headings_to_cluster = headings_tosum_filtered[['heading']].drop_
duplicates()
headings_to_cluster = headings_to_cluster.loc[~headings_to_
cluster['heading'].isnull()]
headings_to_cluster = headings_to_cluster.rename(columns = {'heading':
'keyword'})
headings_to_cluster
With the headings clustered by search intent, we’ll import the results:
topic_keyw_map = pd.read_csv('data/topic_keyw_map.csv')
Let’s rename the keyword column to heading, which we can use to join to the SERP
dataframe later:
192
Chapter 4 Content and UX
topic_keyw_map
The dataframe shows the heading and the meaning of the heading as “topic.” The
next stage is to get some statistics and see how many headings constitute a topic. As the
topics are the central meaning of the headings, this will form the core content sections
per target keyword.
topic_keyw_map_agg = topic_keyw_map.copy()
topic_keyw_map_agg['count'] = 1
topic_keyw_map_agg = topic_keyw_map_agg.groupby('topic').agg({'count':
'sum'}).reset_index()
topic_keyw_map_agg = topic_keyw_map_agg.sort_values('count',
ascending = False)
topic_keyw_map_agg
193
Chapter 4 Content and UX
serps_topics_merge = serps_headings.copy()
serps_topics_merge['heading'] = serps_topics_merge['heading'].str.lower()
194
Chapter 4 Content and UX
serps_topics_merge = serps_topics_merge.merge(topic_keyw_map, on =
'heading', how = 'left')
serps_topics_merge
The count will be reset to 1, so we can count the number of suggested content
sections per target keyword:
keyword_topics_summary['count'] = 1
keyword_topics_summary
195
Chapter 4 Content and UX
196
Chapter 4 Content and UX
The preceding dataframe shows the content sections (topic) that should be written
for each target keyword.
keyword_topics_summary.groupby(['keyword']).agg({'count': 'sum'}).
reset_index()
Webinar best practices will have the most content, while other target keywords will
have around two core content sections on average.
Reflections
For B2B marketing, it works really well as it’s a good way of automating a manual
process most SEOs go through (i.e., seeing what content the top 10 ranking pages cover)
especially when you have a lot of keywords to create content for.
We used the H1 and H2 because using even more copy from the body (such as H3 or
<p> paragraphs even after filtering out stop words) would introduce more noise into the
string distance calculations.
Sometimes, you get some reliable suggestions that are actually quite good; however,
the output should be reviewed first before raising content requests from your creative
team or agency.
197
Chapter 4 Content and UX
Summary
There are many aspects of SEO that go into delivering content and UX better than your
competitors. This chapter focused on
The next chapter deals with the third major pillar of SEO: authority.
198
CHAPTER 5
Authority
Authority is arguably 50% of the Google algorithm. You could optimize your site to your
heart’s content by creating the perfect content and deliver it with the perfect UX that’s
hosted on a site with the most perfect information architecture, only to find it’s nowhere
in Google’s search results when searching by the title of the page – assuming it’s not a
unique search phrase, so what gives?
You’ll find out about this and the following in this chapter:
199
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_5
Chapter 5 Authority
The first thing is that their algorithm ranked pages based on their authority, in other
words, how trustworthy the document (or website) was, as opposed to only matching
a document on keyword relevance. Authority in those days was measured by Google
as the amount of links from other sites linking to your site. This was much in the same
way as citations in a doctoral dissertation. The more links (or citations), the higher the
probability a random surfer on the Web would find your content. This made SEO harder
to game and the results (temporarily yet significantly) more reliable relative to the
competition.
The second thing they did was partner with Yahoo! which openly credited Google for
powering their search results. So what happened next? Instead of using Yahoo!, people
went straight to Google, bypassing the intermediary Yahoo! Search engine, and the rest is
history – or not quite.
200
Chapter 5 Authority
Figure 5-1 is just one example of many showing a positive relationship between
rankings and authority. In this case, the authority is the product of nonsearch
advertising. And why is that? It’s because good links and effective advertising drive brand
impressions, which are also positively linked.
What we will set out to do is show how data science can help you:
201
Chapter 5 Authority
While most of the analysis can be done on a spreadsheet, Python has certain
advantages. Other than the sheer number of rows it can handle, it can also look at the
statistical side more readily such as distributions.
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools
pd.set_option('display.max_colwidth', None)
%matplotlib inline
root_domain = 'johnsankey.co.uk'
hostdomain = 'www.johnsankey.co.uk'
hostname = 'johnsankey'
full_domain = 'https://www.johnsankey.co.uk'
target_name = 'John Sankey'
202
Chapter 5 Authority
We start by importing the data and cleaning up the column names to make it easier
to handle and quicker to type, for the later stages.
target_ahrefs_raw = pd.read_csv(
'data/johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv')
List comprehensions are a powerful and less intensive way to clean up the column names.
The list comprehension instructs Python to convert the column name to lowercase
for each column (“col”) in the dataframe columns.
Though not strictly necessary, I like having a count column as standard for
aggregations and a single value column “project” should I need to group the entire table:
target_ahrefs_raw['rd_count'] = 1
target_ahrefs_raw['project'] = target_name
Target_ahrefs_raw
203
Chapter 5 Authority
Now we have a dataframe with clean column names. The next step is to clean the
actual table values and make them more useful for analysis.
Make a copy of the previous dataframe and give it a new name:
target_ahrefs_clean_dtypes = target_ahrefs_raw.copy()
Clean the dofollow_ref_domains column which tells us how many ref domains the
sitelinking has. In this case, we’ll convert the dashes to zeros and then cast the whole
column as a whole number.
Start with referring domains:
target_ahrefs_clean_dtypes['dofollow_ref_domains'] = np.where(target_
ahrefs_clean_dtypes['dofollow_ref_domains'] == '-',
0, target_ahrefs_clean_dtypes['dofollow_ref_
domains'])
target_ahrefs_clean_dtypes['dofollow_ref_domains'] = target_ahrefs_clean_
dtypes['dofollow_ref_domains'].astype(int)
204
Chapter 5 Authority
target_ahrefs_clean_dtypes['dofollow_linked_domains'] = np.where(target_
ahrefs_clean_dtypes['dofollow_linked_domains'] == '-',
0, target_ahrefs_clean_dtypes['dofollow_linked_
domains'])
target_ahrefs_clean_dtypes['dofollow_linked_domains'] = target_ahrefs_
clean_dtypes['dofollow_linked_domains'].astype(int)
“First seen” tells us the date when the link was first found (i.e., discovered and then
added to the index of ahrefs). We’ll convert the string to a date format that Python can
process and then use this to derive the age of the links later on:
target_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(target_ahrefs_
clean_dtypes['first_seen'], format='%d/%m/%Y %H:%M')
target_ahrefs_clean_dtypes['month_year'] = target_ahrefs_clean_
dtypes['first_seen'].dt.to_period('M')
The link age is calculated by taking today’s date and subtracting the first seen date.
Then it’s converted to a number format and divided by a huge number to get the number
of days:
target_ahrefs_clean_dtypes
205
Chapter 5 Authority
With the data types cleaned, and some new data features created (note columns
added earlier), the fun can begin.
target_ahrefs_analysis = target_ahrefs_clean_dtypes
target_ahrefs_analysis.describe()
206
Chapter 5 Authority
So from the preceding table, we can see the average (mean), the number of referring
domains (107), and the variation (the 25th percentiles and so on).
The average domain rating (equivalent to Moz’s Domain Authority) of referring
domains is 27. Is that a good thing? In the absence of competitor data to compare in this
market sector, it’s hard to know, which is where your experience as an SEO practitioner
comes in. However, I’m certain we could all agree that it could be much higher – given
that it falls on a scale between 0 and 100. How much higher to make a shift is another
question.
The preceding table can be a bit dry and hard to visualize, so we’ll plot a histogram to
get more of an intuitive understanding of the referring domain authority:
dr_dist_plt = (
ggplot(target_ahrefs_analysis,
aes(x = 'dr')) +
geom_histogram(alpha = 0.6, fill = 'blue', bins = 100) +
scale_y_continuous() +
theme(legend_position = 'right'))
dr_dist_plt
The distribution is heavily skewed, showing that most of the referring domains have
an authority rating of zero (Figure 5-2). Beyond zero, the distribution looks fairly uniform
with an equal amount of domains across different levels of authority.
207
Chapter 5 Authority
dr_firstseen_plt = (
ggplot(target_ahrefs_analysis, aes(x = 'first_seen', y = 'dr',
group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 2) +
labs(y = 'Domain Rating', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
208
Chapter 5 Authority
dr_firstseen_plt.save(filename = 'images/1_dr_firstseen_plt.png',
height=5, width=10, units = 'in', dpi=1000)
dr_firstseen_plt
The plot looks very noisy as you’d expect and only really shows you what the DR
(domain rating) of a referring domain was at a point in time (Figure 5-3). The utility of
this chart is that if you have a team tasked with acquiring links, you can monitor the link
quality over time in general.
dr_firstseen_smooth_plt = (
ggplot(target_ahrefs_analysis, aes(x = 'first_seen', y = 'dr',
group = 1)) +
geom_smooth(alpha = 0.6, colour = 'blue', size = 3, se = False) +
labs(y = 'Domain Rating', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
209
Chapter 5 Authority
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
dr_firstseen_smooth_plt.save(filename = 'images/1_dr_firstseen_smooth_plt.
png', height=5, width=10, units = 'in', dpi=1000)
dr_firstseen_smooth_plt
The use of geom_smooth() gives a somewhat less noisy view and shows the
variability of the domain rating over time to show how consistent the quality is
(Figure 5-4). Again, this correlates to the quality of the links being acquired.
What this doesn’t quite describe is the overall site authority over time, because the
value of links acquired is retained over time; therefore, a different math approach is
required.
To see the site’s authority over time, we will calculate a running average of the
domain rating by month of the year. Note the use of the expanding() function which
instructs Pandas to include all previous rows with each new row:
210
Chapter 5 Authority
target_rd_cummean_df = target_ahrefs_analysis
target_rd_mean_df = target_rd_cummean_df.groupby(['month_year'])['dr'].
sum().reset_index()
target_rd_mean_df['dr_runavg'] = target_rd_mean_df['dr'].expanding().mean()
target_rd_mean_df.head(10)
We now have a table which we can use to feed the graph and visualize.
dr_cummean_smooth_plt = (
ggplot(target_rd_mean_df, aes(x = 'month_year', y = 'dr_runavg',
group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 2) +
#labs(y = 'GA Sessions', x = 'Date') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
dr_cummean_smooth_plt
211
Chapter 5 Authority
So the target site started with high authority links (which may have been a PR
campaign announcing the business brand), which faded soon after for four years and
then rebooted with new acquisition of high authority links again (Figure 5-5).
Most importantly, we can see the site’s general authority over time, which is how a
search engine like Google may see it too.
A really good extension to this analysis would be to regenerate the dataframe so that
we would plot the distribution over time on a cumulative basis. Then we could not only
see the median quality but also the variation over time too.
That’s the link quality, what about quantity?
212
Chapter 5 Authority
target_count_cumsum_df = target_ahrefs_analysis
print(target_count_cumsum_df.columns)
target_count_cumsum_df = target_count_cumsum_df.groupby(['month_year'])
['rd_count'].sum().reset_index()
target_count_cumsum_df['count_runsum'] = target_count_cumsum_df['rd_
count'].expanding().sum()
target_count_cumsum_df['link_velocity'] = target_count_cumsum_df['rd_
count'].diff()
target_count_cumsum_df
target_count_plt = (
ggplot(target_count_cumsum_df, aes(x = 'month_year', y = 'rd_count',
group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 2) +
labs(y = 'Count of Referring Domains', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
213
Chapter 5 Authority
axis_text_x=element_text(rotation=90, hjust=1)
))
target_count_plt.save(filename = 'images/3_target_count_plt.png',
height=5, width=10, units = 'in', dpi=1000)
target_count_plt
But perhaps it is not as useful for how a search engine would view the overall
number of referring domains a site has.
target_count_cumsum_plt = (
ggplot(target_count_cumsum_df, aes(x = 'month_year', y = 'count_
runsum', group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 2) +
scale_y_continuous() +
scale_x_date() +
214
Chapter 5 Authority
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
target_count_cumsum_plt
The cumulative view shows us the total number of referring domains (Figure 5-7).
Naturally, this isn’t the entirely accurate picture as some referring domains may have
been lost, but it’s good enough to get the gist of where the site is at.
We see that links were steadily added from 2017 for the next four years before
accelerating again around March 2021. This is consistent with what we have seen with
domain rating over time.
A useful extension to correlate that with performance may be to layer in
215
Chapter 5 Authority
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools
pd.set_option('display.max_colwidth', None)
%matplotlib inline
root_domain = 'johnsankey.co.uk'
hostdomain = 'www.johnsankey.co.uk'
hostname = 'johnsankey'
full_domain = 'https://www.johnsankey.co.uk'
target_name = 'John Sankey'
ahrefs_path = 'data/'
216
Chapter 5 Authority
The listdir() function from the OS module allows us to list all of the files in a
subdirectory:
ahrefs_filenames = os.listdir(ahrefs_path)
ahrefs_filenames
['www.davidsonlondon.com--refdomains-subdomain__2022-03-13_23-37-29.csv',
'www.stephenclasper.co.uk--refdomains-subdoma__2022-03-13_23-47-28.csv',
'www.touchedinteriors.co.uk--refdomains-subdo__2022-03-13_23-42-05.csv',
'www.lushinteriors.co--refdomains-subdomains__2022-03-13_23-44-34.csv',
'www.kassavello.com--refdomains-subdomains__2022-03-13_23-43-19.csv',
'www.tulipinterior.co.uk--refdomains-subdomai__2022-03-13_23-41-04.csv',
'www.tgosling.com--refdomains-subdomains__2022-03-13_23-38-44.csv',
'www.onlybespoke.com--refdomains-subdomains__2022-03-13_23-45-28.csv',
'www.williamgarvey.co.uk--refdomains-subdomai__2022-03-13_23-43-45.csv',
'www.hadleyrose.co.uk--refdomains-subdomains__2022-03-13_23-39-31.csv',
'www.davidlinley.com--refdomains-subdomains__2022-03-13_23-40-25.csv',
'johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv']
With the files listed, we’ll now read each one individually using a for loop and add
these to a dataframe. While reading in the file, we’ll use some string manipulation to
create a new column with the site name of the data we’re importing:
ahrefs_df_lst = list()
ahrefs_colnames = list()
comp_ahrefs_df_raw = pd.concat(ahrefs_df_lst)
comp_ahrefs_df_raw
217
Chapter 5 Authority
Now we have the raw data from each site in a single dataframe, the next step is to
tidy up the column names and make them a bit more friendlier to work with. A custom
function could be used, but we’ll just chain the function calls with a list comprehension:
competitor_ahrefs_cleancols = comp_ahrefs_df_raw.copy()
competitor_ahrefs_cleancols.columns = [col.lower().replace(' ','_').
replace('.','_').replace('__','_').replace('(','')
.replace(')','').replace('%','')
for col in competitor_ahrefs_cleancols.columns]
Having a count column and a single value column (“project”) is useful for groupby
and aggregation operations:
competitor_ahrefs_cleancols['rd_count'] = 1
competitor_ahrefs_cleancols['project'] = target_name
competitor_ahrefs_cleancols
218
Chapter 5 Authority
The columns are now cleaned up, so we’ll now clean up the row data:
competitor_ahrefs_clean_dtypes = competitor_ahrefs_cleancols
For referring domains, we’re replacing hyphens with zero and setting the data type as
an integer (i.e., whole number). This will be repeated for linked domains, also:
competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] =
np.where(competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] == '-',
0, competitor_ahrefs_clean_dtypes['dofollow_ref_
domains'])
competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] = competitor_ahrefs_
clean_dtypes['dofollow_ref_domains'].astype(int)
# linked_domains
competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] =
np.where(competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] == '-',
0, competitor_ahrefs_clean_dtypes['dofollow_linked_
domains'])
competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] = competitor_
ahrefs_clean_dtypes['dofollow_linked_domains'].astype(int)
219
Chapter 5 Authority
First seen gives us a date point at which links were found, which we can use for
time series plotting and deriving the link age. We’ll convert to date format using the
to_datetime function:
competitor_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(competitor_
ahrefs_clean_dtypes['first_seen'],
format='%d/%m/%Y %H:%M')
competitor_ahrefs_clean_dtypes['first_seen'] = competitor_ahrefs_clean_
dtypes['first_seen'].dt.normalize()
competitor_ahrefs_clean_dtypes['month_year'] = competitor_ahrefs_clean_
dtypes['first_seen'].dt.to_period('M')
To calculate the link_age, we’ll simply deduct the first seen date from today’s date
and convert the difference into a number:
competitor_ahrefs_clean_dtypes['link_age'] = dt.datetime.now() -
competitor_ahrefs_clean_dtypes['first_seen']
competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_
dtypes['link_age']
competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_
dtypes['link_age'].astype(int)
competitor_ahrefs_clean_dtypes['link_age'] = (competitor_ahrefs_clean_
dtypes['link_age']/(3600 * 24 * 1000000000)).round(0)
The target column helps us distinguish the “client” site vs. competitors, which is
useful for visualization later:
competitor_ahrefs_clean_dtypes['target'] = np.where(competitor_ahrefs_
clean_dtypes['site'].str.contains('johns'),
1, 0)
competitor_ahrefs_clean_dtypes['target'] = competitor_ahrefs_clean_
dtypes['target'].astype('category')
competitor_ahrefs_clean_dtypes
220
Chapter 5 Authority
Now that the data is cleaned up both in terms of column titles and row values, we’re
ready to set forth and start analyzing.
competitor_ahrefs_aggs = competitor_ahrefs_analysis.groupby('site').
agg({'link_age': 'mean',
'dofollow_links': 'mean', 'domain': 'count', 'dr': 'mean',
'dofollow_ref_domains': 'mean', 'traffic_': 'mean', 'dofollow_
linked_domains': 'mean', 'links_to_target': 'mean', 'new_links':
'mean', 'lost_links': 'mean'}).reset_index()
competitor_ahrefs_aggs
221
Chapter 5 Authority
The resulting table shows us aggregated statistics for each of the link features. Next,
read in the list of SEMRush domain level data (which by way of manual data entry was
literally typed in since it’s only 11 sites):
semrush_viz = [10100, 2300, 931, 2400, 911, 2100, 1800, 136, 838, 428,
1100, 1700]
competitor_ahrefs_aggs['semrush_viz'] = semrush_viz
competitor_ahrefs_aggs
The SEMRush visibility data has now been appended, so we’re ready to find some
r-squared, known as the coefficient of determination, which will tell which link feature
can best explain the variation in SEMRush visibility:
222
Chapter 5 Authority
competitor_ahrefs_r2 = competitor_ahrefs_aggs.corr() ** 2
competitor_ahrefs_r2 = competitor_ahrefs_r2[['semrush_viz']].reset_index()
competitor_ahrefs_r2 = competitor_ahrefs_r2.sort_values('semrush_viz',
ascending = False)
competitor_ahrefs_r2
Naturally, we’d expect the semrush_viz to correlate perfectly with itself. DR (domain
rating) surprisingly doesn’t explain the difference in SEMRush very well with an r_
squared of 21%.
On the other hand, “traffic_” which is the referring domain’s traffic value correlates
better. From this alone, we’re prepared to disregard “dr.” Let’s inspect this visually:
comp_correl_trafficviz_plt = (
ggplot(competitor_ahrefs_aggs,
aes(x = 'traffic_', y = 'semrush_viz')) +
geom_point(alpha = 0.4, colour = 'blue', size = 2) +
223
Chapter 5 Authority
comp_correl_trafficviz_plt.save(filename = 'images/2_comp_correl_
trafficviz_plt.png',
height=5, width=10, units = 'in', dpi=1000)
comp_correl_trafficviz_plt
This is not terribly convincing (Figure 5-8), due to the lack of referring domains
beyond 2,000,000. Does this mean we should disregard traffic_ as a measure?
Figure 5-8. Scatterplot of the SEMRush visibility (semrush_viz) vs. the total
AHREFs backlink traffic (traffic_) of the site’s backlinks
Not necessarily. The outlier data point with 10,000 visibility isn’t necessarily
incorrect. The site does have superior visibility and more referring traffic in the real
world, so it doesn’t mean the site’s data should be removed.
If anything, more data should be gathered with more domains in the same sector.
Alternatively, pursuing a more thorough treatment would involve obtaining SEMRush
visibility data at the page level and correlating this with page-level link feature metrics.
Going forward, we will use traffic_ as our measure of quality.
224
Chapter 5 Authority
Link Quality
We start with link quality, which we’ve very recently discovered should be measured by
“traffic_” as opposed to the industry accepted.
Let’s start by inspecting the distributive properties of each link feature using the
describe() function:
competitor_ahrefs_analysis = competitor_ahrefs_clean_dtypes
competitor_ahrefs_analysis[['traffic_']].describe()
The resulting table shows some basic statistics including the mean, standard
deviation (std), and interquartile metrics (25th, 50th, and 75th percentiles), which give
you a good idea of where most referring domains fall in terms of referring domain traffic.
comp_dr_dist_box_plt = (
ggplot(competitor_ahrefs_analysis, #.loc[competitor_ahrefs_
analysis['dr'] > 0],
aes(x = 'reorder(site, traffic_)', y = 'traffic_',
colour = 'target')) +
225
Chapter 5 Authority
geom_boxplot(alpha = 0.6) +
scale_y_log10() +
theme(legend_position = 'none',
axis_text_x=element_text(rotation=90, hjust=1)
))
comp_dr_dist_box_plt.save(filename = 'images/4_comp_traffic_dist_box_plt.
png', height=5, width=10, units = 'in', dpi=1000)
comp_dr_dist_box_plt
226
Chapter 5 Authority
The interquartile range is the range of data between its 25th percentile and 75th
percentile. The purpose is to tell us
• How much of the data is away from the median (the center)
In this case, the IQR is quantifying how much traffic each site’s referring domains get
and its variability.
We also see that “John Sankey” has the third highest median referring domain traffic
which compares well in terms of link quality against their competitors. The size of the
box (its IQR) is not the longest (quite consistent around its median) but not as short
as Stephen Clasper (more consistent, with a higher median and more backlinks from
referring domain sites higher than the median).
“Touched Interiors” has the most diverse range of DR compared with other domains,
which could indicate an ever so slightly more relaxed criteria for link acquisition. Or is it
the case that as your brand becomes more well known and visible online, this brand has
naturally attracted more links from zero traffic referring domains? Maybe both.
Let’s plot the domain quality over time for each competitor:
comp_traf_timeseries_plt = (
ggplot(competitor_ahrefs_analysis,
aes(x = 'first_seen', y = 'traffic_',
group = 'site', colour = 'site')) +
geom_smooth(alpha = 0.4, size = 2, se = False,
method='loess'
) +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
comp_traf_timeseries_plt.save(filename = 'images/4_comp_traffic_timeseries_
plt.png', height=5, width=10, units = 'in', dpi=1000)
comp_traf_timeseries_plt
227
Chapter 5 Authority
Figure 5-10. Time series plot showing the amount of traffic each referring domain
has over time for each website
The remaining sites are more or less flat in terms of their link acquisition
performance. David Linley started big, then dive-bombed in terms of link quality before
improving again in 2020 and 2021.
Now that we have some concept of how the different sites perform, what we really
want is a cumulative link quality by month_year as this is likely to be additive in the way
search engines evaluate the authority of websites.
We’ll use our trusted groupby() and expanding().mean() functions to compute the
cumulative stats we want:
competitor_traffic_cummean_df = competitor_ahrefs_analysis.copy()
competitor_traffic_cummean_df = competitor_traffic_cummean_
df.groupby(['site', 'month_year'])['traffic_'].sum().reset_index()
competitor_traffic_cummean_df['traffic_runavg'] = competitor_traffic_
cummean_df['traffic_'].expanding().mean()
competitor_traffic_cummean_df
228
Chapter 5 Authority
Scientific formatted numbers aren’t terribly helpful, nor is a table for that matter, but
at least the dataframe is in a ready format to power the following chart:
competitor_traffic_cummean_plt = (
ggplot(competitor_traffic_cummean_df, aes(x = 'month_year', y =
'traffic_runavg', group = 'site', colour = 'site')) +
geom_line(alpha = 0.6, size = 2) +
labs(y = 'Cumu Avg of traffic_', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
competitor_traffic_cummean_plt.save(filename = 'images/4_competitor_
traffic_cummean_plt.png', height=5, width=10, units = 'in', dpi=1000)
competitor_traffic_cummean_plt
229
Chapter 5 Authority
The code is color coding the sites to make it easier to see which site is which.
So as we might expect, David Linley’s link acquisition team has done well as their
authority has made leaps and bounds over all of the competitors over time (Figure 5-11).
Figure 5-11. Time series plot of the cumulative average backlink traffic for
each website
All of the other competitors have pretty much flatlined. This is reflected in David
Linley’s superior SEMRush visibility (Figure 5-12).
230
Chapter 5 Authority
Figure 5-12. Column chart showing the SEMRush visibility for each website
What can we learn? So far in our limited data research, we can see that slow and
steady does not win the day. By contrast, sites need to be going after links from high
traffic sites in a big way.
Link Volumes
That’s quality analyzed; what about the volume of links from referring domains?
Our approach will be to compute a cumulative sum of referring domains using the
groupby() function:
competitor_count_cumsum_df = competitor_ahrefs_analysis
competitor_count_cumsum_df = competitor_count_cumsum_df.groupby(['site',
'month_year'])['rd_count'].sum().reset_index()
231
Chapter 5 Authority
The expanding function allows the calculation window to grow with the number of
rows, which is how we achieve our cumulative sum:
competitor_count_cumsum_df['count_runsum'] = competitor_count_cumsum_
df['rd_count'].expanding().sum()
competitor_count_cumsum_df
The result is a dataframe with the site, month_year, and count_runsum (the running
sum), which is in the perfect format to feed the graph – which we will now run as follows:
competitor_count_cumsum_plt = (
ggplot(competitor_count_cumsum_df, aes(x = 'month_year', y =
'count_runsum',
group = 'site', colour = 'site')) +
geom_line(alpha = 0.6, size = 2) +
labs(y = 'Running Sum of Referring Domains', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
232
Chapter 5 Authority
axis_text_x=element_text(rotation=90, hjust=1)
))
competitor_count_cumsum_plt.save(filename = 'images/5_count_cumsum_smooth_
plt.png', height=5, width=10, units = 'in', dpi=1000)
competitor_count_cumsum_plt
Figure 5-13. Time series plot of the running sum of referring domains for
each website
For example, William Garvey started with over 5000 domains. I’d love to know who
their digital PR team is.
We can also see the rate of growth, for example, although Hadley Rose started link
acquisition in 2018, things really took off around mid-2021.
233
Chapter 5 Authority
Link Velocity
Let’s take a look at link velocity:
competitor_velocity_cumsum_plt = (
ggplot(competitor_count_cumsum_df, aes(x = 'month_year', y = 'link_
velocity',
group = 'site', colour = 'site')) +
geom_line(alpha = 0.6, size = 2) +
labs(y = 'Running Sum of Referring Domains', x = 'Month Year') +
scale_y_log10() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
competitor_velocity_cumsum_plt.save(filename = 'images/5_competitor_
velocity_cumsum_plt.png',
height=5, width=10, units = 'in', dpi=1000)
competitor_velocity_cumsum_plt
The view shows the relative speed at which the sites are acquiring links (Figure 5-14).
This is an unusual but useful view as for any given month you can see which site is
acquiring the most links by virtue of the height of their lines.
234
Chapter 5 Authority
Figure 5-14. Time series plot showing the link velocity of each website
David Linley was winning the contest throughout the years until Hadley Rose
came along.
Link Capital
Like most things that are measured in life, the ultimate value is determined by the
product of their rate and volume. So we will apply the same principle to determine the
overall value of a site’s authority and call it “link capital.”
We’ll start by merging the running average stats for both link volume and average
traffic (as our measure of authority):
competitor_capital_cumu_df = competitor_count_cumsum_df.merge(competitor_
traffic_cummean_df,
on = ['site', 'month_year'], how = 'left'
)
competitor_capital_cumu_df['auth_cap'] = (competitor_capital_cumu_
df['count_runsum'] * competitor_capital_cumu_df['traffic_runavg']).
round(1)*0.001
competitor_capital_cumu_df['auth_velocity'] = competitor_capital_cumu_
df['auth_cap'].diff()
competitor_capital_cumu_df
235
Chapter 5 Authority
The merged table is produced with new columns auth_cap (measuring overall
authority) and auth_velocity (the rate at which authority is being added).
Let’s see how the competitors compare in terms of total authority over time in
Figure 5-15.
Figure 5-15. Time series plot of authority capital over time by website
236
Chapter 5 Authority
The plot shows the link capital of several sites over time. What’s quite interesting is
how Hadley Rose emerged as the most authoritative with the third most consistently
highest trafficked backlinking sites with a ramp-up in volume in less than a year. This
has allowed them to overtake all of their competitors in the same time period (based on
volume while maintaining quality).
What about the velocity in which authority has been added? In the following, we’ll
plot the authority velocity over time for each website:
competitor_capital_veloc_plt = (
ggplot(competitor_capital_cumu_df, aes(x = 'month_year', y =
'auth_velocity',
group = 'site', colour = 'site')) +
geom_line(alpha = 0.6, size = 2) +
labs(y = 'Authority Capital', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
competitor_capital_veloc_plt.save(filename = 'images/6_auth_veloc_smooth_
plt.png',
height=5, width=10, units = 'in', dpi=1000)
competitor_capital_veloc_plt
The only standouts are David Linley and Hadley Rose (Figure 5-16). Should David
Linley maintain the quality and the velocity of its link acquisition program?
237
Chapter 5 Authority
We’re in no doubt that it will catch up and even surpass Hadley Rose, all other things
being equal.
238
Chapter 5 Authority
To achieve this, we’ll group the referring domains and their traffic levels to calculate
the number of sites:
power_doms_strata = competitor_ahrefs_analysis.groupby(['domain',
'traffic_']).agg({'rd_count': 'count'})
power_doms_strata = power_doms_strata.reset_index().sort_values('traffic_',
ascending = False)
A referring domain can only be considered a hub or power domain if it links to more
than two domains, so we’ll filter out those that don’t meet the criteria. Why three or
more? Because one is random, two is a coincidence, and three is directed.
power_doms_strata = power_doms_strata.loc[power_doms_strata['rd_
count'] > 2]
power_doms_strata
The table shows referring domains, their traffic, and the number of (our furniture)
sites that these backlinking domains are linking to.
239
Chapter 5 Authority
Being data driven, we’re not satisfied with a list, so we’ll use statistics to help
understand the distribution of power before filtering the list further:
pd.set_option('display.float_format', str)
power_doms_stats = power_doms_strata.describe()
power_doms_stats
We see the distribution is heavily positively skewed where most of the highly
trafficked referring domains are in the 75th percentile or higher. Those are the ones we
want. Let’s visualize:
power_doms_stats_plt = (
ggplot(power_doms_strata, aes(x = 'traffic_')) +
geom_histogram(alpha = 0.6, binwidth = 10) +
labs(y = 'Power Domains Count', x = 'traffic_') +
scale_y_continuous() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation = 90, hjust=1)
))
240
Chapter 5 Authority
power_doms_stats_plt.save(filename = 'images/7_power_doms_stats_plt.png',
height=5, width=10, units = 'in', dpi=1000)
power_doms_stats_plt
Although we’re interested in hubs, we’re sorting the dataframe by traffic as these
have the most authority:
power_doms
241
Chapter 5 Authority
By far, the most powerful is the daily mail, so in this case start budgeting for a good
digital PR consultant or full-time employee. There are also other publisher sites like the
Evening Standard (standard.co.uk) and The Times.
Some links are easier and quicker to get such as the yell.com and Thomson local
directories.
Then there are more market-specific publishers such as the Ideal Home, Homes and
Gardens, Livingetc, and House and Garden.
242
Chapter 5 Authority
• Checking the relevance of the backlink page (or home page) to see if
it impacts visibility and filtering for relevance
Taking It Further
Of course, the preceding discussion is just the tip of the iceberg, as it’s a simple
exploration of one site so it’s very difficult to infer anything useful for improving rankings
in competitive search spaces.
The following are some areas for further data exploration and analysis:
• Adding search volume data on the hostnames to see how many brand
searches the referring domains receive as an alternative measure of
authority
• Content relevance
Naturally, the preceding ideas aren’t exhaustive. Some modeling extensions would
require an application of the machine learning techniques outlined in Chapter 6.
243
Chapter 5 Authority
Summary
Backlinks, the expression of website authority for search engines, are incredibly
influential to search result positions for any website. In this chapter, you have
learned about
In the next chapter, we will use data science to analyze keyword search result
competitors.
244
CHAPTER 6
Competitors
What self-respecting SEO doesn’t do competitor analysis to find out what they’re missing?
Back in 2007, Andreas recalls using spreadsheets collecting data on SERPs with columns
representing aspects of the competition, such as the number of links to the home page,
number of pages, word counts, etc. In hindsight, the idea was right, but the execution was
near hopeless because of the difficulty of Excel to perform a statistically robust analysis in
the short time required – something you will now learn shortly using machine learning.
Defining the Problem
Before we rush in, let’s think about the challenge. With over 10,000 ranking factors, there
isn’t enough time nor budget to learn and optimize for the high-priority SEO items.
We propose to find the ranking factors that will make the decisive difference to your
SEO campaign by cutting through the noise and using machine learning on competitor
data to discover
245
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_6
Chapter 6 Competitors
Outcome Metric
Let it be written that the outcome variable should be search engine ranking in Google.
This approach can be adapted for any other search engine (be it Bing, Yandex, Baidu,
etc.) as long as you can get the data from a reliable rank checking tool.
Why Ranking?
Because unlike user sessions, it doesn’t vary according to the weather, the time of year,
and so on – Query Freshness excepted. It’s probably the cleanest metric. In any case,
the ranking represents the order in which content of the ranking URL best satisfies the
search query – the point of RankBrain, come to think of it. So in effect, we are working
out how to optimize for any Google update informed by RankBrain.
From a data perspective, the ranking position must be a floating numeric data type
known as a “float” in Python (“double” in R).
Features
Now that we have established the outcome metric, we must now determine the
independent variables, the model inputs also known as features. The data types for the
feature will vary, for example:
Data Strategy
Now that we know the outcome and features, what to do? Given that rankings are
numeric and that we want to explain the difference in rank which is a continuous
variable (i.e., one flows into two, then into three, etc.), then competitor analysis in this
instance is a regression problem. This means in mathematical terms
246
Chapter 6 Competitors
Figure 6-1. Scatterplot of title tags branded (as a proportion of total characters)
and average Google rank position
247
Chapter 6 Competitors
In terms of what to collect features on, search engines rank URLs, not domains, and
therefore we will focus on the former. This will save you money and time in terms of data
collection costs as well as putting the actual data together. However, the downside is that
you won’t be able to include domain-wide features such as ranking URL distance from
the home page.
To that end, we will be using a decision tree–based algorithm known as “random
forest.” There are other algorithms such as decision tree, better ones like AdaBoost, and
XGBoost. A data scientist will typically experiment with a number of models and then
pick the best one in terms of model stability and predictive accuracy.
However, we’re here to get you started; the models are likely to produce similar
results, and so we’re saving you time while delivering you, most importantly, the
intuition behind the machine learning technique for analyzing your SERP competitors
for SEO.
Data Sources
Although we’re not privy to Google’s internal data (unless you work for Google), we rely
heavily on third-party tools to provide the data. The reason the tools are third party and
not first party is that the data for all websites in the data study must be comparable to
each other – unless in the unlikely event you have access to your competitors’ data. No?
Moving on.
Your data sources will depend on the type of features you wish to test for modeling
the SERP competitors.
Rank: This will come from your rank checking tool and not Google Search Console.
So that’s getSTAT, SEO Monitor, Rank Ranger or the DataForSEO SERPs API. There are
others, although we have or have no direct experience of testing their APIs and thus
cannot be mentioned. Why those three? Because they all allow you to export the top 100
URLs for every keyword you’re including in your research. This is important because
from the outset we don’t want to assume who your SERPs competitors are. We just want
to extract the data and interrogate it.
For the features:
Onsite: To test whether onsite factors can explain the differences in rank for your
keyword set, use features like title tag length, page speed, number of words, reading ease,
and anything your rank checking tool can tell you about a URL. You can also derive your
248
Chapter 6 Competitors
own features such as title relevance by calculating the string distance between the title
tag and the target keyword. Rest assured, we’ll show you how later.
For less competitive industries and small datasets (less than 250,000 URLs per
site), a tool like Screaming Frog or Sitebulb will do. For large datasets and competitive
industries, it’s most likely that your competitors will block desktop site auditors, so
you will have to resort to an enterprise-grade tool that crawls from the cloud and has
an API. We have personally found Botify, not only to have both but also to work well
because most enterprise brands use them so they won’t get blocked, when it comes to
crawling!
Offsite: To test the impact of offsite factors, choose a reliable source with a good
API. In our experience, AHREFs and BuzzSumo work well, yielding metrics such as the
domain rating, number of social shares by platform, and number of internal links on the
backlinking URLs. Both have APIs which allow you to automate the collection of offsite
data into your R workspace.
• Examine the quality of data: Are there too many NAs? Single-level
factors?
The idea is to improve the quality of the data you’re going to feed into your model by
discarding features and rows as not all of them will be informative or useful. Exploring
the data will also help you understand the limits of your model for explaining the
ranking factors in your search query space.
Before joining onto the SERPs data, let’s explore.
To summarize the overall approach
249
Chapter 6 Competitors
8. Median impute
Naturally, there is a lot to cover, so we will explain each of these briefly and go into
more detail over the more interesting secrets ML can uncover on your competitors.
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
import re
import time
import requests
250
Chapter 6 Competitors
import json
from datetime import timedelta
String methods used to compute the overlap between two text strings:
import uritools
from tldextract import extract
Some variables are initiated at the start, so that when you reuse the script on another
client or site, you simply overwrite the following variable values:
root_url = 'https://www.johnsankey.co.uk'
target_site = 'www.johnsankey.co.uk'
root_domain = 'johnsankey.co.uk'
hostname = 'johnsankey'
target_name = 'sankey'
geo_market = 'uk'
251
Chapter 6 Competitors
Start with the Keywords
As with virtually all things in SEO, start with the end in mind. That usually means the
target keywords you want your site to rank for and therefore their SERPs which we will
load into a Pandas dataframe:
serps_raw = pd.read_csv('data/keywords_serps.csv')
To make the dataframe easier to handle, we’ll use a list comprehension to turn
the column names into lowercase and replace punctuation marks and spaces with
underscores:
The rank_absolute column is replaced by the more simplified and familiar “rank”:
serps_raw
The serps_raw dataframe has over 25,000 rows of SERPs data with 6 columns,
covering all of the keywords:
serps_df = serps_raw.copy()
252
Chapter 6 Competitors
serps_df['url'] = serps_df['url'].astype(str)
The first manipulation is to extract the domain for the “site” column. The site column
will apply the uritools API function to strip of the slug and then split the URL into a list of
its components using a list comprehension:
Once split, we will extract everything in the list, taking the last three components:
Next, we want to profile the rank into strata, so that we can have rank categories.
While this may not be used in this particular exercise, it’s standard practice when
working with SERPs data.
Rather than have zero search_volumes, we’ll set these to one to avoid divide by zero
errors using np.where():
serps_df['se_results_count'] = np.where(serps_df['se_results_count'] == 0,
1, serps_df['se_results_count'])
serps_df['count'] = 1
We’ll also count the number of keywords in a search string, known in the data
science world as tokens:
These will then be categorized into head, middle, and tail, based on the token length:
before_length_conds = [
serps_df['token_count'] == 1,
253
Chapter 6 Competitors
serps_df['token_count'] == 2,
serps_df['token_count'] > 2]
serps_df
Focus on the Competitors
The SERPs data effectively tells us what content is being rewarded by Google where the
rank is the outcome metric. However, much of this data is likely to be noisy, and a few
of the columns are likely to have ranking factors that explain the difference in ranking
between content.
The content is noisy because SERPs are likely to contain content from sites (such
as reviews and references) which will prove very difficult for the commercial sites to
learn from. Ultimately, when conducting this exercise, SEO is primarily interested in
outranking competitor sites before these other sites become a consideration.
254
Chapter 6 Competitors
So, you’ll want to select your competitors to make your study more meaningful. For
example, if your client or your brand is in the webinar technology space, it won’t make
sense to include Wikipedia.com or Amazon.com in your dataset as they don’t directly
compete with your brand.
What you really want are near-direct to direct competitors, that is, doppelgangers, so
that you can compare what it is they do or don’t do to rank higher or lower than you.
The downside of this approach is that you don’t get to appreciate what Google wants
from the SERPs by stripping out noncompetitors. That’s because the SERPs need to be
analyzed as a whole, which is covered to an extent in Chapter 10. However, this chapter
is about competitor analysis, so we shall proceed.
To find the competitors, we’ll have to perform some aggregations starting with
calculating reach (i.e., the number of content with positions in the top 10):
With the SERPs filtered or limited to the top 10, we’ll aggregate the total number of
top 10s by site, using groupby site and summing the count column:
major_players_reach = major_players_reach.groupby('site').agg({'count':
sum}).reset_index()
The reach metric is most of the story by giving us the volume, but we also want the
rank which is calculated by taking the median. This will help order sites with comparable
levels of reach.
255
Chapter 6 Competitors
major_players_searches = serps_df.groupby('site').agg({'se_results_count':
'mean'}).reset_index()
major_players_searches = major_players_searches.sort_values('se_
results_count')
The rank and search result aggregations are joined onto the reach data to form one
table using the merge() function. This is equivalent to a vlookup using the site column as
the basis of the merge:
major_players_stats = major_players_reach.merge(major_players_rank, on =
'site', how = 'left')
major_players_stats = major_players_stats.merge(major_players_searches, on
= 'site', how = 'left')
Using all the data, we’ll compute an overall visibility metric which divides the reach
squared by the rank. The reach is squared to avoid sites with a few top 10s and very high
rankings appearing at the top of the list.
The rank is the divisor because the higher the rank, the lower the number; therefore,
dividing by a lower number will increase the value of the site’s visibility should it
rank higher:
major_players_stats['visibility'] = ((major_players_stats['reach'] ** 2) /
major_players_stats['rank']).round()
major_players_stats = major_players_stats.loc[~major_players_stats['site'].
str.contains('|'.join(social_sites))]
major_players_stats = major_players_stats.loc[major_players_stats['site']
!= 'nan']
major_players_stats.head(10)
256
Chapter 6 Competitors
The dataframe shows the top 20 feature sites we would expect to see dominating the
SERPs. A few of the top sites are not direct competitors (will probably be uncrawlable!),
so these will be removed, as we’re interested in learning from the most direct
competitors to see their most effective SEO.
As a result, we will select the most direct competitors and store these in a list “player_
sites_lst”:
The list will be used to filter the SERPs to contain only content from these direct
competitors:
direct_players_stats = major_players_stats.loc[major_players_stats['site'].
isin(player_sites_lst)]
direct_players_stats
257
Chapter 6 Competitors
The dataframe shows that Darlings of Chelsea is the leading site to “beat” with the
most reach and the highest rank on average.
Let’s visualize this:
major_players_stats_plt = (
ggplot(direct_players_stats,
aes(x = 'reach', y = 'rank', fill = 'site', colour = 'site',
size = 'se_results_count')) +
geom_point(alpha = 0.8) +
geom_text(direct_players_stats, aes(label = 'site'), position=position_
stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'Google Reach') +
scale_y_reverse() +
scale_size_continuous(range = [5, 20]) +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
major_players_stats_plt.save(filename = 'images/1_major_players_stats_
plt.png',
height=5, width=8, units = 'in', dpi=1000)
major_players_stats_plt
258
Chapter 6 Competitors
Although Darlings of Chelsea leads the luxury sector, Made.com has the most
presence on the more highly competitive keywords as signified by the size of their data
point (se_results_count) (Figure 6-2).
Figure 6-2. Bubble chart comparing Google’s top 10s (reach) of each website and
their average Google ranking position. Circle size represents the number of search
results of the queries each site appears for
John Sankey on the other hand is the lowest ranking and has the least reach.
Filtering the SERPs data for just the direct competitors, this will make the data
less noisy:
player_serps = serps_df[serps_df['site'].isin(player_sites_lst)]
player_serps
259
Chapter 6 Competitors
We end with a dataframe with far less rows from 25,000 to 1,294. Although machine
learning algorithms would generally work best with 10,000 rows or more, the methods
are still superior (in terms of insight speed and consistency) to working manually in a
spreadsheet.
We’re analyzing the most relevant sites, and we can proceed to collect data on those
sites. These will form our hypotheses which will form the possible ranking factors that
explain the differences in ranking between sites.
Site crawls provide a rich source of data as they contain information about the
content and technical SEO characteristics of the ranking pages, which will be our
starting point.
We’ll start by defining a function to export the URLs for site crawling to a CSV file:
def export_crawl(df):
dom_name = df.domain.iloc[0]
df = df[['url']]
df = df[['url']].drop_duplicates()
df.to_csv('data/1_to_crawl/' + dom_name + '_crawl_urls.csv',
index=False)
The function is applied to the filtered SERPs dataframe using the groupby() function:
260
Chapter 6 Competitors
direct_players_stats.groupby(site).apply(export_crawl)
Once the data is crawled, we can store the exports in a folder and read them in,
one by one.
In this instance, we set the file path as a variable named “crawl_path”:
crawl_path = 'data/2_crawled/'
crawl_filenames = os.listdir(crawl_path)
crawl_filenames
['www_johnsankey_co_uk.csv',
'www_sofasandstuff_com.csv',
'loaf_com.csv',
'www_designersofas4u_co_uk.csv',
'www_theenglishsofacompany_co_uk.csv',
'www_willowandhall_co_uk.csv',
'www_darlingsofchelsea_co_uk.csv',
'www_sofology_co_uk.csv',
'www_arloandjacob_com.csv']
crawl_df_lst = list()
crawl_colnames = list()
261
Chapter 6 Competitors
A for loop is used to go through the list of website auditor CSV exports and read the
data into a list:
crawl_raw = pd.concat(crawl_df_lst)
The column names are made more data-friendly by removing formatting and
converting the column names to lowercase:
crawl_raw
262
Chapter 6 Competitors
crawl_df = crawl_raw.copy()
Getting the site name will help us aggregate the data by site. Using a list
comprehension, we’ll loop through the dataframe URL column and apply the urisplit()
function:
crawl_df
263
Chapter 6 Competitors
Printing the data types using the .dtypes property helps us see which columns
require potential conversion into more usable data types:
print(crawl_df.dtypes)
We can see from the following printed list highlighted in blue that there are numeric
variables that are in object format but should be a float64 and will therefore require
conversion.
This results in the following:
264
Chapter 6 Competitors
We’ll create a copy of the dataframe to create a new one that will have converted
columns:
cleaner_crawl = crawl_df.copy()
Starting with reading time, we’ll replace no data with the current timing format using
np.where():
cleaner_crawl['reading_time_mm_ss_'] = np.where(cleaner_crawl['reading_
time_mm_ss_'] == 'No Data', '00:00',cleaner_crawl['reading_time_mm_ss_'])
cleaner_crawl['reading_time_mm_ss_'] = '00:' +cleaner_crawl['reading_time_
mm_ss_']
cleaner_crawl['reading_time_mm_ss_'] = pd.to_timedelta(cleaner_
crawl['reading_time_mm_ss_']).dt.total_seconds()
We’ll convert other string format columns to float, by first defining a list of columns
to be converted:
Using the list, we’ll use the apply column and the to_numeric() function to convert
the columns:
cleaner_crawl[float_cols] = cleaner_crawl[float_cols].apply(pd.to_numeric,
errors='coerce')
265
Chapter 6 Competitors
The columns are correctly formatted. For more advanced features, you may want to
try segmenting the different types of content according to format, such as blogs, guides,
categories, subcategories, items, etc.
For further features, we shall import backlink authority data. First, we’ll import the
data by reading all of the AHREFs data in the folder:
authority_filenames
['darlingsofchelsea.co.uk-best-pages-by-links-subdomains-12-
Sep-2022_19-17-56.csv',
'sofology.co.uk-best-pages-by-links-subdomains-12-Sep-2022_19-24-00.csv',
'sofasandstuff.com-best-pages-by-links-subdomains-12-
Sep-2022_19-23-42.csv',
'willowandhall.co.uk-best-pages-by-links-subdomains-12-
Sep-2022_19-17-03.csv',
'theenglishsofacompany.co.uk-best-pages-by-links-subdomains-12-
Sep-2022_19-24-53.csv',
'arloandjacob.com-best-pages-by-links-subdomains-12-
Sep-2022_19-23-19.csv',
'designersofas4u.co.uk-best-pages-by-links-subdomains-12-
Sep-2022_19-16-38.csv',
266
Chapter 6 Competitors
'johnsankey.co.uk-best-pages-by-links-subdomains-12-
Sep-2022_19-16-04.csv',
'heals.com-best-pages-by-links-subdomains-12-Sep-2022_19-17-32.csv',
'loaf.com-best-pages-by-links-subdomains-12-Sep-2022_19-25-17.csv']
auth_df_lst = list()
auth_colnames = list()
The for loop reads in the data using the read_csv function, stores the filename as a
column (so we know which file the data comes from), cleans up the column names, and
adds the data to the lists created earlier:
Once the loop has run, the lists are combined into a single dataframe using the
concat() function:
auth_df_raw = pd.concat(auth_df_lst)
auth_df_raw = auth_df_raw.rename(columns = {'page_url': 'url'})
auth_df_raw.drop(['#', 'size', 'code', 'crawl_date', 'language',
'page_title',
'first_seen'], axis = 1, inplace = True)
auth_df_raw
The resulting auth_df_raw dataframe is shown as follows with the site pages and
their backlink metrics.
267
Chapter 6 Competitors
Join the Data
Now that the data from their respective tool sources are imported, they are now ready to
be joined. Usually, the common column (known as the “primary key”) between datasets
is the URL as that is what search engines rank.
We’ll start by joining the SERPs data to the crawl data. Before we do, we only require
the SERPs containing the competitor sites.
player_serps = serps_df[serps_df['site'].isin(player_sites_lst)]
The vlookup to join competitor SERPs and the crawl data of the ranking URLs is
achieved by using the .merge() function:
268
Chapter 6 Competitors
The next step is to join the backlink authority data to the dataset containing SERPs
and crawl metrics, again using the merge() function:
player_serps_crawl_auth = player_serps_crawl.copy()
player_serps_crawl_auth = player_serps_crawl_auth.merge(auth_df_raw, on =
['url'], how = 'left')
player_serps_crawl_auth.drop(['sitefile'], axis = 1, inplace = True)
player_serps_crawl_auth
The data has now been joined such that each SERP URL has its onsite and offsite
SEO data in a single dataframe:
269
Chapter 6 Competitors
hypo_serps_features = player_serps_crawl_auth.copy()
Add regional_tld which denotes whether the ranking URL is regionalized or not:
regional_tlds = ['.uk']
hypo_serps_features['regional_tld'] = np.where(hypo_serps_features['site'].
str.contains('|'.join(regional_tlds)), 1, 0)
Add a metric for measuring how much of the target keyword is used in the title tag
using the sorensen_dice() function:
hypo_serps_features['title'] = hypo_serps_features['title'].astype(str)
270
Chapter 6 Competitors
hypo_serps_features['title_relevance'] = hypo_serps_features.loc[:,
['title', 'keyword']].apply(
lambda x: sorensen_dice(*x), axis=1)
We’re also interested in measuring the extent to which title tags and H1 heading
consistency are influential:
hypo_serps_features['h1'] = hypo_serps_features['h1'].astype(str)
hypo_serps_features['title_h1'] = hypo_serps_features.loc[:, ['title',
'h1']].apply(
lambda x: sorensen_dice(*x), axis=1)
Does having a brand in your title tag matter? Let’s find out:
hypo_serps_features['site'] = hypo_serps_features['site'].astype(str)
hypo_serps_features['hostname'] = hypo_serps_features['site'].apply(lambda
x: extract(x))
hypo_serps_features['hostname'] = hypo_serps_features['hostname'].
str.get(1)
hypo_serps_features['title'] = hypo_serps_features['title'].str.lower()
hypo_serps_features['title_branded'] = hypo_serps_features.loc[:,
['title',
'hostname']].apply(
lambda x: sorensen_dice(*x), axis=1)
Another useful feature is URL parameters, that is, question marks in the
ranking URL:
hypo_serps_features['url_params'] = np.where(hypo_serps_features['url'].
str.contains('\?'), '1', '0')
hypo_serps_features['url_params'] = hypo_serps_features['url_params'].
astype('category')
Another test is whether the ranking URL has Google Analytics code. It’s unlikely to
amount to anything, but if the data is available, why not?
hypo_serps_features['google_analytics_code'] = np.where(hypo_serps_
features[
'google_analytics_code'].str.contains('UA'), '1', '0')
271
Chapter 6 Competitors
hypo_serps_features['google_analytics_code'] = hypo_serps_features['google_
analytics_code'].astype('category')
hypo_serps_features['google_tag_manager_code'] = np.where(
hypo_serps_features['google_tag_manager_code'].str.contains('GTM'),
'1', '0')
hypo_serps_features['google_tag_manager_code'] = hypo_serps_
features['google_tag_manager_code'].astype('category')
hypo_serps_features['google_tag_manager_code_second_'] = np.where(
hypo_serps_features['google_tag_manager_code_second_'].str.
contains('GTM'), '1', '0')
hypo_serps_features['google_tag_manager_code_second_'] = hypo_serps_
features[
'google_tag_manager_code_second_'].astype('category')
A test for cache control is added to check for whether it’s private, public, or other and
converted to a category:
hypo_serps_features['cache_privacy'] = np.where(
hypo_serps_features['cache-control'].str.contains('private'),
'private', '0')
hypo_serps_features['cache_privacy'] = np.where(
hypo_serps_features['cache-control'].str.contains('public'), 'public',
hypo_serps_features['cache_privacy'])
hypo_serps_features['cache_privacy'] = np.where(
hypo_serps_features['cache-control'].str.contains('0'), 'other', hypo_
serps_features['cache_privacy'])
hypo_serps_features['cache_privacy'] = hypo_serps_features['cache_
privacy'].astype('category')
A cache age has also been added by extracting the numerical component of the
cache-control string. This is achieved by splitting the string on the “=” sign and then
using the .get() function, before converting to a numerical float data type:
272
Chapter 6 Competitors
hypo_serps_features['cache_age'] = hypo_serps_features['cache-control'].
str.split('\=')
hypo_serps_features['cache_age'] = hypo_serps_features['cache_age'].
str.get(-1)
hypo_serps_features['cache_age'] = np.where(hypo_serps_features['cache_
age'].isnull(), 0,
hypo_serps_
features['cache_age'])
hypo_serps_features['cache_age'] = np.where(hypo_serps_features['cache_
age'].str.contains('[a-z]'),
0, hypo_serps_
features['cache_age'])
hypo_serps_features['cache_age'] = hypo_serps_features['cache_age'].
astype(float)
hypo_serps_features['self_canonicalised'] = np.where(hypo_serps_features[
'canonical_url'] == hypo_serps_features['url'], 1, 0)
We drop identifiers such as the canonical URL as these are individual records that
identify a single row which will add nothing to the analysis.
We’re only interested in the characteristics or trend of this unique data value in itself.
We also drop hypotheses which are likely to be redundant or not interested in
testing, such as the HTTP protocol. This relies on your own SEO experience and
judgment.
273
Chapter 6 Competitors
Once done, we’ll create another copy of the dataframe and export as CSV in
preparation for machine learning, starting with single-level factors:
hypo_serps_pre_slf = hypo_serps_features.copy()
hypo_serps_pre_slf
274
Chapter 6 Competitors
To remove SLFs, we’ll iterate through the dataframe column by column to identify
any column that has data containing 70% or more of the same value and store the
column names in a list. 70% is an arbitrary threshold; you could choose 80% or 90%, for
example; however, that comes with a risk of removing some insightful ranking factors –
even if it only applies to a smaller number of URLs which might ironically be the top
ranking URLs.
slf_cols = []
slf_limit = .7
slf_cols
The columns with 70% identical data are printed as follows and will be removed from
the dataset:
['location_code',
'language_code',
'rank_profile',
'branded',
'count',
'crawl_depth',
'is_subdomain',
'no_query_string_keys',
'query_string_contains_filtered_parameters',
'query_string_contains_more_than_three_keys',
'query_string_contains_paginated_parameters',
'query_string_contains_repetitive_parameters',
'query_string_contains_sorted_parameters',
'scheme',...
Let’s examine a few of these SLF ranking factors using the groupby() function.
Starting with branded, we can see all of the ranking URL titles are branded, so these can
be removed:
275
Chapter 6 Competitors
hypo_serps_pre_slf.groupby('branded')['count'].sum().sort_values()
branded
generic 1294
Name: count, dtype: int64
hypo_serps_pre_slf.groupby('url_params')['count'].sum().sort_values
Parameterized URLs also appear to be redundant with only 17 URLs that are
parameterized. However, these may still provide insight in unexpected ways.
Having identified the SLFs, we’ll process these in a new dataframe where these will
be removed using a list comprehension:
hypo_serps_pre_mlfs = hypo_serps_pre_slf.copy()
The list of columns to be removed is nuanced further as we’d like to keep url_params
as mentioned earlier and the count column for further aggregation in future processes:
hypo_serps_pre_mlfs
276
Chapter 6 Competitors
hypo_serps_preml_prescale = hypo_serps_pre_mlfs.copy()
Separate columns into numeric and nonnumeric so we can rescale the numeric
columns using .dtypes:
hypo_serps_preml_num = hypo_serps_preml_prescale.select_
dtypes(include=np.number)
hypo_serps_preml_num_colnames = hypo_serps_preml_num.columns
Nonnumeric columns are saved into a separate dataframe, which will be used for
joining later:
hypo_serps_preml_nonnum = hypo_serps_preml_prescale.select_
dtypes(exclude=np.number)
277
Chapter 6 Competitors
hypo_serps_preml_num
We’ll make use of the MinMaxScaler() from the preprocessing functions of the
sklearn API:
Convert the column values into a numpy array and then use the MinMaxScaler()
function to rescale the data:
x = hypo_serps_preml_num.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
hypo_serps_preml_num_scaled = pd.DataFrame(x_scaled, index=hypo_serps_
preml_num.index, columns = hypo_serps_preml_num_colnames)
hypo_serps_preml_num_scaled
278
Chapter 6 Competitors
variance = hypo_serps_preml_num_scaled.var()
columns = hypo_serps_preml_num_scaled.columns
Save the names of variables having variance more than a threshold value:
highvar_variables = [ ]
nz_variables = [ ]
We’ll iterate through the numeric columns setting the threshold at 7% such that
there must be at least 7% variation in the data to remain in the dataset. Again, 7% is an
arbitrary choice. The high variation columns are stored in the list we created earlier
called highvar_variables:
for i in range(0,len(variance)):
if variance[i]>=0.07:
highvar_variables.append(columns[i])
else:
279
Chapter 6 Competitors
nz_variables.append(columns[i])
highvar_variables
['rank',
'no_canonical_links',
'total_canonicals',
'no_internal_followed_linking_urls',
'no_internal_followed_links',
'no_internal_linking_urls',
'no_internal_links_to_url',
'url_rank', ...]
nz_variables
['se_results_count',
'count',
'token_count',
'expires_date',
'no_cookies',
'file_size_kib_',
'total_page_size_kib_', ...]
The NZVs identified and stored in nz_variables are shown earlier. We can see that
more web pages, for example, have highly similar numbers of keywords in the search
query (“token count”) and HTML page sizes (“total_page_size_kib_”), so we’ll be happy
to remove these.
Here’s a quick sanity check to ensure there are no columns that are listed as both
high variation and NZV:
[]
280
Chapter 6 Competitors
hypo_serps_preml_num_scaled['title_relevance'].describe()
count 1294.000000
mean 0.478477
std 0.199106
min 0.000000
25% 0.323529
50% 0.512712
75% 0.622487
max 1.000000
Name: title_relevance, dtype: float64
The scaled_images column on the other hand is NZV, as shown in the following,
where most values are zero until the 75th percentile of 0.17 showing very little variation
and should therefore be excluded:
hypo_serps_preml_num_scaled['scaled_images'].describe()
count 977.000000
mean 0.114348
std 0.163530
min 0.000000
25% 0.000000
50% 0.000000
75% 0.179487
max 1.000000
Name: scaled_images, dtype: float64
281
Chapter 6 Competitors
We’ll redefine the highvar_variables list to include some NZVs we think should
remain in the dataset:
hypo_serps_preml_num_highvar = hypo_serps_preml_num_scaled[highvar_
variables]
hypo_serps_preml_num_highvar
Next, we’ll also remove ranking factors that are highly correlated to each other
(known as multicollinearity), using the variance_inflation_factor() function from the
statsmodels API to detect large variance inflation factors (VIF).
Multicollinearity is an issue because it reduces the statistical significance of the
ranking features used to model the search result rankings.
A large variance inflation factor (VIF) on a ranking feature or any modeling variable
hints at a highly correlated relationship to other ranking factors. Removing those
variables will improve the model’s predictive consistency, that is, more stable and less
degree of error when making forecasts.
282
Chapter 6 Competitors
Remove rows with missing values (np.nan) and infinite values (np.inf, -np.inf ):
vif_input = hypo_serps_preml_num_highvar[~hypo_serps_preml_num_highvar.
isin([np.nan, np.inf, -np.inf]).any(1)]
Store in X_variables.
X_variables = vif_input
vif_data = pd.DataFrame()
vif_data["feature"] = X_variables.columns
vif_data["vif"] = [variance_inflation_factor(X_variables.values, i) for i
in range(len(X_variables.columns))]
vif_data.sort_values('vif')
The VIF data distribution using the describe() function is printed to get an idea of
what level of intercolumn correlation is and act as our threshold for rejecting columns:
vif_data['vif'].describe()
count 38.000000
mean inf
std NaN
min 3.254763
25% 26.605281
50% 76.669063
75% 3504.833113
max inf
Name: vif, dtype: float64
Having determined the VIF range, we’ll discard any ranking factor with a VIF above
the median. Technically, best practice is that a VIF of five or above is highly correlated;
however, in this case, we’re just looking to remove excessively correlated ranking factors,
which is still an improvement:
283
Chapter 6 Competitors
hypo_serps_preml_lowvif = hypo_serps_preml_num_highvar.copy()
hypo_serps_preml_lowvif
We’ve now gone from 38 to 19 columns. As you may come to appreciate by now,
machine learning is not simply a case of plugging in the numbers to get a result as much
work must be done to get the numbers into a usable format.
Median Impute
We want to retain as many rows of data as possible as any rows with missing values in
any column will have to be removed.
One technique is to use median impute where the median value for a given column
of data will be estimated to replace the missing value.
Of course, the median is likely to be more meaningful if it is calculated at the domain
level rather than an entire column, as we’re pitting sites against each other. So where
possible, we will use median impute at the domain level, otherwise at the column level.
Import libraries to detect data types used in the for loop to detect columns that are
not numeric for median imputation:
284
Chapter 6 Competitors
hypo_serps_preml_median = hypo_serps_preml_lowvif.copy()
Variables are set so that the for loop can groupby() the entire column and at the
domain level (“site”):
hypo_serps_preml_median['site'] = hypo_serps_preml_prescale['site']
hypo_serps_preml_median['project'] = 'competitors'
hypo_serps_preml_median
The result is a dataframe with less missing values, improving data retention.
285
Chapter 6 Competitors
categorical_cols = hypo_serps_preml_cat.columns.tolist()
Use a list comprehension to update the categorical_cols list and ensure the stop
columns are not in there:
categorical_cols
The following are the categorical columns that will now be one hot encoded:
['token_size',
'compression',
'connection',
'charset',
'canonical_status',
'canonical_url_render_status',
'flesch_kincaid_reading_ease',
'sentiment',
'contains_paginated_html', ... ]
hypo_serps_preml_cat
286
Chapter 6 Competitors
The following is the dataframe with only the OHE columns selected.
The get_dummies() will be used to create the OHE columns for each categorical
rank factor:
hypo_serps_preml_ohe
287
Chapter 6 Competitors
With OHE, the category columns have now expanded from 38 to 95 columns.
For example, the compression column has been replaced by two new columns
compression_Brotli and compression_Gzipped, as there were only two values for that
ranking factor.
Eliminate NAs
With the numeric and category data columns cleaned and transformed, we’re now ready
to combine the data and eliminate the missing values.
Combine the dataframes using concat():
hypo_serps_preml_ready
The next preparation step is to eliminate “NA” values as ML algorithms don’t cope
very well with cell values that have “not available” as a value.
First of all, check which columns have a proportion of NAs, by taking the sum of null
values in each column and dividing by the total number of rows:
288
Chapter 6 Competitors
We put our calculations of missing data into a separate dataframe and then
sort values:
We can see that there are no columns with missing values, which is great news, onto
the next stage.
If there were missing values, the columns would be removed as we’ve done what we
can to improve the data to get to this point.
Modeling the SERPs
A quick reminder, modeling the SERPs is a formula that will predict rank based on the
features of SEO, that is
• Split the dataset into test (20%) and train (80%). The model will learn
from the most of the dataset (train) and will be applied to the test
dataset (data the model has not seen before). We do this to see how
the model really performs in a real-world situation.
Import the relevant APIs and libraries which are mostly from scikit-learn, a free
machine learning software library for Python:
hypo_serps_ml = hypo_serps_preml_ready.copy()
290
Chapter 6 Competitors
encoder = ce.HashingEncoder()
serps_features_ml_encoded = encoder.fit_transform(hypo_serps_ml)
serps_features_ml_encoded
Set the target variable as rank, which is the outcome we’re looking to explain and
guide our SEO recommendations:
target_var = 'rank'
291
Chapter 6 Competitors
regressor.fit(X_train, y_train)
Test the model on the test dataset and store the forecasts into y_pred:
y_pred = regressor.predict(X_test)
• Feeding the predict command, the test dataset, and the model
Given the modeling and prediction of rank is a regression problem, we use RMSE
and r-squared as evaluation metrics. So what do they tell us?
The RMSE tells us what the average margin of error is for a predicted rank. For
example, an RMSE of 5 would tell us that the model will predict ranking positions + or – 5
from the true value on average.
The r-squared has the formal title of “coefficient of determination.” What does that
mean? In practical terms, the r-squared represents the proportion of data points in the
dataset that can be explained by the model. It is computed by taking the square of the
correlation coefficient (r), hence r-squared. An r-squared of 0.4 means that 40% of the
data can be explained by the model.
Beware of models with an r-squared of 1 or anything remotely close, especially in
SEO. The chances are there’s an error in your code or it’s overfitting or you work for
Google. Either way, you need to debug.
292
Chapter 6 Competitors
You might be wondering what good or reasonable values for each of these metrics
are. The truth is it depends on how good you need your model to be and how you intend
to use it.
If you intend to use your model as part of an automated SEO system that will directly
make changes to your content management system (CMS), then the RMSE needs to be
really accurate, so perhaps no more than five ranking positions. Even then, that depends
on the starting position of your rankings, as five is a significant difference for a page
already ranking on page 1 compared to a ranking on page 3!
If the intended use case for the model is simply to gain insight into what is driving
the rankings and what you should prioritize for split A/B testing or optimization, then an
RMSE of 20 or less is acceptable.
293
Chapter 6 Competitors
df_imp = df_imp
df_imp
The following dataframe result shows the most influential SERP features or ranking
factors in descending order of importance.
Plot the importance data in a bar chart using the plotnine library:
294
Chapter 6 Competitors
In this particular case, Figure 6-3 shows that the most important factor was “title_
relevance” which measures the string distance between the title tag and the target
keyword. This is measured by the string overlap, that is, how much of the title tag string is
taken up by the target keyword.
Figure 6-3. Variable importance chart showing the most influential ranking
factors identified by the machine learning algorithm
No surprise there for the SEO practitioner; however, the value here is providing
empirical evidence to the nonexpert business audience that doesn’t understand the
need to optimize the title tags. Data like this can also be used to secure buy-in from non-
SEO colleagues such as developers to prioritize SEO change requests.
Other factors of note in this industry are as follows:
295
Chapter 6 Competitors
Every market or industry is different, so the preceding text is not a general result for
the whole of SEO!
i = 3
Calculate the stats to average the site CWV performance and Google rank:
num_factor_agg = hypo_serps_features.groupby(['site']).
agg({str(influencers[i]): 'mean', 'rank': 'mean', 'se_results_count':
'sum', 'count': 'sum'}).reset_index()
num_factor_agg = num_factor_agg.sort_values(str(influencers[i]))
To show the client in a different color to the competitors, we’ll create a new column
“target,” such that if the website is the client, then it’s 1, otherwise 0:
num_factor_agg['target'] = np.where(num_factor_agg['site'].str.
contains(hostname), 1, 0)
num_factor_agg
The following is the dataframe that will be used to power the chart in Figure 6-4.
296
Chapter 6 Competitors
This function returns a polynomial line of best fit according to whether you’d like it
straight (degree 1) or curved (2 or more degrees):
num_factor_viz_plt = (
ggplot(num_factor_agg,
aes(x = str(influencers[i]), y = 'rank', fill = 'target', colour
= 'target', #shape = 'cat_item',
297
Chapter 6 Competitors
size = 'se_results_count')) +
geom_point(alpha = 0.3) +
geom_smooth(method = 'lm', se = False, formula = 'y ~ poly(x,
degree=1)', colour = 'blue', size = 1.5) +
labs(y = 'Google Rank', x = str(influencers[i])) +
scale_y_reverse() +
scale_size_continuous(range = [5, 20]) +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
num_factor_viz_plt
Plotting the average Core Web Vitals (CWV) vs. average Google rank by website
which also includes a line of best fit (Figure 6-4), we can estimate the ranking impact
per unit improvement in CWV. In this case, is about 0.5 rank position gain per 1 unit
improvement in CWV.
Figure 6-4. Bubble chart of websites comparing Google rank and CWV
performance score
298
Chapter 6 Competitors
Activation
With your model outputs, you’re now ready to make some decisions on your SEO
strategy in terms of
• Changes you’d like to make sitewide because they’re a “no-brainer,”
such as site speed or increasing brand searches (either through
programmatic advertising, content marketing, or both)
• Split A/B testing of factors included in your model
• Further research into the ranking factor itself to guide your
recommendations
299
Chapter 6 Competitors
Why? Because the ML analysis is just a snapshot of the SERPs for a single point in time.
Having a continuous stream of data collection and analysis means you get a more true
picture of what is really happening with the SERPs for your industry.
This is where SEO purpose–built data warehouse and dashboard systems come in,
and these products are available today. What these systems do are
To build your own automated system, you would deploy into a cloud infrastructure
like Amazon Web Services (AWS) or Google Cloud Platform (GCP) what is called
ETL, that is, extract, transform, and load, so that your data collection, analysis, and
visualization are automated in one place. This is explained more fully in Chapter 8.
Summary
In this chapter, you learned
Competitor research and analysis in SEO is hard because there are so many ranking
factors that are available and so many to control for. Spreadsheet tools are not up to the
task due to the amounts of data involved, let alone the statistical capabilities that data
science languages like Python offer.
When conducting SEO competitor analysis using machine learning (ML), it’s
important to understand that this is a regression problem, the target variable is Google
rank, and the hypotheses are the ranking factors.
In Chapter 7, we will cover experiments which are something that would naturally
follow the outputs of competitor statistical analysis.
300
CHAPTER 7
Experiments
It’s quite exciting to unearth insights from data or your own thought experiments that
could be implemented on your site and drive real, significant organic improvements.
With the rise of split testing platforms such as Distilled ODN and RankScience, it’s of no
surprise that experimentation is playing an ever-increasing role in SEO.
If you’re running a small site where the change leading to a negative impact is
inconsequential or the change is seemingly an obvious SEO best practice, then you may
forgo formal experimentation and simply focus on shipping the changes you believe are
required.
On the other hand, if you’re working on a large enterprise website, be it in-house or
as part of an agency, then any changes will be highly consequential, and you’ll want to
make sure you test these changes in order to both understand the impact (both positive
and negative) as well as help shape your understanding to help inform new hypotheses
to test.
1. Hypothesis generation
2. Experiment design
4. Evaluation
5. Implementation
301
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_7
Chapter 7 Experiments
Generating Hypotheses
Before any experiment starts, you need to base it around your hypothesis, that is, a belief
in what you believe will significantly change your target variable or outcome metric, for
example, organic impressions or URLs crawled.
This step is crucially important because without clear hypotheses, you won’t begin to
know what it is you will be testing to influence your outcome metric. So think about what
it is you want to learn from that could help you improve your SEO performance.
There’s a number of areas to source hypotheses from:
• Competitor analysis
I usually like to use the format "We believe that Google will give a greater weighting
to URLs linked from by other prominent pages of a website." This statement is then
expanded to consider what you’re proposing to test and how you’ll measure (i.e., “We’ll
know if the hypothesis is valid when…”).
Competitor Analysis
The competitor analysis that you carried out (explained in Chapter 6) will be a natural
source of ideas to test because they have some statistical foundation to them, surfacing
things that your competitors are doing or not doing to benefit from superior organic
performance. These hypotheses have the added advantage of knowing what the metric is
that you’ll be testing from the outset. After all, you had to get the data into your analysis
in the first place.
they made the news, they most probably merit testing. As an aside, in the early days of
our SEO careers before data science was actually a thing, the best way to really know
your data was to test everything we read about SEO online, such as Webmaster World,
BlackHatWorld forums, etc., and see what worked and what didn’t work. If you didn’t
have sites banned from the index in Google, you weren’t respected as an SEO or perhaps
you were just not being bold with your experiments. The very essence was “optimizing”
for search engines.
Naturally, things have moved on, and most of us in SEO are working for brands and
established businesses. So some of the creative wild experiments would be inappropriate
or rather career limiting, which we’re not advocating to do.
303
Chapter 7 Experiments
Experiment Design
Having decided on what hypotheses you’re going to test, you’re now ready to design your
experiment. In the following code example, we will be designing an experiment to see
the impact of a test item (it could be anything, say a paragraph of text beneath the main
heading) on organic impressions at the URL level.
Let’s start by importing the APIs:
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
import requests
import json
from datetime import timedelta
from glob import glob
import os
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
from datetime import datetime, timedelta
304
Chapter 7 Experiments
These are required for carrying out the actual split test:
pd.set_option('display.max_colwidth', None)
%matplotlib inline
Because we’re using website analytics data of which popular brands include Google,
Looker, Adobe, this is easily exported from a landing report by date. If this is a large site,
then you may need to use an API (see Chapter 8 for Google Analytics).
Depending on your website analytics package, the column names may vary;
however, if you’re looking to test the difference in impressions between URL groups over
time, you will require
• Date
• Sessions
Other outcomes than sessions can also be tested, such as impressions, ctr,
position, etc.
Assuming you have a CSV export, read the data from your website analytics package
using pd.read_csv:
analytics_raw = pd.read_csv("data/expandable-content.csv")
Print the data types of the columns to check for numerical data that might be
imported as a character string which would need changing:
print(analytics_raw.dtypes)
landing_page object
date object
sessions float64
dtype: object
The session data is a numerical float which is fine, but the data is classed as “object”
which means it will need converting to a date format.
305
Chapter 7 Experiments
We’ll make a copy of the dataframe using the .copy() method. This is so that any
changes we make won’t affect the original imported table. That way, if we’re not happy
with the change, we don’t have to go all the way to the top and reimport.
analytics_clean = analytics_raw.copy()
The date column uses the to_datetime() function which takes the column and the
format the date is in and is then normalized to convert the string into a date format.
This will be important for filling in missing dates and plotting data over time later:
analytics_clean['date'] = pd.to_datetime(analytics_clean['date'],
format='%d/%m/%Y').dt.normalize()
analytics_clean
The Pandas dataframe below shows ‘analytics_clean’ which now has the data in a
usable format for further manipulation such as graphing sessions over time.
analytics_raw.describe()
306
Chapter 7 Experiments
We can see that the average (mean) number of sessions per URL on any given date
is about 2, which varies wildly shown by the standard deviation (sd) value of 4. Given
the mean is 2 and a landing page can’t have a session of –2 (mean of 2 less standard
deviation of 4), this implies that some outlier pages are getting extremely high sessions,
which explains the variation.
Now look at the dates:
analytics_clean['date'].describe()
count 157678
unique 28
top 2019-08-18 00:00:00
freq 6351
first 2019-08-06 00:00:00
last 2019-09-02 00:00:00
Name: date, dtype: object
There’s not much to infer other than the data’s date range of about a month
in August.
307
Chapter 7 Experiments
Zero Inflation
Web analytics typically only logs data against a web page when there is an impression.
What about the days when the page doesn’t receive an impression? What then?
Zero inflation is where we add null records for pages that didn’t record an organic
impression on a given day. If we didn’t zero-inflate, then there would be a distortion
of the mean of the data for a given web page, let alone for a group of pages, namely,
A and B.
For example, URL X may have had 90 sessions on 10 days within a given day period
logged in analytics which should suggest that average impression per day is 9 per day.
However, because they happened on the 10 days, your calculations would mislead
you to think the URL is better than it is. By zero-inflating the data, that is, adding null
rows to the dataset for the days in the 30-day period, when URL X didn’t have any
organic impressions, the average calculated would be restored to the expected 3 per day.
Zero inflation also gives us another useful property to work from, and that is the
Poisson distribution.
It’s beyond the scope of this book to explain the Poisson distribution. Still, what you
need to know is that the Poisson distribution is common for rare events when we test for
the difference between groups A and B.
Any statistically significant difference between the two groups will hopefully show
that the test group B had significantly less zeros than A. Enough science, let’s go.
There is a much easier (and less comprehensible) way to fill in missing dates. Both
methods are given in this chapter starting with the longer yet easier to read.
Here, we use the function date_range() to set the date range from the minimum and
maximum dates found in the analytics dataframe, with an interval of one day. This is
saved into a variable object called “datelist”:
datelist = pd.date_range(start=analytics_clean['date'].min(),
end=analytics_clean['date'].max(), freq='1d')
nd is the length of days in the date range, and nu is the unique list of landing pages
we want the dates for:
nd = len(datelist)
nu = len(analytics_clean['landing_page'].unique())
308
Chapter 7 Experiments
Here, we create a dataframe with all the possible landing page and date
combinations by a cross-product of the landing page and the dates:
Then we look up which dates and landing pages have sessions logged against them:
Any that are unmatched (and thus null) are filled with zeros:
analytics_expanded[['date','sessions']] = analytics_expanded.
groupby('landing_page')[
['date', 'sessions']].fillna(0)
analytics_expanded['sessions'] = analytics_expanded['sessions'].
astype('int64')
analytics_expanded
309
Chapter 7 Experiments
Let’s explore the data using the .describe() which will tell us how the distribution of
sessions has changed having been zero-inflated:
analytics_expanded.describe()
310
Chapter 7 Experiments
The following screenshot shows the distribution of sessions following zero inflation.
And what a difference! The mean has shrunk by over 75% from 2.16 to 0.42, and
hardly any pages get over one session just under a month.
311
Chapter 7 Experiments
For example, we take a sample of URLs and benchmark the performance before
implementing a test on said URL sample to see if a significant impact results or not.
The main motivation for us to conduct A/A testing is to determine whether the
A/B test design is reliable enough to proceed with or not. What we’re looking for are no
differences between A before and A after.
We’ll test a period of 13 days, assuming now changes have been made, although in a
real setting, you would check nothing has changed before testing.
Why 13 days? This is an arbitrary number; however, methods are given later on
determining sample size for a robust A/B test to ensure any differences detected are
significant. The same methods could be applied here.
This A/A test is just an illustration of how to create the data structures and test. So if
you wanted to conduct a “SearchPilot” style of split testing, then sample size and testing
period determination aside, the following code would help you run it:
aa_test_period = 13
312
Chapter 7 Experiments
Set the cutoff date to be the latest date less the test period:
analytics_phased = analytics_expanded.copy()
Set the A/A group based on the date before or after the cutoff:
analytics_phased
Before testing, let’s determine analytically the statistical properties of both A/A
groups, which is indicative of what the actual split A/A test result might be.
313
Chapter 7 Experiments
The first function is day_range() which returns the number of days in the date range
which is the latest date less the earliest date:
def day_range(date):
return (max(date) - min(date)).days
First, we calculate the means by filtering the data for nonzero sessions and then
aggregate the date range and average by A/A group:
aa_means = (
analytics_phased.loc[analytics_phased["sessions"] != 0]
.groupby(["aa_group"])
.agg({"date": ["min", "max", day_range], "sessions": "mean"})
)
aa_means
We can see that the day ranges and the date range are correct, and the averages per
group are roughly the same when rounded to whole numbers.
aa_means = analytics_phased.loc[analytics_phased['sessions'] != 0]
aa_zeros = analytics_phased.copy()
aa_zeros['zeros'] = np.where(aa_zeros['sessions'] == 0, 1, 0)
aa_zeros['rows'] = 1
314
Chapter 7 Experiments
Calculate the variation “sigma” which is 99.5% of the ratio of zeros to the total
possible sessions:
aa_means_sigmas['sigma'] = aa_means_sigmas['zeros']/aa_means_
sigmas['rows'] * 0.995
aa_means_sigmas
We can see the variation is very similar before and after the cutoff, so that gives us
some confidence that the URLs are stable enough for A/B testing.
Put it together using the .merge() function (the Python equivalent of Excel’s
vlookup):
If you were conducting an A/A test to see the effect of an optimization, then you’d
want to see the test_period group report a higher session rate with the same sigma or
lower. That would indicate your optimization succeeded in increasing SEO traffic.
315
Chapter 7 Experiments
Let’s visualize the distributions using the histogram plotting capabilities of plotnine:
aa_test_plt = (
ggplot(analytics_phased,
aes(x = 'sessions', fill = 'aa_group')) +
geom_histogram(alpha = 0.8, bins = 30) +
labs(y = 'Count', x = '') +
theme(legend_position = 'none',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
) +
facet_wrap('aa_group')
)
aa_test_plt.save(filename = 'images/2_aa_test_plt.png',
height=5, width=8, units = 'in', dpi=1000)
aa_test_plt
Figure 7-2. Histogram plots of pretest and test period A/A group data
316
Chapter 7 Experiments
The box plot gives more visual detail of the two groups’ distribution which will now
be used:
aa_test_box_plt = (
ggplot(analytics_phased,
aes(x = 'aa_group', y = 'sessions',
fill = 'aa_group', colour = 'aa_group')) +
geom_boxplot(alpha = 0.8) +
labs(y = 'Count', x = '') +
theme(legend_position = 'none',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
aa_test_box_plt.save(filename = 'images/2_aa_test_box_plt.png',
height=5, width=8, units = 'in', dpi=1000)
aa_test_box_plt
Figure 7-3 shows again in aa_test_box_plt that there is no difference between the
groups other than the pretest group having a larger number of higher value outliers.
317
Chapter 7 Experiments
Let’s perform the actual A/A test using a statistical model. We’ll create an array,
which is a list of numbers marking data points as either 0 (for pretest) or 1 (test_period),
which will then be assigned to X:
array([[1., 0.],
[1., 0.],
[1., 0.],
...,
[1., 1.],
[1., 1.],
[1., 1.]])
318
Chapter 7 Experiments
X is used to feed the NegativeBinomial() model which will be used to test the
difference in the number of sessions between the two A/A groups.
The arguments are the outcome metric (sessions) and the independent variable
(aa_group):
Then we’ll see the model results using the .summary() attribute:
aa_model.summary()
The printout shows that p-value (LLR p-value) is zero, which means there
is significance. However, the x1 is –0.31, indicating there is a small difference
between groups.
319
Chapter 7 Experiments
Is that enough of a difference to stop the A/B test? That’s a business question. In this
case, it isn’t; however, this is subjective; the graphs and analytical tables would support
the claim that there is no real difference – onward.
num_rows = analytics_phased["sessions"].count()
mu = analytics_phased[analytics_phased["sessions"] != 0].agg({"sessions":
"mean"})
sigma = get_sigma(analytics_phased["sessions"])
807212 sessions 2.155976
dtype: float64 0.8006401416232662
320
Chapter 7 Experiments
With the parameters set, these will feed the following functions. python_rzip will
generate and return a random Poisson distribution based on the parameters:
simulate_sample uses the python_rzip function to return a split test between two
groups of data assuming there is a difference of 20% or more:
With the three functions defined, we can test for significance at varying levels of
traffic. If you fancy a challenge to avoid repetitive code and stretch your Python skills, try
implementing the run_simulations function as part of a list comprehension:
321
Chapter 7 Experiments
0.04 : 100
0.08 : 1000
0.74 : 10000
0.85 : 15000
0.86 : 16000
0.9 : 18000
0.96 : 20000
0.97 : 25000
1.0 : 30000
1.0 : 50000
The preceding output shows the levels of significance (p-value) achieved at different
sample size levels, which in our case are the required number of sessions. If we would be
happy with a 90% (or higher) chance that a 20% difference would be observed, then we’d
require 18,000 sessions per group or more.
So we’ll set the experiment sample size as appropriate:
exp_sample_size = 18000
The testing_days, which is the maximum number of days to run the test, will be set at
30, which is an arbitrary number set by the business. This of course can be lower.
testing_days = 30
322
Chapter 7 Experiments
With the max period of testing days set, we’ll need the minimum URLs for the
test group to hit the required number of user sessions in that time period. Dividing
the sample session size of 18,000 by the number of testing days will give us that
approximate number.
Bear in mind that it can take up to two weeks (and sometimes longer) for Google to
register the site changes and reflect these in the search results (i.e., they have to crawl,
index, and rerank their results). So to limit the risk of ending the experiment early, we’ll
double the minimum URLs required for testing, in order to increase the likelihood of
Google, in the first instance, crawling the test URLs (those with the change(s)):
print(url_sample_size)
1200
1200 URLs are required for the test group. Note that it’s implicitly assumed that the
test URLs are much smaller than the control such that there are plenty of URLs in the
control group to hit the minimum sessions during the testing period.
urls_agg
Our resulting dataframe shows each landing page and their average sessions and
number of days where sessions are generated. Some URLs get more than one day of
sessions as shown by the date column.
323
Chapter 7 Experiments
We will now sample the dataframe based on the required 1200 URLs and assign
these to the “test” group:
urls_test = urls_agg.sample(url_sample_size).assign(ab_group="test")
Drop the sessions and date column as we only need the URLs to send to the web
developer team for allocation:
urls_test
324
Chapter 7 Experiments
urls_test_list = urls_test["landing_page"]
urls_test_list
7129 https://www.next.com/shop/henrik-vibskov-coats/
16561 https://www.next.com/shop/mens-le-mont-st-michel-jackets/
13017 https://www.next.com/shop/mens-cartier-belts/
16169 https://www.next.com/shop/mens-jw-anderson-t-shirts/
8949 https://www.next.com/shop/kensie-totes/
...
9813 https://www.next.com/shop/lizzie-fortunato-shoulder-bags/
12857 https://www.next.com/shop/mens-buscemi-sneakers/
18681 https://www.next.com/shop/mens-raey-jackets/
6000 https://www.next.com/shop/forever-21-knitwear/
20060 https://www.next.com/shop/mens-the-quiet-life-hats/
Name: landing_page, Length: 1200, dtype: object
325
Chapter 7 Experiments
Test landing pages are converted to a list which will be used to mark the other (non
test allocated) URLs as control:
urls_control = urls_agg[~urls_agg["landing_page"].isin(urls_test_list.
values)].assign(
ab_group="control"
)
urls_control
Both test and control groups will now be combined into a single dataframe showing
which URLs are test and control:
326
Chapter 7 Experiments
split_ab_dev.to_csv("data/split_ab_developers.csv")
The final dataframe is combined and exported into a CSV for the software
development team’s reference.
327
Chapter 7 Experiments
What can and does happen is that the test could regress back to similar levels of
performance as the control group after outperforming control. However, if you end the
experiment prematurely, you won’t know and therefore end up wasting your time and
company resources on an invalid experiment.
So if your experiment requires 20,000 pageviews, make sure your experiment reaches
20,000 pageviews for both groups.
328
Chapter 7 Experiments
test_analytics = pd.read_csv('data/sim_split_ab_data.csv')
test_analytics["date"] = pd.to_datetime(test_analytics["date"],
format="%Y/%m/%d")
test_analytics
You’ll see that you now have a dataframe with all the URLs by date and outcomes
labeled as test and control. As is the nature of analytics data, some dates are missing:
329
Chapter 7 Experiments
Add missing dates as some URLs from either group will not have logged a pageview.
We’ll use the list comprehension technique to fill in the missing dates where for every
unique landing page, we’ll create a new date row where none exists:
test_analytics_expand = pd.DataFrame(
[(x, y)
for x in test_analytics['landing_page'].unique()
for y in test_analytics['date'].unique()], columns=("landing_page",
"date"),)
test_analytics_expand
330
Chapter 7 Experiments
Note that there are more rows than before because of the added missing dates. These
will need to have session data added, which will be achieved by merging the original
analytics data:
test_analytics_expanded = test_analytics_expand.merge(
split_ab_dev, how="left", on=['landing_page'])
test_analytics_expanded = test_analytics_expanded.merge(
test_analytics, how="left", on=["date", "landing_page", 'ab_group'])
Post merge, any landing pages with missing dates will have “NaNs” (not a number),
which is dealt with by filling those with zeros and converting the data type to an integer:
test_analytics_expanded['sessions'] = test_analytics_expanded['sessions'].
fillna(0).astype(int)
test_analytics_expanded
331
Chapter 7 Experiments
Our dataset is ready for some data exploration before finally testing. We explore the
data to observe the distribution of sessions, which helps with our model selection.
ab_means = (
test_analytics_expand[test_analytics_expand["sessions"] != 0]
.groupby(["ab_group"])
.agg({"date": ["min", "max", day_range], "sessions": "mean"})
)
ab_sigmas = test_analytics_expand.groupby(["ab_group"]).agg({"sessions":
[get_sigma]})
332
Chapter 7 Experiments
The dataframe shows that the minimum sample sessions were comfortably hit and
that it looks like the test group has made a significant difference, that is, a statistically
significant higher number of sessions.
Let’s plot the data by test and control groups to explore it more, starting with an
overall time trend. The data will be aggregated by date and ab_group with the sessions
averaged:
simul_abgroup_trend_plt = (
ggplot(simul_abgroup_trend,
aes(x = 'date', y = 'sessions', colour = 'ab_group',
group = 'ab_group')) +
geom_line(alpha = 0.6, size = 3) +
labs(y = 'Count', x = '') +
333
Chapter 7 Experiments
theme(legend_position = 'right',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
simul_abgroup_trend_plt.save(filename = 'images/3_simul_abgroup_trend_
plt.png',
height=5, width=8, units = 'in', dpi=1000)
simul_abgroup_trend_plt
Figure 7-4 shows the resulting time series plot of simul_abgroup_trend_plt. Both
groups experienced dips during that period; however, the test group has outperformed
the control group.
Figure 7-4. Time series plot of both test and control group sessions over time
Next, we’ll inspect the distribution of sessions overall, starting with a histogram:
ab_assign_plt = (
ggplot(test_analytics_expanded,
aes(x = 'sessions', fill = 'ab_group')) +
geom_histogram(alpha = 0.6, bins = 30) +
334
Chapter 7 Experiments
ab_assign_plt.save(filename = 'images/4_ab_test_plt.png',
height=5, width=8, units = 'in', dpi=1000)
ab_assign_plt
335
Chapter 7 Experiments
The box plot method will be used to contrast the distributions further:
ab_assign_box_plt = (
ggplot(test_analytics_expand,
aes(x = 'ab_group', y = 'sessions',
fill = 'ab_group', colour = 'ab_group')) +
geom_boxplot(alpha = 0.8) +
labs(y = 'Count', x = '') +
theme(legend_position = 'none',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
ab_assign_box_plt.save(filename = 'images/4_ab_test_box_plt.png',
height=5, width=8, units = 'in', dpi=1000)
ab_assign_box_plt
Figure 7-6 shows ab_assign_box_plt, which is a box plot comparison of the control
and test groups.
336
Chapter 7 Experiments
The control group has many more outliers, but the test group has much less zeros
than the control group.
The scales make this hard to distinguish, so we’ll take a logarithm of the session scale
to visualize this further:
ab_assign_log_box_plt = (
ggplot(test_analytics_expanded,
aes(x = 'ab_group', y = 'sessions',
fill = 'ab_group', colour = 'ab_group')) +
geom_boxplot(alpha = 0.6) +
labs(y = 'Count', x = '') +
scale_y_log10() +
theme(legend_position = 'none',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
337
Chapter 7 Experiments
ab_assign_log_box_plt.save(filename = 'images/4_ab_assign_log_box_plt.png',
height=5, width=8, units = 'in', dpi=1000)
ab_assign_log_box_plt
In all cases, we can see that the average sessions are close to zero, and there are
many landing pages on any given day with zero sessions, which indicates that sessions
are a rare event. This type of distribution is known as “Poisson.”
As a consequence, we’ll use a negative binomial distribution to test the differences
between test and control for significance.
338
Chapter 7 Experiments
First, we’ll mark up the data as being test (1.0) or control (0.0), then convert it to
an array:
array([[1., 0.],
[1., 0.],
[1., 0.],
...,
[1., 1.],
[1., 1.],
[1., 1.]])
ab_model = NegativeBinomial(test_analytics_expand['sessions'],
X).fit()
ab_model.summary()
339
Chapter 7 Experiments
From the preceding result, we can conclude that the change was indeed significant.
The test group (shown by x1) exhibited 2.24 more pageviews on average compared to
control.
In terms of significance, the LLR p-value is zero, so the chances of the difference
occurring due to random noise are incredibly slim.
Interestingly, the pseudo r-squared which measures the extent to which ab_group
can explain the sessions per se is very low at 0.029, which means the model is very
noisy and would require many more other factors to predict levels of traffic other than
ab_group.
340
Chapter 7 Experiments
• You need a different time period – despite meeting the sample size
requirements, it could be down to seasonal effects such as the time of
year or the data fulfilling the sample requirement before a full week is
run or Google wasn’t given a chance to process the changes (see the
previous discussion).
By setting your hypothesis in the first instance, regardless of the outcome, you will
have learned something, and you will be able to move forward with a sensible plan, be it
your next test or a sitewide implementation of your test.
Summary
Experiments have always been a part of the SEO expert’s skill in determining what tactics
are likely to work, even if sometimes the scientific rigor is missing. In the enterprise
setting, a rigorous experiment design is essential due to the impact on revenue and the
need to prove recommendations are beneficial, before rolling out changes sitewide.
While there are tools that assist in this area, it is also useful to understand the data
science behind SEO split tests and the considerations that must be borne in mind. In this
chapter, we covered
• Generating hypotheses
• Experiment design
In the next chapter, we will cover SEO reporting in the form of dashboards.
341
CHAPTER 8
Dashboards
Although a performance dashboard system in itself doesn’t solve SEO problems
directly, having the infrastructure can be a very useful repository for data to support SEO
science as well as create visuals that communicate useful trends, changes, threats, and
opportunities.
Even more importantly, SEO is data rich, and there are numerous data sources and
a good many number of things you can possibly measure in SEO, so the picture can look
very noisy and at worst can be useless if you can’t clearly see and get to the signal.
Having a performance reporting system that uses well-designed and well-thought-
out dashboards will help highlight the most important trends from the noisy data. It will
also be easier to identify causal effects.
We will be supplying some code, written in SQL, to help you understand how to
achieve some of the most valuable visuals.
Data Sources
The types of data sources you would want for your dashboard will be anything that
(a) offers an API and (b) obviously adds information to understanding your SEO
performance more effectively. These may include (and this is by no means exhaustive)
343
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_8
Chapter 8 Dashboards
• Social: BuzzSumo
• Load is the part where you load the transformed data into the data
warehouse.
344
Chapter 8 Dashboards
There are numerous configurations you can pursue depending on which cloud stack
you go with, your team’s cloud engineering skills, and your budget.
Extracting Data
The extract process will usually be automated where your APIs get queried on a daily
basis (known as “polling”) using a virtual machine running the script. The data gets
stored either in storage or a data warehouse.
If your cloud engineering skills are nonexistent, you can still upload data via CSV
format to the data warehouse.
The following is some code to extract data from a number of APIs which will be the
main Google products and some of (not all of ) the more well-known SEO processes:
• Google Analytics
We’ll now provide Python code for you to connect to these APIs not just for reporting
purposes as this code can be adapted to support other SEO science activities covered in
other chapters.
Google Analytics
Traffic remains a key lever of growth, and Google Analytics is widely used as a web
analytics package. However, more organic search traffic will not always correlate directly
with more revenue, but it may indicate engagement through other means.
The following will detail code to extract data from the most well-known and used
website analytics APIs being Google Analytics version 4.
Import the API libraries:
import pandas as pd
from pathlib import Path
import os
from datetime import date, timedelta
345
Chapter 8 Dashboards
Set the file path of the credential keys which is a JSON file and obtainable from your
Google Cloud Platform account under API Libraries ➤ Credentials:
credentials_path = Path("keys/xxxxx.json")
credentials_path_str = str(credentials_path.absolute())
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path_str
client = BetaAnalyticsDataClient()
Define a function to run an aggregated report which will require a date range and the
property ID of the GA4 account.
In this function, we query the API using the inputs to build the request which
includes the metrics we want and store the API response.
We’ve set the dimension as landingPage because that’s how we want the traffic
numbers broken down. Other dimensions may be used which are listed here (https://
developers.google.com/analytics/devguides/reporting/data/v1/api-schema).
request = RunReportRequest(
property=f"properties/{property_id}",
dimensions=[Dimension(name="landingPage")],
metrics=
[
Metric(name="activeUsers"),
Metric(name="screenPageViewsPerSession"),
Metric(name="bounceRate"),
Metric(name="averageSessionDuration"),
Metric(name="userConversionRate"),
346
Chapter 8 Dashboards
Metric(name="ecommercePurchases"),
],
date_ranges=date_ranges,
)
response = client.run_report(request)
return response
response = aggregated_run_report(client)
print("Report result:")
for row in response.rows:
print(row.dimension_values[0].value, row.metric_values[0].value)
Report result:
/ 11347
/blog/sell-airtime-over-charged-your-line-dont-panic 8423
/faq 4870
/blog/sell-airtime-over-charged-your-line-dont-panic 2355
/privacy 1338
The next function uses the API response result rows and packages it into a single
dataframe:
def ga4_response_to_df(response):
dim_len = len(response.dimension_headers)
metric_len = len(response.metric_headers)
all_data = []
for row in response.rows:
row_data = {}
for i in range(dim_len):
row_data.update({response.dimension_headers[i].name: row.
dimension_values[i].value})
for i in range(metric_len):
row_data.update({response.metric_headers[i].name: row.metric_
values[i].value})
all_data.append(row_data)
df = pd.DataFrame(all_data)
return df
347
Chapter 8 Dashboards
df = ga4_response_to_df(response)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ ------------ -----
0 landingPage 418 non-null object
1 dateRange 418 non-null object
2 activeUsers 418 non-null object
3 screenPageViewsPerSession 418 non-null object
4 bounceRate 418 non-null object
5 averageSessionDuration 418 non-null object
6 userConversionRate 418 non-null object
7 ecommercePurchases 418 non-null object
dtypes: object(8)
memory usage: 26.2+ KB
Printing the dataframe’s properties via df.info() tells us the data types which all
appear to be strings, which is okay for the landing page but not for metrics, such as
activeUsers, which should be converted to numeric before the data can be processed
further.
df.head()
The following resulting dataframe shows the dimensions “landingPage” along with
the metrics. The data is aggregated across the entire date range.
348
Chapter 8 Dashboards
But suppose you wanted the data broken down by date as well.
The following function will do just that with the default number of day parameters
set to two years:
date_ranges = []
count = 0
df_output = pd.DataFrame()
for i in range(n_days):
count += 1
if count == 4:
response = aggregated_run_report(client,
property_id=property_id, date_ranges=date_ranges)
df = ga4_response_to_df(response)
df_output = pd.concat([df_output, df], ignore_index=True)
# Re-initialize
count = 0
date_ranges = []
return df_output
349
Chapter 8 Dashboards
Run the function; in this case, we’ll extract the last 90 days:
df = dated_run_report_to_df(client, n_days=90)
The following function converts the column data formats from str to their
appropriate formats which are mostly numeric:
def format_df(df):
df["dateRange"] = pd.to_datetime(df["dateRange"])
df["activeUsers"] = df["activeUsers"].astype("float")
df["screenPageViewsPerSession"] = df["screenPageViewsPerSession"].
astype("float")
df["bounceRate"] = df["bounceRate"].astype("float")
df["averageSessionDuration"] = df["averageSessionDuration"].
astype("float")
df["userConversionRate"] = df["userConversionRate"].astype("float")
df["ecommercePurchases"] = df["ecommercePurchases"].astype("float")
return df
df = format_df(df)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 376 entries, 0 to 375
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 landingPage 376 non-null object
1 dateRange 376 non-null datetime64[ns]
2 activeUsers 376 non-null float64
3 screenPageViewsPerSession 376 non-null float64
4 bounceRate 376 non-null float64
5 averageSessionDuration 376 non-null float64
6 userConversionRate 376 non-null float64
7 ecommercePurchases 376 non-null float64
350
Chapter 8 Dashboards
df
keywords_lst
['airtime app',
'airtime to cash app',
'airtime transfer',
'app to sell airtime',
'app to transfer airtime from one network to another',
'bet with airtime and win cash',
'buy airtime online',
'buy airtime with discount',
'buy recharge card online',
'buy recharge card online with debit card',
351
Chapter 8 Dashboards
With this API, you’ll need your DataForSEO client file which resides in the same
folder as the Jupyter notebook script file running this code:
The API will need to know which country you’d like to see the search results for, the
device, and the language, which are defined as follows. The countries list may be found
here (https://docs.dataforseo.com/v3/serp/google/locations/?bash).
location = 2826
language = "en"
device_input = 'mobile'
The following are functions to query the API. set_post_data will set the parameters
for the search:
def set_post_data(search_query):
post_data = dict()
post_data[len(post_data)] = dict(
language_code = language,
location_code = location,
device = device_input,
keyword = search_query,
calculate_rectangles = True)
return post_data
The function get_api_result uses the preceding function to structure the input which
will be used to request the search results. The API result is stored in a variable named
“response.”
352
Chapter 8 Dashboards
There is a try loop in place so that should there be an issue with the API call, the
function carries on and moves on to the next keyword, to prevent holding up the entire
operation or stalling:
def get_api_result(search_query):
post_data = set_post_data(search_query)
response = client.post("/v3/serp/google/organic/live/advanced",
post_data)
try:
return response
except Exception as e:
print(response)
print(e)
return None
With multiple keywords to be queried, we’ll want to call the function multiple times,
so we’ll do that using a for loop.
Initialize an empty dictionary to store the individual API results:
desktop_serps_returned = {}
Add a for loop to query the API for each and every keyword in the list:
i = 0
353
Chapter 8 Dashboards
Printing the entire output in this book and in the Jupyter notebook would be too
impractical. Instead, we’ll print the keys of the dictionary where the data is stored which
shows the keywords that have API data:
desktop_serps_returned.keys()
With the data stored, the dictionary requires unpacking into a dataframe format,
which will be carried out as follows.
Initialize an empty list:
desktop_serps_flat_df = []
Using a for loop, we’ll iterate through the dictionary keys which will be used to select
parts of the dictionary by keyword. Then we loop through the contents of the dictionary
data for that keyword and add these to the empty list initialized earlier:
354
Chapter 8 Dashboards
desktop_serps_flat_df.append(
(
cost, task_id, se, device, os, res['keyword'],
res['location_code'], res['language_code'],
res['se_results_count'], res['type'], res['se_domain'],
res['check_url'], item['rank_group'], item['rank_absolute'],
item.get('url', None), item.get('domain'), item.get('is_image'),
item.get('is_featured_snippet'),
item.get('is_video'), item.get('is_malicious'), item.get
('is_web_story'),
item.get('description'), item.get('pre_snippet'), item.get
('amp_version'),
item.get('rating'), item.get('price'), item.get('highlighted'),
item.get('links'), item.get('faq'), item.get('extended_people_
also_search'),
item.get('timestamp'), item.get('rectangle'),
res['datetime'], item.get('title'), item.get('cache_url')
)
)
Once the list has all the added keyword SERP data, it is converted into a dataframe:
desktop_full_df = pd.DataFrame(
desktop_serps_flat_df,
columns=[
'cost', 'task_id', 'se', 'device', 'os', 'keyword',
'location_code', 'language_code', 'se_results_count',
'type', 'se_domain', 'check_url', 'rank_group',
'rank_absolute', 'url', 'domain', 'is_image', 'is_featured_snippet',
'is_video', 'is_malicious', 'is_web_story',
'description', 'pre_snippet', 'amp_version',
'rating', 'price', 'highlighted', 'links',
'faq', 'extended_people_also_search', 'timestamp', 'breadcrumb',
'datetime', 'title', 'cache_url'
]
)
desktop_full_df.head(2)
355
Chapter 8 Dashboards
The result is the API data in a dataframe which is ready for reporting or prereporting
transformation.
import datetime
import httplib2
import re
import pandas as pd
import numpy as np
from collections import defaultdict
356
Chapter 8 Dashboards
The script is constructed to allow you to query multiple domains, which could be
useful for an agency reporting system where you look after more than one client or, if
you’re in the client side, multiple sites:
site_list = ['https://www.babywishiest.com']
site = 'https://www.babywishiest.com'
client_name = babywishiest
The dimensions will give a breakdown of the data, while no dimensions will return
summary data for the date range:
To filter to a device, enter MOBILE, DESKTOP, or TABLET or leave it blank for all
devices:
device_filter = ''
To filter to a search type, enter WEB, IMAGE, VIDEO, or discover. This defaults to
WEB if left blank:
search_filter = ''
To filter to a specific three-digit country code (e.g., FRA). A list of country codes is
available here (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3). If left blank,
the API will default to all:
country_filter = ''
To filter pages which contain a string, you can use operators such as “equals,”
“contains,” “notContains,” or “notEquals”:
page_filter_string = ''
page_filter_operator = 'equals'
query_filter_string = ''
query_filter_operator = ''
357
Chapter 8 Dashboards
State your date range for the query, which will be converted in datetime format:
start_date = '2022-08-01'
end_date = '2022-11-30'
Enter a date grouping to break down the data. Use D (day), W (week), M (month), or
A (all):
date_grouping = 'A'
Enter API credentials obtainable from the APIs section of your GCP project:
CLIENT_ID = 'xxxxxxx'
CLIENT_SECRET = 'xxxxxx'
Add sleep time between requests. Increase this if you are hitting limits or
getting errors:
sleep_time = 10
2022-08-01 2022-11-30
With the parameters specified, the next block deals with authentication using the
OAuth method:
OAUTH_SCOPE = 'https://www.googleapis.com/auth/webmasters.readonly'
REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'
358
Chapter 8 Dashboards
http = httplib2.Http()
http = credentials.authorize(http)
Once authenticated, set the custom number of rows to retrieve from the API per
request:
row_limit = 25000
output = pd.DataFrame()
request = {
'rowLimit': row_limit,
'startRow': 0
}
if dimensions:
request['dimensions'] = dimensions
if search_filter:
request['searchFilterGroups'] = [{'filters':[{'dimension':'search',
'expression':search_filter}]}]
dimension_filters = []
if device_filter:
dimension_filters.append({'dimension':'device', 'expression':device_
filter})
if country_filter:
359
Chapter 8 Dashboards
dimension_filters.append({'dimension':'country', 'expression':country_
filter})
if page_filter_string:
dimension_filters.append({'dimension':'page','expression':page_filter_
string, 'operator': page_filter_operator})
if query_filter_string:
dimension_filters.append({'dimension':'query','expression':query_filter_
string, 'operator': query_filter_operator,})
request['dimensionFilterGroups'] = [{'filters':dimension_filters}]
print(f'Filter: {dimension_filters}')
Loop through all the dates from start to end, inclusive and populate the request start
and end dates with the date from the loop:
request['startDate'] = f"{single_date[0].strftime('%Y')}-{single_
date[0].strftime('%m')}-{single_date[0].strftime('%d')}"
request['endDate'] = f"{single_date[1].strftime('%Y')}-{single_date[1].
strftime('%m')}-{single_date[1].strftime('%d')}"
run = True
rowstart = 0
request['startRow'] = rowstart
while run:
try:
response_page = execute_request(webmasters_service, site, request)
scDict_results = defaultdict(list)
try:
for row in response_page['rows']:
360
Chapter 8 Dashboards
if dimensions:
for i,dimension in enumerate(dimensions):
scDict_results[dimension].append(row['keys'][i] or 0)
scDict_results['clicks'].append(row['clicks'] or 0)
scDict_results['ctr'].append(row['ctr'] or 0)
scDict_results['impressions'].append(row['impressions'] or 0)
scDict_results['position'].append(row['position'] or 0)
time.sleep(sleep_time)
except HttpError:
print('Got an error. Retrying in 1m.')
time.sleep(60)
Filter: []
https://www.babywishiest.com - 2022-08-01 to 2022-11-30
1672 results
output
361
Chapter 8 Dashboards
Although it’s a large block of code, the API can be used to extract 100,000 rows of
data if not much more.
desktop_serps_urls = ['https://pay.jumia.com.ng/services/airtime',
'https://pay.jumia.com.ng/',
'https://vtpass.com/',
'https://www.gloverapp.co/products/airtime-to-cash',
'https://www.zoranga.com/',
'https://airtimeflip.com/',
362
Chapter 8 Dashboards
'https://www.tingtel.com/blog/sell-airtime-over-charged-your-line-
dont-panic',
'https://vtpass.com/payment',
'https://pay.jumia.com.ng/services/mobile-data/mtn-mobile-data', ...]
"https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=[test-url]&
key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
base_url = 'https://www.googleapis.com/pagespeedonline/v5/
runPagespeed?url='
strategy = '&strategy=desktop'
api_url = '&key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
Initialize an empty dictionary to store the data and a counter to keep track of the
number of URLs being queried:
desktop_cwv = {}
i = 0
desktop_cwv.keys()
dict_keys(['https://pay.jumia.com.ng/services/airtime', 'https://pay.
jumia.com.ng/', 'https://vtpass.com/', 'https://www.tingtel.com/blog/buy-
airtime-get-discount-on-every-airtime-recharge', 'https://www.gloverapp.
co/products/airtime-to-cash', 'https://www.zoranga.com/', 'https://
airtimeflip.com/', 'https://www.tingtel.com/blog/sell-airtime-over-charged-
your-line-dont-panic', 'https://www.tingtel.com/', 'https://www.tingtel.
363
Chapter 8 Dashboards
Iterate through the PageSpeed API JSON Response dictionary, starting with an
empty list:
desktop_psi_lst = []
Loop through the dictionary by key to extract the different CWV metrics and store
them in the list:
364
Chapter 8 Dashboards
The result is a dataframe showing all the CWV scores for each URL.
Transforming Data
The purpose of transforming the data, which has been extracted by the API or other
means from your data source, is to
• Clean it up for further calculated metrics
The code is going to continue from the Google Analytics (GA) data extracted earlier
where we will cover the preceding points.
We start by copying the GA dataframe:
df_clean = df.copy()
df_clean['averageSessionDuration'] = (df_clean['averageSessionDuration']
/ 60).round(1)
365
Chapter 8 Dashboards
Create new columns for easier transformation based on time and calendar
date units:
df_clean['month'] = df_clean['dateRange'].dt.strftime('%m')
df_clean['year'] = df_clean['dateRange'].dt.strftime('%Y')
df_clean['month_year'] = df_clean['dateRange'].dt.strftime('%Y-%m')
df_clean
With the data formatted, we can start transforming to derive new columns of trend
data such as
• Averages
Let’s make a copy and rename it to reflect that we’re aggregating by landing page and
by month:
ga4_lp_agg_month = df_clean.copy()
366
Chapter 8 Dashboards
We’ll create some basic summary statistics which will be the average (“mean”) and
total (“sum”) of various GA metrics using the groupby() and agg() functions:
ga4_lp_agg_month_basic = ga4_lp_agg_month.groupby(['landingPage',
'month_year']).agg({'activeUsers':'sum',
'screenPageViewsPerSession':'mean',
'bounceRate':'mean',
'averageSessionDuration':'mean',
'userConversionRate':'mean',
'ecommercePurchases':'sum'
}).reset_index()
ga4_lp_agg_month_basic
The metrics as shown earlier are summarized by landing page and month_year
which can be used to feed a basic SEO dashboard reporting system.
Usually in all cases, we track the mean and standard deviation. The average gives
us a useful indicator of where a channel is at in terms of performance, as it will indicate
where most data points were or will be for a given category of data for a given point
of time.
Averages, as every statistician (and many others being statistically aware) will tell
you, can be dangerous on their own when making inferences or decisions even. This is
why we also track the standard deviation as this indicator tells us something about the
variation of a given metric, that is, how consistent it is.
In practical terms, the standard deviation tells us how close the data points are to the
average. And what can we deduce from this?
We can deduce which averages we’re more likely to trust or rely on for comparing
between months. So the standard deviation can tell us a bit about the quality of the
averages for the purpose of confidence in the data and for comparisons and also how the
metric we’re tracking is behaving over time.
367
Chapter 8 Dashboards
For example, you might find that the standard deviation is increasing or decreasing
and should therefore try to understand what the reason behind it is. Could it be
• Changes in your user behavior, search intent of the query, your brand
positioning, or the market?
Tracking the standard deviation could help you see whether something is afoot for
the better or worse:
ga4_lp_agg_month = df_clean.copy()
ga4_lp_agg_month_mean = ga4_lp_agg_month.groupby(['landingPage',
'month_year']).agg({'activeUsers':'mean',
}).reset_index()
ga4_lp_agg_month_mean = ga4_lp_agg_month_mean.rename(columns =
{'activeUsers':'activeUsers_avg'})
While “mean” calculates the average, “std” calculates the standard deviation:
ga4_lp_agg_month_std = ga4_lp_agg_month.groupby(['landingPage',
'month_year']).agg({'activeUsers':'std',
'bounceRate':'std',
}).reset_index()
ga4_lp_agg_month_std = ga4_lp_agg_month_std.rename(columns =
{'activeUsers':'activeUsers_std',
'bounceRate':'bounceRate_std'
})
ga4_lp_agg_month_stats = ga4_lp_agg_month_basic.merge(ga4_lp_agg_
month_mean,
on = ['landingPage', 'month_year'], how = 'left')
368
Chapter 8 Dashboards
ga4_lp_agg_month_stats = ga4_lp_agg_month_stats.merge(ga4_lp_agg_month_std,
on = ['landingPage', 'month_year'], how = 'left')
ga4_lp_agg_month_stats.head()
ga4_lp_agg_month_moms = ga4_lp_agg_month_stats.sort_values(['landingPage',
'month_year'])
We’re just calculating the monthly stats for activeUsers and bounceRate; however,
you can use the same methods on all of the other metric columns.
First, we start by calculating the absolute change from the current row to the
previous row (the previous month) using the shift() function.
Note that “1” was entered as a parameter to the shift() function, which means 1 row.
If you wanted to calculate the year-on-year difference, then you would enter “12” (i.e.,
.shift(12)), which would look at the value 12 rows before.
ga4_lp_agg_month_moms['activeUsers_delta'] = ga4_lp_agg_month_
moms['activeUsers'] - ga4_lp_agg_month_moms['activeUsers'].shift(1)
ga4_lp_agg_month_moms['activeUsers_mom'] = ((ga4_lp_agg_month_
moms['activeUsers_delta'] / ga4_lp_agg_month_moms['activeUsers'].shift(1) *
100)).round(1)
369
Chapter 8 Dashboards
ga4_lp_agg_month_moms['bounceRate_delta'] = ga4_lp_agg_month_
moms['bounceRate'] - ga4_lp_agg_month_moms['bounceRate'].shift(1)
ga4_lp_agg_month_moms['bounceRate_mom'] = ((ga4_lp_agg_month_
moms['bounceRate_delta'] / ga4_lp_agg_month_moms['bounceRate'].shift(1) *
100)).round(1)
ga4_lp_agg_month_moms
The delta and month-on-month columns are added. Note the NaN for the first row
which is because no previous row existed for the shift() function to work.
To overwrite NaNs, you could use the np.where() function to replace .isnull()
with zero.
An alternative approach would be to use a special function to avoid ordering the
rows. However, this could be computationally more expensive to run in the cloud
if you’re planning to automate this as an all-encompassing SEO data warehouse
dashboard reporting system.
Once done, you’re ready to upload to your data warehouse of choice.
Loading Data
As mentioned earlier in the chapter, loading involves moving the transformed data into
the data warehouse. Once uploaded, it’s a good idea to check your data schema and
preview what you’ve uploaded.
The following SQL will produce user trends by month and channel:
370
Chapter 8 Dashboards
and
DATE_DIFF(CURRENT_DATE(), PARSE_DATE('%Y-%m-%d',
CONCAT(SUBSTR(CAST(yearMonth AS STRING), 1, 4),"-",
SUBSTR(CAST(yearMonth AS STRING), 5, 2),"-",'01')), MONTH) <= 12
union all
select yearMonth, 'non_seo' as channel, all_users - organic as users
from (
SELECT yearMonth
, MAX(IF(channel = 'Organic Traffic', users_sum, 0)) organic
, MAX(IF(channel = 'All Users', users_sum, 0)) all_users
from google_analytics.multichannel_ga_monthly
where channel in ('Organic Traffic', 'All Users')
group by yearMonth
)
)
where
DATE_DIFF(CURRENT_DATE(), PARSE_DATE('%Y-%m-%d',
CONCAT(SUBSTR(CAST(yearMonth AS STRING), 1, 4),"-",
SUBSTR(CAST(yearMonth AS STRING), 5, 2),"-",'01')), MONTH) <= 12
group by yearMonth, channel
order by yearMonth LIMIT 100;
The following SQL will produce user traffic stats by month with year-on-year:
select
yearMonth,
CASE
WHEN channel = 'All Users' THEN "non_seo"
ELSE channel
END as channel,
users_yoy
from google_analytics.multichannel_ga_monthly
where
channel in ("All Users", "Organic Traffic")
and
371
Chapter 8 Dashboards
DATE_DIFF(CURRENT_DATE(), PARSE_DATE('%Y-%m-%d',
CONCAT(SUBSTR(CAST(yearMonth AS STRING), 1, 4),"-", SUBSTR(CAST(yearMonth
AS STRING), 5, 2),"-",'01')), MONTH) <= 12
order by yearMonth desc LIMIT 100;
The following SQL will produce year-on-year user traffic stats by month with this
year vs. last year, for the months year to date:
SELECT yearMonth
, year
, SUBSTR(CAST(yearMonth AS STRING), 5, 6) as mon_x
, users_sum
from google_analytics.multichannel_ga_monthly
where
channel in ("Organic Traffic")
order by mon_x, year desc;
372
Chapter 8 Dashboards
Visualization
If you’re satisfied with the SQL results, you can use the same queries and visualize these
in your front end such as Looker Studio, Tableau, etc., as shown in the following. How
does organic search compare to other channels? By volume (top) and YoY (bottom) over
time, shown in the Looker Studio graph (Figure 8-2).
Figure 8-2. Looker Studio graph showing organic vs. non-SEO channels over the
last 12 months
How does organic search compare to other channels year-on-year? The Looker
Studio graph in Figure 8-3 visualizes the SQL statement which calculates the year-on-
year traffic numbers for both organic and non-SEO channels. This is useful for seeing
how well the SEO is performing for the time of the year (i.e., independent of seasonality).
It also gives some measure of how SEO has performed relative to non-SEO channels for
the same period.
Figure 8-3. Looker Studio bar chart of year-on-year traffic numbers for both
organic and non-SEO channels by month
How do organic search users compare to last year? The Looker Studio graph in
Figure 8-4 shows this year and last year traffic numbers for organic traffic only. This is
useful for comparing this year’s SEO performance vs. last year’s SEO performance in
isolation.
373
Chapter 8 Dashboards
Figure 8-4. Looker Studio graph showing this year and last year traffic numbers
for organic traffic only
Automation
Naturally, this can all be automated; you just need a team of competent cloud software
engineers to automate
The result is simple, although the execution in reality is far more complicated, which
relies on cloud engineering skills.
Summary
When putting dashboards together, it’s important to begin with the end in mind
and think about what the purpose of the dashboard is and who it is for (that is your
audience). Once you know the outputs, then work backward.
Dashboards are driven by the data, so you’ll need to consider which data sources
you’ll need. Raw data is seldom a good idea to plug straight into the front end like Looker
Studio as it’s likely to overwhelm the front end and thus load slowly or crash. Instead,
you’ll want to summarize the data into meaningful trends.
374
Chapter 8 Dashboards
Extract, transform, and load (ETL) is the process of automating the data collection,
summarizing the data, and then loading it into a system. We provided code to help you
• See what the data could look like when loaded into Looker Studio
375
CHAPTER 9
• String manipulation
While these techniques will speed up the processing of data for a site migration, they
can easily be applied to other use cases.
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from textdistance import sorensen_dice
from textdistance import jaccard
pd.set_option('display.max_colwidth', None)
377
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_9
Chapter 9 Site Migration Planning
The data comes from a spreadsheet which is a representation of the site taxonomy or
hierarchy, that is, folders and subfolders with the site levels organized in columns:
hierarchy_raw = pd.read_csv('data/saga_hierarchy.csv')
hierarchy_raw
In the preceding table, we can see how the spreadsheet looks with numbers across
the top denoting the site levels and the page (we’ll call them nodes) with names per row
with their immediate parent.
Let’s get the site levels for each of the parent nodes:
site_levels = pd.DataFrame(hierarchy_raw.unstack())
site_levels = site_levels.dropna().drop_duplicates()
378
Chapter 9 Site Migration Planning
With the site nodes defined, which will come in handy later, we’re going to find the
pairs of parent and child nodes.
Once done, the “pairs” list will be put into a new dataframe “parent_child_map.”
Let’s iterate by row to pick pairs:
pairs = []
def parent_child_nodes(row):
data = row.values.tolist()
data = [e for e in data if str(e) not in ('nan')]
print(data)
pairs.append(data)
def apply_pcn(df):
return df.apply(
lambda row:
parent_child_nodes(
row),
axis=1
)
apply_pcn(hierarchy_raw)
380
Chapter 9 Site Migration Planning
Of course, if we want the full URL path, we need to process the data further using a
copy of hierarchy_raw. Start with a downward fill of the first column for the home page
and then populate the cell should the adjacent right cell not be blank (checked using the
function has_data_right):
hierarchy_fp = hierarchy_raw
Here’s the function to check for cells on the right to see if populated with data
or NANs:
def has_data_right(idx):
return hierarchy_fp[hierarchy_fp.columns[idx:]].notnull().
apply(any, axis=1)
for c in hierarchy_fp.columns[1:]:
hierarchy_fp.loc[has_data_right(int(c)), c] = hierarchy_fp.loc[has_
data_right(int(c)), c].ffill()
hierarchy_fp
381
Chapter 9 Site Migration Planning
The following shows the resulting hierarchy_fp dataframe with all the folder names
needed to construct a full path to the URL:
With this in mind, we can now iterate row by row in the dataframe to remove blanks
(NaNs) and join them with a forward slash (/):
min_fp_nonnan = hierarchy_fp
full_paths = []
def find_full_paths(row):
data = row.values.tolist()
data = [e for e in data if str(e) not in ('nan')]
data = '/'.join(data)
print(data)
full_paths.append(data)
def apply_ffp(df):
return df.apply(
lambda row:
find_full_paths(
row),
axis=1
)
382
Chapter 9 Site Migration Planning
apply_ffp(min_fp_nonnan)
full_paths
Now that we have the full folder names joined, a bit of string formatting is required to
get them to resemble URL paths. This is what we’ll do here:
#full_paths
full_path_df = pd.DataFrame(full_paths,columns=['full_path'])
full_path_df['full_path'] = full_path_df.full_path.str.
replace('Homepage/', '')
full_path_df['full_path'] = full_path_df.full_path.str.
replace('Homepage', '')
full_path_df['full_path'] = full_path_df.full_path.str.replace(' ', '-')
full_path_df['full_path'] = full_path_df.full_path.str.replace('&', 'and')
full_path_df['full_path'] = full_path_df.full_path.str.lower()
full_path_df['full_path'] = target_root_url + full_path_df.full_path
full_path_df
383
Chapter 9 Site Migration Planning
The full URL path has now been constructed and pushed into a dataframe, so we can
now add this to the parent_child_map dataframe created earlier:
384
Chapter 9 Site Migration Planning
Now we have a table with the nodes and the full path:
target_node_map = full_node_map
def append_target_pairs(row):
data = row.values.tolist()
[target_roots.append(data[1]) for e in data if str(e) in target_roots]
def apply_atp(df):
return df.apply(
lambda row:
append_target_pairs(
row),
axis=1
)
385
Chapter 9 Site Migration Planning
apply_atp(target_node_map)
target_roots = list(set(target_roots))
target_roots
Having extracted the Travel nodes in target_roots, we can now filter for Travel URLs
only, starting with child nodes:
#Target Children
stop_strings = ['insurance', 'breakdown-cover']
target_parent_nodes = target_node_map[~target_node_map.full_path.str.
contains('|'.join(stop_strings))]
target_parent_nodes
386
Chapter 9 Site Migration Planning
target_parent_nodes = target_node_map[target_node_map.child.
isin(first_gen)]
target_parent_nodes
387
Chapter 9 Site Migration Planning
With the Travel site URLs successfully filtered, we will now join the site levels:
388
Chapter 9 Site Migration Planning
The site levels were joined using Pandas Merge which is equivalent to Microsoft
Excel’s vlookup function. Because the column names were different in both tables, this
had to be specified under left_on and right_on as shown earlier.
389
Chapter 9 Site Migration Planning
Having taken a combination of the site name and parent and child nodes, these have
formed the search strings we will use to get SERPs data for:
serptool_queries = nodes_levelled['search_query'].to_list()
serptool_queries
390
Chapter 9 Site Migration Planning
Using the preceding data, these could be checked using your favorite SEO rank
checking tool API and then loaded into the notebook:
saga_serps = pd.read_csv('client_serps.csv')
saga_serps
Earlier, we have the SERPs loaded into the notebook, showing the keyword, rank
position, and URL.
It is now time to extract the top ranking URL, by grouping the SERPs dataframe by
keyword and then selecting the top ranked URL (if it hasn’t already been selected). The
reason is a single URL cannot be simultaneously redirected to two different URLs; hence,
the “while” clause used in the Python code is checking whether the URL hasn’t already
been used for a previous keyword:
serps_grp = saga_serps.groupby('keyword')
current_allocated = []
def filter_top_serp(df):
del df['keyword']
i = 0
while not df.iloc[i]['url'] in current_allocated:
if not df.iloc[i]['url'] in current_allocated:
current_allocated.append(df.iloc[i]['url'])
return df.iloc[i]
391
Chapter 9 Site Migration Planning
else:
i += 1
current_map = serps_grp.apply(filter_top_serp)
current_map_df = pd.concat([current_map],axis=0).reset_index()
del current_map_df['rank_absolute']
current_map_df['current_alloc'] = pd.DataFrame({'current_alloc':current_
allocated})
current_map_df['current_url'] = np.where(current_map_df.current_url.
isnull(), current_map_df.current_alloc, current_map_df.current_url)
del current_map_df['current_alloc']
current_map_df['search_query'] = current_map_df.search_query.str.lower()
current_map_df
The result is a dataframe with the current URLs which we can now join to the
main table:
392
Chapter 9 Site Migration Planning
nodes_levelled['search_query'] = nodes_levelled.search_query.str.lower()
ia_current_mapping = pd.merge(nodes_levelled, current_map_df, on = 'search_
query', how = 'left')
ia_current_mapping = ia_current_mapping[['parent', 'child', 'level',
'current_url', 'full_path']]
ia_current_mapping
Here, we can see that neither this method nor Google is perfect. Nonetheless, it’s a
good start and saves a lot of manual work.
Let’s tidy the table up by renaming a few columns and replacing NaNs with blanks:
# rearrange columns
ia_current_mapping = ia_current_mapping[['parent', 'child',
'level','search_query', 'current_url', 'full_path']]
ia_current_mapping = ia_current_mapping.rename(columns = {'full_path':
'migration_url'})
ia_current_mapping['current_url'] = np.where(ia_current_mapping.current_
url.isnull(), '', ia_current_mapping.current_url)
ia_current_mapping
393
Chapter 9 Site Migration Planning
In the following table, you can see that the first five lines have a simi value of zero,
because the current URLs are blank, so of course there is zero string similarity between
the proposed migration URL and the current URL.
With the table tidied, we will use NLP methods to compare the string similarity of the
current URL:
ia_current_simi = ia_current_mapping
ia_current_simi = ia_current_simi.drop_duplicates()
ia_current_simi['simi'] = ia_current_simi.loc[
:, ['current_url', 'migration_url']].apply(
lambda x: sorensen_dice(*x), axis=1)
ia_current_simi
The string similarity is helpful because when we review the migration URLs in a
spreadsheet app like Microsoft Excel, we can filter for URLs that are not very similar, for
instance, less than 0.9, which shows us current URLs that might not be a good match for
the migration URLs. Rows with missing current URLs will need to be manually fixed, and
the ones deduced from the SERPs will require a review.
394
Chapter 9 Site Migration Planning
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from textdistance import sorensen_dice
pd.set_option('display.max_colwidth', None)
import os.path
target_site_search = 'saga'
target_bu = 'travel'
target_roots = ['Holidays', 'Cruises', 'Travel Updates', 'Accessibility and
Support', 'Brochure Request', 'My Travel', 'Trade']
source_root_url = 'https://travel.saga.co.uk/'
395
Chapter 9 Site Migration Planning
migration_root_url = 'https://www.saga.co.uk/'
source_hostname = 'travel.saga.co.uk'
file_path = 'cases/'+ target_site_search + '/'
The imported mapping is an edited Excel file to reflect the business and operational
requirements that wouldn’t be adjusted for in the previous section.
With the URL structures set for the category and subcategory URLs, we’re now
going to break down the current URLs and the migration URLs, so that we can create a
mapping formula.
When we import the rest of the site URLs, the script will use their folder structure to
convert them to the new migration URL structure:
latest_mapping_full_branch['current_url'] = np.where(latest_mapping_full_
branch.current_url.isnull(), '', latest_mapping_full_branch.current_url)
To create the new URL structures, create a template variable called “new_branch”;
we simply take the migration URLs and grab the folders between the root domain and
the web page URL string.
396
Chapter 9 Site Migration Planning
latest_mapping_full_branch['new_branch'] = latest_mapping_full_
branch['migration_url'].str.replace(migration_root_url, '', regex = False)
latest_mapping_full_branch['new_branch'] = latest_mapping_full_branch['new_
branch'].str.split('/').str[:-1]
Similar principles are applied to the following old_branch, which is the URL
structure for the current URLs:
latest_mapping_full_branch['old_branch'] = latest_mapping_full_
branch['current_url'].str.replace(migration_root_url, '', regex = False)
To make the new URLs more evergreen and thus without dates, we stick all of the
undesirables into a list and tell Python to remove everything from the list:
latest_mapping_full_branch['old_branch'] = latest_mapping_full_branch['old_
branch'].str.split('/').str[:-1]
latest_mapping_full_branch['old_branch'] = latest_mapping_full_branch['old_
branch'].apply(lambda x: '/'.join(map(str, x)))
397
Chapter 9 Site Migration Planning
The node is the URL string that is specific to the page itself, which we’re extracting as
follows:
If any old_branch values are empty, then we just substitute the node. np.where is the
Pandas equivalent of Excel’s if statement:
latest_mapping_full_branch['old_branch'] = np.where(latest_mapping_full_
branch['old_branch'] == '', latest_mapping_full_branch.node, latest_
mapping_full_branch['old_branch'])
latest_mapping_full_branch
With URL structures broken down and reconstituted, we’re ready to put a mapping
together. A bit of cleanup happens as follows, where we drop duplicate rows and remove
blank rows with empty current_url values. We don’t expect any anomalies at this stage,
but just in case.
398
Chapter 9 Site Migration Planning
Importing the URLs
With the mapping in place, we’re ready to import the URLs and fit them to the new
migration structure:
399
Chapter 9 Site Migration Planning
We’re removing URLs and subfolders that won’t move as part of the site relaunch:
target_crawl_urls = target_crawl_raw
stop_folders = ['/membership/', '/magazine/', '/saga-charities/', '/
legal/', '/money/', '/my/', '/care/', '/magazine-subscriptions/',
'/membership', '/magazine', '/saga-charities', '/legal',
'/money', '/my', '/care', '/magazine-subscriptions',
'boardbasis'
'/antiquity', '/pharaoh', '/orca', '/italy-splendour', '/
walking', '/archaeology',
'/gardens', '/music', '/MyS', '/404', '/contentli']
target_crawl_urls = target_crawl_urls[~target_crawl_urls['URL'].str.
contains('|'.join(stop_folders))]
target_crawl_urls = target_crawl_urls[target_crawl_urls['Host'].str.
contains('www.saga.co.uk')]
target_crawl_urls
400
Chapter 9 Site Migration Planning
The crawl data is in and now subsetted for the URLs we want to migrate. Next, we’re
sticking these into a list to ensure they are unique:
current_url_lst = latest_mapping_full_branch.current_url.to_list()
mapped_url_lst = list(set(current_url_lst))
print(len(mapped_url_lst))
469
258
target_crawl_unmigrated = target_crawl_urls[~target_crawl_urls['URL'].
isin(mapped_url_lst)]
target_crawl_unmigrated
At this stage, it’s sensible to check if we have any redirects (300 responses) and other
non-“200” server status URLs:
We can see that all of the filtered URLs we’ve yet to migrate all serve live pages
(returning a 200 response). If we did have 301s, we could use the following code to
inspect those 301 URLs:
401
Chapter 9 Site Migration Planning
To handle redirecting URLs, we want to ensure they are included in the mapping so
that we can avoid redirect chains when migrating the site URLs. A redirect chain is when
there are multiple redirects between the initial URL requested and the final destination
URL. We’ll achieve this by ensuring these are listed as current URLs.
Mutate the old_branch:
target_crawl_mutate = target_crawl_unmigrated
target_crawl_mutate = target_crawl_mutate.rename(columns = {'url':
'current_url'})
redirect_conds = [
target_crawl_mutate['http_status_code'].isin(['200', '204', '404',
'410', '500']),
target_crawl_mutate['http_status_code'].isin(['301', '302',
'307', '308'])
]
desturl_values = [target_crawl_mutate['current_url'],
target_crawl_mutate['redirected_to_url'],
]
Create a new column and use np.select to assign values to it using our lists as
arguments:
Redirects notwithstanding, at this point these are dealt with. The following code will
now break down the URL into structures ready for mapping using the table created in
earlier steps:
target_crawl_mutate['old_branch'] = target_crawl_mutate['current_url'].str.
replace(source_root_url, '', regex = False)
402
Chapter 9 Site Migration Planning
target_crawl_mutate['node'] = target_crawl_mutate['current_url'].str.
split('/').str[-1] # node only
target_crawl_mutate['node'] = target_crawl_mutate['node'].str.replace('.
aspx', '', regex = False)
target_crawl_mutate['node'] = target_crawl_mutate['node'].str.replace(' ',
'-', regex = False)
target_crawl_mutate['old_branch'] = target_crawl_mutate.old_branch.
apply(lambda x: '/'.join(map(str, x)))
target_crawl_mutate
The following crawl data “target_crawl_mutate” now has old branches, which means
after a bit of cleanup, removing unnecessary column names, we can merge these with
the branch map created earlier to help formulate the migration URLs.
403
Chapter 9 Site Migration Planning
Let’s now look up a new branch as there may be old branches not quite covered in
the remaining URLs, that is, unmatched exceptions:
The unmigrated URLs now have the suggested URL structure which can be used to
create a new column forming the suggested migration URL. We will start with a bit of
cleanup to handle blank new branch values:
allocated_fillnb = unallocated_branch
allocated_fillnb['new_branch'] = np.where(allocated_fillnb.new_branch.
isnull(),
'',
allocated_fillnb.new_branch)
allocated_fillnb['new_branch'] = np.where(allocated_fillnb.new_
branch == '',
allocated_fillnb.old_branch,
allocated_fillnb.new_branch)
allocated_fillnb = allocated_fillnb[allocated_fillnb.new_branch != '']
allocated_fillnb.sort_values('new_branch')
404
Chapter 9 Site Migration Planning
More cleanup ensues to handle URL nodes that contain parameter characters such
as “?” and “=”. Then we attempt to create columns showing their Parent and Child URL
node folders based on the text position within the overall URL string:
allocated_draft = allocated_fillnb
allocated_draft['node'] = np.where(allocated_draft['node'].str.
contains('(\?|=)'),
'',
allocated_draft['node'])
allocated_draft['Parent'] = allocated_draft['new_branch']
allocated_draft['Parent'] = allocated_draft['Parent'].str.split('/').str[0]
allocated_draft['Child'] = allocated_draft['new_branch'].str.
split('/').str[1]
allocated_draft['Child'] = np.where(allocated_draft['Child'].isnull(),
allocated_draft['node'],
allocated_draft['Child']
)
allocated_draft
405
Chapter 9 Site Migration Planning
The preceding table now has the Parent and Child folders. At this stage, we’re looking
to ensure the new URL structure (new_branch) fall into one of the major sections of the
new travel site before putting together the migration URLs.
Convert the root parent folder names to lowercase:
sorted_branches = []
def change_urls(row):
data = row.values.tolist()
#print(data)
if not data[-2] in target_roots_urled:
data[-3] = 'holidays/' + str(data[-3])
sorted_branches.append(data)
406
Chapter 9 Site Migration Planning
def apply_cip(df):
return df.apply(lambda row: change_urls(row), axis=1)
apply_cip(allocated_draft)
Any folders that didn’t have a parent node in the list printed earlier are allocated to
holidays. This should be right 90% of the time. Time to form the draft migration URL:
407
Chapter 9 Site Migration Planning
allocated_drafted['Parent'] = allocated_drafted['Parent'].str.
replace('-', ' ')
allocated_drafted['Parent'] = allocated_drafted['Parent'].str.title()
allocated_drafted['Child'] = allocated_drafted['Child'].str.
replace('-', ' ')
allocated_drafted['Child'] = allocated_drafted['Child'].str.title()
Set to lowercase:
allocated_drafted
By concatenating the domain, new branch, and node, the migration URLs are now
fully formed:.
allocated_drafted['migration_url'] = np.where(allocated_drafted['migration_
url'].str.endswith('/'),
408
Chapter 9 Site Migration Planning
allocated_drafted['migration_
url'].str[:-1],
allocated_drafted
['migration_url'])
allocated_drafted['Parent'] = allocated_drafted['Parent'].str.
replace('-', ' ')
allocated_drafted['Parent'] = allocated_drafted['Parent'].str.title()
allocated_drafted['Child'] = allocated_drafted['Child'].str.
replace('-', ' ')
allocated_drafted['Child'] = allocated_drafted['Child'].str.title()
Set to lowercase:
allocated_drafted['migration_url'] = allocated_drafted['migration_url'].
str.lower()
allocated_drafted.columns = allocated_drafted.columns.str.lower()
pd.set_option('display.max_colwidth', 25)
allocated_drafted
To prepare the combining of the remaining URLs to the original latest mapping, we
need to add site levels and some basic checks such as removing duplicate current URLs
(after all, the same URL can’t be redirected to two or more different URLs).
409
Chapter 9 Site Migration Planning
The site level is calculated by counting the number of slashes in the migration URL
and subtracting one from it. This means the home page is one, and all other pages are
referenced from there.
allocated_distinct = allocated_drafted
allocated_distinct = allocated_distinct.drop_duplicates(subset =
'current_url')
allocated_distinct['migration_url'] = allocated_distinct['migration_url'].
str.replace('/holidays/cruises/', '/cruises/')
allocated_distinct['migration_url'] = allocated_distinct['migration_url'].
str.replace(' ', '-')
allocated_level = allocated_distinct
allocated_level['level'] = allocated_level.migration_url.str.count('/') - 1
allocated_level = allocated_level[['parent', 'child', 'level', 'current_
url', 'migration_url']]
pd.set_option('display.max_colwidth', 65)
allocated_level
With the columns now matching the original imported travel mapping, we’re ready
to combine:
410
Chapter 9 Site Migration Planning
The rows are now combined. Now we will drop duplicate rows and calculate the
string similarity between the current URL and the migration URL, which will help us in
the manual review of the export file:
total_mapping_simi['simi'] = total_mapping_simi.loc[
:, ['current_url', 'migration_url']].apply(
lambda x: sorensen_dice(*x), axis=1)
total_mapping_simi
411
Chapter 9 Site Migration Planning
We now have the migration mapping ready to review in Excel. Note the row number
has reduced as duplicate current URLs have been eliminated. We’ve also put a new
column “simi” to help flag any URLs that have “migration URLs less than 75% similar to
their current URL counterpart.” Although not foolproof, this will help provide a quick
way of finding and sorting any anomalies.
Migration planning can inspire challenge and dread for a lot of SEOs. AI and data
science have yet to advance far enough to fully automate, let alone semiautomate, most
of the site migration planning process.
Much of the advance will depend on the NLP models at the AI level available to
the SEO industry to reliably understand, reduce, and map existing content URLs to
new URLs.
The next section will now address troubleshooting traffic losses following a site
migration.
Migration Forensics
At this point, we’re here to work out what changed and which content was affected
following a migration. We’ll be taking the following steps:
1. Traffic trends
6. Diagnose
As usual, we start importing our libraries. You’ll notice that some of the packages
include some string distance functions from textdistance and timedelta to help us work
with time series data:
import pandas as pd
import numpy as np
from textdistance import sorensen_dice
412
Chapter 9 Site Migration Planning
root_url = 'https://www.saasforecom.com'
root_domain = 'saasforecom.com'
hostname = 'saasforecom'
Traffic Trends
With the libraries imported and the variables set, we’ll import the data from Google
Analytics (GA). We use GA because it gives us a breakdown by date that is not easily
found in Google Search Console (GSC) without substantial sampling.
Here, we’re getting rid of the rows which are not part of the main table that is default
in GA tabular exports:
ga_orgdatelp_raw.columns = ga_orgdatelp_raw.columns.str.lower()
ga_orgdatelp_raw.columns = ga_orgdatelp_raw.columns.str.replace('/', '',
regex = False)
ga_orgdatelp_raw.columns = ga_orgdatelp_raw.columns.str.replace('.', '',
regex = False)
ga_orgdatelp_raw.columns = ga_orgdatelp_raw.columns.str.replace('% ', '',
regex = False)
ga_orgdatelp_raw.columns = ga_orgdatelp_raw.columns.str.replace(' ', ' ',
regex = False)
ga_orgdatelp_raw.columns = ga_orgdatelp_raw.columns.str.replace(' ', '_',
regex = False)
413
Chapter 9 Site Migration Planning
With the data imported and the column names cleaned and nicely formatted, we’ll
get to work on cleaning the actual data inside the columns themselves.
This again will make use of string operations to remove special characters and cast
the data type as a number as opposed to a string.
Clean the GA data:
ga_clean = ga_orgdatelp_raw
414
Chapter 9 Site Migration Planning
ga_clean['ecommerce_conversion_rate'] = ga_clean.ecommerce_conversion_rate.
astype(float)
ga_clean['revenue'] = ga_clean.revenue.str.replace('$', '')
ga_clean['revenue'] = ga_clean.revenue.astype(float)
ga_clean['avg_session_duration'] = ga_clean.avg_session_duration.str.
replace('<', '')
ga_clean['avg_session_duration'] = pd.to_timedelta(ga_clean.avg_session_
duration).astype(int) / 1e9
ga_clean
ga_stats = ga_clean
We select the columns we actually want. You may have noticed that some columns
were cleaned up and ended up not being used. This may seem like a waste of effort;
however, you don’t always know what you will need or for what purpose. So cleaning
columns is a good standard practice so that the data is ready should you discover that
you need it later on.
415
Chapter 9 Site Migration Planning
Import GSC Pages data to grab all of the unique URLs for crawling:
all_gsc_raw = pd.read_csv('data/throughout_Pages.csv')
all_gsc_raw.columns = all_gsc_raw.columns.str.lower()
all_gsc_raw.columns = all_gsc_raw.columns.str.replace('/', '',
regex = False)
all_gsc_raw.columns = all_gsc_raw.columns.str.replace('.', '',
regex = False)
all_gsc_raw.columns = all_gsc_raw.columns.str.replace('% ', '',
regex = False)
all_gsc_raw.columns = all_gsc_raw.columns.str.replace(' ', ' ',
regex = False)
all_gsc_raw.columns = all_gsc_raw.columns.str.replace(' ', '_',
regex = False)
all_gsc_raw['ctr'] = all_gsc_raw.ctr.str.replace('%', '', regex = False)
print(all_gsc_raw.head())
all_gsc_urls = all_gsc_raw[['top_pages']]
all_gsc_urls = all_gsc_urls.rename(columns = {'top_pages': 'url'}).drop_
duplicates()
all_gsc_urls
416
Chapter 9 Site Migration Planning
The GA URLs will also be extracted and joined to the domain ready for crawling:
ga_raw_urls = ga_raw_comb[['landing_page']]
ga_raw_urls = ga_raw_urls.rename(columns = {'landing_page': 'url'}).drop_
duplicates()
ga_raw_urls['url'] = root_url + ga_raw_urls['url']
ga_raw_urls
417
Chapter 9 Site Migration Planning
Combine the GA and GSC URLs, dropping duplicates, ready for crawling:
With the crawl completed, we’re ready to import the data, clean the columns, and
view the raw data:
audit_urls_raw = pd.read_csv('data/all_urls__excluding_uncrawled__
filtered_20210803163126.csv')
audit_urls_raw.columns = audit_urls_raw.columns.str.lower()
audit_urls_raw.columns = audit_urls_raw.columns.str.replace('/', '',
regex = False)
audit_urls_raw.columns = audit_urls_raw.columns.str.replace('.', '',
regex = False)
audit_urls_raw.columns = audit_urls_raw.columns.str.replace('% ', '',
regex = False)
audit_urls_raw.columns = audit_urls_raw.columns.str.replace(' ', ' ',
regex = False)
418
Chapter 9 Site Migration Planning
We can see that most of the server status has not been extracted. This is likely to be a
bug in the crawling software. The best thing to do is to take it up with the software vendor
and recrawl with a longer timeout setting and at a slower pace to improve the numbers.
audit_urls_raw.groupby('final_redirect_url_status_code').size()
final_redirect_url_status_code
200 452
404 1
Not Set 731
dtype: int64
419
Chapter 9 Site Migration Planning
After our recrawl, 452 URLs is the best we could come up with. Next, we’re ensuring
any rows with duplicate URLs are dropped:
The row count has now dropped from 1122 to 922 rows. Next, we’ll find the final
redirect URL so we can see where the URLs map to. Again, this seemingly unnecessary
step is taken to overcome any glitches produced by the audit software.
Prepare the columns for content evaluation:
This function will take a row, turn it into a list, and take the last value that isn’t equal
to “No Data” and stick the URL in the list ult_dest_url created earlier:
def find_ult_dest_url(row):
data = row.values.tolist()
data = [e for e in data if str(e) not in ('No Data')]
420
Chapter 9 Site Migration Planning
data = data[-1]
#print(data)
ult_dest_url.append(data)
The preceding function is applied by calling the following function to take the
dataframe row by row, which is considered to be a less computationally intensive way to
iterate over a dataframe, certainly faster than iterrows:
def apply_fudu(df):
return df.apply(lambda row: find_ult_dest_url(row), axis=1)
apply_fudu(audit_urls)
421
Chapter 9 Site Migration Planning
audit_urls_map_prep = audit_urls_map.join(ult_dest_url_df)
audit_urls_map_prep
With the ultimate destination URLs found, we need a simple way to test how similar
they are. We can do this by measuring the string distance between the URL and the
redirect URL. We’ll use Sorensen-Dice which is fast and effective for SEO purposes:
audit_urls_map = audit_urls_map_prep
audit_urls_map
422
Chapter 9 Site Migration Planning
Segmenting URLs
With all of these audit URLs, we’d want to make sense of them so we can discern trends
by content type. Since we don’t have a trained neural network at hand, we’re going to use
a crude yet useful method of grouping URLs by their URL address.
This method is not only fast, it's also cheap in that you won’t require a million
content documents to train an AI to categorize web documents by content type.
We’ll start by extracting the URLs and ensuring they are unique before sticking them
into a dataframe:
crawled_urls_unq = audit_urls_raw.url.drop_duplicates().to_frame()
crawled_urls_unq
423
Chapter 9 Site Migration Planning
424
Chapter 9 Site Migration Planning
This code cleans up the URLs ready for some text processing so we can start
grouping the URLs. We’ll start with removing the domain portion of the URL as that is
constant throughout the URLs:
classified_start = all_urls[['url']]
classified_start['slug'] = classified_start.url.str.replace(root_url, '',
regex = True)
The following will deal with dates which won’t add value to the classification:
classified_start['slug'] = classified_start.slug.str.replace("\
\d{4}\\-(0[1-9]|1[012])\\-(0[1-9]|[12][0-9]|3[01])",
'', regex = True)
classified_start['slug'] = classified_start.slug.str.replace
("[^\w\s]", " ", regex = True)
classified_start['slug'] = classified_start.slug.str.strip()
classified_start = classified_start.reset_index()
del classified_start['index']
classified_start.head(10)
425
Chapter 9 Site Migration Planning
The result is the URL words without all of the characters, that is, the slug. We’ll want
to apply some numbers to get a sense of priority, so we’ll use GSC traffic data to weight
the slugs.
Get GSC traffic data:
426
Chapter 9 Site Migration Planning
Remove URLs with no clicks. Choose your GSC date range wisely here. If you just
went for 28 days, then there’s the risk of seasonal bias as some content may not receive
traffic at certain times of the year. Our recommendation is to select 16 months, the
maximum possible for extraction from GSC.
We’re going to explode the slug column into unigrams. That means taking the slug
and expanding the column into several rows such that each word in the slug has its own
row as one column:
bigrams = classified_stats['slug']
bigrams = bigrams.str.split(' ').explode().to_frame()
bigrams = bigrams.rename(columns = {'slug': 'ngram'})
bigrams.head(10)
427
Chapter 9 Site Migration Planning
With the slugs “exploded” into ngrams, this will be mapped to their original URL and
traffic stats table:
bigrams_df = classified_stats.join(bigrams)
bigrams_df.head(20)
428
Chapter 9 Site Migration Planning
With the data merged, we’ll want to remove some rows containing some stop words
and other unhelpful words that could be used when creating group names.
A note of warning: The code is a bit repetitive on purpose to give you practice and
build your muscle memory even if there are smarter ways to do the entire block in two
lines – think list and ‘|’.join(list):
429
Chapter 9 Site Migration Planning
bigrams_df = bigrams_df[~bigrams_df.ngram.str.contains(r'\ban\b',
regex = True)]
bigrams_df = bigrams_df[~bigrams_df.ngram.str.contains(r'\bin\b',
regex = True)]
bigrams_df = bigrams_df[~bigrams_df.ngram.str.contains(r'\bcom\b',
regex = True)]
bigrams_df = bigrams_df[~bigrams_df.ngram.str.contains(r'\bwww\b',
regex = True)]
bigrams_df = bigrams_df[~bigrams_df.ngram.str.contains(r'\bthe\b',
regex = True)]
bigrams_df = bigrams_df[~bigrams_df.ngram.str.contains(r'\busing\b',
regex = True)]
bigrams_df = bigrams_df[~bigrams_df.ngram.str.contains(r'\bwith\b',
regex = True)]
bigrams_df = bigrams_df[~bigrams_df.ngram.str.contains(r'\b(http|https)\b',
regex = True)]
bigrams_df['ngram'] = bigrams_df.ngram.str.strip()
bigrams_df = bigrams_df[~bigrams_df.ngram.isnull()]
bigrams_df
430
Chapter 9 Site Migration Planning
The table has ngrams with more sensible labels which can now be sum aggregated to
pick the most common labels per URL:
The idea here is to create an index based on traffic and the amount of instances of
the ngram label:
431
Chapter 9 Site Migration Planning
We now have a table with ngrams with their stats and their ultimate score. The
following function will select the highest score per ngram:
ngram_stats_map = bigram_stats.groupby('ngram').apply(lambda x:
filter_highest_stat(x, 'ngram', 'g_score')).reset_index()
ngram_stats_map = ngram_stats_map.sort_values('g_score',
ascending = False).reset_index()
del ngram_stats_map['index']
ngram_stats_map.head(10)
432
Chapter 9 Site Migration Planning
The result is a prioritized table showing the most common ngrams that can be used
to categorize URLs as segments.
Using the scores, we’ll create two levels of segments, taking the most popular ngrams
as labels while classifying the rest as “other.” We’re creating two levels so that we have a
more high-level and a more detailed view to hand.
We’ll join the segment labels to the dataset so that all URLs are now classified by
segment label:
There are multiple rows per URL; however, we only want the top result, so we’ll apply
a function to filter for the row with the highest g_score:
urls_stats_grams_map = urls_grams_stats.groupby('url').apply(lambda x:
filter_highest_stat(x, 'url', 'g_score')).reset_index()
pd.set_option('display.max_colwidth', None)
urls_grams_map = urls_stats_grams_map
urls_grams_map = urls_grams_map.drop_duplicates()
del urls_grams_map['clicks']
del urls_grams_map['g_score']
#urls_grams_map.iloc[0, 'ngram'] = 'home'
urls_grams_map['subpath'] =
urls_grams_map.url.str.replace(r'(http|https)://www.saasforecom.com', '',
regex = True)
urls_grams_map
434
Chapter 9 Site Migration Planning
All the preceding URLs have a unique row and are categorized by segment label.
Let’s summarize the data by segment to see the distribution of content:
urls_grams_map.groupby('segment_one').count().reset_index()
435
Chapter 9 Site Migration Planning
The next step is merging the performance data from GA with the segment labels and
dropping duplicate URL combinations:
Clean up the data such that null sessions are zero and new_sessions are treated as
whole numbers (i.e., integers):
ga_segmented['new_sessions'] = np.where(ga_segmented.new_sessions.isnull(),
0, ga_segmented.new_sessions)
ga_segmented['new_sessions'] = ga_segmented['new_sessions'].astype(int)
ga_segmented
The result is a dataset ready for time series analysis that can be broken down by
segment.
436
Chapter 9 Site Migration Planning
There is a bit of a limitation in that it’s quite difficult (though not impossible) to get
time series data from Google Search Console (GSC). For Google Analytics (GA), getting
time series data at a URL isolated to organic search is also very difficult.
Time series data can also be quite noisy by nature due to the way it cycles over the
week such that there are peaks and troughs. To tease a trend, we’ll need to dampen the
noise which we will achieve by computing a moving average.
We start by grouping sessions by date:
time_trends = ga_segmented.groupby('date')['new_sessions'].sum().to_
frame().reset_index()
437
Chapter 9 Site Migration Planning
Let’s visualize:
pre_time_trends_plt = (
ggplot(time_trends, aes(x = 'date', y = 'avg_sess', group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 3) +
labs(y = 'GA Sessions', x = 'Date') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
Pre_time_trends_plt
438
Chapter 9 Site Migration Planning
The shift is quite evident in Figure 9-1 with the better half of the traffic trend
switching over to the worse half at around the 20th of December 2020.
Using change point analysis, let’s confirm this analytically using the ruptures
package:
points = np.array(overall_trends['avg_sess'])
model="rbf"
algo = rpt.Pelt(model=model).fit(points)
result = algo.predict(pen=6)
rpt.display(points, result, figsize=(10, 6))
plt.title('Change Point Detection using Pelt')
plt.show()
439
Chapter 9 Site Migration Planning
The change point analysis in Figure 9-2 confirms that on the 20th of December,
there’s a shift downward in traffic.
Figure 9-2. Time series of analytics visits with estimated change point between
before (blue shaded area) and after (red)
Yes, it could be coinciding with the Christmas holidays, but unfortunately for
this particular company, we don’t have the data for the previous year to confirm how
much of the downward change is attributable to seasonality vs. the new site relaunch
migration.
440
Chapter 9 Site Migration Planning
sessseg_trends_mean = sessseg_trends_roll.mean()
segmented_trends['avg_sess'] = sessseg_trends_mean
segmented_trends
The data is in long format with the rolling averages calculated ready for visualization:
ga_seg_trends_plt = (
ggplot(time_trends_segmented, aes(x = 'date', y = 'avg_sess',
group = 'segment_one', colour = 'segment_
one')) +
geom_line(alpha = 0.7, size = 2) +
labs(y = 'GA Sessions', x = 'Date') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
441
Chapter 9 Site Migration Planning
ga_seg_trends_plt.save(filename = 'images/1_ga_seg_trends_plt.png',
height=5, width=15, units = 'in', dpi=1000)
ga_seg_trends_plt
No obvious trends are apparent in Figure 9-3, as most (if not all) of the content
appear to move in the same direction over time. It’s not as if a couple of segments
decreased while others increased or were unchanged.
Analysis Impact
With the time trends dissected, we turn our attention to analyzing the before and after
impact of the migration to hopefully generate recommendations or areas for further
research.
We’ll use GSC data at the page level which can be segmented by merging the map
created earlier:
gsc_before = pd.read_csv('data/gsc_before.csv')
442
Chapter 9 Site Migration Planning
gsc_before.columns = gsc_before.columns.str.lower()
gsc_before.columns = gsc_before.columns.str.replace('/', '', regex = False)
gsc_before.columns = gsc_before.columns.str.replace('.', '', regex = False)
gsc_before.columns = gsc_before.columns.str.replace('% ', '',
regex = False)
gsc_before.columns = gsc_before.columns.str.replace(' ', ' ',
regex = False)
gsc_before.columns = gsc_before.columns.str.replace(' ', '_',
regex = False)
gsc_before['ctr'] = gsc_before.ctr.str.replace('%', '', regex = False)
just so we know which phase of the migration this data refers to:
gsc_before['phase'] = 'before'
443
Chapter 9 Site Migration Planning
So we have the before GSC data at the page level which is now segmented. The
operations are repeated for the phase post migration, known as “after”:
gsc_after = pd.read_csv('data/gsc_after.csv')
gsc_after.columns = gsc_after.columns.str.lower()
gsc_after.columns = gsc_after.columns.str.replace('/', '', regex = False)
gsc_after.columns = gsc_after.columns.str.replace('.', '', regex = False)
gsc_after.columns = gsc_after.columns.str.replace('% ', '', regex = False)
gsc_after.columns = gsc_after.columns.str.replace(' ', ' ', regex = False)
gsc_after.columns = gsc_after.columns.str.replace(' ', '_', regex = False)
gsc_after['ctr'] = gsc_after.ctr.str.replace('%', '', regex = False)
gsc_after['phase'] = 'after'
gsc_after = gsc_after.rename(columns = {'top_pages': 'url'})
gsc_after = gsc_after.merge(urls_grams_map, on = 'url', how = 'left')
gsc_after['count'] = 1
gsc_after['ngram'] = np.where(gsc_after['ngram'].isnull(), 'other',
gsc_after['ngram'])
gsc_after['segment_one'] = np.where(gsc_after['segment_one'].isnull(),
'other', gsc_after['segment_one'])
444
Chapter 9 Site Migration Planning
gsc_after['segment_two'] = np.where(gsc_after['segment_two'].isnull(),
'other', gsc_after['segment_two'])
gsc_after
With both datasets imported and cleaned, we’re ready to start analyzing using
aggregations, starting with weighted average rank positions by phase.
The weighted average rank position function (wavg_rank_imps) takes two arguments
(position and impressions) and returns the calculation result using the column name
“wavg_rank”:
def wavg_rank_imps(x):
names = {'wavg_rank': (x['position'] * x['impressions']).sum()/
(x['impressions']).sum()}
return pd.Series(names, index=['wavg_rank']).round(1)
We’ll make a copy of the “before” dataset before applying the function:
gsc_before_agg = gsc_before
gsc_before_wavg = gsc_before_agg.groupby('phase').apply(wavg_rank_imps).
reset_index()
445
Chapter 9 Site Migration Planning
In addition to the weighted average ranking positions, we’re also interested in the
total number of URLs and the total number of clicks (organic search traffic):
The index is a ratio of clicks to count that forms our index to give us some sense of
proportion:
gsc_before_stats['index'] = gsc_before_stats['clicks']/gsc_before_
stats['wavg_rank']
gsc_before_stats.sort_values('index', ascending = False)
That’s the stats before the migration. Now let’s look at the stats after the migration,
applying the same methods used earlier to data post migration:
gsc_after_agg = gsc_after
gsc_after_wavg = gsc_after_agg.groupby('phase').apply(wavg_rank_imps).
reset_index()
gsc_after_sum = gsc_after_agg.groupby('phase').agg({'count': 'sum',
'clicks': 'sum'}).
reset_index()
446
Chapter 9 Site Migration Planning
With both datasets aggregated, we can concatenate them into a single table to
compare directly:
pd.concat([gsc_before_stats, gsc_after_stats])
So the average rank doesn’t appear to have changed that much, which implies the
dramatic change could be more seasonal. However, as we’ll see later, averages can often
mask what’s really happening.
The amount of pages receiving traffic has decreased by roughly –20%, which is telling
as that appears to be migration related.
We’ll start visualizing some data to help us investigate deeper:
overall_clicks_plt = (
ggplot(pd.concat([gsc_before_stats, gsc_after_stats]),
aes(x = 'reorder(phase, -clicks)', y = 'clicks' ,
fill = 'phase')) +
geom_bar(stat = 'identity', alpha = 0.6, position = 'dodge') +
position=position_stack(vjust=0.01)) +
labs(y = 'GSC Clicks', x = 'phase') +
theme(legend_position = 'right',
)
)
overall_clicks_plt.save(filename = 'images/2_overall_clicks_plt.png',
height=5, width=10, units = 'in', dpi=1000)
overall_clicks_plt
447
Chapter 9 Site Migration Planning
Figure 9-4. Column chart of before and after Google Search Console (GSC) clicks
gsc_before_seg_agg = gsc_before
gsc_before_seg_wavg = gsc_before_seg_agg.groupby(['segment_two', 'phase']).
apply(wavg_rank_imps).reset_index()
gsc_before_seg_sum = gsc_before_seg_agg.groupby(['segment_two', 'phase']).
agg({'count': 'sum', 'clicks': 'sum'}).reset_index()
gsc_before_seg_stats = gsc_before_seg_wavg.merge(gsc_before_seg_sum,
on = ['segment_two', 'phase'], how = 'left')
gsc_before_seg_stats['index'] = gsc_before_seg_stats['clicks']/
gsc_before_seg_stats['wavg_rank']
gsc_before_seg_stats.sort_values('index', ascending = False)
448
Chapter 9 Site Migration Planning
gsc_after_seg_agg = gsc_after
gsc_after_seg_wavg = gsc_after_seg_agg.groupby(['segment_two', 'phase']).
apply(wavg_rank_imps).reset_index()
gsc_after_seg_sum = gsc_after_seg_agg.groupby(['segment_two', 'phase']).
agg({'count': 'sum', 'clicks': 'sum'}).reset_index()
gsc_after_seg_stats = gsc_after_seg_wavg.merge(gsc_after_seg_sum,
on = ['segment_two', 'phase'], how = 'left')
449
Chapter 9 Site Migration Planning
gsc_after_seg_stats['index'] = gsc_after_seg_stats['clicks']/gsc_after_seg_
stats['wavg_rank']
gsc_after_seg_stats.sort_values('index', ascending = False)
Curiously, “management” has 10% more URLs ranking than premigration with no
real change in ranking. “Documentation” has lost virtually all of its clicks and half of
its URLs.
To visualize this, the dataframes will need to be concatenated in long format to feed
the graphics code:
450
Chapter 9 Site Migration Planning
gsc_long_seg_stats['phase'].cat.reorder_categories(['before', 'after'],
inplace=True)
segment_clicks_plt = (
ggplot(gsc_long_seg_stats,
aes(x = 'reorder(segment_two, -clicks)', y = 'clicks' , fill =
'phase')) +
geom_bar(stat = 'identity', alpha = 0.6, position = 'dodge') +
position=position_stack(vjust=0.01)) +
labs(y = 'GSC Clicks', x = '') +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
segment_clicks_plt.save(filename = 'images/2_segment_clicks_plt.png',
height=5, width=10, units = 'in', dpi=1000)
segment_clicks_plt
As shown in Figure 9-5, most of the click losses appear to have happened at the
home page.
451
Chapter 9 Site Migration Planning
Figure 9-5. Column chart of before and after Google Search Console (GSC) clicks
by content segment
segment_urls_plt = (
ggplot(gsc_long_seg_stats,
aes(x = 'reorder(segment_two, -count)', y = 'count' ,
fill = 'phase')) +
geom_bar(stat = 'identity', alpha = 0.6, position = 'dodge') +
position=position_stack(vjust=0.01)) +
labs(y = 'GSC URL Count', x = '') +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
452
Chapter 9 Site Migration Planning
So there are more management URLs receiving traffic post migration (Figure 9-6).
Figure 9-6. Column chart of before and after Google Search Console (GSC) URL
counts by content segment
However, there is much less in “documentation” and “other” and a bit less in “help,”
“sales,” and “stock.” What about Google rank positions?
segment_rank_plt = (
ggplot(gsc_long_seg_stats,
aes(x = 'reorder(segment_two, -wavg_rank)', y = 'wavg_rank' ,
fill = 'phase')) +
geom_bar(stat = 'identity', alpha = 0.6, position = 'dodge') +
#geom_text(dd_factor_df, aes(label = 'serps_name'), position=position_
stack(vjust=0.01)) +
labs(y = 'GSC Clicks', x = '') +
scale_y_reverse() +
#theme_classic() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
453
Chapter 9 Site Migration Planning
Rankings fell for the inventory, sales, wholesalers, and stock classifications
(Figure 9-7).
Figure 9-7. Column chart of before and after Google Search Console (GSC) rank
position averages by content segment
So there is some correlation between the losses in traffic and rankings. As a general
conclusion, some of the downshift in organic performance, as initially suspected, is a
mixture of seasonality and site migration.
D
iagnostics
So we see that rankings fell for inventory and others, but why?
To understand what went wrong, we’re now going to merge performance data with
crawl data to help us diagnose what went wrong. We’ll also append the segment names
so we can diagnose by content area.
Select the clicks and rank columns we want before merging:
454
Chapter 9 Site Migration Planning
Because the dataframes of the before and after share the same column names,
Pandas interprets this as unintended and correctly assumes these columns are different
and therefore adds the suffixes _x and _y. So we’re renaming them to be more user-
friendly:
After joining, we’d expect to see some rows where they have null clicks before
or (more likely) after the migration. So we’re cleaning up the data to replace “not a
number” (NaNs) values with 100 for rankings and 0 for clicks:
gsc_ba_diag['rank_before'] = np.where(gsc_ba_diag['rank_before'].isnull(),
100, gsc_ba_diag['rank_before'])
gsc_ba_diag['rank_after'] = np.where(gsc_ba_diag['rank_after'].isnull(),
100, gsc_ba_diag['rank_after'])
gsc_ba_diag['clicks_before'] = np.where(gsc_ba_diag['clicks_before'].
isnull(), 0, gsc_ba_diag['clicks_before'])
gsc_ba_diag['clicks_after'] = np.where(gsc_ba_diag['clicks_after'].
isnull(), 0, gsc_ba_diag['clicks_after'])
With the data in wide format and the null values cleaned up, we can compute the
differences in clicks and rankings before and after, which we will now do:
gsc_ba_diag
455
Chapter 9 Site Migration Planning
The performance deltas are now in place, so we can merge the crawl data with
performance data into a new dataframe aptly named “perf_crawl.”
Since we have all the URLs we want and there’s a lot of unwanted URLs in the crawl
data, we’ll take the desired URLs (perf_crawl) and join the crawl data specified in the
merge function, which will be set to “left.” This is equivalent to an Excel vlookup, which
will only join the desired crawl URLs.
perf_crawl
456
Chapter 9 Site Migration Planning
More fun awaits us as we now get to diagnose the URLs. To do this, we’re going to use
a set of conditions in the data, such that when they are met, they will be given a diagnosis
value. This is where your SEO experience comes in, because your ability to spot patterns
dictates the conditions you will set as follows:
perf_diags = perf_crawl.copy()
modifier_conds = [
(perf_crawl['http_status_code'] == '200') & (perf_crawl['crawl_source']
!= 'Crawler'),
(perf_crawl['redirect_url_status_code'] == '301'),
(perf_crawl['http_status_code'].isnull()),
perf_crawl['http_status_code'].isin(['400', '403', '404']),
(perf_diags['canonical_status'] != 'Missing') & (perf_diags['indexable_
status'] == 'Noindex'),
perf_diags['content_simi'] < 1
]
457
Chapter 9 Site Migration Planning
Create a new column and use np.select to assign values to it using our lists as
arguments:
perf_diags
A new column “diagnosis” has been added based on the rules we just created,
helping us to make sense, at the URL level, what has happened.
With each URL labeled, we can start to quantify the diagnosis:
diagnosis_clicks = perf_diags.groupby('diagnosis').agg({'clicks_delta':
'sum'}).reset_index()
diagnosis_urls = perf_diags.groupby('diagnosis').agg({'url': 'count'}).
reset_index()
458
Chapter 9 Site Migration Planning
diagnosis_stats['clicks_pURL'] = (diagnosis_stats['clicks_delta'] /
diagnosis_stats['url']).round(2)
diagnosis_stats
According to the analysis, around 25% of the total loss of clicks is down to error codes
(HTTP server status 4XX) and lost content (URLs redirected to a parent folder).
Most of the URLs affected are those 373 redirected which is most of the website.
Other (with no URLs) implies the traffic loss would be seasonal and/or an indirect
effect of the migration errors.
If you want to share what you found, you could visualize this for your colleagues
using the following code:
diagnosis_plt = (
ggplot(diagnosis_stats,
aes(x = 'reorder(diagnosis, -clicks_delta)', y = 'clicks_
delta')) +
geom_bar(stat = 'identity', alpha = 0.6, position = 'dodge', fill =
'blue') +
position=position_stack(vjust=0.01)) +
labs(y = 'GSC Clicks Impact', x = '') +
coord_flip() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
459
Chapter 9 Site Migration Planning
“Other” (probably seasonality) was the major reason for the click losses (Figure 9-8),
followed by lost_content (i.e., URLs redirected).
Figure 9-8. Bar chart of Google Search Console (GSC) click impact by tech SEO
diagnosis
diagnosis_count_dat = perf_diags.groupby('diagnosis').agg({'url':
'count'}).reset_index()
print(diagnosis_count_dat)
diagnosis_urlcount_plt = (
ggplot(diagnosis_count_dat,
aes(x = 'reorder(diagnosis, url)', y = 'url')) +
geom_bar(stat = 'identity', alpha = 0.6, position = 'dodge',
fill = 'blue') +
position=position_stack(vjust=0.01)) +
labs(y = 'URL Count', x = '') +
coord_flip() +
460
Chapter 9 Site Migration Planning
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
diagnosis_urlcount_plt.save(filename = 'images/3_diagnosis_urlcount_plt.
png', height=5, width=15, units = 'in', dpi=1000)
diagnosis_urlcount_plt
Despite “Other” losing the most clicks, it was “lost content” that impacted the most
URLs (Figure 9-9).
Figure 9-9. Column chart of Google Search Console (GSC) URLs affected counts
by tech SEO diagnosis
That’s the overview done; let’s break it down by content type. We’ll use the content
segment labels to get click impact stats:
461
Chapter 9 Site Migration Planning
diagnosis_seg_stats = diagnosis_seg_clicks.merge(diagnosis_seg_urls,
on = ['diagnosis',
'segment'], how = 'left')
diagnosis_seg_stats['clicks_p_url'] = (diagnosis_seg_stats['clicks_delta']
/ diagnosis_seg_stats['url']).round(2)
diagnosis_seg_stats.sort_values('clicks_p_url')
So not only is “Other” the biggest reason for the click losses, most of it impacted
the home page. This would be consistent with the idea of seasonality, that is, a
characteristically quiet December.
While tables are useful, we’ll make use of data visualization to see the overall picture
more easily:
diagnosis_seg_clicks_plt = (
ggplot(diagnosis_seg_stats,
aes(x = 'diagnosis', y = 'segment', fill = 'clicks_delta')) +
geom_tile(stat = 'identity', alpha = 0.6) + position=position_
stack(vjust=0.00)) +
labs(y = '', x = '') +
theme_classic() +
theme(legend_position = 'right')
)
462
Chapter 9 Site Migration Planning
diagnosis_seg_clicks_plt.save(filename = 'images/5_diagnosis_seg_clicks_
plt.png', height=5, width=10, units = 'in', dpi=1000)
diagnosis_seg_clicks_plt
The home page followed by “management content” is the most affected within
“other” (Figure 9-10).
Figure 9-10. Heatmap chart of clicks delta by content type and SEO diagnosis
In terms of lost content, these are mostly “documentation” and “other.” This is quite
useful for deciding where to focus our attention.
Although “other” as a reason isn’t overly helpful for fixing a site post migration, we
can still explain where some of the migration errors occurred, start labeling URLs for
recommended actions, and visualize. This is what we’re doing next.
Road Map
We’ll start with our dataframe “perf_diags” and copy it into “perf_recs” before creating
the recommendations based on the errors found:
perf_recs = perf_diags
The aptly named diag_conds is a list of diagnoses based on the value of the diagnosis
column in the perf_recs dataframe. The np.select function (shortly later on) will draw
from this list to assign a recommendation.
463
Chapter 9 Site Migration Planning
diag_conds = [
perf_recs['diagnosis'] == 'outside_ia',
perf_recs['diagnosis'] == 'redirect_chain',
perf_recs['diagnosis'] == 'error',
perf_recs['diagnosis'] == 'robots_conflict',
perf_recs['diagnosis'] == 'lost_content',
perf_recs['diagnosis'] == 'other'
]
With the lists in place, we can now match them when we create a new column and
use np.select to assign values to it using our lists as arguments:
perf_recs
464
Chapter 9 Site Migration Planning
You’ll now see the perf_recs dataframe updated with a new column to match the
diagnosis.
Of course, we’ll now want to quantify all of this for our presentation decks to our
colleagues, using the hopefully familiar groupby() function:
recs_clicksurl = perf_recs.groupby('recommendation')['clicks_delta'].
agg(['sum', 'count']).reset_index()
recs_clicksurl['recovery_clicks_url'] = np.abs(recs_clicksurl['sum'] /
recs_clicksurl['count'])
We’re taking the absolute as we want to put a positive slant on the presentation of the
numbers:
recs_clicksurl['sum'] = np.abs(recs_clicksurl['sum'])
recs_clicksurl
465
Chapter 9 Site Migration Planning
The preceding table shows the recommendation with clicks to be recovered (sum),
URL count (count), and the potential recovery clicks per URL. Although it may seem
strange to recover 297 clicks per month through “no further action,” some may well be
recovered by fixing the other issues.
Time to visualize:
recs_clicks_plt = (
ggplot(recs_clicksurl,
aes(x = 'reorder(recommendation, sum)', y = 'sum')) +
geom_bar(stat = 'identity', alpha = 0.6, position = 'dodge',
fill = 'blue') +
position=position_stack(vjust=0.01)) +
labs(y = 'Recovery Clicks Available', x = '') +
coord_flip() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
recs_clicks_plt.save(filename = 'images/8_recs_clicks_plt.png',
height=5, width=10, units = 'in', dpi=1000)
recs_clicks_plt
Figure 9-11. Bar chart of estimated recovery clicks available by tech SEO diagnosis
466
Chapter 9 Site Migration Planning
Summary
This chapter covered site migration mapping so that you could set the structure of
your new site and semiautomate the formation of your migration URLs. Some of the
techniques used are as follows:
• String manipulation
While these techniques were applied to speed up the processing of data for a site
migration, they can easily be applied to other use cases. In the next chapter, we will show
how algorithm updates can be better understood using data.
467
CHAPTER 10
Google Updates
Just as death and taxes are the certainties of life, algorithm updates are a certainty for any
SEO career. That’s right, Google frequently introduces changes to its ranking algorithm,
which means your website (and many others) may experience fluctuations in rankings
and, by extension, traffic. These changes may be positive or negative, and in some cases,
you’ll discern no impact at all.
To compound matters, Google in particular gives rather vague information as to
what the algorithm updates are about and how business and SEO professionals should
respond. Naturally, the lack of prescriptive advice from Google other than delivering
“a great user experience” and “creating compelling content” means SEOs must find
answers using various analysis tools. Fortunately, for the SEO
• Domains
• Result types
469
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_10
Chapter 10 Google Updates
• Cannibalization
• Keywords
• Segmented SERPs
Algo Updates
The general approach here is to compare performance between the before and after
phases of the Google algorithm update. In this case, we’ll focus on a newly listed webinar
company known as ON24. ON24 suffered from the December 2019 core update.
With some analysis and visualization, we can get an idea of what’s going on with the
update. As well as the usual libraries, we’ll be importing SERPs data from getSTAT (an
enterprise-level rank tracking platform, available at getstat.com):
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
import re
import time
import requests
import json
from datetime import timedelta
from glob import glob
import os
from textdistance import sorensen_dice
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools
pd.set_option('display.max_colwidth', None)
470
Chapter 10 Google Updates
%matplotlib inline
root_domain = 'on24.com'
hostdomain = 'www.on24.com'
hostname = 'on24'
full_domain = 'https://www.on24.com'
target_name = 'ON24'
getstat_raw.head()
getstat_cleancols = getstat_raw
Convert to lowercase:
Given ON24 is a global brand and using a single website to capture all English
language searches worldwide, we’re using the global monthly search volume instead of
the usual regional (country level) numbers. Hence, we’re renaming the global volumes
as the search volume:
We filter out rows for brand searches as we would expect ON24 to rank well for its
own brand and we’re more interested in the general core update:
471
Chapter 10 Google Updates
getstat_cleancols = getstat_cleancols[~getstat_cleancols['keyword'].str.
contains('24')]
getstat_cleancols
To make the calculations easier, we’ll split the dataframe column-wise into before
and after. The splits will be aggregated and then compared to each other.
We’ll start with the before dataframe, selecting the before columns:
472
Chapter 10 Google Updates
Change the values of the URL column such that if there are any blanks (null values),
then replace it with '' as opposed to a NaN (not a number). This helps avoid any errors
when aggregating later on.
We’ll derive site names using the urisplit function (embedded inside a list
comprehension) to extract the domain name. This will be useful to summarize
performance at the site level.
We change the site field to replace any strip_subdomains strings found in the site
column and replace with nothing:
getstat_before['site'] = getstat_before['site'].str.replace('|'.join(strip_
subdomains), '')
getstat_before['phase'] = 'before'
Stratifying the ranking position data helps us perform more detailed aggregations so
that we can break down performance into Google’s top 3, page 1, etc. This uses np.where
which is the Python equivalent of Excel’s if function:
473
Chapter 10 Google Updates
getstat_before['rank_profile'])
Here, we’ll rename some columns as we don’t need the month year in the
column title:
Column selection is not absolutely necessary, but it does help clean up the
dataframe and remind us of what we’re working with:
We’ll set zero search volumes to one so that we don’t get “divide by zero errors” later
on when deriving calculations:
getstat_before['search_volume'] = np.where(getstat_before['search_volume']
== 0, 1, getstat_before['search_volume'])
Initialize a new column count which also comes in handy for aggregations:
getstat_before['count'] = 1
Sometimes, you’ll want to dissect the SERPs by head, middle, and long tail. To
make this possible, we’ll initialize a column called “token_count” which counts the
amount of gaps between the words (and add 1) to extract the query word count in the
“keyword” column:
getstat_before['token_count'] = getstat_before['keyword'].str.
count(' ') + 1
Thanks to the word count, we use the np.select() function to classify the
query length:
before_length_conds = [
getstat_before['token_count'] == 1,
getstat_before['token_count'] == 2,
474
Chapter 10 Google Updates
getstat_before['token_count'] > 2]
getstat_before
Here are the before dataset with additional features to make the analysis more useful.
Let’s repeat the data transformation steps for the after dataset:
475
Chapter 10 Google Updates
getstat_after['site'] = getstat_after['site'].str.replace('|'.join(strip_
subdomains), '')
getstat_after['phase'] = 'after'
getstat_after['search_volume'] = np.where(getstat_after['search_volume'] ==
0, 1, getstat_after['search_volume'])
getstat_after['count'] = 1
after_length_conds = [
getstat_after['token_count'] == 1,
getstat_after['token_count'] == 2,
getstat_after['token_count'] > 2,
]
getstat_after
476
Chapter 10 Google Updates
Dedupe
The reason for deduplication is that the search engines often rank multiple URLs
from the same SERPs. This is fine if you want to evaluate SERPs share or rates of
cannibalization (i.e., multiple URLs from the same domain competing for the same
ranking and ultimately constraining the maximum ranking achieved). However, in
our use case of just seeing which sites come first, in what rank order, and how often,
deduplication is key.
Using the transformed datasets, we will group by site, selecting and keeping the
highest ranked URL in the unique (deduplicated) dataset:
getstat_bef_unique = getstat_before.sort_values('rank').groupby(['site',
'device', 'keyword']).first()
getstat_bef_unique = getstat_bef_unique.reset_index()
getstat_bef_unique = getstat_bef_unique[getstat_bef_unique['site'] != '']
getstat_bef_unique = getstat_bef_unique.sort_values(['keyword', 'device',
'rank'])
477
Chapter 10 Google Updates
getstat_bef_unique
The dataset has been reduced noticeably from 27,000 to 23,600 rows. We’ll repeat the
same operation for the after dataset:
getstat_aft_unique = getstat_after.sort_values('rank').groupby(['site',
'device', 'keyword']).first()
getstat_aft_unique = getstat_aft_unique.reset_index()
getstat_aft_unique
478
Chapter 10 Google Updates
Domains
One of the most common questions of any algo update is which sites gained and which
ones lost. We will start by filtering for those in the top 10 to calculate the “reach” and sum
these by site:
before_unq_reach = getstat_bef_unique
before_unq_reach = before_unq_reach[before_unq_reach['rank'] < 11 ]
before_unq_reach = before_unq_reach.groupby(['site']).agg({'count':
'sum'}).reset_index()
479
Chapter 10 Google Updates
before_unq_reach['reach'] = np.where(before_unq_reach['reach'].isnull(), 0,
before_unq_reach['reach'])
before_unq_reach.sort_values('reach', ascending = False).head(10)
Unsurprisingly, Google has the most keyword presence of any site. After that, it’s
HubSpot, then ON24, our site of interest. Note that this is before the Google update.
We’ll repeat the domain reach aggregation for after the update:
after_unq_reach = getstat_aft_unique
after_unq_reach = after_unq_reach[after_unq_reach['rank'] < 11 ]
after_unq_reach = after_unq_reach.groupby(['site']).agg({'count': 'sum'}).
reset_index()
after_unq_reach = after_unq_reach.rename(columns = {'count': 'reach'})
after_unq_reach['reach'] = np.where(after_unq_reach['reach'].isnull(), 0,
after_unq_reach['reach'])
480
Chapter 10 Google Updates
Google is an even bigger winner post its own update. HubSpot has lost out slightly,
and ON24 is virtually unchanged. Or so it appears on the surface as we’ll see later on
when we get deeper into the analysis.
Rather than eyeballing two separate dataframes, we’ll join them together for a side-
by-side comparison:
compare_reach_loser = before_unq_reach.merge(after_unq_reach, on =
['site'], how = 'outer')
Swap null values with zero to prevent errors for the next step:
compare_reach_loser['after_reach'] = np.where(compare_reach_loser['after_
reach'].isnull(),
0, compare_reach_loser['after_reach'])
481
Chapter 10 Google Updates
Create new columns to quantify the difference in reach between before and after:
compare_reach_loser['delta_reach'] = compare_reach_loser['after_reach'] -
compare_reach_loser['before_reach']
compare_reach_loser = compare_reach_loser.sort_values('delta_reach')
compare_reach_loser = compare_reach_loser[['site', 'before_reach', 'after_
reach', 'delta_reach']]
compare_reach_loser.head(10)
The biggest loser by far appears to be WorkCast, a major player in the webinar
software space, followed by HubSpot. As you’ll realize, having the tables aggregated
separately and then joined makes the comparison much easier. Let’s repeat to find the
winners:
compare_reach_winners = compare_reach_loser.sort_values('delta_reach',
ascending = False)
compare_reach_winners.head(10)
482
Chapter 10 Google Updates
Interesting, so WorkCast lost, yet its subdomain gained. A few publishers like
Medium and blogs from indirect B2B software competitors also gain. Intuitively, this
looks like blogs and guides have been favored.
Time to visualize. We’ll convert to long format which is the data structure of choice
for data visualization graphing packages (think pivot tables):
compare_reach_losers_long.head(10)
483
Chapter 10 Google Updates
#VIZ
compare_reach_losers_plt = (
ggplot(compare_reach_losers_long, aes(x = 'reorder(site, Reach)', y =
'Reach', fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
#geom_text(dd_factor_df, aes(label = 'market_name'), position=position_
stack(vjust=0.01)) +
labs(y = 'Reach', x = ' ') +
#scale_y_reverse() +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12)) +
facet_wrap('device')
)
compare_reach_plt.save(filename = 'images/1_compare_reach_losers_plt.png',
height=5, width=10, units = 'in', dpi=1000)
compare_reach_plt
484
Chapter 10 Google Updates
It seems not all websites had a consistent presence across both device search result
types (Figure 10-1).
Figure 10-1. Website top 10 ranking counts (reach) before and after by
browser device
Reach Stratified
Reach is helpful, but as always the devil is in the detail, and no doubt you and your
colleagues will want to drill down further by rank strata, that is, rankings in the top 3
positions or perhaps only rankings on page 1 of Google, etc. Let’s aggregate only this
time with reach strata starting with the before dataset:
before_unq_reachstrata = getstat_bef_unique
before_unq_reachstrata = before_unq_reachstrata.groupby(['site', 'rank_
profile']).agg({'count': 'sum'}).reset_index()
before_unq_reachstrata = before_unq_reachstrata.rename(columns = {'count':
'reach'})
before_unq_reachstrata = before_unq_reachstrata[['site', 'rank_profile',
'reach']]
before_unq_reachstrata.sort_values('reach', ascending = False).head(10)
485
Chapter 10 Google Updates
Now we have an ordered dataframe by reach, this time split by rank_profile, thus
stratifying the reach metric. For example, we see HubSpot has twice as many keywords
on page 1 of Google search results compared to page 2, whereas with ON24, it’s more or
less equal.
Repeat the operation for the after dataset:
after_unq_reachstrata = getstat_aft_unique
after_unq_reachstrata = after_unq_reachstrata.groupby(['site', 'rank_
profile']).agg({'count': 'sum'}).reset_index()
after_unq_reachstrata = after_unq_reachstrata.rename(columns = {'count':
'reach'})
after_unq_reachstrata = after_unq_reachstrata[['site', 'rank_profile',
'reach']]
after_unq_reachstrata.sort_values('reach', ascending = False).head(10)
486
Chapter 10 Google Updates
As you can imagine, it’s less easy to see who won and lost by eyeballing the separate
dataframes, so we will merge as usual:
compare_strata_loser = before_unq_reachstrata.merge(after_unq_reachstrata,
on = ['site', 'rank_profile'], how = 'outer')
compare_strata_loser = compare_strata_loser.rename(columns = {'reach_x':
'before_reach', 'reach_y': 'after_reach'})
compare_strata_loser['before_reach'] = np.where(compare_strata_
loser['before_reach'].isnull(), 0, compare_strata_loser['before_reach'])
compare_strata_loser['after_reach'] = np.where(compare_strata_loser['after_
reach'].isnull(), 0, compare_strata_loser['after_reach'])
compare_strata_loser['delta_reach'] = compare_strata_loser['after_reach'] -
compare_strata_loser['before_reach']
compare_strata_loser = compare_strata_loser.sort_values('delta_reach')
compare_strata_loser.head(10)
487
Chapter 10 Google Updates
This dataframe merge makes things much clearer as we can now see ON24 lost most
of its rankings on page 1, whereas WorkCast has lost everywhere.
We’ll turn our attention to the reach winners stratified by rank profile:
compare_strata_winners = before_unq_reachstrata.merge(after_unq_
reachstrata, on = ['site', 'rank_profile'], how = 'outer')
compare_strata_winners = compare_strata_winners.rename(columns =
{'reach_x': 'before_reach', 'reach_y': 'after_reach'})
compare_strata_winners['before_reach'] = np.where(compare_strata_
winners['before_reach'].isnull(),
0, compare_strata_winners['before_reach'])
compare_strata_winners['after_reach'] = np.where(compare_strata_
winners['after_reach'].isnull(),
0, compare_strata_winners['after_reach'])
compare_strata_winners['delta_reach'] = compare_strata_winners['after_
reach'] - compare_strata_winners['before_reach']
compare_strata_winners = compare_strata_winners.sort_values('delta_reach',
ascending = False)
488
Chapter 10 Google Updates
compare_strata_winners.head(10)
Although WorkCast’s info subdomain gained 40 positions overall, their main site
lost 69 positions, so it’s a net loss. Time to visualize, we’ll take the top 28 sites using the
head() function:
The melt() function helps reshape the data from wide format (as per the preceding
dataframe) to long format (where the column names are now in a single column
as rows):
compare_strata_losers_long = compare_strata_losers_long.melt(id_vars =
['site', 'rank_profile'], var_name='Phase', value_name='Reach')
compare_strata_losers_long['Phase'] = compare_strata_losers_long['Phase'].
str.replace('_reach', '')
489
Chapter 10 Google Updates
compare_strata_losers_long['Phase'] = compare_strata_losers_long['Phase'].
astype('category')
With Phase now set as a category, we can now set the order:
compare_strata_losers_long['Phase'] = compare_strata_losers_long['Phase'].
cat.reorder_categories(['before', 'after'])
The same applies to rank profile. Top 3 is obviously better than page 1, which is
better than page 2.
compare_strata_losers_long['rank_profile'] = compare_strata_losers_
long['rank_profile'].astype('category')
compare_strata_losers_long['rank_profile'] = compare_strata_losers_
long['rank_profile'].cat.reorder_categories(['top_3', 'page_1', 'page_2'])
The stop_doms list is used to weed out domains from our analysis that the audience
wouldn’t be interested in:
With the stop_doms list, we can filter the dataframe of these undesirable domain
names by negating any sites that are in (using the isin() function) the stop_doms list:
compare_strata_losers_long = compare_strata_losers_long[~compare_strata_
losers_long['site'].isin(stop_doms)]
compare_strata_losers_long.head(10)
490
Chapter 10 Google Updates
The data is now in long format with the Phase extracted from the before_reach and
after_reach columns and pushed into a column called Phase. The values of the two
columns sit under a new single column Reach. Let's visualize:
compare_strata_losers_plt = (
ggplot(compare_strata_losers_long, aes(x = 'reorder(site, Reach)',
y = 'Reach', fill = 'rank_profile')) +
geom_bar(stat = 'identity', position = 'fill', alpha = 0.8) +
position=position_stack(vjust=0.00)) +
labs(y = 'Reach', x = ' ') +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12)) +
facet_wrap('Phase')
)
compare_strata_losers_plt.save(filename = 'images/1_compare_strata_losers_
plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_strata_losers_plt
491
Chapter 10 Google Updates
We see the proportions of keywords in their rank profile, which are much easier to
see thanks to the fixed lengths (Figure 10-2).
Figure 10-2. Website Google rank proportions (reach) by top 3, page 1, and page
2 before and after
For example, WorkCast had a mixture of top 3 and page 1 rankings which are now all
on page 2. FounderJar had page 2 listings, which are now nowhere to be found.
The fixed lengths are set in the geom_bar() function using the parameter position
set to “fill.” Despite following best practice data visualization as shown earlier, you may
have to acquiesce to your business audience who may want multilength bars as well as
proportions (even if it’s much harder to infer from the chart). So instead of the position
set to fill, we will set it to “stack”:
compare_strata_losers_plt = (
ggplot(compare_strata_losers_long, aes(x = 'reorder(site, Reach)',
y = 'Reach', fill = 'rank_profile')) +
geom_bar(stat = 'identity', position = 'stack', alpha = 0.8) +
position=position_stack(vjust=0.01)) +
labs(y = 'Reach', x = ' ') +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12)) +
facet_wrap('Phase')
)
492
Chapter 10 Google Updates
compare_strata_losers_plt.save(filename = 'images/1_compare_strata_losers_
stack_plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_strata_losers_plt
Admittedly, in cases like ON24 where in the fixed bar length chart above (Figure 10-2),
the differences were not as obvious (Figure 10-3).
Figure 10-3. Website Google rank counts (reach) by top 3, page 1, and page 2
before and after
In contrast, with the free length bars, we can see that ON24 lost at least 10% of
their reach.
Rankings
While reach is nice, as a single metric on its own it is not enough. If you consider the
overall value of your organic presence as a function of price and volume, then reach is
the volume (which we have just addressed). And now we must come to the price, which
in the organic value paradigm is ranking positions.
We’ll aggregate rankings by site for both before and after the core update, starting
with the before dataset:
before_unq_ranks = getstat_bef_unique
493
Chapter 10 Google Updates
Unlike reach where we took the sum of keyword search results, in this case, we’re
taking the average (also known as the mean):
before_unq_ranks = before_unq_ranks.groupby(['site']).agg({'rank':
'mean'}).reset_index()
before_unq_ranks = before_unq_ranks[['site', 'rank']]
before_unq_ranks.sort_values('rank').head(10)
The table shows the average rank by site. As you may infer, the rank per se is quite
meaningless because
• Some keywords have higher search volumes than others.
• The average rank is not zero inflated for keywords the sites don’t
rank for. For example, qlik.com’s average rank of 1 may be just on one
keyword.
Instead of going through the motions, repeating code to calculate and visualize the
rankings for the after dataset and then comparing, we’ll move on to a search volume
weighted average ranking.
494
Chapter 10 Google Updates
before_unq_svranks = getstat_bef_unique
Define the function that takes the dataframe and uses the rank column. The
weighted average is calculated by multiplying the rank by the search volume and then
dividing by the total weight (being the search volume sum):
def wavg_rank_sv(x):
names = {'wavg_rank': (x['rank'] * x['search_volume']).sum()/
(x['search_volume']).sum()}
return pd.Series(names, index=['wavg_rank']).round(1)
With the function in place, we’ll now use the apply() function to apply the wavg_
rank() function just defined earlier:
before_unq_svranks = before_unq_svranks.groupby(['site']).apply(wavg_rank_
sv).reset_index()
before_unq_svranks.sort_values('wavg_rank').head(10)
495
Chapter 10 Google Updates
We can see already that the list of sites have changed due to the search volume
weighting. Even though the weighted average rankings don’t add much value from a
business insight perspective, this is an improvement. However, what we really need is
the full picture being the overall visibility.
Visibility
The visibility will be our index metric for evaluating the value of a site’s organic search
presence taking both reach and ranking into account.
Merge the search volume weighted rank data with reach:
before_unq_visi['reach'] = np.where(before_unq_visi['reach'].isnull(), 0,
before_unq_visi['reach'])
before_unq_visi['wavg_rank'] = np.where(before_unq_visi['wavg_rank'].
isnull(), 100, before_unq_visi['wavg_rank'])
Computing the visibility index is derived by dividing the reach by the weighted
average rank. That’s because the smaller the weighted average rank number, the more
visible the site is, hence why rank is the divisor. In contrast, the reach is the numerator
because the higher the number, the higher your visibility.
496
Chapter 10 Google Updates
The results are looking a lot more sensible and reflect what we would expect to see
in the webinar software space. We can also see that gotomeeting.com, despite having
less reach, has a higher visibility score by virtue of ranking higher on more sought-after
search terms. We can thus conclude the visibility score works.
Compute the same for the after dataset:
497
Chapter 10 Google Updates
compare_visi_losers['before_visi'] = np.where(compare_visi_losers['before_
visi'].isnull(), 0, compare_visi_losers['before_visi'])
compare_visi_losers['after_visi'] = np.where(compare_visi_losers['after_
visi'].isnull(), 0, compare_visi_losers['after_visi'])
compare_visi_losers['delta_visi'] = compare_visi_losers['after_visi'] -
compare_visi_losers['before_visi']
498
Chapter 10 Google Updates
compare_visi_losers = compare_visi_losers.sort_values('delta_visi')
compare_visi_losers.head(10)
The comparison view is much clearer, and ON24 and WorkCast are the biggest losers
of the 2019 core update from Google.
Let’s see the winners:
compare_visi_winners['before_visi'] = np.where(compare_visi_
winners['before_visi'].isnull(), 0, compare_visi_winners['before_visi'])
compare_visi_winners['after_visi'] = np.where(compare_visi_winners['after_
visi'].isnull(), 0, compare_visi_winners['after_visi'])
compare_visi_winners['delta_visi'] = compare_visi_winners['after_visi'] -
compare_visi_winners['before_visi']
499
Chapter 10 Google Updates
compare_visi_winners = compare_visi_winners.sort_values('delta_visi',
ascending = False)
compare_visi_winners.head(10)
The biggest winners are publishers which include nonindustry players like PCMag
and Medium.
Here’s some code to convert to long format for visualization:
compare_visi_losers_long['Phase'] = compare_visi_losers_long['Phase'].
astype('category')
compare_visi_losers_long['Phase'] = compare_visi_losers_long['Phase'].cat.
reorder_categories(['before', 'after'])
compare_visi_losers_long.head(10)
500
Chapter 10 Google Updates
The preceding data is in long format. This will now feed the following graphics code:
compare_visi_losers_plt = (
ggplot(compare_visi_losers_long, aes(x = 'reorder(site, Visi)',
y = 'Visi', fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
position=position_stack(vjust=0.01)) +
labs(y = 'Visiblity', x = ' ') +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12)) +
facet_wrap('Phase')
)
compare_visi_losers_plt.save(filename = 'images/1_compare_visi_losers_plt.
png', height=5, width=10, units = 'in', dpi=1000)
compare_visi_losers_plt
501
Chapter 10 Google Updates
The separate panels are achieved by using the facet_wrap() function where we
instruct plotnine (the graphics package) to separate panels by Phase as a parameter
(Figure 10-4).
compare_visi_winners_long['Phase'] = compare_visi_winners_long['Phase'].
astype('category')
compare_visi_winners_long['Phase'] = compare_visi_winners_long['Phase'].
cat.reorder_categories(['before', 'after'])
compare_visi_winners_long.head(10)
502
Chapter 10 Google Updates
compare_visi_winners_plt = (
ggplot(compare_visi_winners_long, aes(x = 'reorder(site, Visi)', y =
'Visi', fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
position=position_stack(vjust=0.01)) +
labs(y = 'Rank', x = ' ') +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12)) +
facet_wrap('Phase')
)
compare_visi_winners_plt.save(filename = 'images/1_compare_visi_winners_
plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_visi_winners_plt
503
Chapter 10 Google Updates
This time, we’re not using the facet_wrap() function which puts both before and after
bars on the same panel (Figure 10-5).
This makes it easier to compare directly and even get a better sense of the difference
for each site before and after.
Result Types
With the overall performance in hand, we’ll drill down further, starting with result types.
By result types, we mean the format in which the ranking is displayed. This could be
• Image
• News
As usual, we’ll perform aggregations on both before and after datasets. Only this
time, we’ll group by the snippets column:
before_unq_snippets = getstat_bef_unique
504
Chapter 10 Google Updates
We’re aggregating by counting the number of keyword search results the snippet
appears in, which is a form of reach. Most snippets rank in the top 5 positions of the
Search Engine Results Pages, so we won’t bother with snippet rankings.
before_unq_snippets = before_unq_snippets.groupby(['snippets']).
agg({'count': 'sum'}).reset_index()
before_unq_snippets = before_unq_snippets[['snippets', 'count']]
before_unq_snippets = before_unq_snippets.rename(columns = {'count':
'reach'})
before_unq_snippets.sort_values('reach', ascending = False).head(10)
Organic predictably has the most reach followed by images and AMP (accelerated
mobile pages).
Repeat the process for the after dataset:
after_unq_snippets = getstat_aft_unique
after_unq_snippets = after_unq_snippets.groupby(['snippets']).agg({'count':
'sum'}).reset_index()
after_unq_snippets = after_unq_snippets[['snippets', 'count']]
after_unq_snippets = after_unq_snippets.rename(columns = {'count': 'reach'})
after_unq_snippets.sort_values('reach', ascending = False).head(10)
505
Chapter 10 Google Updates
Organic has gone down implying that there could be more diversification of search
results. Join the datasets to facilitate an easier comparison:
compare_snippets = before_unq_snippets.merge(after_unq_snippets, on =
['snippets'], how = 'outer')
compare_snippets = compare_snippets_losers.rename(columns = {'reach_x':
'before_reach', 'reach_y': 'after_reach'})
compare_snippets['before_reach'] = np.where(compare_snippets['before_
reach'].isnull(), 0, compare_snippets['before_reach'])
compare_snippets['after_reach'] = np.where(compare_snippets['after_reach'].
isnull(), 0, compare_snippets['after_reach'])
compare_snippets['delta_reach'] = compare_snippets['after_reach'] -
compare_snippets['before_reach']
compare_snippets_losers = compare_snippets.sort_values('delta_reach')
compare_snippets_losers.head(10)
506
Chapter 10 Google Updates
The table confirms that organic sitelinks’ listings have fallen, followed by places,
videos, and related searches. What does this mean? It means that Google is diversifying
its results but not in the way of videos or local business results. Also, the fall in sitelinks
implies the searches are less navigational, which possibly means more opportunity to
rank for search phrases that were previously the preserve of certain brands.
compare_snippets_winners = compare_snippets.sort_values('delta_reach',
ascending = False)
compare_snippets_winners.head(10)
507
Chapter 10 Google Updates
Comparing the winners, we see that images and pure organic have increased as has
People Also Ask. So the high-level takeaway here is that the content should be more FAQ
driven and tagged with schema markup. There should also be more use of images in the
content. Let’s visualize by reformatting the data and feeding it into plotnine:
each']].head(10)
compare_snippets_losers_long = compare_snippets_losers_long.melt(id_vars =
['snippets'], var_name='Phase', value_name='Reach')
compare_snippets_losers_long['Phase'] = compare_snippets_losers_
long['Phase'].str.replace('_reach', '')
compare_snippets_losers_long['Phase'] = compare_snippets_losers_
long['Phase'].astype('category')
compare_snippets_losers_long['Phase'] = compare_snippets_losers_
long['Phase'].cat.reorder_categories(['after', 'before'])
compare_snippets_losers_long = compare_snippets_losers_long[compare_
snippets_losers_long['snippets'] != 'organic']
compare_snippets_losers_long.head(10)
508
Chapter 10 Google Updates
compare_snippets_losers_plt = (
ggplot(compare_snippets_losers_long, aes(x = 'reorder(snippets,
Reach)', y = 'Reach', fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
#geom_text(dd_factor_df, aes(label = 'market_name'), position=position_
stack(vjust=0.01)) +
labs(y = 'Visiblity', x = ' ') +
#scale_y_reverse() +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
compare_snippets_losers_plt.save(filename = 'images/2_compare_snippets_
losers_plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_snippets_losers_plt
509
Chapter 10 Google Updates
The great thing about charts like Figure 10-6 is that you get an instant sense of
proportion.
It’s much easier to spot that there are more carousel videos than organic sitelinks
post update.
compare_snippets_winners_long = compare_snippets_winners[['snippets',
'before_reach','after_reach']].head(10)
compare_snippets_winners_long = compare_snippets_winners_long.melt(id_vars
= ['snippets'], var_name='Phase', value_name='Reach')
compare_snippets_winners_long['Phase'] = compare_snippets_winners_
long['Phase'].str.replace('_reach', '')
compare_snippets_winners_long['Phase'] = compare_snippets_winners_
long['Phase'].astype('category')
compare_snippets_winners_long['Phase'] = compare_snippets_winners_
long['Phase'].cat.reorder_categories(['after', 'before'])
compare_snippets_winners_long = compare_snippets_winners_long[compare_
snippets_winners_long['snippets'] != 'organic']
compare_snippets_winners_long.head(10)
510
Chapter 10 Google Updates
compare_snippets_winners_plt = (
ggplot(compare_snippets_winners_long, aes(x = 'reorder(snippets,
Reach)', y = 'Reach', fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8)
+position=position_stack(vjust=0.01)) +
labs(y = 'Rank', x = ' ') +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
compare_snippets_winners_plt.save(filename = 'images/1_compare_snippets_
winners_plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_snippets_winners_plt
511
Chapter 10 Google Updates
Other than more of each snippet or result type, the increases across all types look
relatively the same (Figure 10-7).
Figure 10-7. Google’s top 10 count reach by result type before and after
Cannibalization
With performance determined, our attention turns to the potential drivers of
performance, such as cannibals.
Cannibals occur when there are instances of sites with multiple URLs ranking in the
search results for a single keyword.
We’ll start by using the duplicated SERPs datasets and counting the number of URLs
from the same site per keyword. This will involve a groupby() function on the keyword
and site:
At this stage, we want to isolate the SERPs rows that are cannibalized. That means
URLs that have other URLs from the same site appearing in the same keyword results.
cannibals_before_agg = cannibals_before_agg[cannibals_before_
agg['count'] > 1]
512
Chapter 10 Google Updates
cannibals_before_agg['count'] = 1
cannibals_before_agg = getstat_before[getstat_before['device'] ==
'desktop']
cannibals_before_agg = cannibals_before_agg.groupby(['keyword']).
agg({'count': 'sum'}).reset_index()
cannibals_before_agg = cannibals_before_agg.rename(columns = {'count':
'cannibals'})
cannibals_before_agg
You could argue that these numbers contain one URL per site that are not strictly
cannibals. However, this looser calculation is simple and does a robust enough job to get
a sense of the cannibalization trend.
513
Chapter 10 Google Updates
Let’s see how cannibalized the SERPs were following the update:
The preceding preview hints that not much has changed; however, this is hard to tell
by looking at one table. So let’s merge them together and get a side-by-side comparison:
compare_cannibals = cannibals_before_agg.merge(cannibals_after_agg, on =
'keyword', how = 'left')
514
Chapter 10 Google Updates
compare_cannibals['before_cannibals'] = np.where(compare_cannibals['before_
cannibals'].isnull(),
0, compare_cannibals['before_cannibals'])
compare_cannibals['after_cannibals'] = np.where(compare_cannibals['after_
cannibals'].isnull(),
0, compare_cannibals['after_cannibals'])
compare_cannibals['delta_cannibals'] = compare_cannibals['after_
cannibals'] - compare_cannibals['before_cannibals']
compare_cannibals = compare_cannibals.sort_values('delta_cannibals')
compare_cannibals
515
Chapter 10 Google Updates
The table shows at the keyword level that there are less cannibals for “webcast
guidelines” but more for “enterprise training platform.” But what was the overall trend?
cannibals_trend = compare_cannibals
cannibals_trend['project'] = target_name
cannibals_trend = cannibals_trend.groupby('project').agg({'before_
cannibals': 'sum',
'after_cannibals': 'sum',
'delta_cannibals': 'sum'}).reset_index()
cannibals_trend
So there were less cannibals overall by just over 13%, following the core update, as
we would expect.
Let’s convert to format before graphing the top cannibals for both SERPs that gained
and lost cannibals:
compare_cannibals_less['Phase'] = compare_cannibals_less['Phase'].str.
replace('_cannibals', '')
compare_cannibals_less['Phase'] = compare_cannibals_less['Phase'].
astype('category')
compare_cannibals_less['Phase'] = compare_cannibals_less['Phase'].cat.
reorder_categories(['after', 'before'])
compare_cannibals_less
516
Chapter 10 Google Updates
compare_cannibals_less_plt = (
ggplot(compare_cannibals_less, aes(x = 'keyword', y = 'cannibals',
fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
#geom_text(dd_factor_df, aes(label = 'market_name'), position=position_
stack(vjust=0.01)) +
labs(y = '# Cannibals in SERP', x = ' ') +
#scale_y_reverse() +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
compare_cannibals_less_plt.save(filename = 'images/5_compare_cannibals_
less_plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_cannibals_less_plt
517
Chapter 10 Google Updates
Figure 10-8 shows keywords that lost their cannibalizing URLs or had less cannibals.
The most dramatic loss appears to be “live streaming software” going from 4 to 1. All
of the phrases appear to be quite generic apart from the term “act on webinars” which
appears to be a brand term for act-on.com.
compare_cannibals_more['Phase'] = compare_cannibals_more['Phase'].str.
replace('_cannibals', '')
compare_cannibals_more['Phase'] = compare_cannibals_more['Phase'].
astype('category')
compare_cannibals_more['Phase'] = compare_cannibals_more['Phase'].cat.
reorder_categories(['after', 'before'])
compare_cannibals_more
518
Chapter 10 Google Updates
compare_cannibals_more_plt = (
ggplot(compare_cannibals_more, aes(x = 'keyword', y = 'cannibals',
fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
labs(y = '# Cannibals in SERP', x = ' ') +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
compare_cannibals_more_plt.save(filename = 'images/5_compare_cannibals_
more_plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_cannibals_more_plt
519
Chapter 10 Google Updates
Nothing obvious appears as to why these keywords received more cannibals when
compared with the keywords that lost their cannibals, as both had a mixture of generic
and brand hybrid keywords (Figure 10-9).
Keywords
Let’s establish a general trend before moving the analysis toward the site in question.
Token Length
Perhaps there are keyword patterns such as token length that could explain the gains
and losses following the core update. We’ll try token length which measures the number
of keywords in a search query.
Metrics such as search volume before and after are not available in the getSTAT
before and after the update. We’re interested to see how many unique sites were present
for each token length for a general trend view.
520
Chapter 10 Google Updates
We’ll analyze the SERPs for desktop devices; however, the code can easily be adapted
for other devices such as mobiles:
tokensite_before = getstat_bef_unique[getstat_bef_unique['device'] ==
'desktop']
tokensite_after = getstat_aft_unique[getstat_aft_unique['device'] ==
'desktop']
tokensite_after.sort_values(['keyword', 'rank'])
The first step is to aggregate both datasets by token size and phase for both before
and after. We only want the top 10 sites; hence, the filter rank is less than 11. We start by
aggregating at the keyword level within the token size and phase to sum the number of
sites. Then aggregate again by token size and phase to get the overall number of sites
ranking in the top 10 for the token size.
The two-step aggregation was made necessary because of the filtering for the top
10 sites within the keyword; otherwise, we would have aggregated within the phase and
token size in one line.
521
Chapter 10 Google Updates
tokensite_before_agg
522
Chapter 10 Google Updates
With both phases aggregated by site count, we’ll merge these for a side-by-side
comparison:
tokensite_token_deltas = tokensite_before_agg.merge(tokensite_after_agg, on
= ['token_size'], how = 'left')
tokensite_token_deltas['sites_delta'] = (tokensite_token_deltas['site_
count_y'] - tokensite_token_deltas['site_count_x'])
Cast the token size as a category data type so that we can order these for the table
and the graphs later:
tokensite_token_deltas['token_size'] = tokensite_token_deltas['token_
size'].astype('category')
tokensite_token_deltas['token_size'] = tokensite_token_deltas['token_
size'].cat.reorder_categories(['head', 'middle', 'long'])
tokensite_token_deltas = tokensite_token_deltas.sort_values('token_size')
tokensite_token_deltas
523
Chapter 10 Google Updates
Let’s visualize:
targetsite_token_viz
targetsite_token_sites_plt = (
ggplot(tokensite_token_viz,
aes(x = 'token_size', y = 'site_count', fill = 'phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
position=position_stack(vjust=0.01)) +
labs(y = 'Unique Site Count', x = 'Query Length') +
coord_flip() +
theme(legend_position = 'right',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
targetsite_token_sites_plt.save(filename = 'images/8_targetsite_token_
sites_plt.png', height=5, width=10, units = 'in', dpi=1000)
targetsite_token_sites_plt
524
Chapter 10 Google Updates
So that’s the general trend graphed for our PowerPoint deck (Figure 10-10). The
question is which sites gained and lost?
525
Chapter 10 Google Updates
The data is now filtered for the desktop and long tail, making it ready for analysis
using aggregation:
526
Chapter 10 Google Updates
So far, we see that Qualtrics ranked around 1.2 (on average) on 112 long-tail
keywords on desktop searches. We’ll repeat the aggregation for the after data:
527
Chapter 10 Google Updates
Following the core update, HubSpot and Sitecore have moved ahead of Qualtrics
within the top 5 in the long tail. Medium has moved out of the top 5. Let’s make this
comparison easier:
compare_longs['before_reach'] = np.where(compare_longs['before_reach'].
isnull(),
0, compare_longs['before_reach'])
compare_longs['after_reach'] = np.where(compare_longs['after_reach'].
isnull(),
0, compare_longs['after_reach'])
compare_longs['before_rank'] = np.where(compare_longs['before_rank'].
isnull(),
100, compare_longs['before_rank'])
compare_longs['after_rank'] = np.where(compare_longs['after_rank'].
isnull(),
100, compare_longs['after_rank'])
compare_longs['before_visi'] = np.where(compare_longs['before_visi'].
isnull(),
0, compare_longs['before_visi'])
compare_longs['after_visi'] = np.where(compare_longs['after_visi'].
isnull(),
0, compare_longs['after_visi'])
528
Chapter 10 Google Updates
compare_longs.sort_values('delta_visi').head(12)
As confirmed, Qualtrics lost the most in the long tail. Let’s visualize, starting with
the losers:
longs_reach_losers_long = compare_longs.sort_values('delta_visi')
longs_reach_losers_long = longs_reach_losers_long[['site', 'before_visi',
'after_visi']]
longs_reach_losers_long = longs_reach_losers_long[~longs_reach_losers_
long['site'].isin(['google.co.uk', 'youtube.com'])]
longs_reach_losers_long = longs_reach_losers_long.head(10)
longs_reach_losers_long
529
Chapter 10 Google Updates
longs_reach_losers_plt = (
ggplot(longs_reach_losers_long,
aes(x = 'reorder(site, visi)', y = 'visi', fill = 'phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
#geom_text(dd_factor_df, aes(label = 'market_name'), position=position_
stack(vjust=0.01)) +
labs(y = 'Visibility', x = '') +
#scale_y_log10() +
coord_flip() +
theme(legend_position = 'right',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
530
Chapter 10 Google Updates
longs_reach_losers_plt.save(filename = 'images/10_longs_visi_losers_plt.
png', height=5, width=10, units = 'in', dpi=1000)
longs_reach_losers_plt
longs_reach_winners_long = compare_longs.sort_values('delta_visi')
longs_reach_winners_long = longs_reach_winners_long[['site', 'before_visi',
'after_visi']]
We’ll also remove Google and YouTube as Google may have biased their owned
properties in search results following the algorithm update:
longs_reach_winners_long = longs_reach_winners_long[~longs_reach_winners_
long['site'].isin(['google.co.uk', 'google.com', 'youtube.com'])]
531
Chapter 10 Google Updates
Taking the tail as opposed to the head allows us to select the winners as the table was
ordered in ascending order of importance from lost visibility all the way down to sites
that gained the most visibility:
longs_reach_winners_long = longs_reach_winners_long.tail(10)
longs_reach_winners_plt = (
ggplot(longs_reach_winners_long,
aes(x = 'reorder(site, visi)', y = 'visi', fill = 'phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
labs(y = 'Google Visibility', x = '') +
coord_flip() +
theme(legend_position = 'right',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
longs_reach_winners_plt.save(filename = 'images/10_longs_visi_winners_plt.
png', height=5, width=10, units = 'in', dpi=1000)
longs_reach_winners_plt
532
Chapter 10 Google Updates
In the long-tail space, HubSpot and Sitecore are the clear winners (Figure 10-12).
Target Level
With the general trends established, it’s time to get into the details. Naturally, SEO
practitioners and marketers want to know the performance by keywords and pages in
terms of top gainers and losers. We’ll split the analysis between keywords and pages.
Keywords
To achieve this, we’ll filter for the target site “ON24” for both before and after the
core update:
The weighted average rank doesn’t apply here because we’re aggregating at a
keyword level where there is only value for a given keyword:
533
Chapter 10 Google Updates
With the two datasets in hand, we’ll merge them to get a side-by-side comparison:
534
Chapter 10 Google Updates
compare_site_ranks['delta_rank'] = compare_site_ranks['before_rank'] -
compare_site_ranks['after_rank']
compare_site_ranks
compare_site_ranks_long
535
Chapter 10 Google Updates
compare_keywords_rank_plt = (
ggplot(compare_site_ranks_long, aes(x = 'keyword', y = 'rank',
fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
labs(y = 'Google Rank', x = ' ') +
scale_y_reverse() +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
compare_keywords_rank_plt.save(filename = 'images/6_compare_keywords_rank_
plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_keywords_rank_plt
“Salesforce webinars” and “online webinars” really fell by the wayside going from the
top 10 to nowhere (Figure 10-13).
Figure 10-13. Average rank positions by keyword for ON24 before and after
By contrast, “webinar events” and “live webinar” gained. Knowing this information
would help us prioritize keywords to analyze further to recover traffic back. For example,
the SEO in charge of ON24 might want to analyze the top 20 ranking competitors for
“webinar” to generate recovery recommendations.
536
Chapter 10 Google Updates
Pages
Use the target keyword dataset which has been prefiltered to include the target
site ON24:
537
Chapter 10 Google Updates
target_urls_deltas = targetURLs_before_agg.merge(targetURLs_after_agg, on =
['url'], how = 'left')
target_urls_deltas = target_urls_deltas.rename(columns = {'reach_x':
'before_reach', 'reach_y': 'after_reach',
'search_volume_x': 'before_sv', 'search_volume_y':
'after_sv',
538
Chapter 10 Google Updates
target_urls_deltas['after_reach'] = np.where(target_urls_deltas['after_
reach'].isnull(),
0, target_urls_deltas['after_reach'])
target_urls_deltas['after_sv'] = np.where(target_urls_deltas['after_sv'].
isnull(),
target_urls_deltas['before_sv'], target_urls_deltas['after_sv'])
target_urls_deltas['after_rank'] = np.where(target_urls_deltas['after_
rank'].isnull(),
100, target_urls_deltas['after_rank'])
target_urls_deltas['after_visi'] = np.where(target_urls_deltas['after_
visi'].isnull(),
0, target_urls_deltas['after_visi'])
target_urls_deltas = target_urls_deltas.sort_values(['visi_delta'],
ascending = False)
target_urls_deltas
539
Chapter 10 Google Updates
winning_urls = target_urls_deltas['url'].head(10).tolist()
target_url_winners['phase'] = target_url_winners['phase'].
astype('category')
target_url_winners['phase'] = target_url_winners['phase'].cat.reorder_
categories(['after', 'before'])
target_url_winners
540
Chapter 10 Google Updates
target_url_winners_plt = (
ggplot(target_url_winners,
aes(x = 'reorder(url, visi)', y = 'visi', fill = 'phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
labs(y = 'Visi', x = '') +
coord_flip() +
theme(legend_position = 'right',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
target_url_winners_plt.save(filename = 'images/8_target_url_winners_plt.
png', height=5, width=10, units = 'in', dpi=1000)
target_url_winners_plt
The Live Webcast Elite is the page that gained the most impressions, which is due to
gaining positions on searches for “webcast” as seen earlier (Figure 10-14).
Figure 10-14. URL visibility gainers for ON24 before and after
If we had website analytics data such as Google, we could merge it with the URLs
to get an idea of how much traffic the rankings were worth and how closely it correlates
with search volumes.
Let’s take a look at the losing URLs:
losing_urls = target_urls_deltas['url'].tail(10).tolist()
print(losing_urls)
541
Chapter 10 Google Updates
target_url_losers = target_url_losers[target_url_losers['url'].
isin(losing_urls) ]
target_url_losers['phase'] = target_url_losers['phase'].astype('category')
target_url_losers['phase'] = target_url_losers['phase'].cat.reorder_
categories(['after', 'before'])
target_url_losers
target_url_losers_plt = (
ggplot(target_url_losers, aes(x = 'reorder(url, visi)', y = 'visi',
fill = 'phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
position=position_stack(vjust=0.01)) +
labs(y = 'Visi', x = '') +
coord_flip() +
542
Chapter 10 Google Updates
theme(legend_position = 'right',
axis_text_y =element_text(rotation=0, hjust=1, size = 12),
legend_title = element_blank()
)
)
target_url_losers_plt.save(filename = 'images/8_target_url_losers_plt.png',
height=5, width=10, units = 'in', dpi=1000)
target_url_losers_plt
“How webinars work” and “upcoming webinars” were the biggest losing URLs
(Figure 10-15).
Figure 10-15. URL visibility losers for ON24 before and after
The https://www.on24.com/blog/how-webinars-work/#:~:text=Let's%20
start%20with%20a%20simple,using%20other%20available%20interactive%20tools
URL seems like it wasn’t canonicalized (i.e., there was no defined rel="canonical" URL to
consolidate any URL variant or duplicate to).
To all of this, one possible follow-up would be to use Google Search Console data to
extract the search queries for each URL and see
• If the URLs have the content to satisfy the queries generating the
impressions
Another possible follow-up would be to segment the URLs and keywords according
to their content type. This could help determine if there were any general patterns that
could explain and speed up the recovery process.
543
Chapter 10 Google Updates
Segments
We return back to the SERPs to analyze how different site types fared in the Google
update. The general approach will be to work out the most visible sites, before using the
np.select() function to categorize and label these sites.
Top Competitors
To find the top competitor sites, we’ll aggregate both before and after datasets to work
out the visibility index derived from the reach and search volume weighted rank average:
players_before = getstat_bef_unique
print(players_before.columns)
players_before_rank = players_before.groupby('site').apply(wavg_rank_sv).
reset_index()
players_before_reach = players_before.groupby('site').agg({'count':
'sum'}).sort_values('count', ascending = False).reset_index()
players_before_agg = players_before_rank.merge(players_before_reach, on =
'site', how = 'left')
players_before_agg['visi'] = players_before_agg['count'] / players_before_
agg['wavg_rank']
players_before_agg = players_before_agg.sort_values('visi', ascending
= False)
players_before_agg
544
Chapter 10 Google Updates
players_after = getstat_aft_unique
print(players_before.columns)
players_after_rank = players_after.groupby('site').apply(wavg_rank_sv).
reset_index()
players_after_reach = players_after.groupby('site').agg({'count': 'sum'}).
sort_values('count', ascending = False).reset_index()
players_after_agg = players_after_rank.merge(players_after_reach, on =
'site', how = 'left')
players_after_agg['visi'] = players_after_agg['count'] / players_after_
agg['wavg_rank']
players_after_agg = players_after_agg.sort_values('visi', ascending = False)
players_after_agg
545
Chapter 10 Google Updates
To put the data aggregation together, we take the before dataset and exclude any
sites appearing in the after dataset. The purpose is to perform an outer join with the after
dataset, to capture every single site possible.
players_agg = players_before_agg[~players_before_agg['site'].isin(players_
after_agg['site'])]
players_agg = players_agg.merge(players_after_agg, how='outer',
indicator=True)
players_agg = players_agg.sort_values('visi', ascending = False)
players_agg.head(50)
546
Chapter 10 Google Updates
Now that we have all of the sites in descending order of priority, we can start
categorizing the domains by site type. Using the hopefully now familiar np.select()
function, we will categorize the sites manually, creating a list of our conditions that
create lists of sites and then mapping these to a separate list of category names:
site_conds = [
players_agg['site'].str.contains('|'.join(['google.com', 'youtube.com'])),
players_agg['site'].str.contains('|'.join(['wikipedia.org'])),
players_agg['site'].str.contains('|'.join(['medium.com', 'forbes.com',
'hbr.org', 'smartinsights.com', 'mckinsey.com',
'techradar.com','searchenginejournal.com',
'cmswire.com', 'entrepreneur.com',
'pcmag.com', 'elearningindustry.com',
'businessnewsdaily.com'])),
players_agg['site'].isin(['on24.com', 'gotomeeting.com', 'marketo.com',
'zoom.us', 'livestorm.co', 'hubspot.com', 'drift.com',
'salesforce.com', 'clickmeeting.com', 'liferay.com',
547
Chapter 10 Google Updates
'qualtrics.com', 'workcast.com',
'livewebinar.com', 'getresponse.com',
'brightwork.com',
'superoffice.com', 'myownconference.com',
'info.workcast.com', 'tallyfy.com',
'readytalk.com', 'eventbrite.com', 'sitecore.
com', 'pgi.com', '3cx.com', 'walkme.com',
'venngage.com', 'tableau.com', 'netsuite.
com', 'zoominfo.com', 'sproutsocial.com']),
players_agg['site'].isin([ 'neilpatel.com', 'ventureharbour.com',
'wordstream.com', 'business.tutsplus.com',
'convinceandconvert.com',
'growthmarketingpro.com',
'marketinginsidergroup.com',
'adamenfroy.com', 'danielwaas.com',
'newbreedmarketing.com']),
players_agg['site'].str.contains('|'.join(['trustradius.com', 'g2.com',
'capterra.com', 'softwareadvice.com'])),
players_agg['site'].str.contains('|'.join(['facebook.com', 'linkedin.
com', 'business.linkedin.com'])),
players_agg['site'].str.contains('|'.join(['.edu', '.ac.uk']))
]
Create a list of the values we want to assign for each condition. The categories in this
case are based on their business model or site purpose:
Create a new column and use np.select to assign values to it using our lists as
arguments:
548
Chapter 10 Google Updates
The sites are categorized. We’ll now look at the sites classed as other. This is useful
because if we see any sites we think are important enough to be categorized as not
“other,” then we can update the conditions earlier.
players_agg[players_agg['segment'] == 'other'].head(20)
549
Chapter 10 Google Updates
There you have a mapping dataframe which will be used to give segmented SERPs
insights, starting with visibility.
Visibility
With the sites categorized, we can now compare performance by site type before and
after the update.
As usual, we’ll aggregate the before and after datasets. Only this time, we will also
merge the site type labels.
Start with the before dataset:
before_sector_unq_reach = getstat_bef_unique.merge(players_agg_map, on =
'site', how = 'left')
550
Chapter 10 Google Updates
We filter for the top 10 to calculate our reach statistics, which we’ll need for our
visibility calculations later on:
before_sector_unq_reach = before_sector_unq_reach[before_sector_unq_
reach['rank'] < 11 ]
before_sector_agg_reach = before_sector_unq_reach.groupby(['segment']).
agg({'count': 'sum'}).reset_index()
before_sector_agg_reach = before_sector_agg_reach.rename(columns =
{'count': 'reach'})
before_sector_agg_reach = before_sector_agg_reach[['segment', 'reach']]
before_sector_agg_reach['reach'] = np.where(before_sector_agg_
reach['reach'].isnull(),
0, before_sector_agg_reach['reach'])
after_sector_unq_reach = getstat_aft_unique.merge(players_agg_map, on =
'site', how = 'left')
after_sector_unq_reach = after_sector_unq_reach[after_sector_unq_
reach['rank'] < 11 ]
after_sector_agg_reach = after_sector_unq_reach.groupby(['segment']).
agg({'count': 'sum'}).reset_index()
after_sector_agg_reach = after_sector_agg_reach.rename(columns = {'count':
'reach'})
after_sector_agg_reach['reach'] = np.where(after_sector_agg_reach['reach'].
isnull(), 0, after_sector_agg_reach['reach'])
551
Chapter 10 Google Updates
“Other” as a site type segment dominates the statistics in terms of reach. We may
want to filter this out later on. Now for the weighted average rankings by search volume,
which will include the wavg_rank_sv() function defined earlier.
before_sector_unq_visi = before_sector_unq_svranks.merge(before_sector_agg_
reach, on = 'segment', how = 'left')
before_sector_unq_visi['reach'] = np.where(before_sector_unq_visi['reach'].
isnull(), 0, before_sector_unq_visi['reach'])
before_sector_unq_visi['wavg_rank'] = np.where(before_sector_unq_
visi['wavg_rank'].isnull(), 100, before_sector_unq_visi['wavg_rank'])
before_sector_unq_visi['visibility'] = before_sector_unq_visi['reach'] /
before_sector_unq_visi['wavg_rank']
before_sector_unq_visi = before_sector_unq_visi.sort_values('visibility',
ascending = False)
after_sector_unq_visi = after_sector_unq_svranks.merge(after_sector_agg_
reach, on = 'segment', how = 'left')
after_sector_unq_visi['reach'] = np.where(after_sector_unq_visi['reach'].
isnull(), 0, after_sector_unq_visi['reach'])
552
Chapter 10 Google Updates
after_sector_unq_visi['wavg_rank'] = np.where(after_sector_unq_visi['wavg_
rank'].isnull(), 100, after_sector_unq_visi['wavg_rank'])
after_sectaor_unq_visi['visibility'] = after_sector_unq_visi['reach'] /
after_sector_unq_visi['wavg_rank']
after_sector_unq_visi = after_sector_unq_visi.sort_values('visibility',
ascending = False)
after_sector_unq_visi
As well as reach, other performs well in the search volume weighted rank stakes and
therefore in overall visibility. With the before and after segmented datasets aggregated,
we can now join them:
compare_sector_visi_players = before_sector_unq_visi.merge(after_sector_
unq_visi, on = ['segment'], how = 'outer')
compare_sector_visi_players = compare_sector_visi_players.rename(columns =
{'wavg_rank_x': 'before_rank', 'wavg_rank_y': 'after_rank',
'reach_x': 'before_reach', 'reach_y': 'after_reach',
'visibility_x': 'before_visi', 'visibility_y':
'after_visi'
})
553
Chapter 10 Google Updates
compare_sector_visi_players['before_visi'] = np.where(compare_sector_visi_
players['before_visi'].isnull(),
0, compare_sector_visi_players['before_visi'])
compare_sector_visi_players['after_visi'] = np.where(compare_sector_visi_
players['after_visi'].isnull(),
0, compare_sector_visi_players['after_visi'])
compare_sector_visi_players['delta_visi'] = compare_sector_visi_
players['after_visi'] - compare_sector_visi_players['before_visi']
compare_sector_visi_players = compare_sector_visi_players.sort_
values('delta_visi')
compare_sector_visi_players.head(10)
The only site group that lost were reference sites like Wikipedia, dictionaries, and
so on. Their reach increased by 11%, but their rankings declined by almost two places
on average. This could be that nonreference sites are churning out more value adding
articles which are crowding out generic sites like Wikipedia that have no expertises in
those areas.
554
Chapter 10 Google Updates
compare_sector_visi_players_long = compare_sector_visi_players[['segment',
'before_visi','after_visi']]
compare_sector_visi_players_long = compare_sector_visi_players_long.
melt(id_vars = ['segment'], var_name='Phase',
value_name='Visi')
compare_sector_visi_players_long['Phase'] = compare_sector_visi_players_
long['Phase'].str.replace('_visi', '')
compare_sector_visi_players_long['Phase'] = compare_sector_visi_players_
long['Phase'].astype('category')
compare_sector_visi_players_long['Phase'] = compare_sector_visi_players_
long['Phase'].cat.reorder_categories(['after',
'before'])
compare_sector_visi_players_long.head(10)
555
Chapter 10 Google Updates
compare_sector_visi_players_long_plt = (
ggplot(compare_sector_visi_players_long, aes(x = 'reorder(segment,
Visi)', y = 'Visi', fill = 'Phase')) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 0.8) +
labs(y = 'Visibility', x = ' ') +
coord_flip() +
theme(legend_position = 'right', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
compare_sector_visi_players_long_plt.save(filename = 'images/11_compare_
sector_visi_players_long_plt.png', height=5, width=10, units = 'in',
dpi=1000)
Compare_sector_visi_players_long_plt
So other than reference sites, every other category gained, including martech and
publishers which gained the most (Figure 10-16).
556
Chapter 10 Google Updates
Snippets
In addition to visibility, we can dissect result types by segment too. Although there are
many visualizations that can be done by segment, we’ve chosen snippets so that we can
introduce a heatmap visualization technique.
This time, we’ll aggregate on snippets and segments, having performed the join for
both before and after datasets:
before_sector_unq_snippets = getstat_bef_unique.merge(players_agg_map, on =
'site', how = 'left')
before_sector_agg_snippets = before_sector_unq_snippets.
groupby(['snippets', 'segment']).agg({'count': 'sum'}).reset_index()
before_sector_agg_snippets = before_sector_agg_snippets[['snippets',
'segment', 'count']]
before_sector_agg_snippets = before_sector_agg_snippets.rename(columns =
{'count': 'reach'})
after_sector_unq_snippets = getstat_aft_unique.merge(players_agg_map, on =
'site', how = 'left')
after_sector_agg_snippets = after_sector_unq_snippets.groupby(['snippets',
'segment']).agg({'count': 'sum'}).reset_index()
after_sector_agg_snippets = after_sector_agg_snippets[['snippets',
'segment', 'count']]
after_sector_agg_snippets = after_sector_agg_snippets.rename(columns =
{'count': 'reach'})
after_sector_agg_snippets.sort_values('reach', ascending = False).head(10)
557
Chapter 10 Google Updates
For post update, we can see that much of other’s reach are organic, images, and AMP
post update. How does that compare pre- and post update?
compare_sector_snippets = before_sector_agg_snippets.merge(after_sector_
agg_snippets, on = ['snippets', 'segment'], how = 'outer')
compare_sector_snippets = compare_sector_snippets.rename(columns =
{'reach_x': 'before_reach', 'reach_y': 'after_reach'})
compare_sector_snippets['before_reach'] = np.where(compare_sector_snippets
['before_reach'].isnull(), 0, compare_sector_snippets['before_reach'])
compare_sector_snippets['after_reach'] = np.where(compare_sector_snippets
['after_reach'].isnull(), 0, compare_sector_snippets['after_reach'])
compare_sector_snippets['delta_reach'] = compare_sector_snippets['after_
reach'] - compare_sector_snippets['before_reach']
compare_sector_snippets = compare_sector_snippets.sort_
values('delta_reach')
compare_sector_snippets.head(10)
558
Chapter 10 Google Updates
Review sites lost the most reach in the organic listings and Google images. Martech
lost out on AMP results.
compare_sector_snippets.tail(10)
559
Chapter 10 Google Updates
By contrast, publisher sites appear to have displaced review sites on images and
organic results. Since we’re more interested in result types other than the regular organic
links, we’ll strip these out and visualize. Otherwise, we’ll end up with charts that show a
massive weight for organic links while dwarfing out the rest of the result types.
compare_sector_snippets_graphdf = compare_sector_snippets[compare_sector_
snippets['snippets'] != 'organic']
compare_sector_snippets_plt = (
ggplot(compare_sector_snippets_graphdf,
aes(x = 'segment', y = 'snippets', fill = 'delta_reach')) +
geom_tile(stat = 'identity', alpha = 0.6) +
labs(y = '', x = '') +
theme_classic() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
compare_sector_snippets_plt.save(filename = 'images/12_compare_sector_
snippets_plt.png', height=5, width=10, units = 'in', dpi=1000)
compare_sector_snippets_plt
The heatmap in Figure 10-17 uses color as the third dimension to display where the
major changes in reach were for the different site segments (bottom) and result types
(vertical).
This results in the following:
560
Chapter 10 Google Updates
Figure 10-17. Heatmap showing the difference in top 10 counts by site type and
result type
The major change that stands out is Google image results for the other segment.
The rest appears inconsequential by contrast. The heatmap is an example of how three-
dimensional categorical data can be visualized.
Summary
We focused on analyzing performance following algorithm updates with a view to
explaining what happened and possible extraction of insights for areas of further
research and recommendation generation.
Not only did we evaluate methods for establishing visibility changes, our general
approach was to analyze the general SERP trends before segmenting by result types,
cannibals. Then we looked at the target site level, seeing the changes by keyword, query
length, and URLs.
We also tried evaluating general SERP trends by grouping sites into site category
segments to give a richer analysis of the SERPs by visibility and snippets. While the
visualization of data before and after the core update doesn’t always reveal the causes of
any algorithm update, some patterns can be learned to inform further areas of research
for recommendation generation. The data can always be joined with other data sources
and use the techniques outlined in competitor analysis to uncover potential ranking
factor hypotheses for split testing.
The next chapter discusses the future of SEO.
561
CHAPTER 11
Aggregation
Aggregation is a data analysis technique where data is summarized at a group level, for
example, the average number of clicks by content group. In Microsoft Excel, this would
be achieved using pivot tables.
Aggregation is something that can and should be achieved by automation.
Aggregation has been applied to help us understand too many different areas of SEO
to name (covered in this book). With the code supplied, this could be integrated into
cloud-based data pipelines for SEO applications to load data warehouses and power
dashboards.
Aggregation for most use cases is good enough. However, for scientific exploration
of the data and hypothesis testing, we need to consider the distribution of the data, its
variation, and other statistical properties.
Aggregation for reporting and many analytical aspects of SEO consulting can
certainly be automated, which would be carried out in the cloud. This is pretty
straightforward to do given the data pipeline technology by Apache that’s already
in place.
563
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_11
Chapter 11 The Future of SEO
Distributions
The statistical distribution has power because we’re able to understand what is normal
and therefore identify anything that is significantly above or below performance.
We used distributions to find growth keywords where keywords in Google Search
Console (GSC) had impressions above the 95th percentile for their ranking position.
We also used distributions to identify content that lacked internal links in a website
architecture and hierarchy.
This can also be automated in the cloud which could lead to applications
being released for the SEO industry to automate the identification of internal link
opportunities. There is a slight risk of course that all internal links are done on a pure
distribution basis and not factoring in content not intended for search channels which
would need to be part of the software design.
String Matching
String matching using the Sorensen-Dice algorithm has been applied to help match
content to URLs for use cases such as keyword mapping, migration planning, and others.
The results are decent as it’s relatively quick and scales well, but it relies on
descriptive URLs and title tags in the first instance. It also relies on preprocessing such as
removing the website domain portion of the URL before applying string matching, which
is easily automated. Less easy to work around is the human judgment of what is similar
enough for the title and URL slug to be a good match. Should the threshold be 70%,
83%, or 92%?
That is less easy and probably would require some self-learning in the form of an AI,
more specifically a recurrent neural network (RNN). It’s not impossible, of course, as you
would need to determine what a good and less good outcome metric is to know how to
train a model. Plus, you’d need at least a million data points to train the model.
A key question for keyword mapping will be “what is the metric that shows which
URL is the best for users searching for X keyword.” An RNN could be good here as it
could learn from top ranking SERP content, reduce it to an object, and then compare site
content against that object to map the keyword to.
564
Chapter 11 The Future of SEO
For redirecting expired URLs (with backlinks) to live URLs with a 200 HTTP response,
it could be more straightforward and not require an AI. You might use a decision tree–
based algorithm using user behavior to inform what is the best replacement URL, that is,
users on “URL A” would always go to URL X out of the other URL choices available.
A non-AI-based solution doesn’t rely on millions of SERP or Internet data and would
therefore be (relatively) inexpensive to construct in-house. The AI-based solutions on
the other hand are likely to either be built as a SaaS or by a mega enterprise brand that
relies on organic traffic like Amazon, Quora, or Tripadvisor.
Clustering
In this book, clustering has been applied to determine search intent by comparing
search results at scale. The principles are based on comparing distances between data
points, and wherever a distance is relatively small, a cluster exists. Word stemming hasn’t
been applied in this book as it lacks the required precision despite the speed.
Clustering is useful not only for understanding keywords but also for grouping
keywords for reporting performance and grouping content to set the website hierarchy.
Your imagination as an SEO expert is the limit.
Applications already exist in the SEO marketplace for clustering keywords according
to search intent by comparing search results, so this can and already has been automated
in the cloud by Artios, Keyword Insights, Keyword Cupid, SEO Scout and others.
565
Chapter 11 The Future of SEO
Set Theory
Set theory is where we compare sets (think lists) of data like keywords and compare
them to another set. This can be used to see the difference between two datasets. This
was used for content gap analysis to find common keywords (i.e., where the keywords
of two websites intersect) and to find the gap between the target site and the core
keyword set.
This is pretty straightforward and can easily be automated using tools like SEMRush
and AHREFs. So why do it in Python? Because it’s free and it gives you more control over
the degree of intersection required.
Knowing the perfect degree of intersection is less clear because it would require
research and development to work out the degree of intersects required which for one
will depend on the number of sites being intersected.
However, the skill is knowing which competitors to include in the set in the first place
which may not be so easy for a machine to discern.
566
Chapter 11 The Future of SEO
hypotheses to test and create SEO solutions. Spending less time collecting and analyzing
data and more time responding to the data is the order of the day.
You’ll also be in a far better position to work with software engineers when it comes
to specifying cloud apps, be it reporting, automation, or otherwise.
Of course, creativity comes from knowledge, so the more you know about SEO, the
better the questions you will be able to ask of the data, producing better insights as a
consequence and much more targeted recommendations.
Summary
In this chapter, we consolidated at a very high level the ideas and techniques used to
make SEO data driven:
• Aggregation
• Distributions
• String matching
• Clustering
• Set theory
We also examined what computers can and can’t do and provided a reminder why
SEO experts should turn to data science.
Here’s to the future of SEO.
567
Index
A Amazon Web Services (AWS), 5, 300
anchor_levels_issues_count_plt
A/A testing
graphic, 116
aa_means dataframe, 314
anchor_rel_stats_site_agg_plt plot, 121
aa_model.summary(), 319
Anchor texts
aa_test_box_plt, 317
anchor_issues_count_plt, 113
dataframe, 313
HREF, 113
data structures, 312
issues by site level, 114, 116
date range, 314
nondescriptive, 113
groups’ distribution, 317
search engines and users, 111
histogram plots, 316
Sitebulb, 111
.merge() function, 315
Anchor text words, 122–125
NegativeBinomial() model, 319
Andreas, 5, 245
optimization, 315
Antispam algorithms, 200
pretest and test period groups, 318
API libraries, 345
p-value, 319
API output, 128
SearchPilot, 311, 312
API response, 128, 346, 347
sigma, 315
Append() function, 168
statistical model, 318
apply_pcn function, 379
statistical properties, 313
astype() function, 490
.summary() attribute, 319
Augmented Dickey-Fuller method
test period, 313
(ADF), 29
Accelerated mobile pages (AMP), 505
Authority, 199, 200, 236, 237, 241
Account-based marketing, 175
aggregations, 205
Additional features, 130, 475
backlinks, 201, 202
Adobe Analytics, 343
data and cleaning, 203
Aggregation, 67, 81, 105, 131, 186, 205,
data features, 206
218, 253, 256, 276, 368, 449, 474,
dataframe, 204, 212
480, 513, 521, 522, 563
descriptive statistics, 206
AHREFs, 98, 201, 216, 244, 249, 266,
distribution, 207
343, 566
domain rating, 207
Akaike information criterion (AIC), 32
domains, 204
Alternative methods, 118
569
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7
INDEX
570
INDEX
571
INDEX
Dataframe, 15, 18, 20, 21, 23, 25, 42, 43, 45, reach, 479, 480
61, 62, 66, 67, 78, 79, 82, 93, 98, 130 reach stratified, 485–493
Data post migration, 446, 454 rename columns, 481
Data science, 151, 566 separate panels by phase as
automatable, 5 parameter, 502
cheap, 5 visibility, 496–504
data rich, 4 WAVG search volume, 495, 496
Data sources, 7–8, 19, 152, 248, 343, 344, WorkCast, 482, 483
365, 469 drop_col function, 165
Data visualization, 462, 483
Data warehouse, 300, 344, 345, 365,
370, 563 E
Decision tree–based algorithm, 248, Eliminate NAs, 288–289
290, 565 Experiment
Dedupe, 477–479 ab_assign_box_plt, 336
Deduplicate lists, 170 ab_assign_log_box_plt, 338
Defining ABM, 175 ab_assign_plt, 335
depthauth_stats_plt, 110 ab_group, 339
Describe() function, 225, 281, 283 A/B group, 332
Destination URLs, 117–119, 243, 402, 422 ab_model.summary(), 339
df.info(), 348 A/B tests, 327
diag_conds, 463, 464 analytics data, 331
Diagnosis, 457, 458, 461, 463–465 array, 339
Distilled ODN, 301, 311 dataframe, 329
Distributions, 16, 17, 63, 64, 67–70, 75, 76, dataset, 332
84–88, 90, 100, 101, 103, 107, 111, distribution, test group, 335
145–150, 202, 207, 208, 212, 225, histogram, 334
226, 240, 308, 310, 311, 316, 564 hypothesis, 328
DNA sequencing, 153 outcomes, 340
Documentation, 435, 449, 450, 453, 463 pd.concat dataframe, 333
Domain authority, 206–208, 216 p-value, 340
Domain rating (DR), 201, 207–210, 212, simul_abgroup_trend.head(), 333
215, 216, 221, 244, 249 simul_abgroup_trend_plt, 334
Domains test_analytics_expanded, 331, 332
create new columns, 482 test and control, 328
device search result types, 485 test and control groups, 333, 334
HubSpot, 481, 482 test and control sessions, 337
rankings, 493, 494 website analytics software, 329
572
INDEX
573
INDEX
574
INDEX
K
I, J keysv_df, 48
ia_current_mapping, 395 Keyword mapping
Inbound internal links, 89, 105, 108 approaches, 152
Inbound links, 77, 79, 89, 97, 98 definition, 152
Indexable URLs, 68, 73, 75, 117, 142, 145 string matching, 153–159
Individual CWV metrics, 132 Keyword research
Inexact (data) science of SEO data-driven methods, 7
channel’s diminishing value, 2 data sources, 7
high costs, 4 forecasts, 24–38
lacking sample data, 2, 3 Google Search Console (GSC), 8–19
making ads look, 2 Google Trends, 19–24
noisy feedback loop, 1 search intent, 38–57
things can’t be measured, 3 SERP competitors, 57–62
Internal link optimization, 63, 150 Keywords, 533–536
anchor text relevance, 117–125 token length, 520–525
Anchor texts, 111–116 token length deep dive, 525–533
content type, 107–111 Keywords_dict, 167, 169
crawl dataframe, 79
external inbound link data, 79
hyperlinked URL’s, 77 L
inbound links, 77 LCP_cwv_landscape_plt plot, 134
link dataframe, 78 Levenshtein distance, 46
by page authority, 97–106 Life insurance, 39
probability, 77 Linear models, 277
Sitebulb, 78 Link acquisition program, 237
Sitebulb auditing software, 77 Link capital, 235, 237
by site level, 81–97 Link capital velocity, 238
URLs, 79 Link checkers, 343
URLs with backlinks, 80 Link quality, 202, 206, 208, 209, 212, 216,
website optimization, 77 221, 225–231
Internal links distribution, 99 Link velocity, 234, 235
intlink_dist_plt plot, 89 Link volumes, 212, 231–233
575
INDEX
576
INDEX
577
INDEX
Search engine, 1, 2, 7, 63, 64, 66, 77, 97, SEMRush visibility, 222, 224, 231
111, 122, 125, 151, 156, 160, 199, SEO benefits, 125, 141
212, 214, 228, 244, 246, 255, 303, SEO campaigns and operations, 4
477, 566 SEO manager, 85
Search engine optimization (SEO), 1–5, 7, SEO rank checking tool, 391
8, 13, 19, 54, 57, 63, 64, 76, 77, 85, SERP competitors
118, 151, 152, 200, 221, 238, 245, extract keywords from page title, 60, 61
260, 281, 289, 291, 295, 299, 300, filter and clean data, 58–60
302, 303, 320, 341, 343, 345, SEMRush, 57
373, 565 SERPs data, 61, 62
Search Engine Results Pages (SERPs), 4, SERP dataframe, 192
16, 39–46, 50, 57, 58, 62, 126, 127, SERP results, 16, 518, 520
176, 185, 191, 192, 194, 245, SERPs comparison, 43–57
248–255, 257, 260, 268, 469, 505 SERPs data, 61, 62, 126, 390, 391, 394
Search intent, 53, 192 SERPs model, 4
convert SERPs URL into string, 41–43 Serps_raw dataframe, 252
core updates, 39 set_post_data, 352
DataForSEO’s SERP API, 40 Set theory, 566, 567
keyword content mapping, 39 Single-level factor (SLFs), 274
Ladies trench coats, 39 dataset, 275
Life insurance, 39 parameterized URLs, 276
paid search ads, 39 ranking URL titles, 274
queries, 38 SIS_cwv_landscape_plt, 133
rank tracking costs, 39 Site architecture, 39, 108, 564
SERPs comparison, 43–57 Sitebulb crawl data, 78, 142
Split-Apply-Combine (SAC), 41 Site depth, 64, 82, 90, 119, 152
Trench coats, 39 Site migration, 377, 412, 454, 467
Search query, 3, 8, 9, 11, 39, 246, 249, Snippets, 504, 505, 512, 557–561
280, 520 Sorensen-Dice, 46, 118, 153, 422, 564
Search volume, 3, 48–50, 56, 253, 255, 471, speed_ex_plt chart, 141
494–496, 520, 541, 544 Speed Index Score (SIS), 130, 132
Segment, 4, 11–15, 17, 145, 433, 436, 443, Speed score, 133, 146
448, 453, 454, 544 Split A/B test, 293, 299, 301, 312
SEMRush, 57, 160–162, 171, 173, 201, Split heading, 190
223, 566 Standard deviations, 3, 8, 13, 225,
semrush_csvs, 161 307, 366–368
SEMRush domain, 222 Statistical distribution, 564
SEMRush files, 161 Statistically robust, 14, 245
578
INDEX
579
INDEX
W Z
wavg_rank, 445, 495 Zero inflation, 308–311
wavg_rank_imps, 445 Zero string similarity, 394
580