Voniatis A. Data-Driven SEO With Python... Data Science Using Python 2023
Voniatis A. Data-Driven SEO With Python... Data Science Using Python 2023
Python
Solve SEO Challenges with Data
Science Using Python
Andreas Voniatis
Foreword by Will Critchlow,
Founder and CEO, SearchPilot
Data-Driven SEO with Python: Solve SEO Challenges with Data Science
Using Python
Andreas Voniatis
Surrey, UK
Acknowledgments��������������������������������������������������������������������������������������������������xix
Foreword���������������������������������������������������������������������������������������������������������������xxv
Chapter 1: Introduction�������������������������������������������������������������������������������������������� 1
The Inexact (Data) Science of SEO������������������������������������������������������������������������������������������������ 1
Noisy Feedback Loop�������������������������������������������������������������������������������������������������������������� 1
Diminishing Value of the Channel�������������������������������������������������������������������������������������������� 2
Making Ads Look More like Organic Listings��������������������������������������������������������������������������� 2
Lack of Sample Data��������������������������������������������������������������������������������������������������������������� 2
Things That Can’t Be Measured����������������������������������������������������������������������������������������������� 3
High Costs������������������������������������������������������������������������������������������������������������������������������� 4
Why You Should Turn to Data Science for SEO������������������������������������������������������������������������������ 4
SEO Is Data Rich���������������������������������������������������������������������������������������������������������������������� 4
SEO Is Automatable����������������������������������������������������������������������������������������������������������������� 5
Data Science Is Cheap������������������������������������������������������������������������������������������������������������� 5
Summary�������������������������������������������������������������������������������������������������������������������������������������� 5
v
Table of Contents
Chapter 3: Technical����������������������������������������������������������������������������������������������� 63
Where Data Science Fits In��������������������������������������������������������������������������������������������������������� 64
Modeling Page Authority������������������������������������������������������������������������������������������������������������� 64
Filtering in Web Pages����������������������������������������������������������������������������������������������������������� 66
Examine the Distribution of Authority Before Optimization��������������������������������������������������� 67
vi
Table of Contents
vii
Table of Contents
viii
Table of Contents
x
Table of Contents
Diagnostics�������������������������������������������������������������������������������������������������������������������������� 454
Road Map���������������������������������������������������������������������������������������������������������������������������� 463
Summary���������������������������������������������������������������������������������������������������������������������������������� 467
xi
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 569
xii
About the Author
Andreas Voniatis is the founder of Artios and a SEO
consultant with over 20 year’s experience working with
ad agencies (PHD, Havas, Universal Mcann, Mindshare
and iProspect), and brands (Amazon EU, Lyst, Trivago,
GameSys). Andreas founded Artios in 2015 – to apply an
advanced mathematical approach and cloud AI/Machine
Learning to SEO.
With a background in SEO, data science and cloud engineering, Andreas has helped
companies gain an edge through data science and automation. His work has been
featured in publications worldwide including The Independent, PR Week, Search Engine
Watch, Search Engine Journal and Search Engine Land.
Andreas is a qualified accountant, holds a degree in Economics from Leeds
University and has specialised in SEO science for over a decade. Through his firm Artios,
Andreas helps grow startups providing ROI guarantees and trains enterprise SEO teams
on data driven SEO.
xiii
About the Contributing Editor
Simon Dance is the Chief Commercial Officer at Lyst.com, a fashion shopping platform
serving over 200M users a year; an angel investor; and an experienced SEO having
spent a 15-year career working in senior leadership positions including Head of SEO
for Amazon’s UK and European marketplaces and senior SEO roles at large-scale
marketplaces in the flights and vacation rental space as well as consulting venture–
backed companies including Depop, Carwow, and HealthUnlocked. Simon has worn
multiple hats over his career from building links, manually auditing vast backlink
profiles, carrying our comprehensive bodies of keyword research, and writing technical
audit documents spanning hundreds of pages to building, mentoring, and leading teams
who have unlocked significant improvements in SEO performance, generating hundreds
of millions of dollars of incremental revenue. Simon met Andreas in 2015 when he had
just built a rudimentary set of Python scripts designed to vastly increase the scale, speed,
and accuracy of carrying out detailed keyword research and classification. They have
worked together almost ever since.
xv
About the Technical Reviewer
Joos Korstanje is a data scientist with over five years
of industry experience in developing machine learning
tools. He has a double MSc in Applied Data Science and
in Environmental Science and has extensive experience
working with geodata use cases. He has worked at a
number of large companies in the Netherlands and France,
developing machine learning for a variety of tools.
xvii
Acknowledgments
It’s my first book and it wouldn’t have been possible without the help of a few people. I’d
like to thank Simon Dance, my contributing editor, who has asked questions and made
suggested edits using his experience as an SEO expert and commercial director. I’d also
like to thank all of the people at Springer Nature and Apress for their help and support.
Wendy for helping me navigate the commercial seas of getting published. Will Critchlow
for providing the foreword to this book. All of my colleagues, clients, and industry peers
including SEOs, data scientists, and cloud engineers that I have had the pleasure of
working with. Finally, my family, Petra and Julia.
xix
Why I Wrote This Book
Since 2003, when I first got into SEO (by accident), much has changed in the practice of
SEO. The ingredients were lesser known even though much of the focus was on getting
backlinks, be they reciprocal, one-way links or from private networks (which are still
being used in the gaming space). Other ingredients include transitioning to becoming a
recognized brand, producing high-quality content which is valuable to users, a delightful
user experience, producing and organizing content by search intent, and, for now and
tomorrow, optimizing the search journey.
Many of the ingredients are now well known and are more complicated with the
advent of mobile, social media, and voice and the increasing sophistication of search
engines.
Now more than ever, the devil is in the details. There is more data being generated
than ever before from ever more third-party data sources and tools. Spreadsheets alone
won’t hack it. You need a sharper blade, and data science (combined with your SEO
knowledge) is your best friend.
I created this book for you, to make your SEO data driven and therefore the best
it can be.
And why now in 2023? Because COVID-19 happened which gave me time to think
about how I could add value to the world and in particular the niche world of SEO.
Even more presciently, there are lots of conversations on Twitter and LinkedIn about
SEOs and the use of Python in SEO. So we felt the timing is right as the SEO industry has
the appetite and we have knowledge to share.
I wish you the very best in your new adventure as a data-driven SEO specialist!
xxi
Why I Wrote This Book
• Code to get going: The best way to learn naturally is by doing. While
there are many courses in SEO, the most committed students of SEO
will build their own websites and test SEO ideas and practices. Data
science for SEO is no different if you want to make your SEO data
driven. So, you’ll be provided with starter scripts in Python to try
your own hand in clustering pages and content, analyzing ranking
factors. There will be code for most things but not for everything, as
not everything has been coded for (yet). The code is there to get you
started and can always be improved upon.
xxii
Why I Wrote This Book
• How to become an SEO specialist: What this book won’t cover is how
to become an SEO expert although you’ll certainly come away with a
lot of knowledge on how to be a better SEO specialist. There are some
fundamentals that are beyond the scope of this book.
For example, we don’t get into how a search engine works, what a content
management system is, how it works, and how to read and code HTML and CSS. We also
don’t expose all of the ranking factors that a search engine might use to rank websites or
how to perform a site relaunch or site migration.
This book assumes you have a rudimentary knowledge of how SEO works and what
SEO is. We will give a data-driven view of the many aspects of SEO, and that is to reframe
the SEO challenge from a data science perspective so that you have a useful construct to
begin with.
• How to become a data scientist: This book will certainly expose the
data science techniques to solve SEO challenges. What it won’t do is
teach you to become a data scientist or teach you how to program in
the Python computing language.
xxiii
Why I Wrote This Book
1. Keyword research
2. Technical
3. Content and UX
4. Authority
5. Competitor analysis
6. Experiments
7. Dashboards
9. Google updates
• Data sources
• Data structures
• Models
• Activation suggestions
I’ve tried to apply data science to as many SEO processes as possible in the areas
identified earlier. Naturally, there will be some areas that could be applied that have not.
However, technology is changing, and Google is already releasing updates to combat AI-
written content. So I’d imagine in the very near future, more and more areas of SEO will
be subject to data science.
xxiv
Foreword
The data we have access to as SEOs has changed a lot during my 17 years in the indus-
try. Although we lost analytics-level keyword data, and Yahoo! Site Explorer, we gained
a wealth of opportunity in big data, proprietary metrics, and even some from the horse’s
mouth in Google Search Console.
You don’t have to be able to code to be an effective SEO. But there is a certain kind of
approach and a certain kind of mindset that benefits from wrangling data in all its forms.
If that’s how you prefer to work, you will very quickly hit the limits of spreadsheets and
text editors. When you do, you’ll do well to turn to more powerful tools to help you scale
what you’re capable of, get things done that you wouldn’t even have been able to do
without a computer helping, and speed up every step of the process.
There are a lot of programming languages, and a lot of ways of learning them. Some
people will tell you there is only one right way. I’m not one of those people, but my
personal first choice has been Python for years now. I liked it initially for its relative
simplicity and ease of getting started, and very quickly fell for the magic of being able to
import phenomenal power written by others with a single line of code. As I got to know
the language more deeply and began to get some sense of the “pythonic” way of doing
things, I came to appreciate the brevity and clarity of the language. I am no expert, and
I’m certainly not a professional software engineer, but I hope that makes me a powerful
advocate for the approach outlined in this book - because I have been the target market.
When I was at university, I studied neural networks among many other things. At the
time, they were fairly abstract concepts in operations research. At that point in the late
90s, there wasn’t the readily available computing power plus huge data sets needed to
realise the machine learning capabilities hidden in those nodes, edges, and statistical
relationships. I’ve remained fascinated by what is possible and with the help of magical
import statements and remarkably mature frameworks, I have even been able to build
and train my own neural networks in Python. As a stats geek, I love that it’s all stats under
the hood, but at the same time, I appreciate the beauty in a computer being able to do
something a person can’t.
A couple of years after university, I founded the SEO agency Distilled with my co-
founder Duncan Morris, and one of the things that we encouraged among our SEO
xxv
Foreword
consultants was taking advantage of the data and tools at their disposal. This led to
fun innovation - both decentralised, in individual consultants building scripts and
notebooks to help them scale their work, do it faster, or be more effective, and centrally
in our R&D team.
That R&D team would be the group who built the platform that would become
SearchPilot and launched the latest stage of my career where we are very much leading
the charge for data aware decisions in SEO. We are building the enterprise SEO A/B
testing platform to help the world’s largest websites prove the value of their on-site SEO
initiatives. All of this uses similar techniques to those outlined in the pages that follow to
decide how to implement tests, to consume data from a variety of APIs, and to analyse
their results with neural networks.
I believe that as Google implements more and more of their own machine learning
into their ranking algorithms, that SEO becomes fundamentally harder as the system
becomes harder to predict, and has a greater variance across sites, keywords, and topics.
It’s for this reason that I am investing so much time, energy, and the next phase of my
career into our corner of data driven SEO. I hope that this book can set a whole new
cohort of SEOs on a similar path.
I first met Andreas over a decade ago in London. I’ve seen some of the things he
has been able to build over the years, and I’m sure he is going to be an incredible
guide through the intricacies of wrangling data to your benefit in the world of
SEO. Happy coding!
Will Critchlow, CEO, SearchPilot
September 2022
xxvi
CHAPTER 1
Introduction
Before the Google Search Essentials (formerly Webmaster Guidelines), there was an
unspoken contract between SEOs and search engines which promised traffic in return
for helping search engines extract and index website content. This chapter introduces
you to the challenges of applying data science to SEO and why you should use data.
• High costs
1
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_1
Chapter 1 Introduction
within their systems before it gets reflected in the search engine results (which may or
may not result in a change of ranking position).
Because of this variable and unpredictable time lag, this makes it rather difficult to
undertake cause and effect analysis to learn from SEO experiments.
2
Chapter 1 Introduction
To have a dataset that has a true representation of the SEO reality, it must have
multiple audit measurements which allow for statistics such as average and standard
deviations per day of
• Duplicate content
• Missing titles
With this type of data, data scientists are able to do meaningful SEO science work
and track these to rankings and UX outcomes.
• Search query: Google, for some time, has been hiding the search
query detail of organic traffic, of which the keyword detail in Google
Analytics is shown as “Not Provided.” Naturally, this would be useful
as there are many keywords to one URL relationship, so getting the
breakdown would be crucial for attribution modeling outcomes, such
as leads, orders, and revenue.
• Search volume: Google Ads does not fully disclose search volume per
search query. The search volume data for long tail phrases provided
by Ads is reallocated to broader matches because it’s profitable for
Google to encourage users to bid on these terms as there are more
bidders in the auction. Google Search Console (GSC) is a good
substitute, but is first-party data and is highly dependent on your
site’s presence for your hypothesis keyword.
3
Chapter 1 Introduction
• Segment: This would tell us who is searching, not just the keyword,
which of course would in most cases vastly affect the outcomes of
any machine-learned SEO analysis because a millionaire searching
for “mens jeans” would expect different results to another user of
more modest means. After all, Google is serving personalized results.
Not knowing the segment simply adds noise to any SERPs model or
otherwise.
High Costs
Can you imagine running a large enterprise crawling technology like Botify daily? Most
brands run a crawl once a month because it’s cost prohibitive, and not just on your site.
To get a complete dataset, you’d need to run it on your competitors, and that’s only one
type of SEO data.
Cost won’t matter as much to the ad agency data scientist, but it will affect whether
they will get access to the data because the agency may decide the budget isn’t
worthwhile.
4
Chapter 1 Introduction
SEO Is Automatable
At least in certain aspects. We’re not saying that robots will take over your career. And
yet, we believe there is a case that some aspects of your job as an SEO a computer can do
instead. After all, computers are extremely good at doing repetitive tasks, they don’t get
tired nor bored, can “see” beyond three dimensions, and only live on electricity.
Andreas has taken over teams where certain members spent time constantly copying
and pasting information from one document to another (the agency and individual will
remain unnamed to spare their blushes).
Doing repetitive work that can be easily done by a computer is not value adding,
emotionally engaging, nor good for your mental health. The point is we as humans are
at our best when we’re thinking and synthesizing information about a client’s SEO; that’s
when our best work gets done.
Summary
This brief introductory chapter has covered the following:
• The inexact science of SEO
5
CHAPTER 2
Keyword Research
Behind every query a user enters within a search engine is a word or series of words.
For instance, a user may be looking for a “hotel” or perhaps a “hotel in New York City.”
In search engine optimization (SEO), keywords are invariably the target, providing a
helpful way of understanding demand for said queries and helping to more effectively
understand various ways that users search for products, services, organizations, and,
ultimately, answers.
As well as SEO starting from keywords, it also tends to end with the keyword as an
SEO campaign may be evaluated on the value of the keyword’s contribution. Even if this
information is hidden from us by Google, attempts have been made by a number of SEO
tools to infer the keyword used by users to reach a website.
In this chapter, we will give you data-driven methods for finding valuable keywords
for your website (to enable you to have a much richer understanding of user demand).
It’s also worth noting that given keyword rank tracking comes at a cost (usually
charged per keyword tracked or capped at a total number of keywords), it makes sense to
know which keywords are worth the tracking cost.
Data Sources
There are a number of data sources when it comes to keyword research, which we’ll list
as follows:
• Google Search Console
• Competitor Analytics
• SERPs
• Google Trends
• Google Ads
• Google Suggest
7
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_2
Chapter 2 Keyword Research
We’ll cover the ones highlighted in bold as they are not only the more informative of
the data sources, they also scale as data science methods go. Google Ads data would only
be so appealing if it were based on actual impression data.
We will also show you how to make forecasts of keyword data both in terms of the
amount of impressions you get if you achieve a ranking on page 1 (within positions 1 to
10) and what this impact would be over a six-month horizon.
Armed with a detailed understanding of how customers search, you’re in a much
stronger position to benchmark where you index vs. this demand (in order to understand
the available opportunity you can lean into), as well as be much more customer focused
when orienting your website and SEO activity to target that demand.
Let’s get started.
1
In 2006, AOL shared click-through rate data based upon over 35 million search queries, and
since then it has inspired numerous models to try and estimate the click-through rate (CTR) by
search engine ranking position. That is, for every 100 people searching for “hotels in New York,”
30% (for example) click on the position 1 ranking, with just 16% clicking on position 2 (hence the
importance of achieving the top ranked position, in order to, effectively, double your traffic (for
that keyword))
8
Chapter 2 Keyword Research
There is no hard and fast rule. Two sigmas simply mean that there’s a less than 5%
chance that the search query is actually less like the average search query, so a lower
significance threshold like one sigma could easily suffice.
The data are several exports from Google Search Console (GSC) of the top 1000
rows based on a number of filters. The API could be used, and some code is provided in
Chapter 10 showing how to do so.
For now, we’re reading multiple GSC export files stored in a local folder.
Set the path to read the files:
Initialize an empty list that will store the data being read in:
gsc_li = []
The for loop iterates through each export file and takes the filename as the modifier
used to filter the results and then appends it to the preceding list:
for cf in gsc_csvs:
df = pd.read_csv(cf, index_col=None, header=0)
df['modifier'] = os.path.basename(cf)
df.modifier = df.modifier.str.replace('_queries.csv', '')
gsc_li.append(df)
Once the list is populated with the export data, it’s combined into a single dataframe:
gsc_raw_df = pd.DataFrame()
gsc_raw_df = pd.concat(gsc_li, axis=0, ignore_index=True)
9
Chapter 2 Keyword Research
gsc_raw_df.columns = gsc_raw_df.columns.str.strip().str.lower().str.
replace(' ', '_').str.replace('(', '').str.replace(')', '')
gsc_raw_df.head()
With the data imported, we’ll want to format the column values to be capable of
being summarized. For example, we’ll remove the percent signs in the ctr column and
convert it to a numeric format:
GSC data contains a funny character “<” in the impressions and clicks columns for
values less than 10; our job is to clean this up by removing them and then arranging
impressions in descending order. In Python, this would look like
gsc_clean_ctr_df['impressions'] = gsc_clean_ctr_df.impressions.str.
replace('<', '')
pd.to_numeric(gsc_import_df.impressions)
gsc_dedupe_df = gsc_clean_ctr_df.drop_duplicates(subset='top_queries',
keep="first")
10
Chapter 2 Keyword Research
query_conds = [
gsc_segment_strdetect['query'].str.contains('|'.join(retail_vex)),
gsc_segment_strdetect['query'].str.contains('|'.join(platform_vex)),
gsc_segment_strdetect['query'].str.contains('|'.join(title_vex)),
gsc_segment_strdetect['query'].str.contains('|'.join(network_vex))
]
Create a new column and use np.select to assign values to it using our lists as
arguments:
gsc_segment_strdetect
11
Chapter 2 Keyword Research
gsc_segment_strdetect['rank_bracket'] = gsc_segment_strdetect.position.
round(0)
gsc_segment_strdetect
12
Chapter 2 Keyword Research
def imp_aggregator(col):
d = {}
d['avg_imps'] = col['impressions'].mean()
d['imps_median'] = col['impressions'].quantile(0.5)
d['imps_lq'] = col['impressions'].quantile(0.25)
d['imps_uq'] = col['impressions'].quantile(0.95)
d['n_count'] = col['impressions'].count()
13
Chapter 2 Keyword Research
overall_rankimps_agg = group_by_rank_bracket.apply(imp_aggregator)
overall_rankimps_agg
In this case, we went with the 25th and 95th percentiles. The lower percentile
number doesn’t matter as much as we’re far more interested in finding queries with
averages beyond the 95th percentile. If we can do that, we have a juicy keyword. Quick
note, in data science, a percentile is known as a “quantile.”
Could we make a table for each and every segment? For example, show the statistics
for impressions by ranking position by section. Yes, of course, you could, and in theory,
it would provide a more contextual analysis of queries performed vs. their segment
average. The deciding factor on whether to do so or not depends on how many data
points (i.e., ranked queries) you have for each rank bracket to make it worthwhile (i.e.,
statistically robust). You’d want at least 30 data points in each to go that far.
14
Chapter 2 Keyword Research
query_quantile_stats = gsc_segment_strdetect.merge(overall_rankimps_agg, on
=['rank_bracket'], how='left')
query_quantile_stats
query_stats_uq = query_quantile_stats.loc[query_quantile_stats.impressions
> query_quantile_stats.imps_uq]
query_stats_uq['query'].count()
8390
Get the number of keywords with impressions and ranking beyond page 1:
15
Chapter 2 Keyword Research
query_stats_uq_p2b = query_quantile_stats.loc[(query_quantile_stats.
impressions > query_quantile_stats.imps_uq) & (query_quantile_stats.rank_
bracket > 10)]
query_stats_uq_p2b['query'].count()
2510
Depending on your resources, you may wish to track all 8390 keywords or just the
2510. Let’s see how the distribution of impressions looks visually across the range of
ranking positions:
sns.set(rc={'figure.figsize':(15, 6)})
imprank_plt.savefig("images/imprank_plt.png")
What’s interesting is the upper quantile impression keywords are not all in the top
10, but many are on pages 2, 4, and 6 of the SERP results (Figure 2-1). This indicates
that the site is either targeting the high-volume keywords but not doing a good job of
achieving a high ranking position or not targeting these high-volume phrases.
16
Chapter 2 Keyword Research
Figure 2-1. Line chart showing GSC impressions per ranking position bracket for
each distribution quantile
imprank_seg.savefig("images/imprank_seg.png")
Most of the high impression keywords are in Accessories, Console, and of course Top
1000 (Figure 2-2).
17
Chapter 2 Keyword Research
Figure 2-2. Line chart showing GSC impressions per ranking position bracket for
each distribution quantile faceted by segment
query_stats_uq_p2b.to_csv('exports/query_stats_uq_p2b_TOTRACK.csv')
Activation
Now that you’ve identified high impression value keywords, you can
• Think about how to integrate these new targets into your strategy
18
Chapter 2 Keyword Research
Obviously, the preceding list is reductionist, and yet as a minimum, you have better
nonbrand targets to better serve your SEO campaign.
Google Trends
Google Trends is another (free) third-party data source, which shows time series data
(data points over time) up to the last five years for any search phrase that has demand.
Google Trends can also help you compare whether a search is on the rise (or decline)
while comparing it to other search phrases. It can be highly useful for forecasting.
Although no Google Trends API exists, there are packages in Python (i.e., pytrends)
that can automate the extraction of this data as we’ll see as follows:
import pandas as pd
from pytrends.request import TrendReq
import time
Single Keyword
Now that you’ve identified high impression value keywords, you can see how they’ve
trended over the last five years:
kw_list = ["Blockchain"]
pytrends.build_payload(kw_list, cat=0, timeframe='today 5-y', geo='GB',
gprop='')
pytrends.interest_over_time()
19
Chapter 2 Keyword Research
Multiple Keywords
As you can see earlier, you get a dataframe with the date, the keyword, and the number of
hits (scaled from 0 to 100), which is great, and what if you had 10,000 keywords that you
wanted trends for?
In that case, you’d want a for loop to query the search phrases one by one and stick
them all into a dataframe like so:
Read in your target keyword data:
csv_raw = pd.read_csv('data/your_keyword_file.csv')
keywords_df = csv_raw[['query']]
keywords_list = keywords_df['query'].values.tolist()
keywords_list
20
Chapter 2 Keyword Research
['nintendo switch',
'ps4',
'xbox one controller',
'xbox one',
'xbox controller',
'ps4 vr',
'Ps5' ...]
Let’s now get Google Trends data for all of your keywords in one dataframe:
dataset = []
exceptions = []
for q in keywords_list:
q_lst = [q]
try:
pytrends.build_payload(kw_list=q_lst, timeframe='today 5-y',
geo='GB', gprop='')
data = pytrends.interest_over_time()
data = data.drop(labels=['isPartial'],axis='columns')
dataset.append(data)
time.sleep(3)
except:
exceptions.append(q_lst)
21
Chapter 2 Keyword Research
22
Chapter 2 Keyword Research
Looking at Google Trends raw, we now have data in long format showing
• Date
• Keyword
• Hits
Let’s visualize some of these over time. We start by subsetting the dataframe:
23
Chapter 2 Keyword Research
sns.set(rc={'figure.figsize':(15, 6)})
keyword_gtrends_plt.figure.savefig("images/keyword_gtrends.png")
keyword_gtrends_plt
Here, we can see that the “ps5” and “xbox series x” show a near identical trend which
ramp up significantly, while other models are fairly stable and seasonal until the arrival
of the new models.
24
Chapter 2 Keyword Research
df = pd.read_csv("exports/keyword_gtrends_df.csv", index_col=0)
df.head()
As we’d expect, the data from Google Trends is a very simple time series with date,
query, and hits spanning a five-year period. Time to format the dataframe to go from
long to wide:
We no longer have a hits column as these are the values of the queries in their
respective columns. This format is not only useful for SARIMA2 (which we will be
exploring here) but also neural networks such as long short-term memory (LSTM). Let’s
plot the data:
ps_unstacked.plot(figsize=(10,5))
From the plot (Figure 2-4), you’ll note that the profiles of both “PS4” and “PS5” are
different.
2
Seasonal Autoregressive Integrated Moving Average
26
Chapter 2 Keyword Research
For the nongamers among you, “PS4” is the fourth generation of the Sony
PlayStation console, and “PS5” the fifth. “PS4” searches are highly seasonal and have
a regular pattern apart from the end when the “PS5” emerged. The “PS5” didn’t exist
five years ago, which would explain the absence of trend in the first four years of the
preceding plot.
ps_unstacked.set_index("date", inplace=True)
ps_unstacked.index = pd.to_datetime(ps_unstacked.index)
query_col = 'ps5'
a = seasonal_decompose(ps_unstacked[query_col], model = "add")
a.plot();
Figure 2-5 shows the time series data and the overall smoothed trend showing it rises
from 2020.
27
Chapter 2 Keyword Research
The seasonal trend box shows repeated peaks which indicates that there is
seasonality from 2016, although it doesn’t seem particularly reliable given how flat the
time series is from 2016 until 2020. Also suspicious is the lack of noise as the seasonal
plot shows a virtually uniform pattern repeating periodically.
The Resid (which stands for “Residual”) shows any pattern of what’s left of the time
series data after accounting for seasonality and trend, which in effect is nothing until
2020 as it’s at zero most of the time.
For “ps4,” see Figure 2-6.
28
Chapter 2 Keyword Research
We can see fluctuation over the short term (Seasonality) and long term (Trend), with
some noise (Resid). The next step is to use the augmented Dickey-Fuller method (ADF)
to statistically test whether a given time series is stationary or not:
We can see that the p-value of “PS5” shown earlier is more than 0.05, which means
that the time series data is not stationary and therefore needs differencing. “PS4” on
the other hand is less than 0.05 at 0.01, meaning it’s stationery and doesn’t require
differencing.
The point of all this is to understand the parameters that would be used if we were
manually building a model to forecast Google searches.
29
Chapter 2 Keyword Research
ps5_s = auto_arima(ps_unstacked['ps4'],
trace=True,
m=52, #there are 52 period per season (weekly data)
start_p=0,
start_d=0,
start_q=0,
seasonal=False)
30
Chapter 2 Keyword Research
The preceding printout shows that the parameters that get the best results are
PS4: ARIMA(4,0,3)(0,0,0)
PS5: ARIMA(3,1,3)(0,0,0)
The PS5 estimate is further detailed when printing out the model summary:
ps5_s.summary()
31
Chapter 2 Keyword Research
32
Chapter 2 Keyword Research
By minimizing AIC and BIC, we get the best estimated parameters for p and q.
ps4_order = ps4_s.get_params()['order']
ps4_seasorder = ps4_s.get_params()['seasonal_order']
ps5_order = ps5_s.get_params()['order']
ps5_seasorder = ps5_s.get_params()['seasonal_order']
params = {
"ps4": {"order": ps4_order, "seasonal_order": ps4_seasorder},
"ps5": {"order": ps5_order, "seasonal_order": ps5_seasorder}
}
results = []
fig, axs = plt.subplots(len(X.columns), 1, figsize=(24, 12))
Make forecasts:
33
Chapter 2 Keyword Research
Plot predictions:
For ps4, the forecasts are pretty accurate from the beginning until March when the
search values start to diverge (Figure 2-7), while the ps5 forecasts don’t appear to be very
good at all, which is unsurprising.
Figure 2-7. Time series line plots comparing forecasts and actual data for both
ps4 and ps5
34
Chapter 2 Keyword Research
The forecasts show the models are good when there is enough history until they
suddenly change like they have for PS4 from March onward. For PS5, the models are
hopeless virtually from the get-go. We know this because the Root Mean Squared Error
(RMSE) is 8.62 for PS4 which is more than a third of the PS5 RMSE of 27.5, which, given
Google Trends varies from 0 to 100, is a 27% margin of error.
oos_train_data = ps_unstacked
oos_train_data.tail()
As you can see from the preceding table extract, we’re now using all available data.
Now we shall predict the next six months (defined as 26 weeks) in the following code:
oos_results = []
weeks_to_predict = 26
fig, axs = plt.subplots(len(ps_unstacked.columns), 1, figsize=(24, 12))
Again, iterate through the columns to fit the best model each time:
35
Chapter 2 Keyword Research
oos_arima_model = SARIMAX(oos_train_data[col],
order = s.get_params()['order'],
seasonal_order = s.get_params()['seasonal_
order'])
oos_arima_result = oos_arima_model.fit()
Make forecasts:
Plot predictions:
36
Chapter 2 Keyword Research
Best model: ARIMA(3,1,3)(0,0,0)[0]
Total fit time: 7.954 seconds
Column: ps5 - Mean: 3.973076923076923
This time, we automated the finding of the best-fitting parameters and fed that
directly into the model.
37
Chapter 2 Keyword Research
The forecasts don’t look great (Figure 2-8) because there’s been a lot of change in the
last few weeks of the data; however, that’s in the case of those two keywords.
Figure 2-8. Out-of-sample forecasts of Google Trends for ps4 and ps5
The forecast quality will be dependent on how stable the historic patterns are and
will obviously not account for unforeseeable events like COVID-19.
Export your forecasts:
What we learn here is where forecasting using statistical models are useful or are
likely to add value for forecasting, particularly in automated systems like dashboards,
that is, when there’s historical data and not when there is a sudden spike like PS5.
38
Chapter 2 Keyword Research
“Life insurance”
“Trench coats” will share the same search intent as “Ladies trench coats” but won’t
share the same intent as “Life insurance.” To work this out, a simple comparison of the
top 10 ranking sites for both search phrases in Google will offer a strong suggestion of
what Google thinks of the search intent between the two phrases.
It’s not a perfect method, but it works well because you’re using the ranking results
which are a distillation of everything Google has learned to date on what content
satisfies the search intent of the search query (based upon the trillions of global searches
per year). Therefore, it’s reasonable to surmise that if two search queries have similar
enough SERPs, then the search intent is shared between keywords.
This is useful for a number of reasons:
• Paid search ads: Good keyword content mappings also mean you
can improve the account structure and resulting quality score of your
paid search activity.
39
Chapter 2 Keyword Research
Starting Point
Okay, time to cluster. We’ll assume you already have the top 100 SERPs3 results for each
of your keywords stored as a Python dataframe “serps_input.” The data is easily obtained
from a rank tracking tool, especially if they have an API:
serps_input
Here, we’re using DataForSEO’s SERP API,4 and we have renamed the column from
“rank_absolute” to “rank.”
3
Search Engine Results Pages (SERP)
4
Available at https://dataforseo.com/apis/serp-api/
40
Chapter 2 Keyword Research
Here it goes:
Split:
serps_grpby_keyword = serps_input.groupby("keyword")
def filter_twenty_urls(group_df):
filtered_df = group_df.loc[group_df['url'].notnull()]
filtered_df = filtered_df.loc[filtered_df['rank'] <= 20]
return filtered_df
filtered_serps = serps_grpby_keyword.apply(filter_twenty_urls)
normed = normed.add_prefix('normed_')
filtered_serps_df = pd.concat([filtered_serps],axis=0)
41
Chapter 2 Keyword Research
filtserps_grpby_keyword = filtered_serps_df.groupby("keyword")
def string_serps(df):
df['serp_string'] = ''.join(df['url'])
return df
Combine
strung_serps = filtserps_grpby_keyword.apply(string_serps)
strung_serps = pd.concat([strung_serps],axis=0)
strung_serps = strung_serps[['keyword', 'serp_string']]#.head(30)
strung_serps = strung_serps.drop_duplicates()
strung_serps
Now we have a table showing the keyword and their SERP string, we’re ready to
compare SERPs. Here’s an example of the SERP string for “fifa 19 ps4”:
strung_serps.loc[1, 'serp_string']
42
Chapter 2 Keyword Research
'https://www.amazon.co.uk/Electronic-Arts-221545-FIFA-PS4/dp/
B07DLXBGN8https://www.amazon.co.uk/FIFA-19-GAMES/dp/B07DL2SY2Bhttps://
www.game.co.uk/en/fifa-19-2380636https://www.ebay.co.uk/b/FIFA-19-Sony-
PlayStation-4-Video-Games/139973/bn_7115134270https://www.pricerunner.com/
pl/1422-4602670/PlayStation-4-Games/FIFA-19-Compare-Priceshttps://pricespy.
co.uk/games-consoles/computer-video-games/ps4/fifa-19-ps4--p4766432https://
store.playstation.com/en-gb/search/fifa%2019https://www.amazon.com/FIFA-19-
Standard-PlayStation-4/dp/B07DL2SY2Bhttps://www.tesco.com/groceries/
en-GB/products/301926084https://groceries.asda.com/product/ps-4-games/
ps-4-fifa-19/1000076097883https://uk.webuy.com/product-detail/?id=503094
5121916&categoryName=playstation4-software&superCatName=gaming&title=fi
fa-19https://www.pushsquare.com/reviews/ps4/fifa_19https://en.wikipedia.
org/wiki/FIFA_19https://www.amazon.in/Electronic-Arts-Fifa19SEPS4-Fifa-
PS4/dp/B07DVWWF44https://www.vgchartz.com/game/222165/fifa-19/https://www.
metacritic.com/game/playstation-4/fifa-19https://www.johnlewis.com/fifa-19-
ps4/p3755803https://www.ebay.com/p/22045274968'
filtserps_grpby_keyword = filtered_serps_df.groupby("keyword")
def string_serps(df):
df['serp_string'] = ' '.join(df['url'])
return df
strung_serps = filtserps_grpby_keyword.apply(string_serps)
strung_serps = pd.concat([strung_serps],axis=0)
43
Chapter 2 Keyword Research
Here, we now have the keywords and their respective SERPs all converted into
a string which fits into a single cell. For example, the search result for “beige trench
coats” is
'https://www.zalando.co.uk/womens-clothing-coats-trench-coats/_beige/
https://www.asos.com/women/coats-jackets/trench-coats/cat/?cid=15143
https://uk.burberry.com/womens-trench-coats/beige/ https://www2.hm.com/
44
Chapter 2 Keyword Research
en_gb/productpage.0751992002.html https://www.hobbs.com/clothing/
coats-jackets/trench/beige/ https://www.zara.com/uk/en/woman-outerwear-
trench-l1202.html https://www.ebay.co.uk/b/Beige-Trench-Coats-for-
Women/63862/bn_7028370345 https://www.johnlewis.com/browse/women/womens-
coats-jackets/trench-coats/_/N-flvZ1z0rnyl https://www.elle.com/uk/fashion/
what-to-wear/articles/g30975/best-trench-coats-beige-navy-black/'
Time to put these side by side. What we’re effectively doing here is taking a product
of the column to itself, that is, squaring it, so that we get all the SERPs combinations
possible to put the SERPs side by side.
Add a function to align SERPs:
serps_align('ps4', strung_serps)
for q in queries:
temp_df = serps_align(q, strung_serps)
matched_serps = matched_serps.append(temp_df)
45
Chapter 2 Keyword Research
The preceding result shows all of the keywords with SERPs compared side by
side with other keywords and their SERPs. Next, we’ll infer keyword intent similarity
by comparing serp_strings, but first here’s a note on the methods like Levenshtein,
Jaccard, etc.
Levenshtein distance is edit based, meaning the number of edits required to
transform one string (in our case, serp_string) into the other string (serps_string_b).
This doesn’t work very well because the websites within the SERP strings are individual
tokens, that is, not a single continuous string.
Sorensen-Dice is better because it is token based, that is, it treats the individual
websites as individual items or tokens. Using set similarity methods, the logic is to
find the common tokens and divide them by the total number of tokens present by
combining both sets. It doesn’t take the order into account, so we must go one better.
M Measure which looks at both the token overlap and the order of the tokens, that is,
weighting the order tokens earlier (i.e., the higher ranking sites/tokens) more than the
later tokens. There is no API for this unfortunately, so we wrote the function for you here:
import py_stringmatching as sm
ws_tok = sm.WhitespaceTokenizer()
46
Chapter 2 Keyword Research
ws_tok = sm.WhitespaceTokenizer()
#keep only first k URLs
serps_1 = ws_tok.tokenize(serps_str1)[:k]
serps_2 = ws_tok.tokenize(serps_str2)[:k]
#get positions of matches
match = lambda a, b: [b.index(x)+1 if x in b else None for x in a]
#positions intersections of form [(pos_1, pos_2), ...]
pos_intersections = [(i+1,j) for i,j in enumerate(match(serps_1,
serps_2)) if j is not None]
pos_in1_not_in2 = [i+1 for i,j in enumerate(match(serps_1, serps_2)) if
j is None]
pos_in2_not_in1 = [i+1 for i,j in enumerate(match(serps_2, serps_1)) if
j is None]
47
Chapter 2 Keyword Research
Before sorting the keywords into topic groups, let’s add search volumes for each. This
could be an imported table like the following one called “keysv_df”:
keysv_df
48
Chapter 2 Keyword Research
Let’s now join the data. What we’re doing here is giving Python the ability to group
keywords according to SERP similarity and name the topic groups according to the
keyword with the highest search volume.
49
Chapter 2 Keyword Research
Group keywords by search intent according to a similarity limit. In this case, keyword
search results must be 40% or more similar. This is a number based on trial and error of
which the right number can vary by the search space, language, or other factors.
simi_lim = 0.4
keywords_crossed_vols = keywords_crossed_vols.merge(keysv_df, on =
'keyword', how = 'left')
Simulate si_simi:
#keywords_crossed_vols['si_simi'] = np.random.rand(len(keywords_crossed_
vols.index))
keywords_crossed_vols.sort_values('topic_volume', ascending = False)
keywords_filtered_nonnan = keywords_crossed_vols.dropna()
We now have the potential topic name, keyword SERP similarity, and search volumes
of each. You’ll note the keyword and keyword_b have been renamed to topic and
keyword, respectively. Now we’re going to iterate over the columns in the dataframe
using list comprehensions.
List comprehension is a technique for looping over lists. We applied it to the Pandas
dataframe because it’s much quicker than the .iterrows() function. Here it goes.
Add a dictionary comprehension to create numbered topic groups from keywords_
filtered_nonnan:
# {1: [k1, k2, ..., kn], 2: [k1, k2, ..., kn], ..., n: [k1, k2, ..., kn]}
queries_in_df = list(set(keywords_filtered_nonnan.topic.to_list()))
50
Chapter 2 Keyword Research
topic_groups_numbered = {}
topics_added = []
def latest_index(dicto):
if topic_groups_numbered == {}:
i = 0
else:
i = list(topic_groups_numbered)[-1]
return i
The list comprehension will now apply the function to group keywords into clusters:
52
Chapter 2 Keyword Research
The preceding results are statements printing out what keywords are in which topic
group. We do this to make sure we don’t have duplicates or errors, which is crucial for
the next step to perform properly. Now we’re going to convert the dictionary into a
dataframe so you can see all of your keywords grouped by search intent:
topic_groups_lst = []
for k, l in topic_groups_numbered.items():
for v in l:
topic_groups_lst.append([k, v])
53
Chapter 2 Keyword Research
As you can see, the keywords are grouped intelligently, much like a human SEO
analyst would group these, except these have been done at scale using the wisdom of
Google which is distilled from its vast number of users. Name the clusters:
def highest_demand(df):
54
Chapter 2 Keyword Research
topic_groups_vols_keywgrp = topic_groups_vols.groupby('topic_group_no')
topic_groups_vols_keywgrp.get_group(1)
high_demand_topics = topic_groups_vols_keywgrp.apply(highest_demand).
reset_index()
del high_demand_topics['level_1']
high_demand_topics = high_demand_topics.rename(columns = {'keyword':
'topic'})
def shortest_name(df):
df['k_len'] = df.topic.str.len()
min_kl = df.k_len.min()
df = df.loc[df.k_len == min_kl]
del df['topic_group_no']
del df['k_len']
del df['search_volume']
return df
high_demand_topics_spl = high_demand_topics.groupby('topic_group_no')
named_topics = high_demand_topics_spl.apply(shortest_name).reset_index()
del named_topics['level_1']
The resulting table shows that we now have keywords clustered by topic:
55
Chapter 2 Keyword Research
56
Chapter 2 Keyword Research
This is really starting to take shape, and you can quickly see opportunities emerging.
57
Chapter 2 Keyword Research
niche) is so new in terms of what it offers that there’s insufficient demand (that has yet
to be created by advertising and PR to generate nonbrand searches), then these external
tools won’t be as valuable. So, our approach will be to
1. Crawl your own website
2. Filter and clean the data for sections covering only what you sell
F ilter and Clean the Data for Sections Covering Only What
You Sell
The required data for this exercise is to literally take a site auditor5 and crawl your
website. Let’s assume you’ve exported the crawl data with just the columns: URL and
title tag; we’ll import and clean:
import pandas as pd
import numpy as np
crawl_import_df = pd.read_csv('data/crawler-filename.csv')
crawl_import_df
5
Like Screaming Frog, OnCrawl, or Botify, for instance
58
Chapter 2 Keyword Research
The preceding result shows the dataframe of the crawl data we’ve just imported.
We’re most interested in live indexable6 URLs, so let’s filter and select the page_title and
URL columns:
Now we’re going to clean the title tags to make these nonbranded, that is, remove the
site name and the magazine section.
titles_urls_df['page_title'] = titles_urls_df.page_title.str.replace(' -
Saga', '')
titles_urls_df = titles_urls_df.loc[~titles_urls_df.url.str.contains('/
magazine/')]
titles_urls_df
6
That is, pages with a 200 HTTP response that do block search indexing with “noindex”
59
Chapter 2 Keyword Research
We now have 349 rows, so we will query some of the keywords to illustrate the
process.
pd.set_option('display.max_rows', 1000)
serps_ngrammed = filtered_serps_df.set_index(["keyword", "rank_absolute"])\
.apply(lambda x: x.str.split('[-,|?()&:;\[\]=]').
explode())\
.dropna()\
.reset_index()
serps_ngrammed.head(10)
60
Chapter 2 Keyword Research
Courtesy of the explode function, the dataframe has been unnested such that we can
see the keyword rows expanded for the different text previously within the same title and
conjoined by the punctuation mark.
61
Chapter 2 Keyword Research
Eh voila, the preceding result shows a dataframe of keywords obtained from the
SERPs. Most of it makes sense and can now be added to your list of keywords for serious
consideration and tracking.
Summary
This chapter has covered data-driven keyword research, enabling you to
In the next chapter, we will cover the mapping of those keywords to URLs.
62
CHAPTER 3
Technical
Technical SEO mainly concerns the interaction of search engines and websites such that
• Website content is made discoverable by search engines.
• Extract the content meaning from those URLs again for search result
inclusion (known as indexing).
In this chapter, we’ll look at how data-driven approach can be taken toward
improving technical SEO in the following manner:
• Modeling page authority: This is useful for helping fellow SEO and
non-SEOs understand the impact of technical SEO changes.
• Core Web Vitals (CWV): While the benefits to the UX are often
lauded, there are ranking boost benefits to an improved CWV
because of the conserved search engine resources used to extract
content from a web page.
By no means will we claim that this is the final word on data-driven SEO from a
technical perspective. What we will do is expose data-driven ways of solving technical
SEO issues using some data science such as distribution analysis.
63
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_3
Chapter 3 Technical
Ultimately, the preceding list will help you build better cases for getting technical
recommendations implemented.
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
64
Chapter 3 Technical
import requests
import json
from datetime import timedelta
from glob import glob
import os
from client import RestClient # If using the Data For SEO API
from textdistance import sorensen_dice
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools
from urllib.parse import urlparse
import tldextract
pd.set_option('display.max_colwidth', None)
%matplotlib inline
Set variables:
root_domain = 'boundlesshq.com'
hostdomain = 'www.boundlesshq.com'
hostname = 'boundlesshq'
full_domain = 'https://www.boundlesshq.com'
client_name = 'Boundless'
audit_monthyear = 'jul_2022'
Import the crawl data from the Sitebulb desktop crawler. Screaming Frog or any
other site crawling software can be used; however, the column names may differ:
crawl_csv = pd.read_csv('data/boundlesshq_com_all_urls__excluding_
uncrawled__filtered.csv')
crawl_csv.columns = [col.lower().replace('.','').replace('(','').
replace(')','').replace(' ','_')
for col in crawl_csv.columns]
crawl_csv
65
Chapter 3 Technical
The dataframe is loaded into a Pandas dataframe. The most important fields are as
follows:
• url: To detect patterns for noindexing and canonicalizing
crawl_html = crawl_csv.copy()
crawl_html = crawl_html.loc[crawl_html['content_type'] == 'HTML']
crawl_html = crawl_html.loc[crawl_html['host'] == root_domain]
crawl_html = crawl_html.loc[crawl_html['passes_pagerank'] == 'Yes']
crawl_html
66
Chapter 3 Technical
The dataframe has been reduced to 309 rows. For ease of data handling, we’ll select
some columns:
crawl_select['project'] = client_name
crawl_select['count'] = 1
print(crawl_select['ur'].sum(), crawl_select['ur'].sum()/crawl_select.
shape[0])
10993 35.57605177993528
URLs on this site have an average page authority level (measured as UR). Let’s look at
some further stats, indexable and nonindexable pages. We’ll dimension on (I) indexable
and (II) passes pagerank to sum the number of URLs and UR (URL Rating):
overall_pagerank_agg = crawl_select.groupby(['indexable',
67
Chapter 3 Technical
'passes_pagerank']).agg
({'count': 'sum',
'ur':
'sum'}).
reset_
index()
Then we derive the page authority per URL by dividing the total UR by the total
number of URLs:
We see that there are 32 nonindexable URLs with a total authority of 929 that could
be consolidated to the indexable URLs.
There are some more stats, this time analyzed by site level purely out of curiosity:
site_pagerank_agg = crawl_select.groupby(['indexable',
'crawl_depth']).
agg({'count': 'sum',
'ur':
'sum'}).
reset_
index()
site_pagerank_agg['PA'] = site_pagerank_agg['ur'] / site_pagerank_
agg['count']
site_pagerank_agg
68
Chapter 3 Technical
Most of the URLs that have the authority for reallocation are four clicks away from
the home page.
Let’s visualize the distribution of the authority preoptimization, using the geom_
histogram function:
pageauth_dist_plt = (
ggplot(crawl_select, aes(x = 'ur')) +
geom_histogram(alpha = 0.7, fill = 'blue', bins = 20) +
labs(x = 'Page Authority', y = 'URL Count') +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
pageauth_dist_plt.save(filename = 'images/1_pageauth_dist_plt.png',
height=5, width=8, units = 'in', dpi=1000)
pageauth_dist_plt
As we’d expect from looking at the stats computed previously, most of the pages have
between 25 and 50 UR, with the rest spread out (Figure 3-1).
69
Chapter 3 Technical
Figure 3-1. Histogram plot showing URL count of URL Page Authority scores
parent_pa_map
70
Chapter 3 Technical
The table shows all the parent URLs and their mapping.
The next step is to mark pages that will be noindexed, so we can reallocate their
authority:
crawl_optimised = crawl_select.copy()
reallocate_conds = [
crawl_optimised['url'].str.contains('/page/[0-9]/'),
crawl_optimised['url'].str.contains('/country/')
]
reallocate_vals = [1, 1]
The reallocate column uses the np.select function to mark URLs for noindex. Any
URLs not for noindex are marked as “0,” using the default parameter:
71
Chapter 3 Technical
crawl_optimised
The reallocate column is added so we can start seeing the effect of the reallocation,
that is, the potential upside of technical optimization.
As usual, a groupby operation by reallocate and the average PA are calculated:
72
Chapter 3 Technical
So we’ll be actually reallocating 681 UR from the noindex URLs to the 285 indexable
URLs. These noindex URLs have an average UR of 28.
We filter the URLs just for the ones that will be noindexed to help us in determining
what the extra page authority will be:
no_indexed = crawl_optimised.loc[crawl_optimised['reallocate'] == 1]
We aggregate by the first parent URL (the parent node) for the total URLs within and
their URL, because the UR is likely to be reallocated to the remaining indexable URLs
that share the same parent node:
no_indexed_map = no_indexed.groupby('first_parent_url').agg({'count':
'sum', 'ur': sum}).reset_index()
add_ur is a new column created representing the additional authority as a result of the
optimization. This is the total UR divided by the number of URLs:
The preceding table will be merged into the indexable URLs by the first parent URL.
73
Chapter 3 Technical
Filter the URLs just for the indexable and add more authority as a result of the
noindexing reallocate URLs:
crawl_new = crawl_optimised.copy()
crawl_new = crawl_new.loc[crawl_new['reallocate'] == 0]
Often, when joining data, there will be null values for first parent URLs not in the
mapping. np.where() is used to replace those null values with zeros. This enables further
data manipulation to take place as you’ll see shortly.
crawl_new
74
Chapter 3 Technical
The indexable URLs now have their authority scores post optimization, which we’ll
visualize as follows:
pageauth_newdist_plt = (
ggplot(crawl_new, aes(x = 'new_ur')) +
geom_histogram(alpha = 0.7, fill = 'lightgreen', bins = 20) +
labs(x = 'Page Authority', y = 'URL Count') +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
pageauth_newdist_plt.save(filename = 'images/2_pageauth_newdist_plt.png',
height=5, width=8, units = 'in', dpi=1000)
pageauth_newdist_plt
75
Chapter 3 Technical
The impact is noticeable, as we see most pages are above 60 UR post optimization,
should the implementation move forward.
There are some quick stats to confirm:
print(new_pagerank_agg)
The average page authority is now 57 vs. 36, which is a significant improvement.
While this method is not an exact science, it could help you to build a case for getting
your change requests for technical SEO fixes implemented.
76
Chapter 3 Technical
3. Anchor text
import pandas as pd
import numpy as np
from textdistance import sorensen_dice
from plotnine import *
import matplotlib.pyplot as plt
from mizani.formatters import comma_format
target_name = 'ON24'
target_filename = 'on24'
website = 'www.on24.com'
The link data is sourced from the Sitebulb auditing software which is being imported
along with making the column names easier to work with:
77
Chapter 3 Technical
link_data.columns = [col.lower().replace('.','').replace('(','').
replace(')','').replace(' ','_')
for col in link_data.columns]
link_data
• Referring URL Rank UR: The page authority of the referring page
• Target URL Rank UR: The page authority of the target page
78
Chapter 3 Technical
crawl_data.columns = [col.lower().replace('.','').replace('(','').
replace(')','').replace(' ','_')
for col in crawl_data.columns]
crawl_data
So we have the usual list of URLs and how they were found (crawl source) with other
features spanning over 100 columns.
As you’d expect, the number of rows in the link data far exceeds the crawl dataframe
as there are many more links than pages!
Import the external inbound link data:
ahrefs_raw.columns = [col.lower().replace('.','').replace('(','').
replace(')','').replace(' ','_')
for col in ahrefs_raw.columns]
ahrefs_raw
79
Chapter 3 Technical
There are over 210,000 URLs with backlinks, which is very nice! There’s quite a bit of
data, so let’s simplify a little by removing columns and renaming some columns so we
can join the data later:
80
Chapter 3 Technical
Now we have the data in its simplified form which is important because we’re not
interested in the detail of the links but rather the estimated page-level authority that they
import into the target website.
redir_live_urls.groupby(['crawl_depth']).size()
crawl_depth
0 1
1 70
10 5
11 1
81
Chapter 3 Technical
12 1
13 2
14 1
2 303
3 378
4 347
5 253
6 194
7 96
8 33
9 19
Not Set 2351
dtype: int64
We can see how Python is treating the crawl depth as a string character rather than a
numbered category, which we can fix shortly.
Most of the site URLs can be found in the site depths of 2 to 6. There are over 2351
orphaned URLs, which means these won’t inherit any authority unless they have
backlinks.
We’ll now filter for redirected and live links:
Crawl depth is set as a category and ordered so that Python treats the column
variable as a number as opposed to a string character type:
redir_live_urls['crawl_depth'])
redir_live_urls['crawl_depth'] = redir_live_urls['crawl_depth'].
astype('category')
redir_live_urls['crawl_depth'] = redir_live_urls['crawl_depth'].cat.
reorder_categories(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
'Not Set'
82
Chapter 3 Technical
])
redir_live_urls = redir_live_urls.loc[redir_live_urls.host == website]
redir_live_urls.drop('host', axis = 1, inplace = True)
redir_live_urls
redir_live_urls.groupby(['crawl_depth']).size()
crawl_depth
0 1
1 66
2 169
3 280
4 253
5 201
6 122
7 64
83
Chapter 3 Technical
8 17
9 6
10 1
Not Set 2303
dtype: int64
Note how the size has dropped slightly to 2303 URLs. The 48 nonindexable URLs
were probably paginated pages.
Let’s visualize the distribution:
ove_intlink_dist_plt.save(filename = 'images/1_overall_intlink_dist_
plt.png',
height=5, width=5, units = 'in', dpi=1000)
ove_intlink_dist_plt
84
Chapter 3 Technical
The distribution is negatively skewed such that most pages have close to zero links.
This would be of some concern to an SEO manager.
While the overall distribution gives one view, it would be good to deep dive into the
distribution of internal links by crawl depth:
redir_live_urls.groupby('crawl_depth').agg({'no_internal_links_to_url':
['describe']}).sort_values('crawl_depth')
85
Chapter 3 Technical
The table describes the distribution of internal links by crawl depth or site level. Any
URL that is 3+ clicks away from the home page can expect two internal links on average.
This is probably the blog content as the marketing team produces a lot of it.
To visualize it graphically
86
Chapter 3 Technical
theme(legend_position = 'none')
)
The plot intlink_dist_plt in Figure 3-4 is a histogram of the number of internal links
to a URL by site level.
Figure 3-4. Box plot distributions of the number of internal links to a URL by
site level
As suspected, the most variation is in the first level directly below the home page,
with very little variation beyond.
However, we can compare the variation between site levels for content in level 2 and
beyond. For a quick peek, we’ll use a logarithmic scale for the number of internal links
to a URL:
87
Chapter 3 Technical
intlink_dist_plt.save(filename = 'images/1_log_intlink_dist_plt.png',
height=5, width=5, units = 'in', dpi=1000)
intlink_dist_plt
The picture is clearer and more insightful, as we can see how much better and varied
the distribution of the lower site levels compared to each other (Figure 3-5).
Figure 3-5. Box plot distribution of the number of internal links by site level with
logarized vertical axis
88
Chapter 3 Technical
For example, it’s much more obvious that the median number of inbound internal
links for pages on site level 2 is much higher than the lower levels.
It’s also very obvious that the variation in internal inbound links for pages in site
levels 3 and 4 is higher than those in levels 5 and 6.
Remember though the preceding example was achieved using a log scale of the same
input variable.
What we’ve learned here is that having a new variable which is taking a log of the
internal links would yield a more helpful picture to compare levels from 2 to 10.
We’ll achieve this by creating a new column variable “log_intlinks” which is a log of
the internal link count. To avoid negative infinity values from taking a log of zero, we’ll
add 0.01 to the calculation:
redir_live_urls['log_intlinks'] = np.log2(redir_live_urls['no_internal_
links_to_url'] + .01)
intlink_dist_plt.save(filename = 'images/1c_loglinks_dist_plt.png',
height=5, width=5, units = 'in', dpi=1000)
intlink_dist_plt
The intlink_dist_plt plot (Figure 3-6) is quite similar to the logarized scale, only this
time the numbers are easier to read because we’re using normal scales for the vertical
axis. The comparative averages and variations are easier to compare.
89
Chapter 3 Technical
Figure 3-6. Box plot distributions of logarized internal links by site level
intlink_dist = redir_live_urls.groupby('crawl_depth').agg({'no_internal_
links_to_url': ['mean'],
'log_intlinks':
['mean']
90
Chapter 3 Technical
}).reset_index()
intlink_dist.columns = ['_'.join(col) for col in intlink_dist.
columns.values]
intlink_dist = intlink_dist.rename(columns = {'no_internal_links_to_url_
mean': 'avg_int_links',
'log_intlinks_mean': 'logavg_
int_links',
})
intlink_dist
The averages are in place by site level. Notice how the log column helps make the
range of values between crawl depths less extreme and skewed, that is, 4239 to 0.06
for the average vs. 12 to –6.39 for the log average, which makes it easier to normalize
the data.
Now we’ll set the lower quantile at 35% for all site levels. This will use a customer
function quantile_lower:
91
Chapter 3 Technical
def quantile_lower(x):
return x.quantile(.35).round(0)
quantiled_intlinks = redir_live_urls.groupby('crawl_depth').agg({'log_
intlinks':
[quantile_
lower]}).
reset_
index()
quantiled_intlinks.columns = ['_'.join(col) for col in quantiled_intlinks.
columns.values]
quantiled_intlinks = quantiled_intlinks.rename(columns = {'crawl_depth_':
'crawl_depth',
'log_intlinks_
quantile_lower':
'sd_intlink_
lowqua'})
quantiled_intlinks
92
Chapter 3 Technical
The lower quantile stats are set. Quartiles are limited to the 25th percentile, whereas
a quantile means the lower limits can be set to any number, such as 11th, 18th, 24th, etc.,
which is why we use quantiles instead of quartiles. The next steps are to join the data to
the main dataframe, then we’ll apply a function to mark URLs that are underlinked for
their given site level:
redir_live_urls_underidx = redir_live_urls.merge(quantiled_intlinks, on =
'crawl_depth', how = 'left')
The following function assesses whether the URL has less links than the lower
quantile. If yes, then the value of “sd_int_uidx” is 1, otherwise 0:
def sd_intlinkscount_underover(row):
if row['sd_intlink_lowqua'] > row['log_intlinks']:
val = 1
else:
val = 0
return val
redir_live_urls_underidx['sd_int_uidx'] = redir_live_urls_underidx.
apply(sd_intlinkscount_underover, axis=1)
There’s some code to account for “Not Set” which are effectively orphaned URLs. In
this instance, we set these to 1 – meaning they’re underlinked:
redir_live_urls_underidx['sd_int_uidx'] = np.where(redir_live_urls_
underidx['crawl_depth'] == 'Not Set', 1,
redir_live_urls_
underidx['sd_int_uidx'])
redir_live_urls_underidx
93
Chapter 3 Technical
The dataframe shows that the column is in place marking underlinked URLs as 1.
With the URLs marked, we’re ready to get an overview of how under-linked the URLs are,
which will be achieved by aggregating by crawl depth and summing the total number of
underlinked URLs:
intlinks_agged = redir_live_urls_underidx.groupby('crawl_depth').agg({'sd_
int_uidx': ['sum', 'count']}).reset_index()
The following line tidies up the column names by inserting an underscore using a list
comprehension:
To get a proportion (or percentage), we divide the sum by the count and
multiply by 100:
intlinks_agged['sd_uidx_prop'] = (intlinks_agged.sd_int_uidx_sum) /
intlinks_agged.sd_int_uidx_count * 100
print(intlinks_agged)
94
Chapter 3 Technical
crawl_depth sd_int_uidx_sum sd_int_uidx_count sd_uidx_prop
0 0 0 1 0.000000
1 1 38 66 57.575758
2 2 67 169 39.644970
3 3 75 280 26.785714
4 4 57 253 22.529644
5 5 31 201 15.422886
6 6 9 122 7.377049
7 7 9 64 14.062500
8 8 3 17 17.647059
9 9 2 6 33.333333
10 10 0 1 0.000000
11 Not Set 2303 2303 100.000000
So even though the content in levels 1 and 2 have more links than any of the lower
levels, they have a higher proportion of underlinked URLs than any other site level (apart
from the orphans in Not Set of course).
For example, 57% of pages just below the home page are underlinked.
Let’s visualize:
It’s good to visualize using depth_uidx_plt because we can also see (Figure 3-7) that
levels 2, 3, and 4 have the most underlinked URLs by volume.
95
Chapter 3 Technical
depth_uidx_prop_plt.save(filename = 'images/1_depth_uidx_prop_plt.png',
height=5, width=5, units = 'in', dpi=1000)
depth_uidx_prop_plt
96
Chapter 3 Technical
Figure 3-8. Column chart of the proportion of under internally linked URLs by
site level
It’s not a given that URLs in the site level that are underlinked are a problem or
perhaps more so by design. However, they are worth reviewing as perhaps they should
be at that site level or they do deserve more internal links after all.
The following code exports the underlinked URLs to a CSV which can be viewed in
Microsoft Excel:
underlinked_urls = redir_live_urls_underidx.loc[redir_live_urls_underidx.
sd_int_uidx == 1]
underlinked_urls = underlinked_urls.sort_values(['crawl_depth', 'no_
internal_links_to_url'])
underlinked_urls.to_csv('exports/underlinked_urls.csv')
97
Chapter 3 Technical
Given that not all pages earn inbound links, it is normally desired by SEOs to have
pages without backlinks crawled more often. So it would make sense to analyze and
explore opportunities to redistribute this PageRank to other pages within the website.
We’ll start by tacking on the AHREFs data to the main dataframe so we can see
internal links by page authority.
We now have page authority and referring domains at the URL level. Predictably,
the home page has a lot of referring domains (over 3000) and the most page-level
authority at 81.
As usual, we’ll perform some aggregations and explore the distribution of the
PageRank (interchangeable with page authority).
First, we’ll clean up the data to make sure we replace null values with zero:
intlinks_pageauth['page_authority'] = np.where(intlinks_pageauth['page_
authority'].isnull(),
0, intlinks_pageauth['page_
authority'])
Aggregate by page authority:
98
Chapter 3 Technical
intlinks_pageauth.groupby('page_authority').agg({'no_internal_links_to_
url': ['describe']})
The preceding table shows the distribution of internal links by different levels of page
authority.
At the lower levels, most URLs have around two internal links.
A graph will give us the full picture:
# distribution of page_authority
page_authority_dist_plt = (ggplot(intlinks_pageauth, aes(x = 'page_
authority')) +
geom_histogram(fill = 'blue', alpha = 0.6, bins
= 30 ) +
labs(y = '# URLs', x = 'Page Authority') +
#scale_y_log10() +
theme_classic() +
theme(legend_position = 'none')
)
99
Chapter 3 Technical
page_authority_dist_plt.save(filename = 'images/2_page_authority_dist_
plt.png',
height=5, width=5, units = 'in', dpi=1000)
page_authority_dist_plt
Using the log scale, we can see how the higher levels of authority compare:
# distribution of page_authority
page_authority_dist_plt = (ggplot(intlinks_pageauth, aes(x = 'page_
authority')) +
geom_histogram(fill = 'blue', alpha = 0.6, bins
= 30 ) +
labs(y = '# URLs (Log)', x = 'Page Authority') +
100
Chapter 3 Technical
scale_y_log10() +
theme_classic() +
theme(legend_position = 'none')
)
page_authority_dist_plt.save(filename = 'images/2_page_authority_dist_log_
plt.png',
height=5, width=5, units = 'in', dpi=1000)
page_authority_dist_plt
Given this more insightful view, taking a log of “page_authority” to form a new
column variable “log_pa” is justified:
101
Chapter 3 Technical
intlinks_pageauth['page_authority'] = np.where(intlinks_pageauth['page_
authority'] == 0, .1, intlinks_pageauth['page_authority'])
intlinks_pageauth['log_pa'] = np.log2(intlinks_pageauth.page_authority)
intlinks_pageauth.head()
page_authority_trans_dist_plt.save(filename = 'images/2_page_authority_
trans_dist_plt.png',
height=5, width=5, units = 'in', dpi=1000)
page_authority_trans_dist_plt
102
Chapter 3 Technical
The decimal points will be rounded to make the 3000+ URLs easier to categorize:
intlinks_pageauth['pa_band'] = intlinks_pageauth['log_pa'].apply(np.floor)
103
Chapter 3 Technical
def quantile_lower(x):
return x.quantile(.4).round(0)
quantiled_pageau = intlinks_pageauth.groupby('pa_band').agg({'no_internal_
links_to_url': [quantile_lower]}).reset_index()
quantiled_pageau.columns = ['_'.join(col) for col in quantiled_pageau.
columns.values]
quantiled_pageau = quantiled_pageau.rename(columns = {'pa_band_':
'pa_band',
'no_internal_links_
to_url_quantile_
lower': 'pa_intlink_
lowqua'})
quantiled_pageau
104
Chapter 3 Technical
Going by PageRank, we now have the minimum threshold of inbound internal links
we would expect. Time to join the data and mark the URLs that are underlinked for their
authority level:
intlinks_pageauth_underidx = intlinks_pageauth.merge(quantiled_pageau, on =
'pa_band', how = 'left')
def pa_intlinkscount_underover(row):
if row['pa_intlink_lowqua'] > row['no_internal_links_to_url']:
val = 1
else:
val = 0
return val
intlinks_pageauth_underidx['pa_int_uidx'] = intlinks_pageauth_underidx.
apply(pa_intlinkscount_underover, axis=1)
This function will allow us to make some aggregations to see how many URLs there
are at each PageRank band and how many are under-linked:
pageauth_agged = intlinks_pageauth_underidx.groupby('pa_band').agg({'pa_
int_uidx': ['sum', 'count']}).reset_index()
pageauth_agged.columns = ['_'.join(col) for col in pageauth_agged.
columns.values]
print(pageauth_agged)
105
Chapter 3 Technical
pa_band_ pa_int_uidx_sum pa_int_uidx_count uidx_prop
0 -4.0 0 1320 0.000000
1 3.0 0 1950 0.000000
2 4.0 77 203 37.931034
3 5.0 4 9 44.444444
4 6.0 0 1 0.000000
Most of the underlinked content appears to be those that have the highest page
authority, which is slightly contrary to what the site-level approach suggests (that pages
lower down are underlinked). That’s assuming most of the high authority pages are
closer to the home page.
What is the right answer? It depends on what we’re trying to achieve. Let’s continue
with more analysis for now and visualize the authority stats:
# distribution of page_authority
pageauth_agged_plt = (ggplot(intlinks_pageauth_underidx.loc[intlinks_
pageauth_underidx['pa_int_uidx'] == 1],
aes(x = 'pa_band')) +
geom_histogram(fill = 'blue', alpha = 0.6, bins = 10) +
labs(y = '# URLs Under Linked', x = 'Page Authority
Level') +
theme_classic() +
theme(legend_position = 'none')
)
pageauth_agged_plt.save(filename = 'images/2_pageauth_agged_hist.png',
height=5, width=5, units = 'in', dpi=1000)
pageauth_agged_plt
106
Chapter 3 Technical
Figure 3-12. Distribution of under internally linked URLs by page authority level
Content Type
Perhaps it would be more useful to visualize this by content type just by a “quick and
dirty” analysis using the first subdirectory:
intlinks_content_underidx = intlinks_depthauth_underidx.copy()
To get the first subfolder, we’ll define a function that allows the operation to
continue in case of a fail (which would happen for the home page URL because
there is no subfolder). The k parameter specifies the number of slashes in the URL to
find the desired folder and parse the subdirectory name:
107
Chapter 3 Technical
intlinks_content_underidx['content'] = intlinks_content_underidx['url'].
apply(lambda x: get_folder(x))
intlinks_content_underidx.groupby('content').agg({'no_internal_links_to_
url': ['describe']})
Wow, 183 subfolders! That’s way too much for categorical analysis. We could break
it down and aggregate it into fewer categories using the ngram techniques described in
Chapter 9; feel free to try.
In any case, it looks like the site architecture is too flat and could be better structured
to be more hierarchical, that is, more pyramid like.
Also, many of the content folders only have one inbound internal link, so even
without the benefit of data science, it’s obvious these require SEO attention.
108
Chapter 3 Technical
intlinks_depthauth_underidx = intlinks_pageauth_underidx.copy()
intlinks_depthauth_underidx['depthauth_uidx'] = np.where((intlinks_
depthauth_underidx['sd_int_uidx'] +
intlinks_
depthauth_
underidx['pa_
int_uidx'] ==
2), 1, 0)
'''intlinks_depthauth_underidx['depthauth_uidx'] = np.where((intlinks_
depthauth_underidx['sd_int_uidx'] == 1) &
(intlinks_
depthauth_
underidx['pa_int_
uidx'] == 1),
1, 0)'''
depthauth_uidx = intlinks_depthauth_underidx.groupby(['crawl_depth',
'pa_band']).agg({'depthauth_uidx': 'sum'}).reset_index()
depthauth_urls = intlinks_depthauth_underidx.groupby(['crawl_depth',
'pa_band']).agg({'url': 'count'}).reset_index()
depthauth_stats = depthauth_uidx.merge(depthauth_urls,
on = ['crawl_depth',
'pa_band'], how = 'left')
depthauth_stats['depthauth_uidx_prop'] = (depthauth_stats['depthauth_uidx']
/ depthauth_stats['url']).round(2)
depthauth_stats.sort_values('depthauth_uidx', ascending = False)
109
Chapter 3 Technical
Most of the underlinked URLs are orphaned and have page authority (probably from
backlinks).
Visualize to get a fuller picture:
depthauth_stats_plt = (
ggplot(depthauth_stats,
aes(x = 'pa_band', y = 'crawl_depth', fill = 'depthauth_
uidx')) +
geom_tile(stat = 'identity', alpha = 0.6) +
labs(y = '', x = '') +
theme_classic() +
theme(legend_position = 'right')
)
depthauth_stats_plt.save(filename = 'images/3_depthauth_stats_plt.png',
height=5, width=10, units = 'in', dpi=1000)
depthauth_stats_plt
There we have it, depthauth_stats_plt (Figure 3-13) shows most of the focus should
go into the orphaned URLs (which they should anyway), but more importantly we know
which orphaned URLs to prioritize over others.
110
Chapter 3 Technical
Figure 3-13. Heatmap of page authority level, site level, and underlinked URLs
We can also see the extent of the issue. The second highest priority group of
underindexed URLs are at site levels 2, 3, and 4.
Anchor Texts
If the count and their distribution represent the quantitative aspect of internal links, then
the anchor texts could be said to represent their quality.
Anchor texts signal to search engines and users what content to expect after
accessing the hyperlink. This makes anchor texts an important signal and one worth
optimizing.
We’ll start by aggregating the crawl data from Sitebulb to get an overview of
the issues:
111
Chapter 3 Technical
'no_anchors_with_url_in_onclick': ['sum'],
'no_anchors_with_username_and_password_in_href': ['sum'],
'no_image_anchors_with_no_alt_text': ['sum']
}).reset_index()
112
Chapter 3 Technical
Over 4000 links with no descriptive anchor text jump out as the most common issue,
not to mention the 19 anchors with empty HREF (albeit very low in number).
To visualize
anchor_issues_count_plt.save(filename = 'images/4_anchor_issues_count_
plt.png',
height=5, width=5, units = 'in', dpi=1000)
anchor_issues_count_plt
113
Chapter 3 Technical
anchor_issues_levels = crawl_data.groupby('crawl_depth').agg({'no_anchors_
with_empty_href': ['sum'],
'no_anchors_with_leading_or_trailing_whitespace_in_href':
['sum'],
'no_anchors_with_local_file': ['sum'],
'no_anchors_with_localhost': ['sum'],
'no_anchors_with_malformed_href': ['sum'],
'no_anchors_with_no_text': ['sum'],
'no_anchors_with_non_descriptive_text': ['sum'],
'no_anchors_with_non-http_protocol_in_href': ['sum'],
'no_anchors_with_url_in_onclick': ['sum'],
'no_anchors_with_username_and_password_in_href': ['sum'],
'no_image_anchors_with_no_alt_text': ['sum']
}).reset_index()
114
Chapter 3 Technical
print(anchor_issues_levels)
crawl_depth issues instances
111 Not Set non_descriptive_text 2458
31 Not Set leading_or_trailing_whitespace_in_href 2295
104 3 non_descriptive_text 350
24 3 leading_or_trailing_whitespace_in_href 328
105 4 non_descriptive_text 307
.. ... ... ...
85 13 no_text 0
115
Chapter 3 Technical
84 12 no_text 0
83 11 no_text 0
82 10 no_text 0
0 0 empty_href 0
Most of the issues are on orphaned pages followed by URLs three to four levels deep.
To visualize
anchor_levels_issues_count_plt.save(filename = 'images/4_anchor_levels_
issues_count_plt.png',
height=5, width=5, units = 'in', dpi=1000)
anchor_levels_issues_count_plt
116
Chapter 3 Technical
Figure 3-15. Heatmap of site level, anchor text issues, and instances
Merge with the crawl data using the URL as the primary key and then filter for
indexable URLs only:
anchor_merge['crawl_depth'] = anchor_merge['crawl_depth'].
astype('category')
117
Chapter 3 Technical
anchor_merge['crawl_depth'] = anchor_merge['crawl_depth'].cat
.reorder_categories(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'10', 'Not Set'])
Then we compare the string similarity of the anchor text and title tag of the
destination URLs:
And any URLs with less than 70% relevance score will be marked as irrelevant under
the new column “irrel_anchors” as a 1.
Why 70%? This is from experience, and you’re more than welcome to try different
thresholds.
With Sorensen-Dice, which is not only fast but meets SEO needs for measuring
relevance, 70% seems to be the right limit between relevance and irrelevance, especially
when accounting for the site markers in the title tag string:
anchor_merge['project'] = target_name
anchor_merge
118
Chapter 3 Technical
anchor_rel_stats_site_agg = anchor_merge.groupby('project').agg({'irrel_
anchors': 'sum'}).reset_index()
anchor_rel_stats_site_agg['total_urls'] = anchor_merge.shape[0]
anchor_rel_stats_site_agg['irrel_anchors_prop'] = anchor_rel_stats_site_
agg['irrel_anchors'] /anchor_rel_stats_site_agg['total_urls']
print(anchor_rel_stats_site_agg)
project irrel_anchors total_urls irrel_anchors_prop
0 ON24 333946 350643 0.952382
About 95% of anchor texts on this site are irrelevant. How does this compare to their
competitors? That’s your homework.
Let’s go slightly deeper and analyze this by site depth:
anchor_rel_depth_irrels = anchor_merge.groupby(['crawl_depth']).
agg({'irrel_anchors': 'sum'}).reset_index()
anchor_rel_depth_urls = anchor_merge.groupby(['crawl_depth']).
agg({'project': 'count'}).reset_index()
anchor_rel_depth_stats = anchor_rel_depth_irrels.merge(anchor_rel_depth_
urls, on = 'crawl_depth', how = 'left')
119
Chapter 3 Technical
anchor_rel_depth_stats['irrel_anchors_prop'] = anchor_rel_depth_
stats['irrel_anchors'] / anchor_rel_depth_stats['project']
anchor_rel_depth_stats
Virtually, all content at all site levels with the exception of those three clicks away
from the home page (probably blog posts) have irrelevant anchors.
Let’s visualize:
120
Chapter 3 Technical
theme(legend_position = 'none')
)
anchor_rel_stats_site_agg_plt.save(filename = 'images/3_anchor_rel_stats_
site_agg_plt.png',
height=5, width=5, units = 'in', dpi=1000)
anchor_rel_stats_site_agg_plt
Location
More insight could be gained by looking at the location of the anchors:
anchor_rel_locat_irrels = anchor_merge.groupby(['location']).agg({'irrel_
anchors': 'sum'}).reset_index()
121
Chapter 3 Technical
anchor_rel_locat_urls = anchor_merge.groupby(['location']).agg({'project':
'count'}).reset_index()
anchor_rel_locat_stats = anchor_rel_locat_irrels.merge(anchor_rel_locat_
urls, on = 'location', how = 'left')
anchor_rel_locat_stats['irrel_anchors_prop'] = anchor_rel_locat_
stats['irrel_anchors'] / anchor_rel_locat_stats['project']
anchor_rel_locat_stats
The irrelevant anchors are within the header or footer which make these relatively
easy to solve.
anchor_count = anchor_merge[['anchor_text']].copy()
anchor_count['count'] = 1
anchor_count_agg = anchor_count.groupby('anchor_text').agg({'count':
'sum'}).reset_index()
anchor_count_agg = anchor_count_agg.sort_values('count', ascending = False)
anchor_count_agg
122
Chapter 3 Technical
There are over 1,808 variations of anchor texts of which “Contact Us” is the most
popular along with “Live Demo” and “Resources.”
Let’s visualize using a word cloud. We’ll have to import the WordCloud package and
convert the dataframe into a dictionary:
data = anchor_count_agg.set_index('anchor_text').to_dict()['count']
data
123
Chapter 3 Technical
wc = WordCloud(background_color='white',
width=800, height=400,
max_words=30).generate_from_frequencies(anchor_count_agg)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
# Save image
wc.to_file("images/wordcloud.png")
plt.show()
The word cloud (Figure 3-17) could be used in a management presentation. There
are some pretty long anchors there!
Figure 3-17. Word cloud of the most commonly used anchor texts
124
Chapter 3 Technical
The activation from this point would be to see about finding semiautomated rules
to improve the relevance of anchor texts, which is made easier by virtue of the fact that
these are within the header or footer.
Landscape
import re
import time
import random
import pandas as pd
import numpy as np
import requests
import json
import plotnine
import tldextract
from plotnine import *
from mizani.transforms import trans
from client import RestClient
target_bu = 'boundless'
125
Chapter 3 Technical
target_site = 'https://boundlesshq.com/'
target_name = target_bu
We start by obtaining the SERPs for your target keywords using the pandas read_csv
function. We’re interested in the URL which will form the input for querying the Google
PageSpeed API which gives us the CWV metric values:
The SERPs data can get a bit noisy, and ultimately the business is only interested in
their direct competitors, so we’ll create a list of them to filter the SERPs accordingly:
desktop_serps_select = desktop_serps_df[~desktop_serps_df['url'].
isnull()].copy()
126
Chapter 3 Technical
desktop_serps_select = desktop_serps_select[desktop_serps_select['url'].
str.contains('|'.join(selected_sites))]
desktop_serps_select
There are much less rows as a result, which means less API queries and less time
required to get the data.
Note the data is just for desktop, so this process would need to be repeated for
mobile SERPs also.
To query the PageSpeed API efficiently and avoid duplicate requests, we want a
unique set of URLs. We achieve this by
Exporting the URL column to a list
desktop_serps_urls = desktop_serps_select['url'].to_list()
desktop_serps_urls = list(dict.fromkeys(desktop_serps_urls))
desktop_serps_urls
['https://papayaglobal.com/blog/how-to-avoid-permanent-
establishment-risk/',
'https://www.omnipresent.com/resources/permanent-establishment-risk-a-
remote-workforce',
'https://www.airswift.com/blog/permanent-establishment-risks',
'https://www.letsdeel.com/blog/permanent-establishment-risk',
127
Chapter 3 Technical
'https://shieldgeo.com/ultimate-guide-permanent-establishment/',
'https://remote.com/blog/what-is-permanent-establishment',
'https://remote.com/lp/global-payroll',
'https://remote.com/services/global-payroll?nextInternalLocale=
en-us', . . . ]
With the list, we query the API, starting by setting the parameters for the API itself,
the device, and the API key (obtained by getting a Google Cloud Platform account which
is free):
base_url = 'https://www.googleapis.com/pagespeedonline/v5/
runPagespeed?url='
strategy = '&strategy=desktop'
api_key = '&key=[Your PageSpeed API key]'
Initialize an empty dictionary and set i to zero which will be used as a counter to help
us keep track of how many API calls have been made and how many to go:
desktop_cwv = {}
i = 1
The result is a dictionary containing the API response. To get this output into a
usable format, we iterate through the dictionary to pull out the actual CWV scores as
the API has a lot of other micro measurement data which doesn’t serve our immediate
objectives.
Initialize an empty list which will store the API response data:
desktop_psi_lst = []
Loop through the API output which is a JSON dictionary, so we need to pull out the
relevant “keys” and add them to the list initialized earlier:
128
Chapter 3 Technical
The PageSpeed data on all of the ranking URLs is in a dataframe with all of the CWV
metrics:
• FCP: First Contentful Paint
129
Chapter 3 Technical
To show the relevance of the ranking (and hopefully the benefit to ranking by
improving CWV), we want to merge this with the rank data:
The dataframe is complete with the keyword, its rank, URL, device, and CWV
metrics.
At this point, rather than repeat near identical code for mobile, you can assume we
have the data for mobile which we have combined into a single dataframe using the
pandas concat function (same headings).
To add some additional features, we have added another column is_target indicating
whether the ranking URL is the client or not:
130
Chapter 3 Technical
overall_psi_serps_bu['is_target'] = np.where(overall_psi_serps_bu['url'].
str.contains(target_site), '1', '0')
overall_psi_serps_bu['site'] = overall_psi_serps_bu['url'].apply(lambda
url: tldextract.extract(url).domain)
overall_psi_serps_bu['count'] = 1
The aggregation will be executed at the site level so we can compare how each site
scores on average for their CWV metrics and correlate that with performance:
overall_psi_serps_agg = overall_psi_serps_bu.groupby('site').
agg({'LCP': 'mean',
'FCP': 'mean',
'CLS': 'mean',
'FID': 'mean',
'SIS': 'mean',
131
Chapter 3 Technical
'rank_
absolute':
'mean',
'count':
'sum'}).
reset_
index()
overall_psi_serps_agg = overall_psi_serps_agg.rename(columns = {'count':
'reach'})
Here are some operations to make the site names shorter for the graphs later:
overall_psi_serps_agg['site'] = np.where(overall_psi_serps_agg['site'] ==
'papayaglobal', 'papaya',
overall_psi_serps_agg['site'])
overall_psi_serps_agg['site'] = np.where(overall_psi_serps_agg['site'] ==
'boundlesshq', 'boundless',
overall_psi_serps_agg['site'])
overall_psi_serps_agg
That’s the summary which is not so easy to discern trends, and now we’re ready to
plot the data, starting with the overall speed index. The Speed Index Score (SIS) is scaled
between 0 and 100, 100 being the fastest and therefore best.
Note that in all of the charts that will compare Google rank with the individual CWV
metrics, the vertical axis will be inverted such that the higher the position, the higher the
ranking. This is to make the charts more intuitive and easier to understand.
132
Chapter 3 Technical
SIS_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'SIS', y = 'rank_absolute', fill = 'site', colour = 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'Speed Score') +
scale_y_reverse() +
scale_size_continuous(range = [7, 17]) +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
SIS_cwv_landscape_plt.save(filename = 'images/0_SIS_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
SIS_cwv_landscape_plt
Already we can see in SIS_cwv_landscape_plt (Figure 3-18) that the higher your
speed score, the higher you rank in general which is a nice easy sell to the stakeholders,
acting as motivation to invest resources into improving CWV.
Figure 3-18. Scatterplot comparing speed scores and Google rank of different
websites
133
Chapter 3 Technical
Boundless in this instance are doing relatively well. Although they don’t rank the
highest, this could indicate that either some aspects of CWV are not being attended to or
something non-CWV related or more likely a combination of both.
LCP_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'LCP', y = 'rank_absolute', fill = 'site', colour
= 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'Largest Contentful Paint') +
scale_y_reverse() +
scale_size_continuous(range = [7, 17]) +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
LCP_cwv_landscape_plt.save(filename = 'images/0_LCP_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
LCP_cwv_landscape_plt
The LCP_cwv_landscape_plt plot (Figure 3-19) shows that Papaya and Remote look
like outliers; in any case, the trend does indicate that the less time it takes to load the
largest content element, the higher the rank.
134
Chapter 3 Technical
Figure 3-19. Scatterplot comparing Largest Contentful Paint (LCP) and Google
rank by website
FID_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'FID', y = 'rank_absolute', fill = 'site', colour
= 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'First Input Delay') +
scale_y_reverse() +
scale_x_log10() +
scale_size_continuous(range = [7, 17]) +
theme(legend_position = 'none', axis_text_x=element_text(rotation=0,
hjust=1, size = 12))
)
FID_cwv_landscape_plt.save(filename = 'images/0_FID_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
FID_cwv_landscape_plt
135
Chapter 3 Technical
Figure 3-20. Scatterplot comparing First Input Delay (FID) and Google rank
by website
The trend indicates that the less time it takes to make the page interactive for users,
the higher the rank.
Boundless are doing well in this respect.
CLS_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'CLS', y = 'rank_absolute', fill = 'site', colour
= 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
labs(y = 'Google Rank', x = 'Cumulative Layout Shift') +
scale_y_reverse() +
scale_size_continuous(range = [7, 17]) +
136
Chapter 3 Technical
CLS_cwv_landscape_plt.save(filename = 'images/0_CLS_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
CLS_cwv_landscape_plt
Figure 3-21. Scatterplot comparing Cumulative Layout Shift (CLS) and Google
rank by website
FCP_cwv_landscape_plt = (
ggplot(overall_psi_serps_agg,
aes(x = 'FCP', y = 'rank_absolute', fill = 'site', colour
= 'site',
size = 'reach')) +
geom_point(alpha = 0.8) +
geom_text(overall_psi_serps_agg, aes(label = 'site'),
position=position_stack(vjust=-0.08)) +
137
Chapter 3 Technical
FCP_cwv_landscape_plt.save(filename = 'images/0_FCP_cwv_landscape.png',
height=5, width=8, units = 'in', dpi=1000)
FCP_cwv_landscape_plt
Figure 3-22. Scatterplot comparing First Contentful Paint (FCP) and Google rank
by website
That’s the deep dive into the overall scores. The preceding example can be repeated
for both desktop and mobile scores to drill down into, showing which specific CWV
metrics should be prioritized. Overall, for boundless, CLS appears to be its weakest point.
In the following, we’ll summarize the analysis on a single chart by pivoting the data
in a format that can be used to power the single chart:
138
Chapter 3 Technical
overall_psi_serps_long = overall_psi_serps_agg.copy()
overall_psi_serps_long = overall_psi_serps_long.melt(id_vars=['site'],
value_vars=['LCP',
'FCP', 'CLS', 'FID', 'SIS'],
var_name='Metric',
value_name='Index')
overall_psi_serps_long['x_axis'] = overall_psi_serps_long['Metric']
overall_psi_serps_long['site'] = np.where(overall_psi_serps_long['site'] ==
'papayaglobal', 'papaya',
overall_psi_serps_long['site'])
overall_psi_serps_long['site'] = np.where(overall_psi_serps_long['site'] ==
'boundlesshq', 'boundless',
overall_psi_serps_long['site'])
overall_psi_serps_long
139
Chapter 3 Technical
speed_ex_plt = (
ggplot(overall_psi_serps_long,
aes(x = 'site', y = 'Index', fill = 'site')) +
geom_bar(stat = 'identity', alpha = 0.8) +
labs(y = '', x = '') +
theme(legend_position = 'right',
axis_text_x =element_text(rotation=90, hjust=1, size = 12),
legend_title = element_blank()
) +
facet_grid('Metric ~ .', scales = 'free')
)
speed_ex_plt.save(filename = 'images/0_CWV_Metrics_plt.png',
height=5, width=8, units = 'in', dpi=1000)
speed_ex_plt
140
Chapter 3 Technical
The speed_ex_plt chart (Figure 3-23) shows the competitors being compared for
each metric. Remote seem to perform the worst on average, so their prominent rankings
are probably due to non-CWV factors.
Onsite CWV
The purpose of the landscape was to use data to motivate the client, colleagues, and
stakeholders of the SEO benefits that would follow CWV improvement. In this section,
we’re going to drill into the site itself to see where the improvements could be made.
We’ll start by importing the data and cleaning up the columns as usual:
target_crawl_raw = pd.read_csv('data/boundlesshq_com_all_urls__excluding_
uncrawled__filtered_20220427203402.csv')
141
Chapter 3 Technical
We’re using Sitebulb crawl data, and we want to only include onsite indexable URLs
since those are the ones that rank, which we will filter as follows:
target_crawl_raw = target_crawl_raw.loc[target_crawl_raw['host'] ==
target_host]
target_crawl_raw = target_crawl_raw.loc[target_crawl_raw['indexable_
status'] == 'Indexable']
target_crawl_raw = target_crawl_raw.loc[target_crawl_raw['content_type']
== 'HTML']
target_crawl_raw
142
Chapter 3 Technical
With 279 rows, it’s a small website. The next step is to select the desired columns
which will comprise the CWV measures and anything that could possibly explain it:
target_speedDist_df
143
Chapter 3 Technical
The dataframe columns have reduced from 71 to 29, and the CWV scores are more
apparent.
Attempting to analyze the sites at the URL will not be terribly useful, so to make
pattern identification easier, we will classify the content by folder location:
section_conds = [
target_speedDist_df['url'] == 'https://boundlesshq.com/',
target_speedDist_df['url'].str.contains('/guides/'),
target_speedDist_df['url'].str.contains('/how-it-works/')
]
target_speedDist_df[cols] = pd.to_numeric(target_speedDist_df[cols].
stack(), errors='coerce').unstack()
target_speedDist_df
144
Chapter 3 Technical
A new column has been created in which each indexable URL is labeled by their
content category.
Time for some aggregation using groupby on “content”:
speed_dist_agg = target_speedDist_df.groupby('content').agg({'url':
'count', 'performance_score'}).reset_index()
speed_dist_agg
Most of the content are guides followed by blog posts with three offer pages.
To visualize, we’re going to use a histogram showing the distribution of the overall
performance score and color code the URLs in the score columns by their segment.
The home page and the guides are by far the fastest.
target_speedDist_plt = (
ggplot(target_speedDist_df,
aes(x = 'performance_score', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 20) +
labs(y = 'Page Count', x = '\nSpeed Score') +
145
Chapter 3 Technical
target_speedDist_plt.save(filename = 'images/3_target_speedDist_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_speedDist_plt
The target_speedDist_plt plot (Figure 3-24) shows the home page (in purple)
performs reasonably well with a speed score of 84. The guides vary, but most of these
have a speed above 80, and the majority of blog posts are in the 70s.
target_CLS_plt = (
ggplot(target_speedDist_df,
aes(x = 'cumulative_layout_shift', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 20) +
labs(y = 'Page Count', x = '\ncumulative_layout_shift') +
146
Chapter 3 Technical
target_CLS_plt.save(filename = 'images/3_target_CLS_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_CLS_plt
As shown in target_CLS_plt (Figure 3-25), guides have the least amount of shifting
during browser rendering, whereas the blogs and the home page shift the most.
So we now know which content templates to focus our CLS development efforts.
target_FCP_plt = (
ggplot(target_speedDist_df,
aes(x = 'first_contentful_paint', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 30) +
labs(y = 'Page Count', x = '\nContentful paint') +
theme(legend_position = 'right',
147
Chapter 3 Technical
target_FCP_plt.save(filename = 'images/3_target_FCP_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_FCP_plt
In this area, target_FCP_plt (Figure 3-26) shows no discernible trends here which
indicates it’s an overall site problem. So digging into the Chrome Developer Tools and
looking into the network logs would be the obvious next step.
target_LCP_plt = (
ggplot(target_speedDist_df,
aes(x = 'largest_contentful_paint', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 20) +
labs(y = 'Page Count', x = '\nlargest_contentful_paint') +
theme(legend_position = 'right',
axis_text_x = element_text(rotation=90, hjust=1, size = 7))
)
148
Chapter 3 Technical
target_LCP_plt.save(filename = 'images/3_target_LCP_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_LCP_plt
target_LCP_plt (Figure 3-27) shows most guides and some blogs have the fastest LCP
scores; in any case, the blog template and the rogue guides would be the areas of focus.
target_FID_plt = (
ggplot(target_speedDist_df,
aes(x = 'time_to_interactive', fill = 'content')) +
geom_histogram(alpha = 0.8, bins = 20) +
labs(y = 'Page Count', x = '\ntime_to_interactive') +
theme(legend_position = 'right',
axis_text_x = element_text(rotation=90, hjust=1, size = 7))
)
target_FID_plt.save(filename = 'images/3_target_FID_plt.png',
height=5, width=8, units = 'in', dpi=1000)
target_FID_plt
149
Chapter 3 Technical
The majority of the site appears in target_FID_plt (Figure 3-28) to enjoy fast FID
times, so this would be the least priority for CWV improvement.
Summary
In this chapter, we covered how data-driven approach could be taken toward technical
SEO by way of
• Modeling page authority to estimate the benefit of technical SEO
recommendations to colleagues and clients
The next chapter will focus on using data to improve content and UX.
150
CHAPTER 4
Content and UX
Content and UX for SEO is about the quality of the experience you’re delivering to your
website users, especially when they are referred from search engines. This means a
number of things including but not limited to
By no means do we claim that this is the final word on data-driven SEO from a
content and UX perspective. What we will do is expose data-driven ways of solving the
most important SEO challenge using data science techniques, as not all require data
science.
For example, getting scientific evidence that fast page speeds are indicative of higher
ranked pages uses similar code from Chapter 6. Our focus will be on the various flavors
of content that best satisfies the user query: keyword mapping, content gap analysis, and
content creation.
151
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_4
Chapter 4 Content and UX
• Decide what content to create for target keywords that will satisfy
users searching for them
Data Sources
Your most likely data sources will be a combination of
Keyword Mapping
While there is so much to be gained from creating value-adding content, there is also
much to be gained from retiring or consolidating content. This is achieved by merging it
with another on the basis that they share the same search intent. Assuming the keywords
have been grouped together by search intent, the next stage is to map them.
Keyword mapping is the process of mapping target keywords to pages and then
optimizing the page toward these – as a result, maximizing a site’s rank position potential
in the search result. There are a number of approaches to achieve this:
• TF-IDF
• String matching
152
Chapter 4 Content and UX
We recommend string matching as it’s fast, reasonably accurate, and the easiest
to deploy.
String Matching
String matching works to see how many strings overlap and is used in DNA sequencing.
String matching can work in two ways, which are to either treat strings as one object or
strings made up of tokens (i.e., words within a string). We’re opting for the latter because
words mean something to humans and are not serial numbers. For that reason, we’ll be
using Sorensen-Dice which is fast and accurate compared to others we’ve tested.
The following code extract shows how we use string distance to map keywords to
content by seeking the most similar URL titles to the target keyword. Let’s go, importing
libraries:
import requests
from requests.exceptions import ReadTimeout
from json.decoder import JSONDecodeError
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from client import RestClient
import json
import py_stringmatching as sm
from textdistance import sorensen_dice
from plotnine import *
import matplotlib.pyplot as plt
target = 'wella'
153
Chapter 4 Content and UX
We’ll start by importing the crawl data, which is a CSV export of website auditing
software, in this case from “Sitebulb”:
crawl_raw = pd.read_csv('data/www_wella_com_internal_html_urls_by_
indexable_status_filtered_20220629220833.csv')
crawl_raw.columns = [col.lower().replace('(','').replace(')','').
replace('%','').replace(' ', '_')
for col in crawl_raw.columns]
crawl_df = crawl_raw.copy()
We’re only interested in indexable pages as those are the URLs available for
mapping:
The crawl import is complete. However, we’re only interested in the URL and title as
that’s all we need for mapping keywords to URLs. Still it’s good to import the whole file to
visually inspect it, to be more familiar with the data.
154
Chapter 4 Content and UX
The dataframe is showing the URLs and titles. Let’s load the keywords we want to
map that have been clustered using techniques in Chapter 2:
keyword_discovery = pd.read_csv('data/keyword_discovery.csv)
155
Chapter 4 Content and UX
The dataframe shows the topics, keywords, number of search engine results for the
keywords, topic web search results, and the topic group. Note these were clustered using
the methods disclosed in Chapter 2.
We’ll map the topic as this is the central keyword that would also rank for their topic
group keywords. This means we only require the topic column.
total_mapping_simi = keyword_discovery[['topic']].copy().drop_duplicates()
We want all the combinations of topics and URL titles before we can test each
combination for string similarity. We achieve this using the cross-product merge:
A new column “test” is created which will be formatted to remove boilerplate brand
strings and force lowercase. This will make the string matching values more accurate.
total_mapping_simi['test'] = total_mapping_simi['title']
total_mapping_simi['test'] = total_mapping_simi['test'].str.lower()
total_mapping_simi['test'] = total_mapping_simi['test'].str.replace(' \|
wella', '')
total_mapping_simi
156
Chapter 4 Content and UX
Now we’re ready to compare strings by creating a new column “simi,” meaning
string similarity. The scores will take the topic and test columns as inputs and feed the
sorensen_dice function imported earlier:
The simi column has been added complete with scores. A score of 1 is identical, and
0 is completely dissimilar. The next stage is to select the closest matching URLs to topic
keywords:
keyword_mapping_grp = total_mapping_simi.copy()
The dataframe is first sorted by similarity score and topic in descending order so that
the first row by topic is the closest matching:
157
Chapter 4 Content and UX
After sorting, we use the first() function to select the top matching URL for each topic
using the groupby() function:
keyword_mapping_grp = keyword_mapping_grp.groupby('topic').first().
reset_index()
keyword_mapping_grp
Each topic now has its closest matching URL. The next stage is to decide whether
these matches are good enough or not:
At this point, we eyeball the data to see what threshold number is good enough. I’ve
gone with 0.7 or 70% as it seems to do the job mostly correctly, which is to act as the
natural threshold for matching test content to URLs.
Using np.where(), which is equivalent to Excel’s IF formula, we’ll make any rows
exceeding 0.7 as “mapped” and the rest as “unmatched”:
keyword_mapping
158
Chapter 4 Content and UX
Finally, we have keywords mapped to URLs and some stats on the overall exercise.
keyword_mapping_aggs = keyword_mapping.copy()
keyword_mapping_aggs = keyword_mapping_aggs.groupby('mapped').count().
reset_index()
Keyword_mapping_aggs
159
Chapter 4 Content and UX
• Content gaps: The extent to which the brand is not visible for
keywords that form the content set
Without this analysis, your site risks being left behind in terms of audience reach and
also appearing less authoritative because your site appears less knowledgeable about the
topics covered by your existing content. This is particularly important when considering
the buying cycle. Let’s imagine you’re booking a holiday, and now imagine the variety
of search queries that you might use as you carry out that search, perhaps searching
by destination (“beach holidays to Spain”), perhaps refining by a specific requirement
(“family beach holidays in Spain”), and then more specific including a destination
(Majorca), and perhaps (“family holidays with pool in Majorca”). Savvy SEOs think
deeply about mapping customer demand (right across the search journey) to compelling
landing page (and website) experiences that can satisfy this demand. Data science
enables you to manage this opportunity at a significant scale.
Warnings and motivations over, let’s roll starting with the usual package loading:
import re
import time
import random
import pandas as pd
import numpy as np
160
Chapter 4 Content and UX
OS and Glob allow the environment to read the SEMRush files from a folder:
import os
import glob
pd.set_option('display.max_colwidth', None)
These variables are set in advance so that when copying this script over for another
site, the script can be run with minimal changes to the code:
root_domain = 'wella.com'
hostdomain = 'www.wella.com'
hostname = 'wella'
full_domain = 'https://www.wella.com'
target_name = 'Wella'
With the variables set, we’re now ready to start importing data.
data_dir = os.path.join('data/semrush/')
Glob reads all of the files in the folder, and we store the output in a variable
“semrush_csvs”:
161
Chapter 4 Content and UX
['data/hair.com-organic.Positions-uk-20220704-2022-07-05T14_04_59Z.csv',
'data/johnfrieda.com-organic.Positions-
uk-20220704-2022-07-05T13_29_57Z.csv',
'data/madison-reed.com-organic.Positions-
uk-20220704-2022-07-05T13_38_32Z.csv',
'data/sebastianprofessional.com-organic.Positions-
uk-20220704-2022-07-05T13_39_13Z.csv',
'data/matrix.com-organic.Positions-uk-20220704-2022-07-05T14_04_12Z.csv',
'data/wella.com-organic.Positions-uk-20220704-2022-07-05T13_30_29Z.csv',
'data/redken.com-organic.Positions-uk-20220704-2022-07-05T13_37_31Z.csv',
'data/schwarzkopf.com-organic.Positions-
uk-20220704-2022-07-05T13_29_03Z.csv',
'data/garnier.co.uk-organic.Positions-
uk-20220704-2022-07-05T14_07_16Z.csv']
Initialize the final dataframe where we’ll be storing the imported SEMRush data:
semrush_raw_df = pd.DataFrame()
semrush_li = []
The for loop uses the pandas read_csv() function to read the SEMRush CSV file and
extract the filename which is put into a new column “filename.” A bit superfluous to
requirements but it will help us know where the data came from.
Once the data is read, it is added to the semrush_li list we initialized earlier:
for cf in semrush_csvs:
df = pd.read_csv(cf, index_col=None, header=0)
df['filename'] = os.path.basename(cf)
df['filename'] = df['filename'].str.replace('.csv', '')
df['filename'] = df['filename'].str.replace('_', '.')
semrush_li.append(df)
162
Chapter 4 Content and UX
semrush_raw_df.columns = semrush_raw_df.columns.str.strip().str.lower().
str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
A site column is created so we know which content the site belongs to. Here, we used
regex on the filename column, but we could have easily derived this from the URL also:
semrush_raw_df['site'] = semrush_raw_df['filename'].str.extract('(.*?)\-')
semrush_raw_df.head()
That’s the dataframe, although we’re more interested in the keywords and the site it
belongs to.
semrush_raw_presect = semrush_raw_sited.copy()
semrush_raw_presect = semrush_raw_presect[['keyword', 'site']]
semrush_raw_presect
163
Chapter 4 Content and UX
The aim of the exercise is to find keywords to two or more competitors which will
define the core content set.
To achieve this, we will use a list comprehension to split the semrush_raw_presect
dataframe by site into unnamed dataframes:
df1, df2, df3, df4, df5, df6, df7, df8, df9 = [x for _, x in semrush_raw_
presect.groupby(semrush_raw_presect['site'])]
Now that each dataframe has the site and keywords, we can dispense with the site
column as we’re only interested in the keywords and not where they come from.
We start by defining a list of dataframes, df_list:
df_list = [df1, df2, df3, df4, df5, df6, df7, df8, df9]
164
Chapter 4 Content and UX
df1
165
Chapter 4 Content and UX
keywords_lists = []
List comprehension which will go through all of the keyword sets in df_list, and these
as lists to get a list of keyword lists.
The lists within the list of lists are too long to print here; however, the double bracket
at the beginning should show this is indeed a list of lists.
keywords_lists
[['garnier',
'hair colour',
'garnier.co.uk',
'garnier hair color',
'garnier hair colour',
'garnier micellar water',
'garnier hair food',
'garnier bb cream',
'garnier face mask',
'bb cream from garnier',
'garnier hair mask',
'garnier shampoo',
'hair dye',
lst_1
166
Chapter 4 Content and UX
['garnier',
'hair colour',
'garnier.co.uk',
'garnier hair color',
'garnier hair colour',
'garnier micellar water',
'garnier hair food',
'garnier bb cream',
'garnier face mask',
'bb cream from garnier',
'garnier hair mask',
'garnier shampoo',
'hair dye',
'garnier hair dye',
'garnier shampoo bar',
'garnier vitamin c serum',
Now we want to generate combinations of lists so we can control how each of the
site’s keywords get intersected:
The dictionary comprehension will append each list into a dictionary we create
called keywords_dict, where the key (index) is the number of the list:
keywords_dict.keys()
we get the list numbers. The reason it goes from 0 to 8 and not 1 to 9 is because
Python uses zero indexing which means it starts from zero:
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8])
167
Chapter 4 Content and UX
Now we’ll convert the keys to a list for ease of manipulation shortly:
keys_list = list(keywords_dict.keys())
keys_list
[0, 1, 2, 3, 4, 5, 6, 7, 8]
With the list, we can construct combinations of the site's keywords to intersect.
The intersection of the website keyword lists will be the words that are common to the
websites.
list_combos = []
List comprehension using the combinations function picking four site keywords at
random and storing it in list combos using the append() function:
This line converts the combination into a list so that list_combos will be a list of lists:
list_combos
[[0, 1, 2, 3],
[0, 1, 2, 4],
[0, 1, 2, 5],
[0, 1, 2, 6],
[0, 1, 2, 7],
[0, 1, 2, 8],
[0, 1, 3, 4],
[0, 1, 3, 5],
[0, 1, 3, 6], ...
168
Chapter 4 Content and UX
With the list of lists, we’re ready to start intersecting the keyword lists to build the
core content (keyword) set.
keywords_intersected = []
Define the multi_intersect function which takes a list of dictionaries and their keys,
then finds the common keywords (i.e., intersection), and adds it to the keywords_
intersected list.
The function can be adapted to just compare two sites, three sites, and so on. Just
ensure you rerun the combinations function with the number of lists desired and edit
the function as follows:
Using the list comprehension, we loop through the list of combinations list_combos
to run the multi_intersect function which takes the dictionary containing all the site
keywords (keywords_dict), pulls the appropriate keywords, and finds the common ones,
before adding to keywords_intersected:
And we get a list of lists, because each list is an iteration of the function for each
combination:
keywords_intersected
169
Chapter 4 Content and UX
unique_keywords_intersected = list(set(flat_keywords_intersected))
print(len(flat_keywords_intersected), len(unique_keywords_intersected))
87031 8380
There were 87K keywords originally and 8380 keywords post deduplication.
unique_keywords_intersected
170
Chapter 4 Content and UX
That’s the list, but it’s not over yet as we need to establish the gap, which we all want
to know.
Establishing Gap
The question is which keywords are “Wella” not targeting and how many are there?
We’ll start by filtering the SEMRush site for the target site Wella.com:
target_semrush = semrush_raw_sited.loc[semrush_raw_sited['site'] ==
root_domain]
And then we include only the keywords in the core content set:
target_on = target_semrush.loc[target_semrush['keyword'].isin(unique_
keywords_intersected)]
target_on
171
Chapter 4 Content and UX
Let’s get some stats starting with the number of keywords in the preceding dataframe
and the number of keywords in the core content set:
print(target_on[['keyword'].drop_duplicates().shape[0], len(unique_
keywords_intersected))
6936 8380
So just under 70% of Wella’s keyword content is in the core content set, which is
about 1.4K keywords short.
To find the 6.9K intersect keywords, we can use the list and set functions:
172
Chapter 4 Content and UX
To find the keywords that are not in the core content set, that is, the content gap, we’ll
remove the target SEMRush keywords from the core content set:
Now that we know what these gap keywords are, we can filter the dataframe by listing
keywords:
cga_semrush = semrush_raw_sited.loc[semrush_raw_sited['keyword'].
isin(target_gap)]
cga_semrush
173
Chapter 4 Content and UX
We only want the highest ranked target URLs per keyword, which we’ll achieve with
a combination of sort_values(), groupby(), and first():
cga_unique = cga_semrush.sort_values('position').groupby('keyword').
first().reset_index()
cga_unique['project'] = target_name
Ready to export:
cga_unique.to_csv('exports/cga_unique.csv')
cga_unique
174
Chapter 4 Content and UX
3. Check the search results for each heading as writers can phrase
the intent differently
This strategy won’t work for all verticals as there’s a lot of noise in some market
sectors compared to others. For example, with hair styling articles, a lot of the headings
(and their sections) are celebrity names which will not have the same detectable search
intent as another celebrity.
In contrast, in other verticals this method works really well because there aren’t
endless lists with the same HTML heading tags shared with related article titles (e.g.,
“Drew Barrymore” and “54 ways to wear the modern Marilyn”).
Instead, the headings are fewer in number and have a meaning in common, for
example, “What is account-based marketing?” and “Defining ABM,” which is something
Google is likely to understand.
With those caveats in mind, let’s go.
import requests
from requests.exceptions import ReadTimeout
from json.decoder import JSONDecodeError
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
import requests
import json
175
Chapter 4 Content and UX
target = 'on24'
These are the keywords the target website wants to rank for. There’s only eight
keywords, but as you’ll see, this process generates a lot of noisy data, which will need
cleaning up:
serps_input
176
Chapter 4 Content and UX
The extract function from the TLD extract package is useful for extracting the
hostname and domain name from URLs:
serps_input_clean = serps_input.copy()
serps_input_clean['url'] = serps_input_clean['url'].astype(str)
serps_input_clean['host'] = serps_input_clean['url'].apply(lambda x:
extract(x))
Extract the hostname by taking the penultimate list element from the list using the
string get method:
serps_input_clean['host'] = serps_input_clean['host'].str.get(-2)
177
Chapter 4 Content and UX
serps_input_clean['site'] = serps_input_clean['url'].apply(lambda x:
extract(x))
serps_input_clean['site'] = [list(lst) for lst in serps_input_
clean['site']]
Only this time, we want both the hostname and the top-level domain (TLD) which
we will join to form the site or domain name:
serps_input_clean
The augmented dataframe shows the host and site columns added.
This line allows the column values to be read by setting the column widths to their
maximum value:
pd.set_option('display.max_colwidth', None)
178
Chapter 4 Content and UX
serps_to_crawl_df = serps_input_clean.copy()
There are some sites not worth crawling because they won’t let you, which are
defined in the following list:
serps_to_crawl_df = serps_to_crawl_df.loc[~serps_to_crawl_df['host'].
isin(dont_crawl)]
We’ll also remove nulls and sites outside the top 10:
serps_to_crawl_df = serps_to_crawl_df.loc[~serps_to_crawl_df['domain'].
isnull()]
serps_to_crawl_df = serps_to_crawl_df.loc[serps_to_crawl_df['rank'] < 10]
serps_to_crawl_df.head(10)
179
Chapter 4 Content and UX
With the dataframe filtered, we just want the URLs to export to our desktop crawler.
Some URLs may rank for multiple search phrases. To avoid crawling the same URL
multiple times, we’ll use drop_duplicates() to make the URL list unique:
serps_to_crawl_upload = serps_to_crawl_df[['url']].drop_duplicates()
serps_to_crawl_upload.to_csv('data/serps_to_crawl_upload.csv', index=False)
serps_to_crawl_upload
Now we have a list of 62 URLs to crawl, which cover the eight target keywords.
Let’s import the results of the crawl:
crawl_raw = pd.read_csv('data/all_inlinks.csv')
pd.set_option('display.max_columns', None)
Using a list comprehension, we’ll clean up the column names to make it easier to
work with:
180
Chapter 4 Content and UX
Print out the column names to see how many extractor fields were extracted:
print(crawl_raw.columns)
181
Chapter 4 Content and UX
There are 6 primary headings (H1 in HTML) and 65 H2 headings altogether. These
will form the basis of our content sections which tell us what content should be on
those pages.
crawl_raw
crawl_headings = crawl_raw.loc[crawl_raw['link_position'] ==
'Content'].copy()
182
Chapter 4 Content and UX
The dataframe also contains columns that are superfluous to our requirements such
as link_position and link_origin. We can remove these by listing the columns by position
(saves space and typing out the names of which there are many!).
drop_cols = [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20]
Using the .drop() method, we can drop multiple columns in place (i.e., without
having to copy the result onto itself ):
crawl_headings.drop(crawl_headings.columns[drop_cols], axis = 1,
inplace = True)
Rename the columns from source to URL, which will be useful for joining later:
crawl_headings
With the desired columns of URL and their content section columns, these need to
be converted to long format, where all of the sections will be in a single column called
“heading”:
crawl_headings_long = crawl_headings.copy()
183
Chapter 4 Content and UX
We’ll want a list of the extractor column names (again to save typing) by subsetting
the dataframe from the second column onward using .iloc and extracting the column
names (.columns.values):
Using the .melt() function, we’ll pivot the dataframe to reshape the content sections
into a single column “heading” using the preceding list:
crawl_headings_long = crawl_headings_long.loc[~crawl_headings_
long['heading'].isnull()]
crawl_headings_long = crawl_headings_long.drop_duplicates()
crawl_headings_long
184
Chapter 4 Content and UX
The resulting dataframe shows the URL, the heading, and the position where the first
number denotes whether it was an h1 or h2 and the second number indicates the order
of the heading on the page. The heading is the text value.
You may observe that the heading contains some values that are not strictly content
but boilerplate content that is sitewide, such as Company, Resources, etc. These will
require removal at some point.
serps_headings = serps_to_crawl_df.copy()
serps_headings['heading'] = np.where(serps_headings['heading'].isnull(),
'', serps_headings['heading'])
serps_headings['project'] = 'target'
serps_headings
185
Chapter 4 Content and UX
With the data joined, we’ll take the domain, heading, and the position:
Split position by underscore and extract the last number in the list (using -1) to get
the order the heading appears on the page:
headings_tosum['pos_n'] = headings_tosum['position'].str.split('_').str[-1]
headings_tosum['pos_n'] = headings_tosum['pos_n'].astype(float)
headings_tosum['count'] = 1
headings_tosum
186
Chapter 4 Content and UX
187
Chapter 4 Content and UX
stop_headings = domsheadings_tosum_agg.loc[domsheadings_tosum_
agg['count'] > 1]
stop_headings = stop_headings.loc[stop_headings['heading'].str.
contains('\n')]
stop_headings = stop_headings['heading'].tolist()
stop_headings
188
Chapter 4 Content and UX
We’ll now analyze the headings per se, starting by counting the number of headings:
headings_tosum_agg = headings_tosum_agg.loc[~headings_tosum_agg['heading'].
isin(stop_headings)]
headings_tosum_agg = headings_tosum_agg.loc[headings_tosum_
agg['heading'] != '']
headings_tosum_agg.head(10)
The dataframe looks to contain more sensible content headings with the exception
of “company,” which also is much further down the order of the page at 25.
189
Chapter 4 Content and UX
headings_tosum_filtered = headings_tosum_agg.copy()
headings_tosum_filtered = headings_tosum_filtered.loc[headings_tosum_
filtered['count'] < 10 ]
headings_tosum_filtered['tokens'] = headings_tosum_filtered['heading'].str.
count(' ') + 1
headings_tosum_filtered['heading'] = headings_tosum_filtered['heading'].
str.strip()
Split heading using colons as a punctuation mark and extract the right-hand side of
the colon:
headings_tosum_filtered['heading'] = headings_tosum_filtered['heading'].
str.split(':').str[-1]
headings_tosum_filtered['heading'] = headings_tosum_filtered['heading'].
str.split('.').str[-1]
headings_tosum_filtered = headings_tosum_filtered.loc[~headings_tosum_
filtered['heading'].str.contains('[0-9] of [0-9]', regex = True)]
Remove headings that are less than 5 words long or more than 12:
headings_tosum_filtered = headings_tosum_filtered.loc[headings_tosum_
filtered['tokens'].between(5, 12)]
headings_tosum_filtered = headings_tosum_filtered.sort_values('count',
ascending = False)
190
Chapter 4 Content and UX
headings_tosum_filtered = headings_tosum_filtered.loc[headings_tosum_
filtered['heading'] != '' ]
headings_tosum_filtered.head(10)
Now we have headings that look more like actual content sections. These are now
ready for clustering.
Cluster Headings
The reason for clustering is that writers will describe the same section heading using
different words and deliberately so as to avoid copyright infringement and plagiarism.
However, Google is smart enough to know that “webinar best practices” and “best
practices for webinars” are the same.
To make use of Google’s knowledge, we’ll make use of the SERPs to see if the search
results of each heading are similar enough to know if they mean the same thing or not
(i.e., whether the underlying meaning or intent is the same).
We’ll create a list and use the search intent clustering code (see Chapter 2) to
categorize the headings into topics:
191
Chapter 4 Content and UX
headings_to_cluster = headings_tosum_filtered[['heading']].drop_
duplicates()
headings_to_cluster = headings_to_cluster.loc[~headings_to_
cluster['heading'].isnull()]
headings_to_cluster = headings_to_cluster.rename(columns = {'heading':
'keyword'})
headings_to_cluster
With the headings clustered by search intent, we’ll import the results:
topic_keyw_map = pd.read_csv('data/topic_keyw_map.csv')
Let’s rename the keyword column to heading, which we can use to join to the SERP
dataframe later:
192
Chapter 4 Content and UX
topic_keyw_map
The dataframe shows the heading and the meaning of the heading as “topic.” The
next stage is to get some statistics and see how many headings constitute a topic. As the
topics are the central meaning of the headings, this will form the core content sections
per target keyword.
topic_keyw_map_agg = topic_keyw_map.copy()
topic_keyw_map_agg['count'] = 1
topic_keyw_map_agg = topic_keyw_map_agg.groupby('topic').agg({'count':
'sum'}).reset_index()
topic_keyw_map_agg = topic_keyw_map_agg.sort_values('count',
ascending = False)
topic_keyw_map_agg
193
Chapter 4 Content and UX
serps_topics_merge = serps_headings.copy()
serps_topics_merge['heading'] = serps_topics_merge['heading'].str.lower()
194
Chapter 4 Content and UX
serps_topics_merge = serps_topics_merge.merge(topic_keyw_map, on =
'heading', how = 'left')
serps_topics_merge
The count will be reset to 1, so we can count the number of suggested content
sections per target keyword:
keyword_topics_summary['count'] = 1
keyword_topics_summary
195
Chapter 4 Content and UX
196
Chapter 4 Content and UX
The preceding dataframe shows the content sections (topic) that should be written
for each target keyword.
keyword_topics_summary.groupby(['keyword']).agg({'count': 'sum'}).
reset_index()
Webinar best practices will have the most content, while other target keywords will
have around two core content sections on average.
Reflections
For B2B marketing, it works really well as it’s a good way of automating a manual
process most SEOs go through (i.e., seeing what content the top 10 ranking pages cover)
especially when you have a lot of keywords to create content for.
We used the H1 and H2 because using even more copy from the body (such as H3 or
<p> paragraphs even after filtering out stop words) would introduce more noise into the
string distance calculations.
Sometimes, you get some reliable suggestions that are actually quite good; however,
the output should be reviewed first before raising content requests from your creative
team or agency.
197
Chapter 4 Content and UX
Summary
There are many aspects of SEO that go into delivering content and UX better than your
competitors. This chapter focused on
The next chapter deals with the third major pillar of SEO: authority.
198
CHAPTER 5
Authority
Authority is arguably 50% of the Google algorithm. You could optimize your site to your
heart’s content by creating the perfect content and deliver it with the perfect UX that’s
hosted on a site with the most perfect information architecture, only to find it’s nowhere
in Google’s search results when searching by the title of the page – assuming it’s not a
unique search phrase, so what gives?
You’ll find out about this and the following in this chapter:
199
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7_5
Chapter 5 Authority
The first thing is that their algorithm ranked pages based on their authority, in other
words, how trustworthy the document (or website) was, as opposed to only matching
a document on keyword relevance. Authority in those days was measured by Google
as the amount of links from other sites linking to your site. This was much in the same
way as citations in a doctoral dissertation. The more links (or citations), the higher the
probability a random surfer on the Web would find your content. This made SEO harder
to game and the results (temporarily yet significantly) more reliable relative to the
competition.
The second thing they did was partner with Yahoo! which openly credited Google for
powering their search results. So what happened next? Instead of using Yahoo!, people
went straight to Google, bypassing the intermediary Yahoo! Search engine, and the rest is
history – or not quite.
200
Chapter 5 Authority
Figure 5-1 is just one example of many showing a positive relationship between
rankings and authority. In this case, the authority is the product of nonsearch
advertising. And why is that? It’s because good links and effective advertising drive brand
impressions, which are also positively linked.
What we will set out to do is show how data science can help you:
201
Chapter 5 Authority
While most of the analysis can be done on a spreadsheet, Python has certain
advantages. Other than the sheer number of rows it can handle, it can also look at the
statistical side more readily such as distributions.
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools
pd.set_option('display.max_colwidth', None)
%matplotlib inline
root_domain = 'johnsankey.co.uk'
hostdomain = 'www.johnsankey.co.uk'
hostname = 'johnsankey'
full_domain = 'https://www.johnsankey.co.uk'
target_name = 'John Sankey'
202
Chapter 5 Authority
We start by importing the data and cleaning up the column names to make it easier
to handle and quicker to type, for the later stages.
target_ahrefs_raw = pd.read_csv(
'data/johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv')
List comprehensions are a powerful and less intensive way to clean up the column names.
The list comprehension instructs Python to convert the column name to lowercase
for each column (“col”) in the dataframe columns.
Though not strictly necessary, I like having a count column as standard for
aggregations and a single value column “project” should I need to group the entire table:
target_ahrefs_raw['rd_count'] = 1
target_ahrefs_raw['project'] = target_name
Target_ahrefs_raw
203
Chapter 5 Authority
Now we have a dataframe with clean column names. The next step is to clean the
actual table values and make them more useful for analysis.
Make a copy of the previous dataframe and give it a new name:
target_ahrefs_clean_dtypes = target_ahrefs_raw.copy()
Clean the dofollow_ref_domains column which tells us how many ref domains the
sitelinking has. In this case, we’ll convert the dashes to zeros and then cast the whole
column as a whole number.
Start with referring domains:
target_ahrefs_clean_dtypes['dofollow_ref_domains'] = np.where(target_
ahrefs_clean_dtypes['dofollow_ref_domains'] == '-',
0, target_ahrefs_clean_dtypes['dofollow_ref_
domains'])
target_ahrefs_clean_dtypes['dofollow_ref_domains'] = target_ahrefs_clean_
dtypes['dofollow_ref_domains'].astype(int)
204
Chapter 5 Authority
target_ahrefs_clean_dtypes['dofollow_linked_domains'] = np.where(target_
ahrefs_clean_dtypes['dofollow_linked_domains'] == '-',
0, target_ahrefs_clean_dtypes['dofollow_linked_
domains'])
target_ahrefs_clean_dtypes['dofollow_linked_domains'] = target_ahrefs_
clean_dtypes['dofollow_linked_domains'].astype(int)
“First seen” tells us the date when the link was first found (i.e., discovered and then
added to the index of ahrefs). We’ll convert the string to a date format that Python can
process and then use this to derive the age of the links later on:
target_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(target_ahrefs_
clean_dtypes['first_seen'], format='%d/%m/%Y %H:%M')
target_ahrefs_clean_dtypes['month_year'] = target_ahrefs_clean_
dtypes['first_seen'].dt.to_period('M')
The link age is calculated by taking today’s date and subtracting the first seen date.
Then it’s converted to a number format and divided by a huge number to get the number
of days:
target_ahrefs_clean_dtypes
205
Chapter 5 Authority
With the data types cleaned, and some new data features created (note columns
added earlier), the fun can begin.
target_ahrefs_analysis = target_ahrefs_clean_dtypes
target_ahrefs_analysis.describe()
206
Chapter 5 Authority
So from the preceding table, we can see the average (mean), the number of referring
domains (107), and the variation (the 25th percentiles and so on).
The average domain rating (equivalent to Moz’s Domain Authority) of referring
domains is 27. Is that a good thing? In the absence of competitor data to compare in this
market sector, it’s hard to know, which is where your experience as an SEO practitioner
comes in. However, I’m certain we could all agree that it could be much higher – given
that it falls on a scale between 0 and 100. How much higher to make a shift is another
question.
The preceding table can be a bit dry and hard to visualize, so we’ll plot a histogram to
get more of an intuitive understanding of the referring domain authority:
dr_dist_plt = (
ggplot(target_ahrefs_analysis,
aes(x = 'dr')) +
geom_histogram(alpha = 0.6, fill = 'blue', bins = 100) +
scale_y_continuous() +
theme(legend_position = 'right'))
dr_dist_plt
The distribution is heavily skewed, showing that most of the referring domains have
an authority rating of zero (Figure 5-2). Beyond zero, the distribution looks fairly uniform
with an equal amount of domains across different levels of authority.
207
Chapter 5 Authority
dr_firstseen_plt = (
ggplot(target_ahrefs_analysis, aes(x = 'first_seen', y = 'dr',
group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 2) +
labs(y = 'Domain Rating', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
208
Chapter 5 Authority
dr_firstseen_plt.save(filename = 'images/1_dr_firstseen_plt.png',
height=5, width=10, units = 'in', dpi=1000)
dr_firstseen_plt
The plot looks very noisy as you’d expect and only really shows you what the DR
(domain rating) of a referring domain was at a point in time (Figure 5-3). The utility of
this chart is that if you have a team tasked with acquiring links, you can monitor the link
quality over time in general.
dr_firstseen_smooth_plt = (
ggplot(target_ahrefs_analysis, aes(x = 'first_seen', y = 'dr',
group = 1)) +
geom_smooth(alpha = 0.6, colour = 'blue', size = 3, se = False) +
labs(y = 'Domain Rating', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
209
Chapter 5 Authority
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
dr_firstseen_smooth_plt.save(filename = 'images/1_dr_firstseen_smooth_plt.
png', height=5, width=10, units = 'in', dpi=1000)
dr_firstseen_smooth_plt
The use of geom_smooth() gives a somewhat less noisy view and shows the
variability of the domain rating over time to show how consistent the quality is
(Figure 5-4). Again, this correlates to the quality of the links being acquired.
What this doesn’t quite describe is the overall site authority over time, because the
value of links acquired is retained over time; therefore, a different math approach is
required.
To see the site’s authority over time, we will calculate a running average of the
domain rating by month of the year. Note the use of the expanding() function which
instructs Pandas to include all previous rows with each new row:
210
Chapter 5 Authority
target_rd_cummean_df = target_ahrefs_analysis
target_rd_mean_df = target_rd_cummean_df.groupby(['month_year'])['dr'].
sum().reset_index()
target_rd_mean_df['dr_runavg'] = target_rd_mean_df['dr'].expanding().mean()
target_rd_mean_df.head(10)
We now have a table which we can use to feed the graph and visualize.
dr_cummean_smooth_plt = (
ggplot(target_rd_mean_df, aes(x = 'month_year', y = 'dr_runavg',
group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 2) +
#labs(y = 'GA Sessions', x = 'Date') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
dr_cummean_smooth_plt
211
Chapter 5 Authority
So the target site started with high authority links (which may have been a PR
campaign announcing the business brand), which faded soon after for four years and
then rebooted with new acquisition of high authority links again (Figure 5-5).
Most importantly, we can see the site’s general authority over time, which is how a
search engine like Google may see it too.
A really good extension to this analysis would be to regenerate the dataframe so that
we would plot the distribution over time on a cumulative basis. Then we could not only
see the median quality but also the variation over time too.
That’s the link quality, what about quantity?
212
Chapter 5 Authority
target_count_cumsum_df = target_ahrefs_analysis
print(target_count_cumsum_df.columns)
target_count_cumsum_df = target_count_cumsum_df.groupby(['month_year'])
['rd_count'].sum().reset_index()
target_count_cumsum_df['count_runsum'] = target_count_cumsum_df['rd_
count'].expanding().sum()
target_count_cumsum_df['link_velocity'] = target_count_cumsum_df['rd_
count'].diff()
target_count_cumsum_df
target_count_plt = (
ggplot(target_count_cumsum_df, aes(x = 'month_year', y = 'rd_count',
group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 2) +
labs(y = 'Count of Referring Domains', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
213
Chapter 5 Authority
axis_text_x=element_text(rotation=90, hjust=1)
))
target_count_plt.save(filename = 'images/3_target_count_plt.png',
height=5, width=10, units = 'in', dpi=1000)
target_count_plt
But perhaps it is not as useful for how a search engine would view the overall
number of referring domains a site has.
target_count_cumsum_plt = (
ggplot(target_count_cumsum_df, aes(x = 'month_year', y = 'count_
runsum', group = 1)) +
geom_line(alpha = 0.6, colour = 'blue', size = 2) +
scale_y_continuous() +
scale_x_date() +
214
Chapter 5 Authority
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
target_count_cumsum_plt
The cumulative view shows us the total number of referring domains (Figure 5-7).
Naturally, this isn’t the entirely accurate picture as some referring domains may have
been lost, but it’s good enough to get the gist of where the site is at.
We see that links were steadily added from 2017 for the next four years before
accelerating again around March 2021. This is consistent with what we have seen with
domain rating over time.
A useful extension to correlate that with performance may be to layer in
215
Chapter 5 Authority
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools
pd.set_option('display.max_colwidth', None)
%matplotlib inline
root_domain = 'johnsankey.co.uk'
hostdomain = 'www.johnsankey.co.uk'
hostname = 'johnsankey'
full_domain = 'https://www.johnsankey.co.uk'
target_name = 'John Sankey'
ahrefs_path = 'data/'
216
Chapter 5 Authority
The listdir() function from the OS module allows us to list all of the files in a
subdirectory:
ahrefs_filenames = os.listdir(ahrefs_path)
ahrefs_filenames
['www.davidsonlondon.com--refdomains-subdomain__2022-03-13_23-37-29.csv',
'www.stephenclasper.co.uk--refdomains-subdoma__2022-03-13_23-47-28.csv',
'www.touchedinteriors.co.uk--refdomains-subdo__2022-03-13_23-42-05.csv',
'www.lushinteriors.co--refdomains-subdomains__2022-03-13_23-44-34.csv',
'www.kassavello.com--refdomains-subdomains__2022-03-13_23-43-19.csv',
'www.tulipinterior.co.uk--refdomains-subdomai__2022-03-13_23-41-04.csv',
'www.tgosling.com--refdomains-subdomains__2022-03-13_23-38-44.csv',
'www.onlybespoke.com--refdomains-subdomains__2022-03-13_23-45-28.csv',
'www.williamgarvey.co.uk--refdomains-subdomai__2022-03-13_23-43-45.csv',
'www.hadleyrose.co.uk--refdomains-subdomains__2022-03-13_23-39-31.csv',
'www.davidlinley.com--refdomains-subdomains__2022-03-13_23-40-25.csv',
'johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv']
With the files listed, we’ll now read each one individually using a for loop and add
these to a dataframe. While reading in the file, we’ll use some string manipulation to
create a new column with the site name of the data we’re importing:
ahrefs_df_lst = list()
ahrefs_colnames = list()
comp_ahrefs_df_raw = pd.concat(ahrefs_df_lst)
comp_ahrefs_df_raw
217
Chapter 5 Authority
Now we have the raw data from each site in a single dataframe, the next step is to
tidy up the column names and make them a bit more friendlier to work with. A custom
function could be used, but we’ll just chain the function calls with a list comprehension:
competitor_ahrefs_cleancols = comp_ahrefs_df_raw.copy()
competitor_ahrefs_cleancols.columns = [col.lower().replace(' ','_').
replace('.','_').replace('__','_').replace('(','')
.replace(')','').replace('%','')
for col in competitor_ahrefs_cleancols.columns]
Having a count column and a single value column (“project”) is useful for groupby
and aggregation operations:
competitor_ahrefs_cleancols['rd_count'] = 1
competitor_ahrefs_cleancols['project'] = target_name
competitor_ahrefs_cleancols
218
Chapter 5 Authority
The columns are now cleaned up, so we’ll now clean up the row data:
competitor_ahrefs_clean_dtypes = competitor_ahrefs_cleancols
For referring domains, we’re replacing hyphens with zero and setting the data type as
an integer (i.e., whole number). This will be repeated for linked domains, also:
competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] =
np.where(competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] == '-',
0, competitor_ahrefs_clean_dtypes['dofollow_ref_
domains'])
competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] = competitor_ahrefs_
clean_dtypes['dofollow_ref_domains'].astype(int)
# linked_domains
competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] =
np.where(competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] == '-',
0, competitor_ahrefs_clean_dtypes['dofollow_linked_
domains'])
competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] = competitor_
ahrefs_clean_dtypes['dofollow_linked_domains'].astype(int)
219
Chapter 5 Authority
First seen gives us a date point at which links were found, which we can use for
time series plotting and deriving the link age. We’ll convert to date format using the
to_datetime function:
competitor_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(competitor_
ahrefs_clean_dtypes['first_seen'],
format='%d/%m/%Y %H:%M')
competitor_ahrefs_clean_dtypes['first_seen'] = competitor_ahrefs_clean_
dtypes['first_seen'].dt.normalize()
competitor_ahrefs_clean_dtypes['month_year'] = competitor_ahrefs_clean_
dtypes['first_seen'].dt.to_period('M')
To calculate the link_age, we’ll simply deduct the first seen date from today’s date
and convert the difference into a number:
competitor_ahrefs_clean_dtypes['link_age'] = dt.datetime.now() -
competitor_ahrefs_clean_dtypes['first_seen']
competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_
dtypes['link_age']
competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_
dtypes['link_age'].astype(int)
competitor_ahrefs_clean_dtypes['link_age'] = (competitor_ahrefs_clean_
dtypes['link_age']/(3600 * 24 * 1000000000)).round(0)
The target column helps us distinguish the “client” site vs. competitors, which is
useful for visualization later:
competitor_ahrefs_clean_dtypes['target'] = np.where(competitor_ahrefs_
clean_dtypes['site'].str.contains('johns'),
1, 0)
competitor_ahrefs_clean_dtypes['target'] = competitor_ahrefs_clean_
dtypes['target'].astype('category')
competitor_ahrefs_clean_dtypes
220
Chapter 5 Authority
Now that the data is cleaned up both in terms of column titles and row values, we’re
ready to set forth and start analyzing.
competitor_ahrefs_aggs = competitor_ahrefs_analysis.groupby('site').
agg({'link_age': 'mean',
'dofollow_links': 'mean', 'domain': 'count', 'dr': 'mean',
'dofollow_ref_domains': 'mean', 'traffic_': 'mean', 'dofollow_
linked_domains': 'mean', 'links_to_target': 'mean', 'new_links':
'mean', 'lost_links': 'mean'}).reset_index()
competitor_ahrefs_aggs
221
Chapter 5 Authority
The resulting table shows us aggregated statistics for each of the link features. Next,
read in the list of SEMRush domain level data (which by way of manual data entry was
literally typed in since it’s only 11 sites):
semrush_viz = [10100, 2300, 931, 2400, 911, 2100, 1800, 136, 838, 428,
1100, 1700]
competitor_ahrefs_aggs['semrush_viz'] = semrush_viz
competitor_ahrefs_aggs
The SEMRush visibility data has now been appended, so we’re ready to find some
r-squared, known as the coefficient of determination, which will tell which link feature
can best explain the variation in SEMRush visibility:
222
Chapter 5 Authority
competitor_ahrefs_r2 = competitor_ahrefs_aggs.corr() ** 2
competitor_ahrefs_r2 = competitor_ahrefs_r2[['semrush_viz']].reset_index()
competitor_ahrefs_r2 = competitor_ahrefs_r2.sort_values('semrush_viz',
ascending = False)
competitor_ahrefs_r2
Naturally, we’d expect the semrush_viz to correlate perfectly with itself. DR (domain
rating) surprisingly doesn’t explain the difference in SEMRush very well with an r_
squared of 21%.
On the other hand, “traffic_” which is the referring domain’s traffic value correlates
better. From this alone, we’re prepared to disregard “dr.” Let’s inspect this visually:
comp_correl_trafficviz_plt = (
ggplot(competitor_ahrefs_aggs,
aes(x = 'traffic_', y = 'semrush_viz')) +
geom_point(alpha = 0.4, colour = 'blue', size = 2) +
223
Chapter 5 Authority
comp_correl_trafficviz_plt.save(filename = 'images/2_comp_correl_
trafficviz_plt.png',
height=5, width=10, units = 'in', dpi=1000)
comp_correl_trafficviz_plt
This is not terribly convincing (Figure 5-8), due to the lack of referring domains
beyond 2,000,000. Does this mean we should disregard traffic_ as a measure?
Figure 5-8. Scatterplot of the SEMRush visibility (semrush_viz) vs. the total
AHREFs backlink traffic (traffic_) of the site’s backlinks
Not necessarily. The outlier data point with 10,000 visibility isn’t necessarily
incorrect. The site does have superior visibility and more referring traffic in the real
world, so it doesn’t mean the site’s data should be removed.
If anything, more data should be gathered with more domains in the same sector.
Alternatively, pursuing a more thorough treatment would involve obtaining SEMRush
visibility data at the page level and correlating this with page-level link feature metrics.
Going forward, we will use traffic_ as our measure of quality.
224
Chapter 5 Authority
Link Quality
We start with link quality, which we’ve very recently discovered should be measured by
“traffic_” as opposed to the industry accepted.
Let’s start by inspecting the distributive properties of each link feature using the
describe() function:
competitor_ahrefs_analysis = competitor_ahrefs_clean_dtypes
competitor_ahrefs_analysis[['traffic_']].describe()
The resulting table shows some basic statistics including the mean, standard
deviation (std), and interquartile metrics (25th, 50th, and 75th percentiles), which give
you a good idea of where most referring domains fall in terms of referring domain traffic.
comp_dr_dist_box_plt = (
ggplot(competitor_ahrefs_analysis, #.loc[competitor_ahrefs_
analysis['dr'] > 0],
aes(x = 'reorder(site, traffic_)', y = 'traffic_',
colour = 'target')) +
225
Chapter 5 Authority
geom_boxplot(alpha = 0.6) +
scale_y_log10() +
theme(legend_position = 'none',
axis_text_x=element_text(rotation=90, hjust=1)
))
comp_dr_dist_box_plt.save(filename = 'images/4_comp_traffic_dist_box_plt.
png', height=5, width=10, units = 'in', dpi=1000)
comp_dr_dist_box_plt
226
Chapter 5 Authority
The interquartile range is the range of data between its 25th percentile and 75th
percentile. The purpose is to tell us
• How much of the data is away from the median (the center)
In this case, the IQR is quantifying how much traffic each site’s referring domains get
and its variability.
We also see that “John Sankey” has the third highest median referring domain traffic
which compares well in terms of link quality against their competitors. The size of the
box (its IQR) is not the longest (quite consistent around its median) but not as short
as Stephen Clasper (more consistent, with a higher median and more backlinks from
referring domain sites higher than the median).
“Touched Interiors” has the most diverse range of DR compared with other domains,
which could indicate an ever so slightly more relaxed criteria for link acquisition. Or is it
the case that as your brand becomes more well known and visible online, this brand has
naturally attracted more links from zero traffic referring domains? Maybe both.
Let’s plot the domain quality over time for each competitor:
comp_traf_timeseries_plt = (
ggplot(competitor_ahrefs_analysis,
aes(x = 'first_seen', y = 'traffic_',
group = 'site', colour = 'site')) +
geom_smooth(alpha = 0.4, size = 2, se = False,
method='loess'
) +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
)
)
comp_traf_timeseries_plt.save(filename = 'images/4_comp_traffic_timeseries_
plt.png', height=5, width=10, units = 'in', dpi=1000)
comp_traf_timeseries_plt
227
Chapter 5 Authority
Figure 5-10. Time series plot showing the amount of traffic each referring domain
has over time for each website
The remaining sites are more or less flat in terms of their link acquisition
performance. David Linley started big, then dive-bombed in terms of link quality before
improving again in 2020 and 2021.
Now that we have some concept of how the different sites perform, what we really
want is a cumulative link quality by month_year as this is likely to be additive in the way
search engines evaluate the authority of websites.
We’ll use our trusted groupby() and expanding().mean() functions to compute the
cumulative stats we want:
competitor_traffic_cummean_df = competitor_ahrefs_analysis.copy()
competitor_traffic_cummean_df = competitor_traffic_cummean_
df.groupby(['site', 'month_year'])['traffic_'].sum().reset_index()
competitor_traffic_cummean_df['traffic_runavg'] = competitor_traffic_
cummean_df['traffic_'].expanding().mean()
competitor_traffic_cummean_df
228
Chapter 5 Authority
Scientific formatted numbers aren’t terribly helpful, nor is a table for that matter, but
at least the dataframe is in a ready format to power the following chart:
competitor_traffic_cummean_plt = (
ggplot(competitor_traffic_cummean_df, aes(x = 'month_year', y =
'traffic_runavg', group = 'site', colour = 'site')) +
geom_line(alpha = 0.6, size = 2) +
labs(y = 'Cumu Avg of traffic_', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
competitor_traffic_cummean_plt.save(filename = 'images/4_competitor_
traffic_cummean_plt.png', height=5, width=10, units = 'in', dpi=1000)
competitor_traffic_cummean_plt
229
Chapter 5 Authority
The code is color coding the sites to make it easier to see which site is which.
So as we might expect, David Linley’s link acquisition team has done well as their
authority has made leaps and bounds over all of the competitors over time (Figure 5-11).
Figure 5-11. Time series plot of the cumulative average backlink traffic for
each website
All of the other competitors have pretty much flatlined. This is reflected in David
Linley’s superior SEMRush visibility (Figure 5-12).
230
Chapter 5 Authority
Figure 5-12. Column chart showing the SEMRush visibility for each website
What can we learn? So far in our limited data research, we can see that slow and
steady does not win the day. By contrast, sites need to be going after links from high
traffic sites in a big way.
Link Volumes
That’s quality analyzed; what about the volume of links from referring domains?
Our approach will be to compute a cumulative sum of referring domains using the
groupby() function:
competitor_count_cumsum_df = competitor_ahrefs_analysis
competitor_count_cumsum_df = competitor_count_cumsum_df.groupby(['site',
'month_year'])['rd_count'].sum().reset_index()
231
Chapter 5 Authority
The expanding function allows the calculation window to grow with the number of
rows, which is how we achieve our cumulative sum:
competitor_count_cumsum_df['count_runsum'] = competitor_count_cumsum_
df['rd_count'].expanding().sum()
competitor_count_cumsum_df
The result is a dataframe with the site, month_year, and count_runsum (the running
sum), which is in the perfect format to feed the graph – which we will now run as follows:
competitor_count_cumsum_plt = (
ggplot(competitor_count_cumsum_df, aes(x = 'month_year', y =
'count_runsum',
group = 'site', colour = 'site')) +
geom_line(alpha = 0.6, size = 2) +
labs(y = 'Running Sum of Referring Domains', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
232
Chapter 5 Authority
axis_text_x=element_text(rotation=90, hjust=1)
))
competitor_count_cumsum_plt.save(filename = 'images/5_count_cumsum_smooth_
plt.png', height=5, width=10, units = 'in', dpi=1000)
competitor_count_cumsum_plt
Figure 5-13. Time series plot of the running sum of referring domains for
each website
For example, William Garvey started with over 5000 domains. I’d love to know who
their digital PR team is.
We can also see the rate of growth, for example, although Hadley Rose started link
acquisition in 2018, things really took off around mid-2021.
233
Chapter 5 Authority
Link Velocity
Let’s take a look at link velocity:
competitor_velocity_cumsum_plt = (
ggplot(competitor_count_cumsum_df, aes(x = 'month_year', y = 'link_
velocity',
group = 'site', colour = 'site')) +
geom_line(alpha = 0.6, size = 2) +
labs(y = 'Running Sum of Referring Domains', x = 'Month Year') +
scale_y_log10() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
competitor_velocity_cumsum_plt.save(filename = 'images/5_competitor_
velocity_cumsum_plt.png',
height=5, width=10, units = 'in', dpi=1000)
competitor_velocity_cumsum_plt
The view shows the relative speed at which the sites are acquiring links (Figure 5-14).
This is an unusual but useful view as for any given month you can see which site is
acquiring the most links by virtue of the height of their lines.
234
Chapter 5 Authority
Figure 5-14. Time series plot showing the link velocity of each website
David Linley was winning the contest throughout the years until Hadley Rose
came along.
Link Capital
Like most things that are measured in life, the ultimate value is determined by the
product of their rate and volume. So we will apply the same principle to determine the
overall value of a site’s authority and call it “link capital.”
We’ll start by merging the running average stats for both link volume and average
traffic (as our measure of authority):
competitor_capital_cumu_df = competitor_count_cumsum_df.merge(competitor_
traffic_cummean_df,
on = ['site', 'month_year'], how = 'left'
)
competitor_capital_cumu_df['auth_cap'] = (competitor_capital_cumu_
df['count_runsum'] * competitor_capital_cumu_df['traffic_runavg']).
round(1)*0.001
competitor_capital_cumu_df['auth_velocity'] = competitor_capital_cumu_
df['auth_cap'].diff()
competitor_capital_cumu_df
235
Chapter 5 Authority
The merged table is produced with new columns auth_cap (measuring overall
authority) and auth_velocity (the rate at which authority is being added).
Let’s see how the competitors compare in terms of total authority over time in
Figure 5-15.
Figure 5-15. Time series plot of authority capital over time by website
236
Chapter 5 Authority
The plot shows the link capital of several sites over time. What’s quite interesting is
how Hadley Rose emerged as the most authoritative with the third most consistently
highest trafficked backlinking sites with a ramp-up in volume in less than a year. This
has allowed them to overtake all of their competitors in the same time period (based on
volume while maintaining quality).
What about the velocity in which authority has been added? In the following, we’ll
plot the authority velocity over time for each website:
competitor_capital_veloc_plt = (
ggplot(competitor_capital_cumu_df, aes(x = 'month_year', y =
'auth_velocity',
group = 'site', colour = 'site')) +
geom_line(alpha = 0.6, size = 2) +
labs(y = 'Authority Capital', x = 'Month Year') +
scale_y_continuous() +
scale_x_date() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation=90, hjust=1)
))
competitor_capital_veloc_plt.save(filename = 'images/6_auth_veloc_smooth_
plt.png',
height=5, width=10, units = 'in', dpi=1000)
competitor_capital_veloc_plt
The only standouts are David Linley and Hadley Rose (Figure 5-16). Should David
Linley maintain the quality and the velocity of its link acquisition program?
237
Chapter 5 Authority
We’re in no doubt that it will catch up and even surpass Hadley Rose, all other things
being equal.
238
Chapter 5 Authority
To achieve this, we’ll group the referring domains and their traffic levels to calculate
the number of sites:
power_doms_strata = competitor_ahrefs_analysis.groupby(['domain',
'traffic_']).agg({'rd_count': 'count'})
power_doms_strata = power_doms_strata.reset_index().sort_values('traffic_',
ascending = False)
A referring domain can only be considered a hub or power domain if it links to more
than two domains, so we’ll filter out those that don’t meet the criteria. Why three or
more? Because one is random, two is a coincidence, and three is directed.
power_doms_strata = power_doms_strata.loc[power_doms_strata['rd_
count'] > 2]
power_doms_strata
The table shows referring domains, their traffic, and the number of (our furniture)
sites that these backlinking domains are linking to.
239
Chapter 5 Authority
Being data driven, we’re not satisfied with a list, so we’ll use statistics to help
understand the distribution of power before filtering the list further:
pd.set_option('display.float_format', str)
power_doms_stats = power_doms_strata.describe()
power_doms_stats
We see the distribution is heavily positively skewed where most of the highly
trafficked referring domains are in the 75th percentile or higher. Those are the ones we
want. Let’s visualize:
power_doms_stats_plt = (
ggplot(power_doms_strata, aes(x = 'traffic_')) +
geom_histogram(alpha = 0.6, binwidth = 10) +
labs(y = 'Power Domains Count', x = 'traffic_') +
scale_y_continuous() +
theme(legend_position = 'right',
axis_text_x=element_text(rotation = 90, hjust=1)
))
240
Chapter 5 Authority
power_doms_stats_plt.save(filename = 'images/7_power_doms_stats_plt.png',
height=5, width=10, units = 'in', dpi=1000)
power_doms_stats_plt
Although we’re interested in hubs, we’re sorting the dataframe by traffic as these
have the most authority:
power_doms
241
Chapter 5 Authority
By far, the most powerful is the daily mail, so in this case start budgeting for a good
digital PR consultant or full-time employee. There are also other publisher sites like the
Evening Standard (standard.co.uk) and The Times.
Some links are easier and quicker to get such as the yell.com and Thomson local
directories.
Then there are more market-specific publishers such as the Ideal Home, Homes and
Gardens, Livingetc, and House and Garden.
242
Chapter 5 Authority
• Checking the relevance of the backlink page (or home page) to see if
it impacts visibility and filtering for relevance
Taking It Further
Of course, the preceding discussion is just the tip of the iceberg, as it’s a simple
exploration of one site so it’s very difficult to infer anything useful for improving rankings
in competitive search spaces.
The following are some areas for further data exploration and analysis:
• Adding search volume data on the hostnames to see how many brand
searches the referring domains receive as an alternative measure of
authority
• Content relevance
Naturally, the preceding ideas aren’t exhaustive. Some modeling extensions would
require an application of the machine learning techniques outlined in Chapter 6.
243
Chapter 5 Authority
Summary
Backlinks, the expression of website authority for search engines, are incredibly
influential to search result positions for any website. In this chapter, you have
learned about
In the next chapter, we will use data science to analyze keyword search result
competitors.
244
Index
A Amazon Web Services (AWS), 5, 300
anchor_levels_issues_count_plt
A/A testing
graphic, 116
aa_means dataframe, 314
anchor_rel_stats_site_agg_plt plot, 121
aa_model.summary(), 319
Anchor texts
aa_test_box_plt, 317
anchor_issues_count_plt, 113
dataframe, 313
HREF, 113
data structures, 312
issues by site level, 114, 116
date range, 314
nondescriptive, 113
groups’ distribution, 317
search engines and users, 111
histogram plots, 316
Sitebulb, 111
.merge() function, 315
Anchor text words, 122–125
NegativeBinomial() model, 319
Andreas, 5, 245
optimization, 315
Antispam algorithms, 200
pretest and test period groups, 318
API libraries, 345
p-value, 319
API output, 128
SearchPilot, 311, 312
API response, 128, 346, 347
sigma, 315
Append() function, 168
statistical model, 318
apply_pcn function, 379
statistical properties, 313
astype() function, 490
.summary() attribute, 319
Augmented Dickey-Fuller method
test period, 313
(ADF), 29
Accelerated mobile pages (AMP), 505
Authority, 199, 200, 236, 237, 241
Account-based marketing, 175
aggregations, 205
Additional features, 130, 475
backlinks, 201, 202
Adobe Analytics, 343
data and cleaning, 203
Aggregation, 67, 81, 105, 131, 186, 205,
data features, 206
218, 253, 256, 276, 368, 449, 474,
dataframe, 204, 212
480, 513, 521, 522, 563
descriptive statistics, 206
AHREFs, 98, 201, 216, 244, 249, 266,
distribution, 207
343, 566
domain rating, 207
Akaike information criterion (AIC), 32
domains, 204
Alternative methods, 118
569
© Andreas Voniatis 2023
A. Voniatis, Data-Driven SEO with Python, https://doi.org/10.1007/978-1-4842-9175-7
INDEX
570
INDEX
571
INDEX
Dataframe, 15, 18, 20, 21, 23, 25, 42, 43, 45, reach, 479, 480
61, 62, 66, 67, 78, 79, 82, 93, 98, 130 reach stratified, 485–493
Data post migration, 446, 454 rename columns, 481
Data science, 151, 566 separate panels by phase as
automatable, 5 parameter, 502
cheap, 5 visibility, 496–504
data rich, 4 WAVG search volume, 495, 496
Data sources, 7–8, 19, 152, 248, 343, 344, WorkCast, 482, 483
365, 469 drop_col function, 165
Data visualization, 462, 483
Data warehouse, 300, 344, 345, 365,
370, 563 E
Decision tree–based algorithm, 248, Eliminate NAs, 288–289
290, 565 Experiment
Dedupe, 477–479 ab_assign_box_plt, 336
Deduplicate lists, 170 ab_assign_log_box_plt, 338
Defining ABM, 175 ab_assign_plt, 335
depthauth_stats_plt, 110 ab_group, 339
Describe() function, 225, 281, 283 A/B group, 332
Destination URLs, 117–119, 243, 402, 422 ab_model.summary(), 339
df.info(), 348 A/B tests, 327
diag_conds, 463, 464 analytics data, 331
Diagnosis, 457, 458, 461, 463–465 array, 339
Distilled ODN, 301, 311 dataframe, 329
Distributions, 16, 17, 63, 64, 67–70, 75, 76, dataset, 332
84–88, 90, 100, 101, 103, 107, 111, distribution, test group, 335
145–150, 202, 207, 208, 212, 225, histogram, 334
226, 240, 308, 310, 311, 316, 564 hypothesis, 328
DNA sequencing, 153 outcomes, 340
Documentation, 435, 449, 450, 453, 463 pd.concat dataframe, 333
Domain authority, 206–208, 216 p-value, 340
Domain rating (DR), 201, 207–210, 212, simul_abgroup_trend.head(), 333
215, 216, 221, 244, 249 simul_abgroup_trend_plt, 334
Domains test_analytics_expanded, 331, 332
create new columns, 482 test and control, 328
device search result types, 485 test and control groups, 333, 334
HubSpot, 481, 482 test and control sessions, 337
rankings, 493, 494 website analytics software, 329
572
INDEX
573
INDEX
574
INDEX
K
I, J keysv_df, 48
ia_current_mapping, 395 Keyword mapping
Inbound internal links, 89, 105, 108 approaches, 152
Inbound links, 77, 79, 89, 97, 98 definition, 152
Indexable URLs, 68, 73, 75, 117, 142, 145 string matching, 153–159
Individual CWV metrics, 132 Keyword research
Inexact (data) science of SEO data-driven methods, 7
channel’s diminishing value, 2 data sources, 7
high costs, 4 forecasts, 24–38
lacking sample data, 2, 3 Google Search Console (GSC), 8–19
making ads look, 2 Google Trends, 19–24
noisy feedback loop, 1 search intent, 38–57
things can’t be measured, 3 SERP competitors, 57–62
Internal link optimization, 63, 150 Keywords, 533–536
anchor text relevance, 117–125 token length, 520–525
Anchor texts, 111–116 token length deep dive, 525–533
content type, 107–111 Keywords_dict, 167, 169
crawl dataframe, 79
external inbound link data, 79
hyperlinked URL’s, 77 L
inbound links, 77 LCP_cwv_landscape_plt plot, 134
link dataframe, 78 Levenshtein distance, 46
by page authority, 97–106 Life insurance, 39
probability, 77 Linear models, 277
Sitebulb, 78 Link acquisition program, 237
Sitebulb auditing software, 77 Link capital, 235, 237
by site level, 81–97 Link capital velocity, 238
URLs, 79 Link checkers, 343
URLs with backlinks, 80 Link quality, 202, 206, 208, 209, 212, 216,
website optimization, 77 221, 225–231
Internal links distribution, 99 Link velocity, 234, 235
intlink_dist_plt plot, 89 Link volumes, 212, 231–233
575
INDEX
576
INDEX
577
INDEX
Search engine, 1, 2, 7, 63, 64, 66, 77, 97, SEMRush visibility, 222, 224, 231
111, 122, 125, 151, 156, 160, 199, SEO benefits, 125, 141
212, 214, 228, 244, 246, 255, 303, SEO campaigns and operations, 4
477, 566 SEO manager, 85
Search engine optimization (SEO), 1–5, 7, SEO rank checking tool, 391
8, 13, 19, 54, 57, 63, 64, 76, 77, 85, SERP competitors
118, 151, 152, 200, 221, 238, 245, extract keywords from page title, 60, 61
260, 281, 289, 291, 295, 299, 300, filter and clean data, 58–60
302, 303, 320, 341, 343, 345, SEMRush, 57
373, 565 SERPs data, 61, 62
Search Engine Results Pages (SERPs), 4, SERP dataframe, 192
16, 39–46, 50, 57, 58, 62, 126, 127, SERP results, 16, 518, 520
176, 185, 191, 192, 194, 245, SERPs comparison, 43–57
248–255, 257, 260, 268, 469, 505 SERPs data, 61, 62, 126, 390, 391, 394
Search intent, 53, 192 SERPs model, 4
convert SERPs URL into string, 41–43 Serps_raw dataframe, 252
core updates, 39 set_post_data, 352
DataForSEO’s SERP API, 40 Set theory, 566, 567
keyword content mapping, 39 Single-level factor (SLFs), 274
Ladies trench coats, 39 dataset, 275
Life insurance, 39 parameterized URLs, 276
paid search ads, 39 ranking URL titles, 274
queries, 38 SIS_cwv_landscape_plt, 133
rank tracking costs, 39 Site architecture, 39, 108, 564
SERPs comparison, 43–57 Sitebulb crawl data, 78, 142
Split-Apply-Combine (SAC), 41 Site depth, 64, 82, 90, 119, 152
Trench coats, 39 Site migration, 377, 412, 454, 467
Search query, 3, 8, 9, 11, 39, 246, 249, Snippets, 504, 505, 512, 557–561
280, 520 Sorensen-Dice, 46, 118, 153, 422, 564
Search volume, 3, 48–50, 56, 253, 255, 471, speed_ex_plt chart, 141
494–496, 520, 541, 544 Speed Index Score (SIS), 130, 132
Segment, 4, 11–15, 17, 145, 433, 436, 443, Speed score, 133, 146
448, 453, 454, 544 Split A/B test, 293, 299, 301, 312
SEMRush, 57, 160–162, 171, 173, 201, Split heading, 190
223, 566 Standard deviations, 3, 8, 13, 225,
semrush_csvs, 161 307, 366–368
SEMRush domain, 222 Statistical distribution, 564
SEMRush files, 161 Statistically robust, 14, 245
578
INDEX
579
INDEX
W Z
wavg_rank, 445, 495 Zero inflation, 308–311
wavg_rank_imps, 445 Zero string similarity, 394
580