Keyword Clustering
Keyword Clustering
INTRODUCTION
keyword and clustering of SEO keyword has been done for many years research as research
shows and content planning approaches for years, also incorporating sentiment analysis and a
couple of other fun areas to help teams really target their content.
Requirements
Some keyword data in CSV format: Doesn’t need to be a lot and it doesn’t really
matter where you got it from, you’ll just need to be aware of the column headers
and edit the code accordingly. In our case we used my Google Search Console data
R: The open-source statistical language.
RStudio: The best IDE for R and a lot of other languages too.
The TM package for R: Packages are like plugins for the language which contain
a lot of pre-built functions for specific tasks. The TM package is the best for text
mining, which we’ll need to do during this process
The Tidyverse package for R: The Tidyverse package is the definitive collection
of other packages to make working with data and visualizing it a lot more effective
and actually fun
The first thing that to be done when working with any data in R is to actually read the data into
our environment. After you’ve created your RStudio project, get your dataset in CSV format into
your working directory and use the following command:
Clustering is primarily a numerical function, so we’re going to need to make our text workable in
a numerical world. The way we’ll do this is to turn our keywords into a Document Term
Matrix using the Corpus function, and then we’ll clean that corpus up in line with best practice
for text analysis.
Firstly, after reading in our data, we want to make sure that we’ve got our packages installed and
live in our R environment. You can do this with the following commands:
install.packages("tm")
install.packages(“wordcloud”)
install.packages("tidyverse")
library(tm)
library(wordcloud)
library(tidyverse)
If you want to cut down on the amount of code you’re using when installing packages, you can
use the combine and lapply functions like so:
Cleaning a Corpus
When you’re working with text data in R, there are a few steps that you should take as standard,
in order to ensure that you’re working with the most important words and also eliminating
possible duplication due to capitalisation and punctuation.
Change all text to lower case: This brings consistency, rather than including
duplicates caused by capitalisation
Turn it to a plain-text document: This eliminates any possible rogue characters
Remove punctuation: Eliminates duplication caused by punctuation
Stem your words: By cutting extensions from the words in your corpus, you
eliminate the duplication caused by adding “S” to some words, for example
In a lot of cases, I’d usually remove stopwords (terms such as “And”), but in this case, it’s worth
keeping them since we’re going to be working with full queries rather than fragments.
Functions in R
There are many ways to make particular commands or pieces of code reproducible. Rather than
entering a series of commands every time you need to use them, you can just wrap them into a
function and use that function every time. They’re a time-saver and a great way to ensure that
your code is easier to work with, as well as easier to share.
Based on the criteria above, here’s an R function to help you clean your keyword corpus up to
prepare it for clustering your keywords:
Paste this function into your R console and then we need to actually run it on our corpus. This is
really easy:
There are loads of clustering models out there that can be used for keywords – some work better
than others, but the best one to start with is always the tried and trusted k-means clustering. Once
you’ve done a little bit of work with this, you can explore other models, but for me, this is the
best place to start.
K-means clustering is one of the most popular unsupervised machine learning models and it
works by calculating the distance between different numerical vectors and grouping them
accordingly.
Now that we cleaned have corpus and gotten it into a state where the k-means algorithm can
tokenise the terms and match them up to numeric values, we’re ready to start clustering.
One of the biggest errors seen when people try to apply data analytics techniques to digital
marketing and SEO is that they never actually make the analysis useful, they just make a pretty
graph for a pitch and the actual output is never usable. That’s certainly a possible failing of topic
and keyword clustering if you’re not smart about it, which is why we’re going to run through
how to identify the optimal number of clusters.
There are lots of different ways that you can identify the optimal number of clusters – I’m partial
to a Bayesian inference criterion, myself, although good luck getting that to run quickly in R.
Since today we’re just doing an introduction, I’m going to take you through the most commonly-
used way to identify the best number of topic clusters for your keywords: the Elbow Method.
The Elbow Method is probably the easiest way to find the optimal number of clusters (or k), and
it’s certainly the fastest way to process it in R, but that still doesn’t mean it’s particularly quick.
Essentially, the Elbow Method computes the variance between the different terms and sees how
many different clusters these could be put in up to the point that adding another cluster doesn’t
provide better modelling of the data. In other words, we use this model to identify the point at
which adding extra clusters becomes a waste of time. After all, if this approach doesn’t become
efficient, no one will use it.
But how do we make the Elbow Method work? How do we use it to identify our target number
of clusters, our k? The easiest way to do that is to visualize our clusters and take a judgement
from there, hence why we installed the ggplot2 package earlier.
Firstly, we need to create an empty data frame to put our cluster information into. That’s easy,
we’ll just use the following command:
It may take a while to run this if you’ve got quite a lot of keywords, but once it’s done, paste the
following:
There’s still a certain amount of manual work involved in this keyword & topic clustering, and a
big chunk of that is around finding k and then identifying what the clusters actually are.
Fortunately, it’s not actually a lot of manual work and it will really help with your SEO and
content targeting, so it’s really worth taking the time.
Now we’ve got our graph, we need to use a bit of human intelligence to identify our number of
clusters. It’s not perfect, and that’s why other models exist, but I hope this is a good start for you.
When we look at these charts with the Elbow Method, we’re looking for the point that the chart
curves and drops down sharply. The point at which additional clusters become less useful. From
looking at the chart below using my dataset, we can see that seven clusters is the point at which
we should stop adding extra clusters.
Now we’ve identified k and broken our terms out into clusters, now we need to match it back to
our original dataset and name our topics.
From the piece of analysis above, we can see that the optimal number of clusters on this dataset,
our k, is seven. Now we need to run the following commands to match them back to our original
dataset:
Now your original keywords are in a data frame with their assigned cluster in the next column
and the columns are named “Query” and “Cluster”, keeping them consistent with our main
dataset.
From here, we’ll want to get those clusters assigned to our main dataset. Dplyr from the
Tidyverse has a really handy left_join function which works a bit like Excel’s Index Match and
will let you match this easily.
The easiest way to get to grips with what’s contained in each cluster is to subset and explore
accordingly.
Subsetting is one of the most essential elements of working with large datasets, so it’s definitely
worth getting to grips with. Fortunately, R has a number of base functions to let you do just that.
Let’s explore it a little. First, we want to see how many observations (keywords in this case), we
have in this cluster. Nice and easy – in fact, in RStudio, you can just look in the Data pane like
so:
But let’s do it with some code anyway. The nrow function in base R will do that for you.
nrow(clusterOne)
How about if we want to see the number of impressions and clicks from that cluster? We can use
the Tidyverse’s summarise function for that:
Exploring your data in subsets will give you a much greater understanding of what each cluster
contains, so it’s well worth doing.
Wordclouds are something that always tend to go over well in client presentations, and, although
a lot of designers hate them, they’re often a fantastic way to see the most common keywords and
terms in your dataset. By doing this by cluster, we’ve got a great way to dig into what each
cluster is discussing.
The final idea we going to run through is to plot a combo chart where we look at impressions and
clicks by cluster. The analyst in me is not a big fan of combo charts since they’re rather flawed,
but they are a great way to quickly identify opportunities to improve SEO performance in your
keyword clusters. I’m also not very good at making R graphs look pretty, so sorry for that!
Again,we can adapt this to look at a wide variety of different metrics and using the GGPlot
package from the Tidyverse gives graphical avenues to explore.
Firstly, we want to create a dataframe containing a summary of the dataset. You can do that like
so with a Dplyr function:
Avg.CTR = mean(CTR))
Now we have a frame which contains all our keyword clusters with the total impressions, total
clicks and the average position and CTR in place, which will be useful for a wide variety of
visualisations, not just this example.
The code below will create a column graph with the impressions by cluster and the clicks on a
line chart with a secondary axis scaled by 20 to accommodate the differences between the two
variables. You can absolutely feel free to change the colours to make it look better.
ggplot(querySummary)+
The Full Keyword And automatic knowledge Clustering For SEO R Script
## Install Packages
options(warn=-1)
set.seed(12)
memory.limit(1600000000)
## Read Data
## Prepare Text
for(i in 1:100){
## Identify Clusters
## Wordclouds
Avg.CTR = mean(CTR))
ggplot(querySummary)+