0% found this document useful (0 votes)

96 views15 pages

Keyword Clustering

- The document discusses using k-means clustering in R to automatically classify keywords from a CSV file into topics. - It explains the requirements, how to read in the data and prepare the text for clustering by creating a document term matrix and cleaning the corpus. - The elbow method is used to determine the optimal number of clusters k by computing the variance as clusters are added and visualizing the results in a graph to identify the breakpoint.

Uploaded by

JACOB RACHUONYO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views15 pages

Keyword Clustering

Uploaded by

JACOB RACHUONYO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

AUTOMATIC KNOWLEDGE CLASSIFICATION

INTRODUCTION

keyword and clustering of SEO keyword has been done for many years research as research
shows and content planning approaches for years, also incorporating sentiment analysis and a
couple of other fun areas to help teams really target their content.

Requirements

 Some keyword data in CSV format: Doesn’t need to be a lot and it doesn’t really
matter where you got it from, you’ll just need to be aware of the column headers
and edit the code accordingly. In our case we used my Google Search Console data
 R: The open-source statistical language.
 RStudio: The best IDE for R and a lot of other languages too.
 The TM package for R: Packages are like plugins for the language which contain
a lot of pre-built functions for specific tasks. The TM package is the best for text
mining, which we’ll need to do during this process
 The Tidyverse package for R: The Tidyverse package is the definitive collection
of other packages to make working with data and visualizing it a lot more effective
and actually fun

That’s it. You have spent precisely zero pennies to do this!

Read In Your Data

The first thing that to be done when working with any data in R is to actually read the data into
our environment. After you’ve created your RStudio project, get your dataset in CSV format into
your working directory and use the following command:

queries <- read.csv("Queries.csv", stringsAsFactors = FALSE)

Rename “Queries.csv” as whatever you’ve called the file with your keyword data.

Preparing Your Text Data For Clustering

Clustering is primarily a numerical function, so we’re going to need to make our text workable in
a numerical world. The way we’ll do this is to turn our keywords into a Document Term
Matrix using the Corpus function, and then we’ll clean that corpus up in line with best practice
for text analysis.

Firstly, after reading in our data, we want to make sure that we’ve got our packages installed and
live in our R environment. You can do this with the following commands:
install.packages("tm")
install.packages(“wordcloud”)
install.packages("tidyverse")
library(tm)
library(wordcloud)
library(tidyverse)
If you want to cut down on the amount of code you’re using when installing packages, you can
use the combine and lapply functions like so:

instPacks <- c("tidyverse", "tm", "wordcloud")

lapply(instPacks, require, character.only = TRUE)
Now we’ve got our packages in place, we need to create that Document Term Matrix or corpus
from our text. We do that with the following command:

dtm <- Corpus(VectorSource(queries$Query))

By doing this, we’ve turned our Search Console queries into a Corpus in our R environment, and
it’s ready to work with. Again, adapt the “queries$Query” to whatever output you have to work
with.

Cleaning a Corpus

When you’re working with text data in R, there are a few steps that you should take as standard,
in order to ensure that you’re working with the most important words and also eliminating
possible duplication due to capitalisation and punctuation.

I generally recommend doing the following to every corpus:

 Change all text to lower case: This brings consistency, rather than including
duplicates caused by capitalisation
 Turn it to a plain-text document: This eliminates any possible rogue characters
 Remove punctuation: Eliminates duplication caused by punctuation
 Stem your words: By cutting extensions from the words in your corpus, you
eliminate the duplication caused by adding “S” to some words, for example

In a lot of cases, I’d usually remove stopwords (terms such as “And”), but in this case, it’s worth
keeping them since we’re going to be working with full queries rather than fragments.

Functions in R
There are many ways to make particular commands or pieces of code reproducible. Rather than
entering a series of commands every time you need to use them, you can just wrap them into a
function and use that function every time. They’re a time-saver and a great way to ensure that
your code is easier to work with, as well as easier to share.

Based on the criteria above, here’s an R function to help you clean your keyword corpus up to
prepare it for clustering your keywords:

A Corpus-Cleaning Function For R

Here’s how to clean that corpus up.

corpusClean <- function(x){

lowCase <- tm_map(x, tolower)
plainText <- tm_map(lowCase, PlainTextDocument)
remPunc <- tm_map(lowCase, removePunctuation)
stemDoc <- tm_map(remPunc, stemDocument)
output <- DocumentTermMatrix(stemDoc)
}
Here, we’re using the TM packages’ built-in functionality to transform our corpus in the ways
described above. Again, I’m not going to go into the minutiae of how this works – I don’t think
I’ve got another 11k+ word post in me today – but I hope the notation is reasonably clear.

Paste this function into your R console and then we need to actually run it on our corpus. This is
really easy:

corpusCleaned <- corpusClean(dtm)

We’ve created a new variable called corpusCleaned and run our function on the previous dtm
variable, so we have a cleaned-up version of our original corpus. Now we’re ready to start
clustering them into topics.

Using K-Means Clustering For Keywords In R

There are loads of clustering models out there that can be used for keywords – some work better
than others, but the best one to start with is always the tried and trusted k-means clustering. Once
you’ve done a little bit of work with this, you can explore other models, but for me, this is the
best place to start.

K-means clustering is one of the most popular unsupervised machine learning models and it
works by calculating the distance between different numerical vectors and grouping them
accordingly.
Now that we cleaned have corpus and gotten it into a state where the k-means algorithm can
tokenise the terms and match them up to numeric values, we’re ready to start clustering.

Finding The Optimal Number Of Clusters

One of the biggest errors seen when people try to apply data analytics techniques to digital
marketing and SEO is that they never actually make the analysis useful, they just make a pretty
graph for a pitch and the actual output is never usable. That’s certainly a possible failing of topic
and keyword clustering if you’re not smart about it, which is why we’re going to run through
how to identify the optimal number of clusters.

There are lots of different ways that you can identify the optimal number of clusters – I’m partial
to a Bayesian inference criterion, myself, although good luck getting that to run quickly in R.
Since today we’re just doing an introduction, I’m going to take you through the most commonly-
used way to identify the best number of topic clusters for your keywords: the Elbow Method.

The Elbow Method

The Elbow Method is probably the easiest way to find the optimal number of clusters (or k), and
it’s certainly the fastest way to process it in R, but that still doesn’t mean it’s particularly quick.

Essentially, the Elbow Method computes the variance between the different terms and sees how
many different clusters these could be put in up to the point that adding another cluster doesn’t
provide better modelling of the data. In other words, we use this model to identify the point at
which adding extra clusters becomes a waste of time. After all, if this approach doesn’t become
efficient, no one will use it.

But how do we make the Elbow Method work? How do we use it to identify our target number
of clusters, our k? The easiest way to do that is to visualize our clusters and take a judgement
from there, hence why we installed the ggplot2 package earlier.

Running The Elbow Method In R

Firstly, we need to create an empty data frame to put our cluster information into. That’s easy,
we’ll just use the following command:

kFrame <- data.frame()

Now we have to use a for loop to run the clustering algorithm and put it into that data frame.
This is where the processing time comes in, and I know it’s not the cleanest way to run it in R,
but it’s the way that I’ve found it to work the best.
for(i in 1:100){
k <- kmeans(corpusCleaned, centers = i, iter.max = 100)
kFrame <- rbind(kFrame, cbind(i, k$tot.withinss))
}
This loop will take our tidied-up corpus (our corpusCleaned variable) and use the k-means
algorithm to break it out into as many relevant clusters as it can, up to 100 and then put that data
into our empty data frame. Obviously we don’t want 100 clusters – no one’s going to work with
that. What we want to find here is the break point, the number at which we get diminishing
returns by adding new clusters.

It may take a while to run this if you’ve got quite a lot of keywords, but once it’s done, paste the
following:

names(kFrame) <- c("cluster", "total")

All we’re doing here is naming our column headers, but it’ll be important for our next stage of
finding k.

Visualizing The Elbow Method Using GGPlot2

There’s still a certain amount of manual work involved in this keyword & topic clustering, and a
big chunk of that is around finding k and then identifying what the clusters actually are.

Fortunately, it’s not actually a lot of manual work and it will really help with your SEO and
content targeting, so it’s really worth taking the time.

Use the following command to create a plot of your clusters:

ggplot(data = kFrame, aes(x=cluster, y=total, group=1))+

theme_bw(base_family = "Arial")+ geom_line(colour = "darkgreen")+
scale_x_continuous(breaks = seq(from=0, to=100,by=5))
Certain elements, such as the colour, font and your chosen dataset can be switched up, obviously.

Using Search Console dataset, we get the following result:

Using The Elbow Method To Identify The Optimal Number Of Clusters

Now we’ve got our graph, we need to use a bit of human intelligence to identify our number of
clusters. It’s not perfect, and that’s why other models exist, but I hope this is a good start for you.

When we look at these charts with the Elbow Method, we’re looking for the point that the chart
curves and drops down sharply. The point at which additional clusters become less useful. From
looking at the chart below using my dataset, we can see that seven clusters is the point at which
we should stop adding extra clusters.
Now we’ve identified k and broken our terms out into clusters, now we need to match it back to
our original dataset and name our topics.

From the piece of analysis above, we can see that the optimal number of clusters on this dataset,
our k, is seven. Now we need to run the following commands to match them back to our original
dataset:

kmeans7 <- kmeans(corpusCleaned, 7)

This is fairly self-explanatory, I hope, but essentially what we’re doing here is creating a new
variable called kmeans7 (change it to whatever you like), telling R to use its base kmeans
command on our corpusCleaned variable and to use it on the number of clusters we’ve identified.
You can obviously adapt this to whatever number of SEO keyword or topic clusters your
analysis identified using your own datasets.
Finally, you’ll want to turn this into a clean data frame. You can do that like so:

kwClusters <- as.data.frame(cbind(queries$Query, kmeans7$cluster))

names(kwClusters) <- c("Query", "Cluster")
Now your original keywords are in a data frame with their assigned cluster in the next column.

Now your original keywords are in a data frame with their assigned cluster in the next column
and the columns are named “Query” and “Cluster”, keeping them consistent with our main
dataset.

From here, we’ll want to get those clusters assigned to our main dataset. Dplyr from the
Tidyverse has a really handy left_join function which works a bit like Excel’s Index Match and
will let you match this easily.

queries <- left_join(queries, kwClusters, by = "Query")

Again, you’ll need to adapt your variables to fit your datasets, but this is how it’s working with
my particular example

Explore Your Clusters By Subsetting

The easiest way to get to grips with what’s contained in each cluster is to subset and explore
accordingly.

Subsetting is one of the most essential elements of working with large datasets, so it’s definitely
worth getting to grips with. Fortunately, R has a number of base functions to let you do just that.

Let’s take a look at our first cluster in isolation:

clusterOne <- subset(queries, Cluster == 1)

Here, we’ve cut down our dataset to only look at cluster one. When you’re subsetting or using
other exact matches in R, you need to use the ==, otherwise things can get a bit skewey.

Let’s explore it a little. First, we want to see how many observations (keywords in this case), we
have in this cluster. Nice and easy – in fact, in RStudio, you can just look in the Data pane like
so:
But let’s do it with some code anyway. The nrow function in base R will do that for you.

nrow(clusterOne)
How about if we want to see the number of impressions and clicks from that cluster? We can use
the Tidyverse’s summarise function for that:

clusterOne %>% summarise(Impressions = sum(Impressions), Clicks = sum(Clicks))

This will give us an output of the total number of impressions and clicks from that cluster, which
can be useful when identifying the opportunity available. Obviously, you can use this for other
elements as well. Perhaps you’ve merged some search volume data in, for example, or you want
to see what your average position is for this cluster.

Exploring your data in subsets will give you a much greater understanding of what each cluster
contains, so it’s well worth doing.

Create Wordclouds By Cluster

Wordclouds are something that always tend to go over well in client presentations, and, although
a lot of designers hate them, they’re often a fantastic way to see the most common keywords and
terms in your dataset. By doing this by cluster, we’ve got a great way to dig into what each
cluster is discussing.

The Wordcloud package for R has everything you need to do this.

Let’s take a look at the queries in my first cluster:

wordcloud(clusterOne$Query, scale=c(5,0.5), max.words=250, random.order=FALSE,

rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))
This will create a wordcloud from that keyword cluster’s query column, give it some pretty
colours and present it in the RStudio plots pane. In this cluster’s case, the wordcloud looks like
this:
As you can see from here, cluster one of my Search Console data is all about tracking email in
Google Analytics. Word clouds are a really quick and visual way to explore and identify the
topics covered in your keyword clusters and using R instead of a separate tool will help you do
that nice and easily without needing to leave RStudio.

Plotting A Combo Chart With R & GG Plot

The final idea we going to run through is to plot a combo chart where we look at impressions and
clicks by cluster. The analyst in me is not a big fan of combo charts since they’re rather flawed,
but they are a great way to quickly identify opportunities to improve SEO performance in your
keyword clusters. I’m also not very good at making R graphs look pretty, so sorry for that!

Again,we can adapt this to look at a wide variety of different metrics and using the GGPlot
package from the Tidyverse gives graphical avenues to explore.
Firstly, we want to create a dataframe containing a summary of the dataset. You can do that like
so with a Dplyr function:

querySummary <- queries %>% group_by(Cluster) %>%

summarise(Impressions = sum(Impressions), Clicks = sum(Clicks), Avg.Position =

mean(Position),

Avg.CTR = mean(CTR))

Now we have a frame which contains all our keyword clusters with the total impressions, total
clicks and the average position and CTR in place, which will be useful for a wide variety of
visualisations, not just this example.

The code below will create a column graph with the impressions by cluster and the clicks on a
line chart with a secondary axis scaled by 20 to accommodate the differences between the two
variables. You can absolutely feel free to change the colours to make it look better.

ggplot(querySummary)+

geom_col(aes(Cluster, Impressions), size = 1, colour = "black", fill = "darkgray")+

geom_line(aes(Cluster, 20*Clicks), size = 1, colour = "darkgreen", group =1)+

scale_y_continuous(sec.axis = sec_axis(~./20, name = "Clicks"))

This will give the following graph:

And there you have it – a really simple introduction to keyword clustering for SEO and some
things you can do with it before you start creating your content accordingly.

The Full Keyword And automatic knowledge Clustering For SEO R Script

## Install Packages

options(warn=-1)

set.seed(12)

memory.limit(1600000000)

instPacks <- c("tidyverse", "tm", "wordcloud")

lapply(instPacks, require, character.only = TRUE)

## Read Data

queries <- read.csv("Queries.csv", stringsAsFactors = FALSE)

## Prepare Text

dtm <- Corpus(VectorSource(queries$Query))

corpusClean <- function(x){

lowCase <- tm_map(x, tolower)

plainText <- tm_map(lowCase, PlainTextDocument)

remPunc <- tm_map(lowCase, removePunctuation)

stemDoc <- tm_map(remPunc, stemDocument)

output <- DocumentTermMatrix(stemDoc)

corpusCleaned <- corpusClean(dtm)

## Elbow Method To Find K

kFrame <- data.frame()

for(i in 1:100){

k <- kmeans(corpusCleaned, centers = i, iter.max = 100)

kFrame <- rbind(kFrame, cbind(i, k$tot.withinss))

names(kFrame) <- c("cluster", "total")

## Visualise Elbow To Find K

ggplot(data = kFrame, aes(x=cluster, y=total, group=1))+

theme_bw(base_family = "Arial")+ geom_line(colour = "darkgreen")+

scale_x_continuous(breaks = seq(from=0, to=100,by=5))

## Identify Clusters

kmeans7 <- kmeans(corpusCleaned, 7)

kwClusters <- as.data.frame(cbind(queries$Query, kmeans7$cluster))

names(kwClusters) <- c("Query", "Cluster")

## Merge Clusters To Query Data

queries <- left_join(queries, kwClusters, by = "Query")

## Explore Clusters With Subsets

clusterOne <- subset(queries, Cluster == 1)

## Wordclouds

wordcloud(clusterOne$Query, scale=c(5,0.5), max.words=250, random.order=FALSE,

rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))

## Impressions/ Clicks By Cluster

querySummary <- queries %>% group_by(Cluster) %>%

summarise(Impressions = sum(Impressions), Clicks = sum(Clicks), Avg.Position =

mean(Position),

Avg.CTR = mean(CTR))

ggplot(querySummary)+

geom_col(aes(Cluster, Impressions), size = 1, colour = "black", fill = "darkgray")+

geom_line(aes(Cluster, 20*Clicks), size = 1, colour = "darkgreen", group =1)+

scale_y_continuous(sec.axis = sec_axis(~./20, name = "Clicks"))

Java MCQ
No ratings yet
Java MCQ
55 pages
Stewart LabHandout
No ratings yet
Stewart LabHandout
11 pages
Excel MCQ
No ratings yet
Excel MCQ
29 pages
NSX Lab Description
No ratings yet
NSX Lab Description
344 pages
The Bankers Own The Earth
100% (3)
The Bankers Own The Earth
51 pages
BodyLanguagefor Leaders PDF
No ratings yet
BodyLanguagefor Leaders PDF
14 pages
The Hippocampus in Clinical Neuroscience Frontiers of Neurology and Neuroscience Vol 34
No ratings yet
The Hippocampus in Clinical Neuroscience Frontiers of Neurology and Neuroscience Vol 34
306 pages
Xilinx System Generator For DSP PDF
No ratings yet
Xilinx System Generator For DSP PDF
376 pages
Hotel Bill 25092024
No ratings yet
Hotel Bill 25092024
1 page
Ms Word
No ratings yet
Ms Word
29 pages
Grundfos - CR 5 12 A A A E HQQE
No ratings yet
Grundfos - CR 5 12 A A A E HQQE
10 pages
新电影评论和评分
100% (2)
新电影评论和评分
7 pages
CLobazam
No ratings yet
CLobazam
7 pages
Bank Math Lecture Book v-1
No ratings yet
Bank Math Lecture Book v-1
99 pages
Systemair Fans KVO Data Sheet Eng PDF
No ratings yet
Systemair Fans KVO Data Sheet Eng PDF
4 pages
Admission Circular in Evening - Executive MBA (EMBA) in Jahangirnagar University
No ratings yet
Admission Circular in Evening - Executive MBA (EMBA) in Jahangirnagar University
2 pages
GSFLOW Release Notes 2.2.0
No ratings yet
GSFLOW Release Notes 2.2.0
84 pages
Sec 4 Water - Resources - (Regulation - and - Management) - Act, - 2010-1-16
No ratings yet
Sec 4 Water - Resources - (Regulation - and - Management) - Act, - 2010-1-16
16 pages
2024 Emerging Space Brief Satellite Servicing
No ratings yet
2024 Emerging Space Brief Satellite Servicing
6 pages
Assessment in Double Entry Accounting
No ratings yet
Assessment in Double Entry Accounting
7 pages
CSE Lecture02.note
No ratings yet
CSE Lecture02.note
22 pages
UFBU Meeting Notice03072025120953
No ratings yet
UFBU Meeting Notice03072025120953
2 pages
Prlog
No ratings yet
Prlog
10 pages
Kolkata Faculty List DG Upload Jan 2023
No ratings yet
Kolkata Faculty List DG Upload Jan 2023
3 pages
New Indy Complaint
No ratings yet
New Indy Complaint
5 pages
Apax Ra Web 2020 210608
No ratings yet
Apax Ra Web 2020 210608
28 pages
Sub: Clean Overdraft/ DPN Facility To Employees - Modification of Guidelines
No ratings yet
Sub: Clean Overdraft/ DPN Facility To Employees - Modification of Guidelines
3 pages
01 AB 0.428 000638740261156 P Y R&R Atms Rentals and Vending LLC UNIT 61054 2478 E Desert Inn RD LAS VEGAS NV 89160-8044
No ratings yet
01 AB 0.428 000638740261156 P Y R&R Atms Rentals and Vending LLC UNIT 61054 2478 E Desert Inn RD LAS VEGAS NV 89160-8044
4 pages
6.1a Black-Scholes-Merton Formulas
No ratings yet
6.1a Black-Scholes-Merton Formulas
25 pages
Chapter 2 - Selected Solutions
No ratings yet
Chapter 2 - Selected Solutions
4 pages
The Technical Analyst WWW - Technicalanalyst.co - Uk
No ratings yet
The Technical Analyst WWW - Technicalanalyst.co - Uk
2 pages
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Beginning R: The Statistical Programming Language
From Everand
Beginning R: The Statistical Programming Language
Mark Gardener
4.5/5 (4)
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
R Programming Insights Textbook
From Everand
R Programming Insights Textbook
Manish Soni
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
From Everand
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
Emrys Callahan
5/5 (1)
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
JavaScript Patterns JumpStart Guide (Clean up your JavaScript Code)
From Everand
JavaScript Patterns JumpStart Guide (Clean up your JavaScript Code)
Dan Wahlin
4.5/5 (3)
SQL Server: Tips and Tricks - 2
From Everand
SQL Server: Tips and Tricks - 2
Priyanka Agarwal
4.5/5 (3)
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
Ms Access 2007: Step by Step
From Everand
Ms Access 2007: Step by Step
Asim Abbasi
5/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Practical C++ Backend Programming
From Everand
Practical C++ Backend Programming
Justin Barbara
No ratings yet
Amazon Web Services (AWS) Interview Questions and Answers
From Everand
Amazon Web Services (AWS) Interview Questions and Answers
Tech Interviews
4.5/5 (3)
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
Mastering JavaScript: The Complete Guide to JavaScript Mastery
From Everand
Mastering JavaScript: The Complete Guide to JavaScript Mastery
Tim Robards
5/5 (1)
Programming in Star
From Everand
Programming in Star
Francis McCabe
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
AWS Solution Architect Certification Exam Practice Paper 2019
From Everand
AWS Solution Architect Certification Exam Practice Paper 2019
Tech Interviews
3.5/5 (3)
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Mastering DynamoDB
From Everand
Mastering DynamoDB
Tanmay Deshpande
No ratings yet
SQL Server: Tips and Tricks - 1
From Everand
SQL Server: Tips and Tricks - 1
Priyanka Agarwal
5/5 (1)
JavaScript for Kids: Start Your Coding Adventure
From Everand
JavaScript for Kids: Start Your Coding Adventure
Abdelfattah Ragab
No ratings yet
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
From Everand
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
Justin Barbara
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
C in 30 Pages
From Everand
C in 30 Pages
U.Q. Magnusson
4.5/5 (2)
Implementing DevOps on AWS
From Everand
Implementing DevOps on AWS
Veselin Kantsev
No ratings yet
Unleashing the Power of TypeScript
From Everand
Unleashing the Power of TypeScript
Steve Kinney
No ratings yet
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
C++ Regular Expressions Simplified: A Practical Guide with Examples
From Everand
C++ Regular Expressions Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
C++ Functional Programming for Starters: A Practical Guide with Examples
From Everand
C++ Functional Programming for Starters: A Practical Guide with Examples
William E. Clark
No ratings yet
Thinking About Star
From Everand
Thinking About Star
Francis McCabe
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
SQL for Beginners: A Guide to Excelling in Coding and Database Management
From Everand
SQL for Beginners: A Guide to Excelling in Coding and Database Management
Vere salazar
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
C++ Programming: From Novice to Expert in a Step-by-Step Journey
From Everand
C++ Programming: From Novice to Expert in a Step-by-Step Journey
Ryan Campbell
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Salesforce Developer Interview Questions: 1.0, #1
From Everand
Salesforce Developer Interview Questions: 1.0, #1
SFDC TELUGU
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
Learn R Programming in 24 Hours
From Everand
Learn R Programming in 24 Hours
Alex Nordeen
No ratings yet
Blazor and API Example: Classroom Quiz Application
From Everand
Blazor and API Example: Classroom Quiz Application
Taurius Litvinavicius
No ratings yet
Crafting Clean Code: Your Agile Software Guide
From Everand
Crafting Clean Code: Your Agile Software Guide
Sachin Naha
No ratings yet
Rapid Application Development With CakePHP
From Everand
Rapid Application Development With CakePHP
Jamie Munro
No ratings yet
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Just the basics of JavaScript
From Everand
Just the basics of JavaScript
Tom Henricksen
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Keyword Clustering

Uploaded by

Keyword Clustering

Uploaded by

AUTOMATIC KNOWLEDGE CLASSIFICATION

That’s it. You have spent precisely zero pennies to do this!

Read In Your Data

queries <- read.csv("Queries.csv", stringsAsFactors = FALSE)

Preparing Your Text Data For Clustering

instPacks <- c("tidyverse", "tm", "wordcloud")

dtm <- Corpus(VectorSource(queries$Query))

I generally recommend doing the following to every corpus:

A Corpus-Cleaning Function For R

Here’s how to clean that corpus up.

corpusClean <- function(x){

corpusCleaned <- corpusClean(dtm)

Using K-Means Clustering For Keywords In R

Finding The Optimal Number Of Clusters

The Elbow Method

Running The Elbow Method In R

kFrame <- data.frame()

names(kFrame) <- c("cluster", "total")

Visualizing The Elbow Method Using GGPlot2

Use the following command to create a plot of your clusters:

ggplot(data = kFrame, aes(x=cluster, y=total, group=1))+

Using Search Console dataset, we get the following result:

kmeans7 <- kmeans(corpusCleaned, 7)

kwClusters <- as.data.frame(cbind(queries$Query, kmeans7$cluster))

queries <- left_join(queries, kwClusters, by = "Query")

Explore Your Clusters By Subsetting

Let’s take a look at our first cluster in isolation:

clusterOne <- subset(queries, Cluster == 1)

clusterOne %>% summarise(Impressions = sum(Impressions), Clicks = sum(Clicks))

Create Wordclouds By Cluster

The Wordcloud package for R has everything you need to do this.

Let’s take a look at the queries in my first cluster:

wordcloud(clusterOne$Query, scale=c(5,0.5), max.words=250, random.order=FALSE,

Plotting A Combo Chart With R & GG Plot

querySummary <- queries %>% group_by(Cluster) %>%

summarise(Impressions = sum(Impressions), Clicks = sum(Clicks), Avg.Position =

geom_col(aes(Cluster, Impressions), size = 1, colour = "black", fill = "darkgray")+

geom_line(aes(Cluster, 20*Clicks), size = 1, colour = "darkgreen", group =1)+

scale_y_continuous(sec.axis = sec_axis(~./20, name = "Clicks"))

This will give the following graph:

instPacks <- c("tidyverse", "tm", "wordcloud")

queries <- read.csv("Queries.csv", stringsAsFactors = FALSE)

dtm <- Corpus(VectorSource(queries$Query))

corpusClean <- function(x){

lowCase <- tm_map(x, tolower)

plainText <- tm_map(lowCase, PlainTextDocument)

remPunc <- tm_map(lowCase, removePunctuation)

stemDoc <- tm_map(remPunc, stemDocument)

output <- DocumentTermMatrix(stemDoc)

corpusCleaned <- corpusClean(dtm)

## Elbow Method To Find K

k <- kmeans(corpusCleaned, centers = i, iter.max = 100)

kFrame <- rbind(kFrame, cbind(i, k$tot.withinss))

names(kFrame) <- c("cluster", "total")

## Visualise Elbow To Find K

ggplot(data = kFrame, aes(x=cluster, y=total, group=1))+

theme_bw(base_family = "Arial")+ geom_line(colour = "darkgreen")+

scale_x_continuous(breaks = seq(from=0, to=100,by=5))

kmeans7 <- kmeans(corpusCleaned, 7)

kwClusters <- as.data.frame(cbind(queries$Query, kmeans7$cluster))

names(kwClusters) <- c("Query", "Cluster")

## Merge Clusters To Query Data

queries <- left_join(queries, kwClusters, by = "Query")

## Explore Clusters With Subsets

clusterOne <- subset(queries, Cluster == 1)

wordcloud(clusterOne$Query, scale=c(5,0.5), max.words=250, random.order=FALSE,

## Impressions/ Clicks By Cluster

querySummary <- queries %>% group_by(Cluster) %>%

summarise(Impressions = sum(Impressions), Clicks = sum(Clicks), Avg.Position =

geom_col(aes(Cluster, Impressions), size = 1, colour = "black", fill = "darkgray")+

geom_line(aes(Cluster, 20*Clicks), size = 1, colour = "darkgreen", group =1)+

scale_y_continuous(sec.axis = sec_axis(~./20, name = "Clicks"))

You might also like