Semantic Keyword Clustering in Python
As we have already learned everything about Keyword Clustering, what it is, how it works, and the best keyword clustering tools, and give you Advanced level keyword clustering tools 100% free to use, now we will talk about Semantic Keyword Clustering with Python, How it works, and what is Semantic Keyword Clustering in Python. Fasten your seatbelts, SEO folks! We’re about to embark on an exhilarating exploration of semantic keyword clustering in Python. We’ll not only delve into the core concepts but also dissect two distinct approaches – the traditional and the innovative – leaving no stone unturned.
Table of Contents
What Exactly is Semantic Keyword Clustering?
At its core, semantic keyword clustering is the art of grouping keywords not merely based on their superficial similarities but also their deeper semantic connections along the keywords. It’s akin to organizing a vast collection of ideas, where each keyword represents a concept, and clusters embody interconnected themes.
Imagine a constellation of stars, where each star signifies a keyword. Semantic keyword clustering strives to identify and connect stars that belong to the same constellation, revealing the underlying patterns and relationships that might not be apparent at first glance. It transcends the limitations of simple keyword matching and delves into the intricate web of meanings, contexts, and associations that words carry. By understanding the semantic landscape of keywords, we can gain valuable insights into user intent, search behavior, and the broader context in which words are used. This empowers us to create content that resonates deeply with our target audience, optimize our search engine visibility, and foster meaningful connections in the digital realm.
In essence, semantic keyword clustering bridges the surface level of words and the profound depths of meaning. It’s a tool that allows us to navigate the complexities of language and unlock the hidden treasures of communication.
Why Does it Matter?
You might wonder, “Why should I care about grouping keywords? Isn’t that just an SEO trick?”. It is not just that simple that you can skip in your SEO practice. Semantic keyword clustering is far more than a mere optimization tactic. It’s about understanding the very pulse of your audience’s language and intent, simplifying the subtle nuances and patterns. When you grasp how people search, you can craft content that resonates deeply, forging connections that transcend mere clicks and conversions.
Semantic SEO in Python
With its elegant syntax and powerful libraries, Python emerges as the conductor’s baton in this symphony of semantic keyword clustering exploration.
Libraries like NLTK, spaCy, and Gensim equip us with a formidable set of tools to tackle the complexities of natural language processing (NLP). Python’s versatility and user-friendliness make it the instrument of choice for data scientists, SEO virtuosos, and content maestros alike.
Google SERP Results and Discovering the Semantics
While not entirely transparent, Google’s sophisticated natural language processing (NLP) models provide a wealth of linguistic insights that can be leveraged for various applications. One such application is semantic keyword clustering.
Rather than developing complex NLP models in-house, we can utilize Google’s search engine results pages (SERPs) to identify semantically related keywords. This process involves:
- Keyword Identification: Generate a list of keywords relevant to the target topic.
- SERP Data Acquisition: Utilize web scraping techniques to collect SERP data for each keyword.
- Relationship Visualization: Construct a graph representing the connections between ranking pages and keywords.
- Semantic Clustering: The fundamental principle is that keywords consistently appearing on the same high-ranking pages are likely semantically related. These interconnected keywords form the basis of the semantic clusters.
This approach provides an efficient and practical method for leveraging Google’s extensive language understanding to identify semantically related keywords. This information can be invaluable for content creation, search engine optimization, and other data-driven marketing strategies.
The Python Script
The Python Script offers the following functions:
- By using Google’s custom search engine, download the SERPs for the keyword list. The data is saved to an SQLite database. Here, you should set up a custom search API.
- Then, make use of the free quota of 100 requests daily. But they also offer a paid plan for $5 per 1000 quests if you want to get started and if you have big datasets.
- If you aren’t in a hurry, it’s better to go with the SQLite solutions. SERP results will be appended to the table on each run. (Simply take a new series of 100 keywords when you have quota again the next day.)
- Meanwhile, you need to set up these variables in the Python Script.
- CSV_FILE=”raw_keywords.csv” => store your keywords here
- LANGUAGE = “en”
- COUNTRY = “en”
- API_KEY=” xxxxxxx”
- CSE_ID=” xxxxxxx”
- Running “getSearchResult(CSV_FILE,LANGUAGE,COUNTRY,API_KEY,CSE_ID,DATABASE,SERP_TABLE)” will write the SERP results to the database.
- The Networkx and the community detection module do the clustering tasks. The data is fetched from the SQLite database – the clustering is called with “getCluster(DATABASE,SERP_TABLE,CLUSTER_TABLE,TIMESTAMP)”
- The Clustering results can be found in the SQLite table – as long as you don’t change, the name is “keyword_clusters” by default.
Below, you’ll see the complete code:
# Semantic Keyword Clustering by Pemavor.com
# Author: Stefan Neefischer ([email protected])
from googleapiclient.discovery import build
import pandas as pd
import Levenshtein
from datetime import datetime
from fuzzywuzzy import fuzz
from urllib.parse import urlparse
from tld import get_tld
import langid
import json
import pandas as pd
import numpy as np
import networkx as nx
import community
import sqlite3
import math
import io
from collections import defaultdict
def cluster_return(searchTerm,partition):
return partition[searchTerm]
def language_detection(str_lan):
lan=langid.classify(str_lan)
return lan[0]
def extract_domain(url, remove_http=True):
uri = urlparse(url)
if remove_http:
domain_name = f"{uri.netloc}"
else:
domain_name = f"{uri.netloc}://{uri.netloc}"
return domain_name
def extract_mainDomain(url):
res = get_tld(url, as_object=True)
return res.fld
def fuzzy_ratio(str1,str2):
return fuzz.ratio(str1,str2)
def fuzzy_token_set_ratio(str1,str2):
return fuzz.token_set_ratio(str1,str2)
def google_search(search_term, api_key, cse_id,hl,gl, **kwargs):
try:
service = build("customsearch", "v1", developerKey=api_key,cache_discovery=False)
res = service.cse().list(q=search_term,hl=hl,gl=gl,fields='queries(request(totalResults,searchTerms,hl,gl)),items(title,displayLink,link,snippet)',num=10, cx=cse_id, **kwargs).execute()
return res
except Exception as e:
print(e)
return(e)
def google_search_default_language(search_term, api_key, cse_id,gl, **kwargs):
try:
service = build("customsearch", "v1", developerKey=api_key,cache_discovery=False)
res = service.cse().list(q=search_term,gl=gl,fields='queries(request(totalResults,searchTerms,hl,gl)),items(title,displayLink,link,snippet)',num=10, cx=cse_id, **kwargs).execute()
return res
except Exception as e:
print(e)
return(e)
def getCluster(DATABASE,SERP_TABLE,CLUSTER_TABLE,TIMESTAMP="max"):
dateTimeObj = datetime.now()
connection = sqlite3.connect(DATABASE)
if TIMESTAMP=="max":
df = pd.read_sql(f'select * from {SERP_TABLE} where requestTimestamp=(select max(requestTimestamp) from {SERP_TABLE})', connection)
else:
df = pd.read_sql(f'select * from {SERP_TABLE} where requestTimestamp="{TIMESTAMP}"', connection)
G = nx.Graph()
#add graph nodes from dataframe columun
G.add_nodes_from(df['searchTerms'])
#add edges between graph nodes:
for index, row in df.iterrows():
df_link=df[df["link"]==row["link"]]
for index1, row1 in df_link.iterrows():
G.add_edge(row["searchTerms"], row1['searchTerms'])
# compute the best partition for community (clusters)
partition = community.best_partition(G)
cluster_df=pd.DataFrame(columns=["cluster","searchTerms"])
cluster_df["searchTerms"]=list(df["searchTerms"].unique())
cluster_df["cluster"]=cluster_df.apply(lambda row: cluster_return(row["searchTerms"],partition), axis=1)
aggregations = defaultdict()
aggregations["searchTerms"]=' | '.join
clusters_grouped=cluster_df.groupby("cluster").agg(aggregations).reset_index()
clusters_grouped["requestTimestamp"]=dateTimeObj
clusters_grouped=clusters_grouped[["requestTimestamp","cluster","searchTerms"]]
#save to sqlite cluster table
connection = sqlite3.connect(DATABASE)
clusters_grouped.to_sql(name=CLUSTER_TABLE,index=False,if_exists="append",dtype={"requestTimestamp": "DateTime"}, con=connection)
def getSearchResult(filename,hl,gl,my_api_key,my_cse_id,DATABASE,TABLE):
dateTimeObj = datetime.now()
rows_to_insert=[]
keyword_df=pd.read_csv(filename)
keywords=keyword_df.iloc[:,0].tolist()
for query in keywords:
if hl=="default":
result = google_search_default_language(query, my_api_key, my_cse_id,gl)
else:
result = google_search(query, my_api_key, my_cse_id,hl,gl)
if "items" in result and "queries" in result :
for position in range(0,len(result["items"])):
result["items"][position]["position"]=position+1
result["items"][position]["main_domain"]= extract_mainDomain(result["items"][position]["link"])
result["items"][position]["title_matchScore_token"]=fuzzy_token_set_ratio(result["items"][position]["title"],query)
result["items"][position]["snippet_matchScore_token"]=fuzzy_token_set_ratio(result["items"][position]["snippet"],query)
result["items"][position]["title_matchScore_order"]=fuzzy_ratio(result["items"][position]["title"],query)
result["items"][position]["snippet_matchScore_order"]=fuzzy_ratio(result["items"][position]["snippet"],query)
result["items"][position]["snipped_language"]=language_detection(result["items"][position]["snippet"])
for position in range(0,len(result["items"])):
rows_to_insert.append({"requestTimestamp":dateTimeObj,"searchTerms":query,"gl":gl,"hl":hl,
"totalResults":result["queries"]["request"][0]["totalResults"],"link":result["items"][position]["link"],
"displayLink":result["items"][position]["displayLink"],"main_domain":result["items"][position]["main_domain"],
"position":result["items"][position]["position"],"snippet":result["items"][position]["snippet"],
"snipped_language":result["items"][position]["snipped_language"],"snippet_matchScore_order":result["items"][position]["snippet_matchScore_order"],
"snippet_matchScore_token":result["items"][position]["snippet_matchScore_token"],"title":result["items"][position]["title"],
"title_matchScore_order":result["items"][position]["title_matchScore_order"],"title_matchScore_token":result["items"][position]["title_matchScore_token"],
})
df=pd.DataFrame(rows_to_insert)
#save serp results to sqlite database
connection = sqlite3.connect(DATABASE)
df.to_sql(name=TABLE,index=False,if_exists="append",dtype={"requestTimestamp": "DateTime"}, con=connection)
##############################################################################################################################################
#Read Me: #
##############################################################################################################################################
#1- You need to setup a google custom search engine. #
# Please Provide the API Key and the SearchId. #
# Also set your country and language where you want to monitor SERP Results. #
# If you don't have an API Key and Search Id yet, #
# you can follow the steps under Prerequisites section in this page https://developers.google.com/custom-search/v1/overview#prerequisites #
# #
#2- You need also to enter database, serp table and cluster table names to be used for saving results. #
# #
#3- enter csv file name or full path that contains keywords that will be used for serp #
# #
#4- For keywords clustering enter the timestamp for serp results that will used for clustering. #
# If you need to cluster last serp results enter "max" for timestamp. #
# or you can enter specific timestamp like "2021-02-18 17:18:05.195321" #
# #
#5- Browse the results through DB browser for Sqlite program #
##############################################################################################################################################
#csv file name that have keywords for serp
CSV_FILE="keywords.csv"
# determine language
LANGUAGE = "en"
#detrmine city
COUNTRY = "en"
#google custom search json api key
API_KEY="ENTER KEY HERE"
#Search engine ID
CSE_ID="ENTER ID HERE"
#sqlite database name
DATABASE="keywords.db"
#table name to save serp results to it
SERP_TABLE="keywords_serps"
# run serp for keywords
getSearchResult(CSV_FILE,LANGUAGE,COUNTRY,API_KEY,CSE_ID,DATABASE,SERP_TABLE)
#table name that cluster results will save to it.
CLUSTER_TABLE="keyword_clusters"
#Please enter timestamp, if you want to make clusters for specific timestamp
#If you need to make clusters for the last serp result, send it with "max" value
#TIMESTAMP="2021-02-18 17:18:05.195321"
TIMESTAMP="max"
#run keyword clusters according to networks and community algorithms
getCluster(DATABASE,SERP_TABLE,CLUSTER_TABLE,TIMESTAMP)
- The Traditional Path
Let’s discuss the well-trodden path of the traditional approach. This method hinges on creating word embeddings, intricate mathematical representations that capture the semantic essence of words. Think of them as coordinates on a multidimensional map, where words that share similar meanings cluster together.
Once these embeddings are crafted, we employ a metric known as Word Mover’s Distance (WMD) to gauge the semantic proximity between keywords. WMD, in essence, calculates the minimum “effort” required to transform one keyword’s embedding into another, akin to measuring the distance between two points on our semantic map. The closer the distance, the stronger the semantic bond.
While this approach boasts remarkable accuracy, it comes with a caveat: it demands significant computational resources and expertise. Building and training robust word embedding models can be time-consuming, requiring access to vast datasets and specialized knowledge.
- The Innovative Trail
Now, let’s venture off the beaten path and explore a more recent and ingenious approach that leverages the collective intelligence of Google Search. Google, with its relentless pursuit of understanding human language, has amassed an unparalleled wealth of knowledge about how words and phrases connect. By tapping into this reservoir of wisdom, we can uncover hidden semantic relationships that might elude even the most sophisticated language models.
The process involves scraping Google search results for a given set of keywords and analyzing the overlap in the ranking pages. The underlying assumption is that if two keywords frequently appear on the same search result pages, they likely share some semantic affinity.
We then construct a graph where nodes represent keywords and edges signify co-occurrence on search result pages. By applying community detection algorithms to this graph, we can identify clusters of tightly interconnected keywords, revealing the underlying semantic landscape.
While computationally less demanding than the traditional method, this approach offers a fresh perspective on semantic keyword clustering. It harnesses the real-world usage patterns reflected in Google Search, providing a glimpse into language’s dynamic and ever-evolving nature.
Choosing Your Path: A Matter of Priorities
So, which path should you choose? The traditional route, with its meticulous precision, or the innovative trail, with its agility and real-world insights? The answer, as with most things in life, depends on your specific needs and constraints.
If you seek the utmost accuracy and have the resources to invest in building and training complex language models, the traditional approach might be your ideal companion. However, if you’re looking for a quicker, more accessible solution that taps into Google Search’s collective wisdom, the innovative trail beckons.
In the ever-evolving landscape of SEO and content marketing, semantic keyword clustering emerges as a guiding light, illuminating the path to deeper audience understanding and more impactful communication. Whether you choose the traditional path or the innovative trail, the journey promises to be both enlightening and rewarding.
So, grab your Python toolkit, embrace the power of semantics, and embark on your own adventure into the captivating world of keyword clustering. Happy exploring!
2 thoughts on “Semantic Keyword Clustering and SEO – With Python”