Semantic Analysis for SEO: Advancing Beyond LDA

Not too long ago, the SEO community was abuzz with excitement over a new and intriguing term: LDA (Latent Dirichlet Allocation). Remember that? It all began with the poorly named SEOmoz tool and has since ushered in a new age of three-letter initialization snake oil.

However, this can still be a positive development for search engine optimization enthusiasts, I assure you. While it has left many people with new jargon to bandy about, it has also achieved something far more significant; it has prompted SEOs to engage in more discussions about the field of information retrieval (IR).

LDA is just one of many approaches to semantic analysis. Instead of trying to figure out which specific methods Google/Bing might be using (as they could be employing multiple methods in conjunction), let’s delve into the essence of the process itself.

Grasping Semantic Analysis in Contemporary Search

Contrary to what many in the industry believe, semantic analysis (SA) is not merely about synonyms and plurals (stemming). This is the most common misconception we encounter. Concepts and Themes — The fundamental challenge in determining on-page relevance lies in the fact that computers don’t truly comprehend language very well (their understanding is said to be at a 6th-grade level). That’s where SA comes in, helping them better grasp the subject matter of a page. Take the search query jaguar for instance. It could refer to a car, a large feline, an operating system, a football team, and so on.

To gain a clearer understanding of the page’s topic, search engines look for terms/phrases on the page to categorize it. In the case of the car, we might find terms/phrases like auto mechanic, engine, while for the animal, short hair, hunts prey, and so forth, would be present. Let’s consider a search for White House. This doesn’t automatically imply the US presidential residence. It could simply refer to a “white” “house.” Consequently, the system would search for related terms such as President of the United States, Barack Obama, and so on … you get the drift.

Unveiling Insights

Now, let’s explore the value aspect; within each category and its sub-categories, there’s an occurrence rate (how frequently a particular phrase/term/concept appears when used in other documents and associated correlation rates). This same process can be used to identify other common terms that the search engine would “expect” to find. This goes beyond a simple keyword density approach, as there would be an expectation not only for phrase density but also for the presence of related phrases. This is all based on an initial set of documents and is constantly refined over time (machine learning) through click and query analysis data. Merely repeating the same term repeatedly with high density and without expected related phrases would be a futile exercise.

Implementing Semantic Analysis Concepts in SEO

Now, let’s examine how this applies to the world of SEO. One might assume it’s solely about on-page contextual elements, but that would be inaccurate. One of the reasons I lean towards phrase-based approaches over LDA (though both can be employed) is that it extends beyond mere content analysis. Various filings illustrate the system’s use for a range of elements, including:

On-page content analysis: This one’s obvious. It involves analyzing the content (including the TITLE) to identify concepts.
Links: This is interesting because they consider: A. The anchor text of inbound links. B. The semantic relevance of the linking page. C. The TITLE of the linking page. D. The relevance of the inbound link to the linking page itself.
Duplicate content: There are even patents on using this SA approach for duplicate content detection.
Spam: Naturally, they can utilize it for spam detection in various ways (alongside other methods).
Personalization: Interestingly, they discuss personalizing such a system, building a model of “good phrases” for different user types. They also mention “Personalized Topic-Based Document Descriptions” that adapt snippets using this system. This versatility makes the phrase-based IR approach highly valuable. Whether used alone or with other SA methods, it appears to be a potent tool with implications that extend far beyond what most SEOs seem to grasp. For our purposes, we’re primarily interested in on-page content and, of course, inbound links. I’d even venture to include page naming conventions and outbound links, though I haven’t personally come across this in the documentation.

Compile a list of supporting concepts to use in your arguments.
Craft a TITLE incorporating related terms.
Assess the relevance of pages you hope to acquire links from.
Enhance relevance through outbound links.
Employ relevant page naming conventions.
Apply the same principles to any content additions. Identifying these phrases/concepts would be part of the keyword research phase. We wouldn’t just focus on primary, secondary, and modifier terms but also on creating a list of “related phrases.”

Essential Tools

This is where the real challenge lies. There aren’t any dedicated tools for this purpose. The poorly named SEOmoz tool is a starting point, but it falls short of our requirements. We need a tool that can:

Analyze the top 10-20 ranked pages for the most prevalent phrases/terms/concepts.
Generate a list of these terms.
Analyze our target page, compare it to the list, and provide suggestions. Unfortunately, such a tool doesn’t exist yet. I’ve been in discussions with the nexus-security team about developing something similar; only time will tell. Currently, there’s one tool I’ve used that comes close. It’s the equally poorly named LSI Tool from the folks at NicheBot. They offer a tool that searches and categorizes terms into semantic baskets. Now, it’s not perfect. As mentioned earlier, I’d personally design a significantly different tool that incorporates this approach but also includes SERP analysis. However, it’s a start. Perhaps our friends at nexus-security would be interested in working on it? To illustrate, here’s the main screen for a given project:

And here’s the “LSI View”:

You would then start refining the lists based on the specific goals of your page/site/project. Once you have your optimized list, export it. You can then utilize it for content creation, share it with writers, the link building team (to enhance the semantic diversity of inbound anchor texts), those handling on-site elements like TITLE tags and page naming conventions… basically, the entire team. Naturally, there’s an art to this. Given the lack of a tool to properly analyze top query spaces, you’ll need to conduct some independent research and rely on your “instincts” as well. Other tools worth considering are listed at the bottom of this post (though none truly fit the bill).

The Significance of Understanding Semantic Analysis in SEO

So, the question arises: What’s the value for the average SEO professional? Why bother with all this? Here’s how I see it: Link building is arguably the most challenging and often the most expensive aspect of SEO. Therefore, we want to optimize as many on-site factors as possible. Semantic analysis plays a crucial role in this. As search engines become more sophisticated, there will be more opportunities for improvement (less noise), likely increasing its weight. As we’ve seen, semantic analysis can be integrated throughout the SEO process, from site development and meta data to on-page content and even link building. Adopting this approach can yield significant long-term benefits. Does it work? I firmly believe so. My SEO approach has always considered the role of semantic analysis. Knowing the specific method used is less critical than understanding how to incorporate it effectively into your SEO strategy. By consciously working within semantic best practices, you can undoubtedly enhance your value proposition now and in the future (in my opinion). Here’s some additional reading material:

Phrase-Based IR on Reliable SEO
Phrase-Based IR resources
Another post on phrases
Understanding Semantic Search on SEJ
Some stuff from Bill on semantic analysis
Understanding LDA
Bing does phrases too, on SEJ Feel free to ask any questions you may have in the comments section. David Harry is an SEO and search analyst with Reliable SEO_._ You can also connect with him on Twitter: @theGypsy.