All You Need
In One Single
Theme.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat
Search here:

Blog

Home > Uncategorized  > Beyond Link Juice: Co-Citation and Natural Language Processing

Search engine optimization used to be straightforward. Machines could match a search query to a string of words in a web document and a user will see those documents and choose one that answers the query in a better way.

Incoming links helped search engines (i.e. Google) rank documents which all had that string of words included.

It was simple.

It has gotten much more sophisticated over the years. Now search engines use co-citation and Natural Language Processing (semantic analysis) to better understand the context and purpose of a query.

Here’s what you should know:

What is Co-Citation?

Co-citation is a method of web document coupling based on common references.

In other words, if A links to C and B links to C, A and B are likely to be related even if they are not directly linking to each other.

Co-citation analysis is a concept used in bibliometrics that’s also made its way into information retrieval since there’s a lot of overlap between the fields and the problems they seek to solve. It’s a clustering method used to identify similar content through shared citations.

A quick take-away is that SEO professionals have actually known about and co-citation analysis for a while now; it was probably deployed by Google to create the results returned by the “related:” advanced operator (which is now deprecated).

Not the Only Signal

There are plenty of content based signals that play into the ranking algorithm already, and Google is continuing to refine its understanding of what separates a quality page from spam, which includes catching link spam and other attempts at gaming the algorithm.

Due to this on-going emphasis on content quality, link quality, and other qualitative ranking factors within the Plex, pure link juice shouldn’t explain this behavior… especially since algorithm updates like Penguin included ways of detecting and ignoring low-quality links (like links from irrelevant and/or non-authoritative pages) and attempts at ranking through “Google-Bombing” via exact match anchor text links.

Some links just aren’t as juicy as they used to be, and now there are even types of links that can hurt your efforts to rank.

Natural Language Processing

Natural Language Processing, especially as it relates to Entity Disambiguation and creation of Taxonomies for the purpose of building an Ontology is very much a continually evolving field and it’s an active area of research at many of the top universities and companies in the United States.

Natural Language Processing techniques already exist for Named Entity Recognition and building Taxonomies, and Google has a huge corpora providing more than enough data to learn from if they wanted to break new ground in the field.

It’s well within the realm of plausibility that they have the capabilities to do it… but is there any indication they’re actually doing it?

They employ some of the brightest minds in the world for the purpose of “organizing the world’s information and making it universally accessible and useful.”

Organizing information is pretty much the primary function of concepts such as Taxonomy and Ontology in Information Retrieval. As if this weren’t suggestive enough, Google has a few Patent filings that are relevant to the field… here’s a few of the most relevant ones:

  • Automatic taxonomy generation in search results using phrases (US Patent No. 7,426,507)
  • Determining query term synonyms within query context (US Patent No. 7,636,714)
  • Categorizing objects, such as documents and/or clusters, with respect to a taxonomy and data structures derived from such categorization (US Patent Application No. 8918395B2)
  • System and method for determining a composite score for categorized search results (US Patent No. 7,814,085)
  • Identifying Query Aspects (US Patent Application No. 20100198837)

These patents show a continued trend by Google to examine features within a document in an attempt to derive context well beyond pure simple term frequency in page content or incoming link anchor text, paving the way for technologies such as Entity Extraction and Co-Occurrence to potentially factor into document ranking algorithms in new and exciting ways.

If you need further proof, please allow me to introduce the Google Knowledge Graph… an Ontological Information Retrieval system.

Not only is it possible for Google to be working with this technology… It seems Google has taken major strides towards implementing the necessary infrastructure to do it.

We are looking at a new ranking signal. It’s Topical PageRank and the devaluing of irrelevant, spammy, or weak links from algorithms like Penguin. This may actually be confirmation of the scope of Penguin on the SERPs. We’re not seeing rankings based on something Google just noticed, we’re seeing rankings based on what Google is ignoring.

Still work in progress

  1. The algorithm is still driven by words that are in the content or anchor text, meaning it’s just business as usual for Google. Links are still very important to ranking, they’ve just changed the way they look at links.
  2. Many of the secondary factors, such as Lexical Co-Occurrence and Synonymy have been in place in the algorithm for quite some time.
  3. A lot of the technology used to make it happen is still out of reach for the average SEO professional. We don’t all have access to web scale data to try and compute the transfer of Topical Authority.
  4. We’re still in theory land, more testing is going to be needed to get to the bottom of this.

Why it is that exciting…

  1. It’s a continuation of the shift away from ranking signals that are easily spammed. Google’s algorithm is getting better, and we’ve learned that it’s Context that is truly King.
  2. It shows an increased ability for Google to disambiguate concepts within phrases and provide contextual results on the basis of similar or related terminology.
  3. It appears to be an example of the impact the Knowledge Graph may have on the SERPs in the future.
  4. Lexical co-occurrence is at work in the examples because they’re oddball queries in their own right. They co-occur a lot less in the web corpora than some of their variant forms…. or they don’t receive a lot of mentions, so Google is estimating/approximating.

Search engine algorithms have undergone a remarkable transformation, moving far beyond the simplistic reliance on incoming links and basic keyword matching. The advent of co-citation analysis and Natural Language Processing (NLP) has ushered in a new era of search engine sophistication, enabling search enginesto comprehend the context, purpose, and meaning behind user queries with greater accuracy.