The Goal of Search Engines & How They Work
Search Engine Relevancy
Many people think search engines have a hidden agenda. This simply is not true. The goal of the search engine is to provide high quality content to people searching the Internet.
Search engines with the broadest distribution network sell the most advertising space. Currently, Google is considered the search engine with the best relevancy. Their technologies power the bulk of web search.
The Problem Listing a New Site
The biggest problem new web sites have is that search engines have no idea they exist. Even when a search engine finds a new document, it has a hard time determining its quality. Search engines rely on links to help determine the quality of a document. Some engines, such as Google, also trust web sites more as they age.
The following bits may contain a few advanced search topics. It is fine if you do not necessarily understand them right away. The average webmaster does not need to know search technology in depth but some might be interested in it.
Gerard Salton
The phrase vector space model, which search algorithms still heavily rely upon today, goes back to the 1970s. Gerard Salton was a well-known expert in the field of information retrieval who pioneered many of today's modern methods.
If you are interested in learning more about early information retrieval systems, you may want to read A Theory of Indexing, which is a quick 50-page book by Salton which describes many of the common terms and concepts in the information retrieval field.
Mike Grehan's book, Search Engine Marketing: The Essential Best Practices Guide, also discusses some of the technical bits to information retrieval in more detail than this book. My book was created to be a current how-to guide, while his is geared more toward giving information about how information retrieval works.
Parts of a Search Engine
While there are different ways to organize web content, every crawling search engine has the same basic parts. Each consists of:
- a crawler;
- an index (or catalog);
- and a search interface.
Crawler (or Spider)
The crawler does just what its name implies. It scours the web following links, updating pages, and adding new pages when it comes across them. Each search engine has periods of deep crawling and periods of shallow crawling. There is also a scheduler mechanism to prevent a spider from overloading servers and to tell the spider what documents to crawl next and how frequently to crawl them.
Rapidly changing or highly important documents are more likely to get crawled frequently. The frequency of crawl should typically have little effect on search relevancy; it simply helps the search engines keep fresh content in their index. The home page of CNN.com might get crawled once every ten minutes. A popular rapidly growing forum might get crawled a few dozen times each day. A static site with little link popularity and rarely changing content might only get crawled once or twice a month.
The best benefit of having a frequently crawled page is that you can get your new sites, pages, or projects crawled quickly by linking to them from a powerful or frequently changing page.
The Index
The index is where the spider-collected data are stored. When you perform a search on a major search engine, you are not searching the web, but the cache of the web provided by that search engine's index.
Reverse Index
Search engines organize their content in what is called a reverse index. It sorts web documents by words. When you search Google and it displays 1-10 out of 143,000 web sites, it means that there are approximately 143,000 web pages which either have the words from your search on them or have inbound links containing them.
Storing Attributes
Since search engines view pages from their source code in a linear format, it is best to move JavaScript and other extraneous code to external files to help move the page copy higher in the source code.
Some people also use Cascading Style Sheets (CSS) or a blank table cell to place the page content ahead of the navigation. As far as how search engines evaluate what words are first, they look at how the words appear in the source code. I have not done significant testing to determine if it is worth the effort to make your unique page code appear ahead of the navigation, but if it does not take much additional effort, it is probably worth doing. Link analysis (discussed in depth later) is far more important than page copy to most search algorithms, but every little bit can help.
Google has also hired some people from Mozilla and is likely working on helping their spider understand how browsers render pages. Microsoft published visually segmenting research which may help them understand what page content is most important.
As well as storing the position of a word, search engines can also store how the data are marked up. For example: Is the term in the page title? Is it a heading? What type of heading? Is it bold? Is it emphasized? Is it in part of a list? Is it in link text?
Words which are in a heading or are set apart from normal text in other ways may be given additional weighting in many search algorithms.
Keep in mind, it may be an unnatural pattern for your keyword phrases to appear many times in bold and headings without occurring in any of the regular textual body copy.
If a page looks like it is aligned too perfectly with a topic (i.e., overly-focused so as to have an abnormally high keyword density), then that page may get a lower relevancy score than a page with a lower keyword density and more natural page copy.
Proximity
By storing where the terms occur, search engines can understand how close one term is to another. Generally, the closer the terms are together, the more likely the page with matching terms will satisfy your query.
If you only use an important group of words on the page once, try to make sure they are close together or right next to each other. If words also occur naturally sprinkled throughout the copy many times, you do not need to try to rewrite the content to always have the words next to one another. Natural sounding content is best.
Stop Words
Words which are common do not help search engines understand documents. Exceptionally common terms, such as the, are called stop words. While search engines index stop words, they are not typically used or weighted heavily to determine relevancy in search algorithms. If I search for the Cat in the Hat, search engines may insert wildcards for the words the and in, so my search will look like '*cat**hat'.
Index Normalization
Each page is standardized to a size. This prevents longer pages from having an unfair advantage by using a term many more times throughout long page copy.
This also prevents short pages for scoring arbitrarily high by having a high percentage of their page copy composed of a few keyword phrases. Thus, there is no magical page copy length which is best for all search engines.
The uniqueness of page content is far more important than the length. Following are the three best purposes for page copy are the following:
- To be unique enough to get indexed and ranked in the search result
- To create content that people find interesting enough to want to link to
- To convert site visitors to subscribers, buyers, or people who click on ads
Not every page is going to make sales or be compelling enough to link to, but if, in aggregate, many of your pages are of high quality over time, it will help boost the rankings of nearly every page on your site.
Keyword Density, Term Frequency & Term Weight
Term frequency (TF) is a weighted measure of how often a term appears in a document. Terms which are frequently occurring within a document are thought to be some of the more important terms for that document.
If a word appears in every (or many) documents, then it tells you little about how to discern between documents. Words which appear often have little to no discrimination value, which is why many search engines ignore common stop words (like the, and, and or).
Rare terms, which only appear in a few or limited number of documents, have a much higher signal-to-noise ratio. They are much more likely to tell you what a document is about.
Inverse document frequency (IDF) can be used to further discriminate the value of term frequency to account for how common terms are across a corpus of documents. Terms which are in a limited number of documents will likely tell you more about those documents than terms which are scattered throughout many documents.
When people measure keyword density, they are generally missing some other important factors in information retrieval such as IDF, index normalization, word proximity, and how search engines account for the various element types. (Is the term bolded, in a header, or in a link?)
Search engines may also use technologies like latent semantic indexing to mathematically model the concepts of related pages. Google is scanning millions of books from university libraries. As much as that process is about helping people find information, it is also used to help Google understand linguistic patterns.
If you artificially write a page stuffed with one keyword or keyword phrase without adding many of the phrases that occur in similar natural documents you may not show up for many of the related searches, and some algorithms may see your document as being less relevant.
The key is to write naturally (using various related terms) and structure the page
well.
Multiple Reverse Indexes
Search engines may use multiple reverse indexes for different content. Most current search algorithms tend to give more weight to page title and link text than page copy.
For common broad queries, search engines may be able to find enough quality matching documents using link text and page title without needing to spend the additional time searching through the larger index of page content. Anything that saves computer cycles without sacrificing much relevancy is something you can count on search engines doing.
After the most relevant documents are collected, they may be re-sorted based on interconnectivity or other factors.
Around 50% of search queries are unique, and with longer unique queries, there is greater need for search engines to also use page copy to find enough relevant matching documents (since there may be inadequate anchor text to display enough matching documents).
Search Interface
The search algorithm and search interface are used to find the most relevant document in the index based on the search query. First the search engine tries to determine user intent by looking at the words the searcher typed in.
These terms can be stripped down to their root level (e.g., dropping ing and other suffixes) and checked against a lexical database to see what concepts they represent. Terms which are a near match will help you rank for other similarly related terms. For example, using the word swims could help you rank well for swim or swimming.
Search engines can try to match keyword vectors with each of the specific terms in a query. If the search terms occur near each other frequently, the search engine may understand the phrase as a single unit and return documents related to that phrase. WordNet is the most popular lexical database. At the end of this chapter there is a link to a Porter Stemmer tool if you need help conceptualizing how stemming works.
Searcher Feedback
Some search engines, such as Google and Yahoo!, have toolbars and systems like Google Search History and My Yahoo! which collect information about a user. Search engines can also look at recent searches or what the search process was for similar users to help determine what concepts a searcher is looking for and what documents are most relevant for the user's need.
As people use such a system it takes time to build up a search query history and a click-through profile. That profile could eventually be trusted and used to:
- aid in search personalization;
- collect user feedback to determine how well an algorithm is working;
- and help search engines determine if a document is of decent quality (e.g., if many users visit a document and then immediately hit the back button, the search engines may not continue to score that document well for that query).
If a high-ranked page never gets clicked on, or if people typically quickly press the back button, that page may get demoted in the search results for that query (and possibly related search queries). In some cases that may also flag a page or website for manual review.
As people give search engines more feedback and as search engines collect a larger corpus of data, it will become much harder to rank well using only links. The more satisfied users are with your site, the better your site will do as search algorithms continue to advance.
Real-Time versus Prior-to-Query Calculations
In most major search engines, a portion of the relevancy calculations are stored ahead of time. Some of them are calculated in real time.
Some things which are computationally expensive and slow processes, such as calculating overall inter-connectivity (Google calls this PageRank), are done ahead of time.
Many search engines have different data centers, and when updates occur, they roll from one data center to the next. Data centers are placed throughout the world to minimize network lag time. Assuming it is not overloaded or down for maintenance, you will usually get search results from the data centers nearest you. If those data centers are down or if they are experiencing heavy load, your search query might be routed to a different data center.
Search Algorithm Shifts
Search engines such as Google and Yahoo! may update their algorithm dozens of times per month. When you see rapid changes in your rankings it is usually due to an algorithmic shift, a search index update, or something else outside of your control. SEO is a marathon, not a sprint, and some of the effects take a while to kick in.
Usually, if you change something on a page, it is not reflected in the search results that same day. Linkage data also may take a while to have an effect on search relevancy as search engines need to find the new links before they can evaluate them, and some search algorithms may trust links more as the links age.
The key to SEO is to remember that rankings are always changing, but the more you build legitimate signals of trust and quality the more often you will come out on top.
Relevancy Wins Distribution!
The more times a search leads to desired content, the more likely a person is to use that search engine again. If a search engine works well, a person does not just come back, they also tell their friends about it, and they may even download the associated toolbar. The goal of all major search engines is to be relevant. If they are not, they will fade (as many already have).
Search Engine Business Model
Search engines make money when people click on the sponsored advertisements. In the search result below you will notice that both Viagra and Levitra are bidding on the term Viagra. The area off to the right displays sponsored advertisements for the term Viagra. Google gets paid whenever a searcher clicks on any of the sponsored listings.
The white area off to the left displays the organic (free) search results. Google does not get paid when people click on these. Google hopes to make it hard for search engine optimizers to manipulate these results to:
- keep relevancy as high as possible;
- and to encourage people to buy ads.
Popularity: 5% [?]

January 7th, 2007 at 9:24 pm
Quite an interesting read, thanks I dugg it…