Navigation path

How search engines work

This section deals with background details only and does not give any practical tips on how to optimise your pages for search engines.

For that, see Search engines - getting your content found.

 

What is a search engine?

A search engine is a coordinated set of programmes that includes:

  • A spider (also called a "crawler" or a "bot") that goes to every page or representative pages on every Web site that wants to be searchable and reads it, using hypertext links on each page to discover and read a site's other pages
  • A program that creates a huge index (sometimes called a "catalogue") from the pages that have been read
  • A program that receives your search request, compares it to the entries in the index, and returns results to you

An alternative to using a search engine is to explore a structured directory of topics. Yahoo, which also lets you use its search engine, is the most widely-used directory on the Web. A number of Web portal sites offer both the search engine and directory approaches to finding information.

 

Crawler-based search engines

 

Crawler-based search engines, such as Google, create their listings automatically. They "crawl" or "spider" the web, then people search through what they have found.

If you change your web pages, crawler-based search engines eventually find these changes, and that can affect how you are listed. Page titles, body copy and other elements all play a role.

Crawler-based search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes.

Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalogue, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information.

Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed." Until it is indexed -- added to the index -- it is not available to those searching with the search engine.

 

Human-powered directories

A human-powered directory, such as the Open Directory, depends on humans for its listings. You submit a short description to the directory for your entire site, or editors write one for sites they review. A search looks for matches only in the descriptions submitted.

 

"Hybrid search engines" or mixed results

In the web's early days, it used to be that a search engine either presented crawler-based results or human-powered listings. Today, it extremely common for both types of results to be presented. Usually, a hybrid search engine will favour one type of listings over another. For example, MSN Search is more likely to present human-powered listings from LookSmart. However, it does also present crawler-based results (as provided by Inktomi), especially for more obscure queries.

 

Different search engine approaches

  • Major search engines index the content of a large portion of the web and provide results that can run for pages - and consequently overwhelm the user.
  • Specialized content search engines are selective about what part of the web is crawled and indexed. They provide provide a shorter but more focused list of results.
  • 'Ask' provides a general search of the web but allows you to enter a search request in natural language, such as "What's the weather in Seattle today?"
  • Special tools and some major websites such as Yahoo let you use a number of search engines at the same time and compile results for you in a single list.
 
 

Guidelines and references