How they work: “How Search Engines Work”

As we earlier on said, search engines use the so called “spiders”, when for example, the Google spider looks at a HTML page, it takes note of two things, the words within the page and where the words are found. 

Friday, December 10, 2010

As we earlier on said, search engines use the so called "spiders”, when for example, the Google spider looks at a HTML page, it takes note of two things, the words within the page and where the words are found. 

The web page is normally divided into sections like; the title subtitles, Meta tags and other positions of relative importance which are noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a,” "an” and "the.” Other spiders take different approaches.  

These different approaches usually attempt to make the spider operate faster; allow users to search more efficiently, or both. For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web.

Some other search engines like AltaVista, go in the other direction, they mark every single word on a page, including "a,” "an,” "the” and other "insignificant” words. The push to wholeness in this approach is matched by other systems in the attention given to the unseen portion of the Web page, the Meta tags that allow the owner of a page to specify key words and concepts under which the page will be marked or indexed.

This can be helpful, especially in cases in which the words on the page might have double or several meanings the Meta tags can guide the search engine in choosing which of the several possible meanings for these words is correct.

There is, however, a danger in over reliance on Meta tags, because a careless or deceitful page owner might add Meta tags that fit very popular topics but have nothing to do with the actual contents of the page.

To protect against this, spiders will correlate Meta tags with page content, rejecting the Meta tags that don’t match the words on the page.

In above, we assume that the owner of a page actually wants it to be included in the results of a search engine’s activities. A number of times, the page’s owner doesn’t want it showing up on a major search engine, or doesn’t want the activity of a spider accessing the page; for example, a game that builds new, active pages each time sections of the page are displayed or new links are followed.

If a Web spider accesses one of these pages, and begins following all of the links for new pages, the game could mistake the activity for a high-speed human player and spin out of control. To avoid situations like this, the robot exclusion protocol was developed.

This protocol, implemented in the meta-tag section at the beginning of a Web page, tells a spider to leave the page alone to neither index the words on the page nor try to follow its links.

They allow users to look for words or combinations of words found in that index. Early search engines held an index of a few hundred thousand pages and documents, and received maybe one or two thousand inquiries each day.

Today, a top search engine will index hundreds of millions of pages, and respond to tens of millions of queries per day.

Once the spiders have completed the task of finding information on Web pages, the search engine must store the information in a way that makes it useful.

There are two key components involved in making the gathered data accessible to users; the information stored with the data and The method by which the information is indexed.  A search engine could just store the word and the URL where it was found.

In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word.

In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.

To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page.

Each search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders. 

eddie@afrowebs.com