How they work

“How Search Engines Work” The basic fact about the Internet and its front end, the World Wide Web (www), is that there are several millions of web pages available, waiting to present information on a wide variety of topics.

Sunday, November 28, 2010

"How Search Engines Work”

The basic fact about the Internet and its front end, the World Wide Web (www), is that there are several millions of web pages available, waiting to present information on a wide variety of topics.

The bad news about the
Internet is that there are hundreds of millions of pages available, most of them titled according to the impulse of their writers, almost all of them may be sitting on a number of diverse servers with obscure names.

When you need to know about a particular subject, how do you know which pages to read? If you’re like most people, you visit an Internet search engine.

The Internet search engines are special sites on the Web that are designed to
help people find information stored on other sites. There are differences in the ways various search engines work, but they all work in a similar manner.

When we talk about Internet search engines, we actually mean the World Wide Web search engines. Before the Web became the most visible part of the
Internet, there were already search engines in place to help people find information on the Net.

Programs with names like "gopher” and "Archie” kept indexes of files stored on servers connected to the Internet, and dramatically reduced the amount of time required to find programs and
documents.

In the late 1980s, getting serious value from the Internet meant knowing how to use gopher, Archie, Veronica e.t.c. 

However, today, most Internet users limit their searches to the Web, and so I am going to try and concentrate on search engines that focus on the contents of Web pages.

Before a search engine can tell you where a file or document is, it must first find it.

To find information on the several millions or so Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling.

In order to build and maintain a useful list of words, a search engine’s spiders have to look at a lot of pages.

That said and done, how does the spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages.

The spider will begin with a popular site, indexing the words on its pages and following every link found within the site.

In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web (moves like the real spider does). 

These Spiders take a Web page’s content and create key search words that enable online users to find pages they’re looking for. Google began as an academic search engine.

In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how
quickly their spiders can work.

They built their initial system to use multiple spiders, usually three at one time.

Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data per second.

Keeping everything running quickly meant building a system to feed necessary information to the spiders.

The early Google system had a server dedicated to providing URLs to the spiders.

Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server’sname into an address, Google had its own DNS, in order to keep delays to a minimum.

eddie@afrowebs.com