Web Information Retrieval, Search Engines and Web Crawlers

The role that right and factual information now plays in this current age cannot be overemphasized. Proper and accurate information plays a major role in every decision making process. It is also however not an exaggeration to say that the survival of virtually all fields depends on information. 

However, with the vast amount of data readily available on various sources with the Internet hosting a larger percentage of them, there is thus a need to retrieve only the needed information from the readily available ones.

Information retrieval is thus the activity of obtaining information resources relevant (or satisfactory) to an information need from a collection of information resources. 

There is now available on the internet a vast collection of electronic information which are readily made available on different websites and pages covering large geographical areas with ease of access, use and consistency. The World Wide Web (WWW), as a global distributed information repository, has become the largest data sources in today’s world. 

Web Information Retrieval is therefore a technology for helping users to accurately, quickly and easily find information on the web. An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. With the proliferation of huge amounts of (heterogeneous) data on the Web, the importance of information retrieval (IR) has grown considerably over the last few years. 

The internet has over 90 million domains, over 70 million personal blogs which are viewed by over 1 billion people around the world. Ironically the very size of this collection has become an obstacle for easy information retrieval . As the internet constantly expands, the amount of available online information expands as well. The user has to shift through scores of pages to come upon the information he/she desires. In this vast space of information, it is important to create order even without global scale. This may be done by using either building classification catalog or a search engine. Both of them require a web crawling tool to ease the burden of manual data processing. The issue on how to efficiently find, gather and retrieve this information has led to the research and development of systems and tools that attempt to provide a solution to this problem. 

A common tool available on the internet used for adequate information retrieval is the search engines. A search engine is an information retrieval system designed to help find information stored on a computer system. The most visible form of a search engine is a web search engine which searches for the information on the World Wide Web. Search engines provide an interface to a group of items that enables the user to specify the criteria about an item of interest and have the engine find the matching items. Search Engines are however able to do their job with the aid of Web Crawlers. Web crawlers are the heart of search engine.

A web crawler is a program or automated script which browses the World Wide Web in a methodical and automated manner thus enabling it to form an important component of web search engines. They are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining and comparison shopping engines.

Web crawlers start with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.  In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers. Their two main data structures – the “frontier” set of yet-to-be-crawled URLs and the set of discovered URLs – typically do not fit into main memory, so efficient disk-based representations need to be used. Finally, the need to be “polite” to content providers and not to overload any particular web server, and a desire to prioritize the crawl towards high-quality pages and to maintain corpus freshness impose additional engineering challenges.  

Web crawler’s search engine performs two basic functions. First, it compiles an ongoing index of web addresses (URLs).  Then, it retrieves and marks a document, analyzes the content of both its title and its full text, registers the relevant link it contains and then stores the information in its database. When a user submits a query in the form of one or more keywords, the web crawler compares it with the information in its index and reports back any matches and then stores the information in a database. Its second function is to search the internet in real time for the sites that matches a given query. It does this in an exact process of performing its first function, following links from one page to another.

Big players in the computer industry, such as Google, Microsoft and Yahoo!, are the primary contributors of technology for fast access to Web-based information; and searching capabilities are now integrated into most information systems, ranging from business management software and customer relationship systems to social networks and mobile phone applications.


The first search engine created was Archie, created in 1989 by Alan Emtage. Archie helped solve the data scatter problem by combining a script-based data gatherer with a regular expression matcher for retrieving file names matching a user query. Essentially Archie became a database of web filenames which it would match with the user’s queries. Just as Archie started to gain ground and popularity, Veronica was developed by the University of Nevada System Computing Services . Veronica served the same purpose as Archie, but it worked on plain text files. Soon another user interface name Jughead, a tool for obtaining menu information from various Gopher servers appeared later with the same purpose as Veronica, both of these were used for files sent via Gopher, which was created as an Archie alternative. 

In 1993, Matthew Gray created what is considered the first robot, called World Wide Web Wanderer. It was initially used for counting Web servers to measure the size of the Web. The Wanderer ran monthly from 1993 to 1995. Later, it was used to obtain URLs, forming the first database of Web sites called Wandex.  Also in 1993, Martijn Koster created ALIWEB (Archie-Like Indexing of the Web). ALIWEB allowed users to submit their own pages to be indexed. ALIWEB was a search engine based on automated meta-data collection, for the Web 

Brian Pinkerton of the University of Washington released WebCrawler on April 20, 1994 initially as a desktop application and not as a web service as it is been used today. It later went live on the web with a database containing documents from over 6,000 web servers. It was the first crawler which indexed entire pages while other bots were storing a URL, a title and at most 100 words. It was the first full-text search engine on the Internet; the entire text of each page was indexed for the first time. Soon it became so popular that during daytime hours it could not be used as the WebCrawler was averaging 15,000 hits a day.  WebCrawler opened the door for many other services to follow suit. After the debut of WebCrawler came Lycos, Infoseek, and OpenText.

Alta Vista also began in 1995. It was the first search engine to allow natural language inquires and advanced searching techniques. It also provides a multimedia search for photos, music, and videos. Inktomi started in 1996 and in June 1999 Inktomi introduced a directory search engine powered by "concept induction" technology. "Concept induction," according to the company, "takes the experience of human analysis and applies the same habits to a computerized analysis of links, usage, and other patterns to determine which sites are most popular and the most productive." AskJeeves and Northern Light were both launched in 1997. Google was launched in 1997 by Sergey Brian and Larry Page as part of a research project at Stanford University. It uses inbound links to rank sites. In 1998 MSN Search and the Open Directory were also started. 

Google came about as a result of its founders' attempt to organize the web into something searchable. Their early prototype was based upon a few basic principles, including:

  • The best pages tend to be the ones that people linked to the most.
  • The best description of a page is often derived from the anchor text associated with the links to a page.

Both of these principles were observed as structural features of the world wide web, and theories were developed to exploit these principles to optimize the task of retrieving the best documents for a user query.

No comments:

Post a Comment