Search engine

Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. They accept queries supplied by web users and return a list of resources that best fit the query of the user. There are differences in the ways various search engines work, but they all perform three basic tasks:
  • They search the Internet -- or select pieces of the Internet -- based on important words.
  • They keep an index of the words they find, and where they find them.
  • They allow users to look for words or combinations of words found in that index.
Early search engines held an index of a few hundred thousand pages and documents, and received maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of millions of pages, and respond to tens of millions of queries per day
A search engine is a coordinated set of programs that includes:
  • A spider (also called a "crawler" or a "bot") that goes to every page or representative pages on every Web site that wants to be searchable and reads it, using hypertext links on each page to discover and read a site's other pages
  • A program that creates a huge index (sometimes called a "catalog") from the pages that have been read
  • A program that receives your search request, compares it to the entries in the index, and returns results to you

Major components of crawler-based search engines
Crawler-based search engines have three major components.
1) The crawler: Also called the spider. The spider visits a web page, reads it, and then follows links to other pages within the site. The spider will return to the site on a regular basis, such as every month or every fifteen days, to look for changes.
2) The index: Everything the spider finds goes into the second part of the search engine, the index. The index will contain a copy of every web page that the spider finds. If a web page changes, then the index is updated with new information.
3) The search engine software: This is the software program that accepts the user-entered query, interprets it, and sifts through the millions of pages recorded in the index to find matches and ranks them in order of what it believes is most relevant and presents them in a customizable manner to the user.
All crawler-based search engines have the basic parts described above, but there are differences in how these parts are tuned. That is why the same search on different search engines often produces different results. Our comparisons will then be based on these differences in all three parts.
All search engines contain the following main components:
Spider
A browser-like programme that downloads web pages
Crawler
A program that automatically follows all of the links on each web page
Indexer
A program that analyzes web pages downloaded by the spider and the crawler
Database
Storage for downloaded and processed pages
Results engine
Extracts search results from the database
Web server
A server that is responsible for interaction between the user and other search engine components
Different types of search engines
When people mention the term "search engine", it is often used generically to describe both crawler-based search engines and human-powered directories however these two types of search engines gather their listings in radically different ways and therefore are inherently different.
Crawler-based search engines, create their listings automatically by using a piece of software to “crawl” or “spider” the web and then index what it finds to build the search base. Web page changes can be dynamically caught by crawler-based search engines and will affect how these web pages get listed in the search results. Crawler-based search engines are those that use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the site's meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine.
 Google, AllTheWeb and AltaVista,
Crawler-based search engines are good when you have a specific search topic in mind and can be very efficient in finding relevant information in this situation. However, when the search topic is general, crawler-base search engines may return hundreds of thousands of irrelevant responses to simple search requests, including lengthy documents in which your keyword appears only once.
Human-powered directories, such as the Yahoo directory, Open Directory and LookSmart, depend on human editors to create their listings. Typically, webmasters submit a short description to the directory for their websites, or editors write one for the sites they review, and these manually edited descriptions will form the search base. Therefore, changes made to individual web pages will have no effect on how these pages get listed in the search results. Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index.
Human-powered directories are good when you are interested in a general topic of search. In this situation, a directory can guide and help you narrow your search and get refined results. Therefore, search results found in a human-powered directory are usually more relevant to the search topic and more accurate. However, this is not an efficient way to find information when a specific search topic is in mind.
Meta-search engines, such as Dogpile, Mamma, and Metacrawler, transmit user-supplied keywords simultaneously to several individual search engines to actually carry out the search. Search results returned from all the search engines can be integrated, duplicates can be eliminated and additional features such as clustering by subjects within the search results can be implemented by meta-search engines.
Meta-search engines are good for saving time by searching only in one place and sparing the need to use and learn several separate search engines. "But since meta-search engines do not allow for input of many search variables, their best use is to find hits on obscure items or to see if something can be found using the Internet.

Building a Search

Searching through an index involves a user building a query and submitting it through the search engine. The query can be quite simple, a single word at minimum. Building a more complex query requires the use of Boolean operators that allow you to refine and extend the terms of the search.
The Boolean operators most often seen are:
  • AND - All the terms joined by "AND" must appear in the pages or documents. Some search engines substitute the operator "+" for the word AND.
  • OR - At least one of the terms joined by "OR" must appear in the pages or documents.
  • NOT - The term or terms following "NOT" must not appear in the pages or documents. Some search engines substitute the operator "-" for the word NOT.
  • FOLLOWED BY - One of the terms must be directly followed by the other.
  • NEAR - One of the terms must be within a specified number of words of the other.
  • Quotation Marks - The words between the quotation marks are treated as a phrase, and that phrase must be found within the document or file.


Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index

Comments

Popular posts from this blog

Information and communication science