October 28, 2014

Search Engines

The last two decades have witnessed many significant attempts to make this knowledge “discoverable”. These attempts broadly fall into two categories:

1. Classification of webpages in hierarchical categories (directory structure), championed by the likes of Yahoo! and Open Directory Project : Doesnt serve the purpose of todays web as -Huge amount of webpages who in the world has time to manually go and annotate them. Having said that its good when used in restricted domain!
2. Full-text index search engines such as Excite, AltaVista, and Google. : A pre computed Index to algorithmically retrieve and rank web pages

A statistical similarity measure has always been used in practice to assess the closeness of each document (web page) to the user text (query); the underlying principle being that the higher the similarity score, the greater the estimated likelihood that it is relevant to the user. This similarity formulation is based on models of documents and queries, the most effective of which is the vector space model. The cosine measure has consistently been found to be the most successful similarity measure in using this model. It considers document properties as vectors, and takes as distance function the cosine of the angle between each vector pair.

Each term’s contribution is weighted such that terms appearing to be discriminatory are favored while reducing the impact of more common terms. Most similarity measures are a composition of a few statistical values: frequency of a term t in a document d (term frequency or TF), frequency of a term t in a query, number of documents containing a term t (document frequency or DF), number of terms in a document, number of documents in the collection, and number of terms in the collection.

Now as people understood how search engines worked and how they would rank results which is how they would be displayed in the results, all everyone wanted was that their result should come in first. Its just like people naming their companies with multiple A’s so that they could be listed in early pages of YELLOW pages

To avoid this came the concept of PAGERANK, basically link analysis on web, which can be interpreted as the fraction of time that a random web surfer will spend on that webpage when following the out-links from each page on the web.

When ranking multi-term queries, one of the prominent signals used is the proximity of different terms on a page. The goal is to prefer documents in which query terms appear closest together over the ones in which they are spread apart. Proximity of terms is even more critical in the case of phrase queries, where relative position of each query term matters. Rather than simply checking if terms are present in a document, we also need to check that their positions of appearance in the document are compatible with the phrase query being evaluated.

Kudos

Search Engines

Now read this

NGram Language Model.