Search Engine FAQ
Search Strategy
How to Find Info about People
on the Web
Search the Web
How Search Engines Work
Search Wizard
Historical Info on Search Engines
Top Page
|
How To Use Web Search Engines
Tips on using internet search sites like
Google, alltheweb, and Yahoo.
Search Engine Ranking Algorithms
by Curt A. Monash
Page Rank
Search engine ranking algorithms are closely guarded secrets, for at least two reasons:
search engine companies want to protect their methods from their competitors, and they
also want to make it difficult for web site owners to manipulate their rankings.
That said, a specific page's relevance ranking for a specific query currently depends on
three factors:
- Its relevance to the words and concepts in the query
- Its overall link popularity
- Whether or not it is being penalized for excessive search engine optimization (SEO).
Examples of SEO abuse would be a lot of sites linked to each other in a circular scam, or
excessive and highly ungrammatical stuffing with keywords.
Factor #2 was innovated by Google with PageRank. Essentially, the more incoming
links your page has, the better. But it is more complicated than that: indeed,
PageRank is a tricky concept because it is circular, as follows: Every page on
the Internet has a minimum PageRank score just for existing. 85% (at least,
that's the best known estimate, based on an early paper) of this PageRank is passed along
to the pages that page links to, divided more or less equally along its outgoing links.
A page's PageRank is the sum of the minimum value plus all the PageRank passed to
it via incoming links.
Although this is circular, mathematical algorithms exist for calculating it iteratively.
In one final complication, what I just said applies to "raw PageRank."
Google actually reports PageRank scores of 0 to 10 that are believed to be based on the
logarithm of raw PageRank (they're reported as whole numbers). And the base of
that logarithm is believed to be approximately 6.
Anyhow, there are about 30 sites on the Web of PageRank10, including Yahoo, Google,
Microsoft, Intel, and NASA. IBM, AOL, and CNN, by way of contrast, were only at
PageRank 9 as of early in 2004.
Further refinements in link popularity rankings are under development. Notably, link
popularity can be made specific to a subject or category; i.e., pages can have different
PageRanks for health vs. sports vs. computers vs. whatever. Supposedly,
AskJeeves/Teoma already works that way.
It is believed that Inktomi, Altavista, et al. use link popularity in their ranking
algorithms, but to a much lesser extent than Google. Yahoo, owner of Inktomi,
Altavista, Alltheweb, is rolling out a new search engine, which reportedly includes a
feature called Web Rank. More on how that works soon.
Keyword Search
Most search engines handle words and simple phrases. In its simplest form, text
search looks for pages with lots of occurrences of each of the words in a query, stopwords
aside. The more common a word is on a page, compared with its frequency in the
overall language, the more likely that page will appear among the search results.
Hitting all the words in a query is a lot better than missing some.
Search
engines also make some efforts to understand what is meant by the query words. For example, most search engines now offer
optional spelling correction. And
increasingly they search not just on the words and phrases actually entered, but the also
use stemming to search for alternate forms of the words (e.g., speak, speaker, speaking,
spoke). Teoma-based engines are also offering
refinement by category, ala the now-defunct Northern Light. However, Excite-like
concept search has otherwise not made a comeback yet, since the concept categories are too
unstable.
When ranking results, search engines give special weight to keywords that appear:
- High up on the page
- In headings
- In BOLDFACE (at least in Inktomi)
- In the URL
- In the title (important)
- In the description
- In the ALT tags for graphics.
- In the generic keywords metatags (only for Inktomi, and only a little bit even for them)
- In the link text for inbound links.
More weight is put on the factors that the site owner would find it awkward to fake,
such as inbound link text, page title (which shows up on the SERP -- Search Engine Results
Page), and description.
How sites get into search engines
The base case is that spiders crawl the entire Web, starting from known pages and
following all links, and also crawling pages that are hand-submitted. Google
is pretty much like that still. If a site has high PageRank, it is spidered
more often and more deeply.
However, search engines are trying to encourage site owners to pay for the privelege of
having their pages spidered. Teoma's index is very hard to get into without
paying money, and Inktomi's isn't that easy either. And even if you do get
into Inktomi for free, they'll take a long time to respider, while if you pay they
respider constantly. One advantage of being respidered often is that you can tweak
your page to come up higher in their relevancy rankings, then see if your changes worked.
Finally, you can also pay to appear on a search page. That is, your link will
appear when someone searches on a specific keyword or keyphrase. Google does a good
job of making it pretty clear which results (at the top or on the right of the page) are
paid; others maybe do a not-so-good job.
Paid search results are typically all pay-per-click, based on keyword. The
advertiser pays the search engine vendor a specific amount of money each time an ad is
clicked on, this fee having been determined by an auction of each keyword or keyphrase.
Spidap, Top Page
Contact Us
The Spider's Apprentice was conceived and
written by Linda Barlow, who maintains this site for Monash
Information Services. Copyright, 1996-2004. All rights reserved.
Updated: 05/11/04
|