The search technology behind: Ask.com

The History

Teoma has been at the very core of Ask search technology since the early 2000’s. Teoma’s algorithm, which is now known as ExpertRank, is what allows Ask to be one of the most powerful search engines in the world.

Two important events led to the development of this unique technology. The first one occurred way back in 1999, when Ask acquired ownership of Direct Hit, a Massachusetts based company that helped to develop the first “click popularity” search engine technology. This technology was licensed to several important search engines, most notably MSN and Lycos.

The Second major event took place in 2001, when Ask acquired Teoma, which at that time was only a 10 person start-up business based out of Rutgers University. The acquisition of Teoma was important because, at that time, it was the first company to develop search technology that used the clustering concept of subject specific relevance/popularity. This technology was known as ExpertRank, and its acquisition meant that Ask now had access to index and search relevance technology,

How Does It Work?

Ask’s ExpertRank algorithms and search parameters are designed to identify the most authoritative sites on the internet. In other words, the algorithm is designed to identify the most reliable sites and sources, as opposed to the most popular. What this means is that the ExpertRank algorithm doesn’t rank pages based on the volume of links that points towards a particular page. Instead, it focuses on the reliability of its content. This particular process is known as subject-specific popularity, and it is a process which is designed to do the following:

Identify relevant topics (sometimes called clusters)
Identify the experts on said topics
Identify the popularity of of certain pages among those experts

This process takes place whenever you make a search query using ExpertRank and usually involves additional calculations which are not present in other search engine algorithms. The result is high value relevance that often provides a little editorial flavor to each user.

Ask’s Web Crawler FAQ

The Ask Web Crawler is Ask’s web-indexing spider, which means that it is responsible for collecting documents from all over the internet in order to build an ever growing index that features advanced search functionality for Ask and all other sites which are licensed to use the proprietary Ask Search Technology.

What makes Ask search technology different from other search engine algorithms is that it is designed to analyze the contents of the internet as they actually exist, which is to say in subject-specific communities. This whole process begins with the creation of a comprehensive index. Web crawlers are extensively used to accomplish this process and they help to ensure that users only receive the most up-to-date results for each of search.

On the following page, you will encounter several answers to some of the most common questions that users ask with regards to how the Ask Web Crawler works.

What is a Web Crawler/Web Spider?

A Web Crawler/Spider/Robot is basically any program which is designed to follow the various hyperlinks found all throughout a particular site by retrieving and indexing their contents for search-related purposes. It’s also worth mentioning that all crawlers are designed to cause no harm to any person’s site or servers, so there’s nothing to worry about.

Why Do Search Engines Use Web Crawlers?

Search engines, like Ask, need to use Web Crawlers in order to gather data from the internet and use it to compile an ever-growing search index. Crawling also ensures that the results seen by the users are as recent and relevant as they can possibly be.

Ask’s crawlers are also well designed and professionally managed by experienced technicians in order to comply with the standards of the search industry.

How Does Ask’s Crawlers Work?

Ask’s Web Crawlers are designed to use the following search process:

– First of all, the Crawler enters a URL and acquires its HTML page

– The Crawler then follows any hyperlinks found on the page (The URLs can either be on the same site or on external sites).

– The Crawler then adds up all of the newest URLs to its existing URL list for crawling. The system will keep repeating this process in order to maintain an accurate list of URLs on the internet.

– In some cases, Crawlers will exclude certain URLs if they have downloaded a sufficient number of them from the site, or when that particular URL is a duplicate of a different site URL that has already been downloaded by the system.

– Finally, crawled URL files are developed into a single search catalog, which will then serve to display URLs on search results on the site whenever there’s a relevant match.

How Often Will the Ask Crawler Acquire Pages From A Particular Site?

Crawlers are designed to download only one page at a time from each particular site. Once it has received the webpage, it will pause for a little while before downloading another page from the site. This delay can range between a second to several hours, depending on the nature and settings of the crawlers themselves. Generally, however, the faster a site responds to a crawler that’s asking for a page, the shorter its delay.

Is It Possible to Prevent Ask From Showing Cached Copy a Web Page?

Yes, it is possible and all you need to do is to include the following command on your HTML page:

Does Ask Comply With The Robot-Exclusion Standard?

Yes, Ask does comply with the Robots Exclusion Standard of 1994 along with all of the other guidelines of the Robot Exclusion Protocol.

Can I Prevent Crawlers From Acquiring Certain Parts of My Site?

Yes, Ask crawlers are designed to respect the wishes of any site owner that doesn’t want to have his or her URL, or certain pages of it, indexed.

Where Can I Put My Robots.txt File?

You will need to put the file at the top level of your site. For example, let’s say that you have a site called www.samplesite.org. Your robot.txt file will have to be placed at http://www.samplesite.org/robots.txt

How Do I Know If The Ask Crawler Has Already Visited My URL?

To check whether or not the Ask Crawler has already visited your URL, simply check the server logs on your site.

How Do I Stop Ask Crawlers From Indexing My Site’s Page or Tracking Their Links

To prevent crawlers from indexing your site or certain parts of your site, all you need to do is to place the following command on your HTML page.
<META NAME = “ROBOTS” CONTENT = “NOINDEX”>

Furthermore, if you want the crawler to index your document, but prevent hyperlinks from being included then replace “NOINDEX” with “NOFOLLOW.” On the other hand, if you want to turn off all prevention directive then replace “NOINDEX” with “NONE.”

Why Does The Ask Crawler Keep Downloading the Same Page On My Site Several Times?

It’s important to remember that the Ask Crawler is designed to download only one copy per file, from any particular site, during any given crawl. However, there are two cases, wherein such rules don’t apply. These exceptions happen:

– Whenever a URL contains commands that causes the crawler to go to a different site or URL. A good example is when a page contains the following HTML command: <META HTTP-EQUIV=”REFRESH” CONTENT=”0; URL=http://www.SamplePage.html”>

– When the crawler encounters the HTTP status codes 301 and 302, in which case it downloads the second page instead of the first one.
It’s also worth mentioning that if multiple URLs redirect to the same page then it’s quite likely that the Crawler will download the page several times, unless it detects the duplicates. Under such circumtances, the HTML page may be considered a frameset, and what that means is that the page is formed using several other component pages. These component pages are called “frames” and whenever these framesets contain similar frame pages as each of their components then the Crawler will also download both the framepages as well as their components, which then leads to duplicate copies.

Why Does the Ask Crawler Keep Trying to Download Erroneous Links From A Server? Or why Does It Keep Trying to Access A Server That Doesn’t Even Exist?

It’s inevitable for some links to become obsolete or outdated after a given amount of time. Whenever a web page contains links which are either obsolete, broken or linked to a site which no longer exists, Ask’s search algorithms will still try to visit the link in order to identify the pages that it references.

This causes the crawler to request for URLs that may no longer even exist or which have never existed in the first place. In some cases, it may also try to make HTTP requests via IP addresses which no longer belong to any particular server or had never been part of one. Keep in mind that this does not mean that the Crawler is creating addresses randomly. Instead, what it’s actually doing is following links, which accounts for the activity on a machine that is not even a web server.

How Does The Ask Web Crawler Find URLs?

The Ask Crawler identifies and follows pages by following their links from other sites and pages. Whenever a crawler identifies a particular page that contains a frame, it downloads each component of the set, including their content, as parts of the original page. Take note, however, that the Ask Crawler are not meant to index each component frame of the URLs unless they have been linked to other pages.

What Kind of Links Do Ask Crawlers Follow?

The Ask Crawler was designed to identify and follow HREF links, Re-Directs as well as SRC links.

Can the Ask Crawler Include Dynamic URLs?

Ask allows for a certain number of dynamic URLs in its index. However, these kinds of links are thoroughly screened prior to inclusion in the index in order to find and eliminate possible duplicates prior to downloading.

Why Does Ask Crawler Keep Ignoring My URL?

If the Ask Crawler is ignoring your URL then it’s likely because it hasn’t found any link to that particular URL from other pages that it has visited. In this case, you’ll simply have to check the URLs on that particular page.

Does the Ask Crawler Allow HTTP Compression?

It does. However, both the client as well as the server should support the compression in order to allow the HTTP compression to work. When supported, the Ask Crawler allows the servers to send out compressed contents using Gzip and other formats instead of the usual documents.

This feature allows for considerable bandwidth savings for both the HTTP server and the client, resulting in file size reduction of about 75%. However, there is also a little CPU overhead cost involved for both parties due to encoding and decoding costs, but it’s all worth it.

How Do Site Owners Register A URL with Ask for Indexing?

It’s worth pointing out that Ask no longer has a Paid-Site-Submission program, so you can’t really pay to have your URL indexed. However, thanks to the recent upgrades, Ask is now indexing more web pages than ever before, so you shouldn’t worry about your site not appearing on Ask’s search index.

Also, if you are a site owner or webmaster of a particular site then you may want to look for additional online information that offers tips on how to set up and optimize your web server in order to make them “Web Crawler Friendly.”

Why Are Some of Pages Indexed by Ask Crawlers Not Showing Up on Ask.com’s Search Results?

If you can’t find some of your pages indexed on Ask’s search results page then it’s probably because it’s being analyzed by Ask’s servers. Like all search engines, Ask’s system always analyzes the results of each crawl, before processing the results for inclusion in their database. So if you can’t find yours on their search results page then it’s likely because they’re still reviewing it.

Can I take control of the crawler request rate from the Ask Spider to my site?

Yes. Ask fully support the so-called Crawl Delay robots.txt protocol. With this directive, you will be able to set the minimum delay of Ask’s Spider between two successive requests on your site.