Main page Robot, Web Mining Spider - Online Reference Manual 
  info@semantic-knowledge.com 
Home | News | Reference | Support | Download | Buy | About 

Appendixes - Known problems

Although this software is designed to provide the best results, it is technically impossible to solve all the technical problems that may be encountered on the Internet.

So, you may encounter three main types of problems when collecting contents on the Web:

  1. Firewalls. Because the URL downloading can be very fast, specially if you use a dedicated network access or a fast ADSL line, some servers may put your IP address on a blacklist, and your access to particular Web sites can be forbidden for a certain time. Although this software is designed to avoid "disturbing" target sites, some Web servers may incorrectly "think" that your computer is doing a "Denial of service" attack because of the large number of Web pages fetched in a small period of time. You can reduce these problems by setting a very low [Maximum concurrent download], and respecting the [Robot policy] in the Advanced parameters dialog.

  2. Dynamic Web pages. Many servers are based on dynamic content generation (Java applets, PHP, etc.) from large databases, where the Web pages are generated in real time from human input parameters (for example a query of specific keywords to search something, or special parameters stored in cookies, or user and password to go to protected part of sites, or a credit card number required to access some contents, etc.) These Internet zones are often called "Invisible Web", because Web spider and Search engines canít index these contents.

  3. Specific file format. Some servers show contents in proprietary file formats, which are only readable if you download or use a dedicated component (Active-X, etc.) or buy a specific software. This robot is not designed to convert proprietary file formats, because the incidental licence cost can be quite expensive.

See :

http://directory.google.com/Top/Computers/Internet/Abuse/Denial_of_Service,

for more information on "Denial of service".

See :

http://searchenginewatch.com/sereport/00/08-deepweb.html or http://www.webliminal.com/essentialweb/invisible.html,

for more information on "Invisible Web" and dynamic servers.


First page Previous Next Last page

Copyright Acetic and Semantic Knowledge, all rights reserved
www.semantic-knowledge.com