Explicitly mention the crawler limitations

author: ale <ale@incal.net> 2018-09-02 11:17:06 +0100
committer: ale <ale@incal.net> 2018-09-02 11:17:06 +0100
commit: 59f3725ff8c81dca1f1305da34e877c0316d4152 (patch)
tree: f88620a24d006a9672390d9c13807d0b70e86573
parent: 66ce654d5be9c26ba69cc75ac12ff6662410c69d (diff)
download: crawl-59f3725ff8c81dca1f1305da34e877c0316d4152.tar.gz
crawl-59f3725ff8c81dca1f1305da34e877c0316d4152.zip
1 files changed, 28 insertions, 5 deletions
diff --git a/README.md b/README.md
index 0de9d15..3e4d973 100644
--- a/README.md
+++ b/README.md
@@ -29,8 +29,8 @@ as arguments on the command line:
     $ crawl http://example.com/
 
 By default, the tool will store the output WARC file and its own
-database in the current directory. This can be controlled with the
-*--output* and *--state* command-line options.
+temporary crawl database in the current directory. This can be
+controlled with the *--output* and *--state* command-line options.
 
 The crawling scope is controlled with a set of overlapping checks:
 
@@ -44,6 +44,29 @@ The crawling scope is controlled with a set of overlapping checks:
 
 If the program is interrupted, running it again with the same command
 line from the same directory will cause it to resume crawling from
-where it stopped. At the end of a successful crawl, the database will
-be removed (unless you specify the *--keep* option, for debugging
-purposes).
+where it stopped. At the end of a successful crawl, the temporary
+crawl database will be removed (unless you specify the *--keep*
+option, for debugging purposes).
+
+It is possible to tell the crawler to exclude URLs matching specific
+regex patterns by using the *--exclude* or *--exclude-from-file*
+options. These option may be repeated multiple times. The crawler
+comes with its own builtin set of URI regular expressions meant to
+avoid calendars, admin panels of common CMS applications, and other
+well-known pitfalls. This list is sourced from the
+[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) project.
+
+## Limitations
+
+Like most crawlers, this one has a number of limitations:
+
+* it completely ignores *robots.txt*. You can make such policy
+  decisions yourself by turning the robots.txt into a list of patterns
+  to be used with *--exclude-file*.
+* it does not embed a Javascript engine, so Javascript-rendered
+  elements will not be detected.
+* CSS parsing is limited (uses regular expressions), so some *url()*
+  resources might not be detected.
+* it expects reasonably well-formed HTML, so it may fail to extract
+  links from particularly broken pages.
+* support for \<object\> and \<video\> tags is limited.
author	ale <ale@incal.net>	2018-09-02 11:17:06 +0100
committer	ale <ale@incal.net>	2018-09-02 11:17:06 +0100
commit	59f3725ff8c81dca1f1305da34e877c0316d4152 (patch)
tree	f88620a24d006a9672390d9c13807d0b70e86573
parent	66ce654d5be9c26ba69cc75ac12ff6662410c69d (diff)
download	crawl-59f3725ff8c81dca1f1305da34e877c0316d4152.tar.gz crawl-59f3725ff8c81dca1f1305da34e877c0316d4152.zip