aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorale <ale@incal.net>2018-09-02 11:17:06 +0100
committerale <ale@incal.net>2018-09-02 11:17:06 +0100
commit59f3725ff8c81dca1f1305da34e877c0316d4152 (patch)
treef88620a24d006a9672390d9c13807d0b70e86573
parent66ce654d5be9c26ba69cc75ac12ff6662410c69d (diff)
downloadcrawl-59f3725ff8c81dca1f1305da34e877c0316d4152.tar.gz
crawl-59f3725ff8c81dca1f1305da34e877c0316d4152.zip
Explicitly mention the crawler limitations
-rw-r--r--README.md33
1 files changed, 28 insertions, 5 deletions
diff --git a/README.md b/README.md
index 0de9d15..3e4d973 100644
--- a/README.md
+++ b/README.md
@@ -29,8 +29,8 @@ as arguments on the command line:
$ crawl http://example.com/
By default, the tool will store the output WARC file and its own
-database in the current directory. This can be controlled with the
-*--output* and *--state* command-line options.
+temporary crawl database in the current directory. This can be
+controlled with the *--output* and *--state* command-line options.
The crawling scope is controlled with a set of overlapping checks:
@@ -44,6 +44,29 @@ The crawling scope is controlled with a set of overlapping checks:
If the program is interrupted, running it again with the same command
line from the same directory will cause it to resume crawling from
-where it stopped. At the end of a successful crawl, the database will
-be removed (unless you specify the *--keep* option, for debugging
-purposes).
+where it stopped. At the end of a successful crawl, the temporary
+crawl database will be removed (unless you specify the *--keep*
+option, for debugging purposes).
+
+It is possible to tell the crawler to exclude URLs matching specific
+regex patterns by using the *--exclude* or *--exclude-from-file*
+options. These option may be repeated multiple times. The crawler
+comes with its own builtin set of URI regular expressions meant to
+avoid calendars, admin panels of common CMS applications, and other
+well-known pitfalls. This list is sourced from the
+[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) project.
+
+## Limitations
+
+Like most crawlers, this one has a number of limitations:
+
+* it completely ignores *robots.txt*. You can make such policy
+ decisions yourself by turning the robots.txt into a list of patterns
+ to be used with *--exclude-file*.
+* it does not embed a Javascript engine, so Javascript-rendered
+ elements will not be detected.
+* CSS parsing is limited (uses regular expressions), so some *url()*
+ resources might not be detected.
+* it expects reasonably well-formed HTML, so it may fail to extract
+ links from particularly broken pages.
+* support for \<object\> and \<video\> tags is limited.