From 59f3725ff8c81dca1f1305da34e877c0316d4152 Mon Sep 17 00:00:00 2001
From: ale <ale@incal.net>
Date: Sun, 2 Sep 2018 11:17:06 +0100
Subject: Explicitly mention the crawler limitations

---
 README.md | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 0de9d15..3e4d973 100644
--- a/README.md
+++ b/README.md
@@ -29,8 +29,8 @@ as arguments on the command line:
     $ crawl http://example.com/
 
 By default, the tool will store the output WARC file and its own
-database in the current directory. This can be controlled with the
-*--output* and *--state* command-line options.
+temporary crawl database in the current directory. This can be
+controlled with the *--output* and *--state* command-line options.
 
 The crawling scope is controlled with a set of overlapping checks:
 
@@ -44,6 +44,29 @@ The crawling scope is controlled with a set of overlapping checks:
 
 If the program is interrupted, running it again with the same command
 line from the same directory will cause it to resume crawling from
-where it stopped. At the end of a successful crawl, the database will
-be removed (unless you specify the *--keep* option, for debugging
-purposes).
+where it stopped. At the end of a successful crawl, the temporary
+crawl database will be removed (unless you specify the *--keep*
+option, for debugging purposes).
+
+It is possible to tell the crawler to exclude URLs matching specific
+regex patterns by using the *--exclude* or *--exclude-from-file*
+options. These option may be repeated multiple times. The crawler
+comes with its own builtin set of URI regular expressions meant to
+avoid calendars, admin panels of common CMS applications, and other
+well-known pitfalls. This list is sourced from the
+[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) project.
+
+## Limitations
+
+Like most crawlers, this one has a number of limitations:
+
+* it completely ignores *robots.txt*. You can make such policy
+  decisions yourself by turning the robots.txt into a list of patterns
+  to be used with *--exclude-file*.
+* it does not embed a Javascript engine, so Javascript-rendered
+  elements will not be detected.
+* CSS parsing is limited (uses regular expressions), so some *url()*
+  resources might not be detected.
+* it expects reasonably well-formed HTML, so it may fail to extract
+  links from particularly broken pages.
+* support for \<object\> and \<video\> tags is limited.
-- 
cgit v1.2.3-54-g00ecf