aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorale <ale@incal.net>2019-01-02 09:53:42 +0000
committerale <ale@incal.net>2019-01-02 09:53:42 +0000
commitc5ec7eb826bfd08aa6e8dd880efa15930f78ba19 (patch)
tree7c7d5fcfc55922cf78a97001b7ca4b879b747d28 /README.md
parent3518feaf05fcb7f745975851c6684a63532ff19a (diff)
downloadcrawl-c5ec7eb826bfd08aa6e8dd880efa15930f78ba19.tar.gz
crawl-c5ec7eb826bfd08aa6e8dd880efa15930f78ba19.zip
Add multi-file output
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique.
Diffstat (limited to 'README.md')
-rw-r--r--README.md9
1 files changed, 9 insertions, 0 deletions
diff --git a/README.md b/README.md
index 38f7bc3..b4d28e5 100644
--- a/README.md
+++ b/README.md
@@ -56,6 +56,15 @@ avoid calendars, admin panels of common CMS applications, and other
well-known pitfalls. This list is sourced from the
[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) project.
+If you're running a larger crawl, the tool can be told to rotate the
+output WARC files when they reach a certain size (100MB by default,
+controlled by the *--output-max-size* flag. To do so, make sure the
+*--output* option contains somewhere the literal token `%s`, which
+will be replaced by a unique identifier every time a new file is
+created, e.g.:
+
+ $ crawl --output=out-%s.warc.gz http://example.com/
+
## Limitations
Like most crawlers, this one has a number of limitations: