From c5ec7eb826bfd08aa6e8dd880efa15930f78ba19 Mon Sep 17 00:00:00 2001 From: ale Date: Wed, 2 Jan 2019 09:53:42 +0000 Subject: Add multi-file output The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique. --- README.md | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'README.md') diff --git a/README.md b/README.md index 38f7bc3..b4d28e5 100644 --- a/README.md +++ b/README.md @@ -56,6 +56,15 @@ avoid calendars, admin panels of common CMS applications, and other well-known pitfalls. This list is sourced from the [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) project. +If you're running a larger crawl, the tool can be told to rotate the +output WARC files when they reach a certain size (100MB by default, +controlled by the *--output-max-size* flag. To do so, make sure the +*--output* option contains somewhere the literal token `%s`, which +will be replaced by a unique identifier every time a new file is +created, e.g.: + + $ crawl --output=out-%s.warc.gz http://example.com/ + ## Limitations Like most crawlers, this one has a number of limitations: -- cgit v1.2.3-54-g00ecf