diff options
author | ale <ale@incal.net> | 2019-01-02 09:53:42 +0000 |
---|---|---|
committer | ale <ale@incal.net> | 2019-01-02 09:53:42 +0000 |
commit | c5ec7eb826bfd08aa6e8dd880efa15930f78ba19 (patch) | |
tree | 7c7d5fcfc55922cf78a97001b7ca4b879b747d28 /README.md | |
parent | 3518feaf05fcb7f745975851c6684a63532ff19a (diff) | |
download | crawl-c5ec7eb826bfd08aa6e8dd880efa15930f78ba19.tar.gz crawl-c5ec7eb826bfd08aa6e8dd880efa15930f78ba19.zip |
Add multi-file output
The output stage can now write to size-limited, rotating WARC files
using a user-specified pattern, so that output files are always
unique.
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 9 |
1 files changed, 9 insertions, 0 deletions
@@ -56,6 +56,15 @@ avoid calendars, admin panels of common CMS applications, and other well-known pitfalls. This list is sourced from the [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) project. +If you're running a larger crawl, the tool can be told to rotate the +output WARC files when they reach a certain size (100MB by default, +controlled by the *--output-max-size* flag. To do so, make sure the +*--output* option contains somewhere the literal token `%s`, which +will be replaced by a unique identifier every time a new file is +created, e.g.: + + $ crawl --output=out-%s.warc.gz http://example.com/ + ## Limitations Like most crawlers, this one has a number of limitations: |