diff options
author | Jordan <me@jordan.im> | 2022-02-10 20:19:27 -0700 |
---|---|---|
committer | Jordan <me@jordan.im> | 2022-02-10 20:19:27 -0700 |
commit | caadc00d8dfadc0c9e0237fc7377eb632f500926 (patch) | |
tree | 208a8ea6baac1caa3674ecdfcefa656c254212fd /README.md | |
parent | 9ff760bdc4b0d208b64ba33e3af13228f4aca58f (diff) | |
download | crawl-caadc00d8dfadc0c9e0237fc7377eb632f500926.tar.gz crawl-caadc00d8dfadc0c9e0237fc7377eb632f500926.zip |
crawl, readme: max default WARC size 100 MB -> 5 GB
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 1 |
1 files changed, 1 insertions, 0 deletions
@@ -13,6 +13,7 @@ Notable changes include: a browser * store crawl contents in a dated directory * update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) +* max default WARC size 100 MB -> 5 GB This tool can crawl a bunch of URLs for HTML content, and save the results in a nice WARC file. It has little control over its traffic, |