diff options
author | Jordan <me@jordan.im> | 2022-02-14 21:02:12 -0700 |
---|---|---|
committer | Jordan <me@jordan.im> | 2022-02-14 21:02:12 -0700 |
commit | a6a6fef1c7cc7d6878e8aa36541565fb3e0c9747 (patch) | |
tree | 7928f9229c26a12917a2303408dd6ce4fb691432 /README.md | |
parent | 13996013034f19d0d5ddf00a2926d2a117610170 (diff) | |
download | crawl-a6a6fef1c7cc7d6878e8aa36541565fb3e0c9747.tar.gz crawl-a6a6fef1c7cc7d6878e8aa36541565fb3e0c9747.zip |
crawl, readme: record assembled seed URLs to seed_urls file
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 1 |
1 files changed, 1 insertions, 0 deletions
@@ -15,6 +15,7 @@ Notable changes include: * update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) * max default WARC size 100 MB -> 5 GB +* record assembled seed URLs to seed_urls file This tool can crawl a bunch of URLs for HTML content, and save the results in a nice WARC file. It has little control over its traffic, |