diff options
author | Jordan <me@jordan.im> | 2022-02-10 19:35:23 -0700 |
---|---|---|
committer | Jordan <me@jordan.im> | 2022-02-10 19:35:23 -0700 |
commit | fd60e00118d107e1d53fb57acc64aceb29628760 (patch) | |
tree | 06f66a75ead831a75f44c5f1dd56e6006ea6bb66 | |
parent | 3897d5bbdcc9aa52d88b6602e3542e690ee74f6c (diff) | |
download | crawl-fd60e00118d107e1d53fb57acc64aceb29628760.tar.gz crawl-fd60e00118d107e1d53fb57acc64aceb29628760.zip |
readme: document changes from upstream
-rw-r--r-- | README.md | 18 |
1 files changed, 16 insertions, 2 deletions
@@ -1,6 +1,20 @@ A very simple crawler ===================== +This is a fork of [crawl](https://git.autistici.org/ale/crawl) with +changes which make crawl more amenable to serve as a drop-in +replacement for [wpull](https://github.com/ArchiveTeam/wpull)/ +[grab-site](https://github.com/ArchiveTeam/grab-site). Notable changes +include: + +* --bind, support making outbound requests from a particular interface +* --resume, directory containing the crawl state to continue from +* infinite recursion depth by default +* set User-Agent fingerprint to Firefox on Windows to look more like + a browser +* store crawl contents in a dated directory +* update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) + This tool can crawl a bunch of URLs for HTML content, and save the results in a nice WARC file. It has little control over its traffic, save for a limit on concurrent outbound requests. An external tool @@ -17,7 +31,7 @@ interrupted and restarted without issues. Assuming you have a proper [Go](https://golang.org/) environment setup, you can install this package by running: - $ go get git.autistici.org/ale/crawl/cmd/crawl + $ go get git.jordan.im/crawl/cmd/crawl This should install the *crawl* binary in your $GOPATH/bin directory. @@ -82,5 +96,5 @@ Like most crawlers, this one has a number of limitations: # Contact -Send bugs and patches to ale@incal.net. +Send bugs and patches to me@jordan.im. |