aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJordan <me@jordan.im>2022-02-10 19:35:23 -0700
committerJordan <me@jordan.im>2022-02-10 19:35:23 -0700
commitfd60e00118d107e1d53fb57acc64aceb29628760 (patch)
tree06f66a75ead831a75f44c5f1dd56e6006ea6bb66
parent3897d5bbdcc9aa52d88b6602e3542e690ee74f6c (diff)
downloadcrawl-fd60e00118d107e1d53fb57acc64aceb29628760.tar.gz
crawl-fd60e00118d107e1d53fb57acc64aceb29628760.zip
readme: document changes from upstream
-rw-r--r--README.md18
1 files changed, 16 insertions, 2 deletions
diff --git a/README.md b/README.md
index bcf1bba..07720f2 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,20 @@
A very simple crawler
=====================
+This is a fork of [crawl](https://git.autistici.org/ale/crawl) with
+changes which make crawl more amenable to serve as a drop-in
+replacement for [wpull](https://github.com/ArchiveTeam/wpull)/
+[grab-site](https://github.com/ArchiveTeam/grab-site). Notable changes
+include:
+
+* --bind, support making outbound requests from a particular interface
+* --resume, directory containing the crawl state to continue from
+* infinite recursion depth by default
+* set User-Agent fingerprint to Firefox on Windows to look more like
+ a browser
+* store crawl contents in a dated directory
+* update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot)
+
This tool can crawl a bunch of URLs for HTML content, and save the
results in a nice WARC file. It has little control over its traffic,
save for a limit on concurrent outbound requests. An external tool
@@ -17,7 +31,7 @@ interrupted and restarted without issues.
Assuming you have a proper [Go](https://golang.org/) environment setup,
you can install this package by running:
- $ go get git.autistici.org/ale/crawl/cmd/crawl
+ $ go get git.jordan.im/crawl/cmd/crawl
This should install the *crawl* binary in your $GOPATH/bin directory.
@@ -82,5 +96,5 @@ Like most crawlers, this one has a number of limitations:
# Contact
-Send bugs and patches to ale@incal.net.
+Send bugs and patches to me@jordan.im.