aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md18
1 files changed, 16 insertions, 2 deletions
diff --git a/README.md b/README.md
index bcf1bba..07720f2 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,20 @@
A very simple crawler
=====================
+This is a fork of [crawl](https://git.autistici.org/ale/crawl) with
+changes which make crawl more amenable to serve as a drop-in
+replacement for [wpull](https://github.com/ArchiveTeam/wpull)/
+[grab-site](https://github.com/ArchiveTeam/grab-site). Notable changes
+include:
+
+* --bind, support making outbound requests from a particular interface
+* --resume, directory containing the crawl state to continue from
+* infinite recursion depth by default
+* set User-Agent fingerprint to Firefox on Windows to look more like
+ a browser
+* store crawl contents in a dated directory
+* update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot)
+
This tool can crawl a bunch of URLs for HTML content, and save the
results in a nice WARC file. It has little control over its traffic,
save for a limit on concurrent outbound requests. An external tool
@@ -17,7 +31,7 @@ interrupted and restarted without issues.
Assuming you have a proper [Go](https://golang.org/) environment setup,
you can install this package by running:
- $ go get git.autistici.org/ale/crawl/cmd/crawl
+ $ go get git.jordan.im/crawl/cmd/crawl
This should install the *crawl* binary in your $GOPATH/bin directory.
@@ -82,5 +96,5 @@ Like most crawlers, this one has a number of limitations:
# Contact
-Send bugs and patches to ale@incal.net.
+Send bugs and patches to me@jordan.im.