From fd60e00118d107e1d53fb57acc64aceb29628760 Mon Sep 17 00:00:00 2001 From: Jordan Date: Thu, 10 Feb 2022 19:35:23 -0700 Subject: readme: document changes from upstream --- README.md | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index bcf1bba..07720f2 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,20 @@ A very simple crawler ===================== +This is a fork of [crawl](https://git.autistici.org/ale/crawl) with +changes which make crawl more amenable to serve as a drop-in +replacement for [wpull](https://github.com/ArchiveTeam/wpull)/ +[grab-site](https://github.com/ArchiveTeam/grab-site). Notable changes +include: + +* --bind, support making outbound requests from a particular interface +* --resume, directory containing the crawl state to continue from +* infinite recursion depth by default +* set User-Agent fingerprint to Firefox on Windows to look more like + a browser +* store crawl contents in a dated directory +* update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) + This tool can crawl a bunch of URLs for HTML content, and save the results in a nice WARC file. It has little control over its traffic, save for a limit on concurrent outbound requests. An external tool @@ -17,7 +31,7 @@ interrupted and restarted without issues. Assuming you have a proper [Go](https://golang.org/) environment setup, you can install this package by running: - $ go get git.autistici.org/ale/crawl/cmd/crawl + $ go get git.jordan.im/crawl/cmd/crawl This should install the *crawl* binary in your $GOPATH/bin directory. @@ -82,5 +96,5 @@ Like most crawlers, this one has a number of limitations: # Contact -Send bugs and patches to ale@incal.net. +Send bugs and patches to me@jordan.im. -- cgit v1.2.3-54-g00ecf