aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJordan <me@jordan.im>2022-02-11 09:46:05 -0700
committerJordan <me@jordan.im>2022-02-11 09:46:05 -0700
commit13996013034f19d0d5ddf00a2926d2a117610170 (patch)
treed3dbe007f27c245e321ce6944dce271ea35944cb
parent429aa56ef914050931eb352365f26185ec07c193 (diff)
downloadcrawl-13996013034f19d0d5ddf00a2926d2a117610170.tar.gz
crawl-13996013034f19d0d5ddf00a2926d2a117610170.zip
readme: go get -> go install (deprecated), misc updates
-rw-r--r--README.md22
1 files changed, 9 insertions, 13 deletions
diff --git a/README.md b/README.md
index c7124a0..5c740fc 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,8 @@ Notable changes include:
* set User-Agent fingerprint to Firefox on Windows to look more like
a browser
* store crawl contents in a dated directory
-* update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot)
+* update ignore regex set per updates to
+ [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot)
* max default WARC size 100 MB -> 5 GB
This tool can crawl a bunch of URLs for HTML content, and save the
@@ -31,7 +32,7 @@ interrupted and restarted without issues.
Assuming you have a proper [Go](https://golang.org/) environment setup,
you can install this package by running:
- $ go get git.jordan.im/crawl/cmd/crawl
+ $ go install git.jordan.im/crawl/cmd/crawl@latest
This should install the *crawl* binary in your $GOPATH/bin directory.
@@ -43,8 +44,7 @@ as arguments on the command line:
$ crawl http://example.com/
By default, the tool will store the output WARC file and its own
-temporary crawl database in the current directory. This can be
-controlled with the *--output* and *--state* command-line options.
+temporary crawl database in a newly-created directory.
The crawling scope is controlled with a set of overlapping checks:
@@ -58,9 +58,10 @@ The crawling scope is controlled with a set of overlapping checks:
If the program is interrupted, running it again with the same command
line from the same directory will cause it to resume crawling from
-where it stopped. At the end of a successful crawl, the temporary
-crawl database will be removed (unless you specify the *--keep*
-option, for debugging purposes).
+where it stopped when a previous crawl state directory is passed with
+*--resume*. At the end of a successful crawl, the temporary crawl
+database will be removed (unless you specify the *--keep* option, for
+debugging purposes).
It is possible to tell the crawler to exclude URLs matching specific
regex patterns by using the *--exclude* or *--exclude-from-file*
@@ -72,12 +73,7 @@ well-known pitfalls. This list is sourced from the
If you're running a larger crawl, the tool can be told to rotate the
output WARC files when they reach a certain size (100MB by default,
-controlled by the *--output-max-size* flag. To do so, make sure the
-*--output* option contains somewhere the literal token `%s`, which
-will be replaced by a unique identifier every time a new file is
-created, e.g.:
-
- $ crawl --output=out-%s.warc.gz http://example.com/
+controlled by the *--output-max-size* flag.
## Limitations