readme: go get -> go install (deprecated), misc updates

author: Jordan <me@jordan.im> 2022-02-11 09:46:05 -0700
committer: Jordan <me@jordan.im> 2022-02-11 09:46:05 -0700
commit: 13996013034f19d0d5ddf00a2926d2a117610170 (patch)
tree: d3dbe007f27c245e321ce6944dce271ea35944cb
parent: 429aa56ef914050931eb352365f26185ec07c193 (diff)
download: crawl-13996013034f19d0d5ddf00a2926d2a117610170.tar.gz
crawl-13996013034f19d0d5ddf00a2926d2a117610170.zip
1 files changed, 9 insertions, 13 deletions
diff --git a/README.md b/README.md
index c7124a0..5c740fc 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,8 @@ Notable changes include:
 * set User-Agent fingerprint to Firefox on Windows to look more like
   a browser
 * store crawl contents in a dated directory
-* update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot)
+* update ignore regex set per updates to
+  [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot)
 * max default WARC size 100 MB -> 5 GB
 
 This tool can crawl a bunch of URLs for HTML content, and save the
@@ -31,7 +32,7 @@ interrupted and restarted without issues.
 Assuming you have a proper [Go](https://golang.org/) environment setup,
 you can install this package by running:
 
-    $ go get git.jordan.im/crawl/cmd/crawl
+    $ go install git.jordan.im/crawl/cmd/crawl@latest
 
 This should install the *crawl* binary in your $GOPATH/bin directory.
 
@@ -43,8 +44,7 @@ as arguments on the command line:
     $ crawl http://example.com/
 
 By default, the tool will store the output WARC file and its own
-temporary crawl database in the current directory. This can be
-controlled with the *--output* and *--state* command-line options.
+temporary crawl database in a newly-created directory.
 
 The crawling scope is controlled with a set of overlapping checks:
 
@@ -58,9 +58,10 @@ The crawling scope is controlled with a set of overlapping checks:
 
 If the program is interrupted, running it again with the same command
 line from the same directory will cause it to resume crawling from
-where it stopped. At the end of a successful crawl, the temporary
-crawl database will be removed (unless you specify the *--keep*
-option, for debugging purposes).
+where it stopped when a previous crawl state directory is passed with
+*--resume*. At the end of a successful crawl, the temporary crawl
+database will be removed (unless you specify the *--keep* option, for
+debugging purposes).
 
 It is possible to tell the crawler to exclude URLs matching specific
 regex patterns by using the *--exclude* or *--exclude-from-file*
@@ -72,12 +73,7 @@ well-known pitfalls. This list is sourced from the
 
 If you're running a larger crawl, the tool can be told to rotate the
 output WARC files when they reach a certain size (100MB by default,
-controlled by the *--output-max-size* flag. To do so, make sure the
-*--output* option contains somewhere the literal token `%s`, which
-will be replaced by a unique identifier every time a new file is
-created, e.g.:
-
-    $ crawl --output=out-%s.warc.gz http://example.com/
+controlled by the *--output-max-size* flag.
 
 ## Limitations
author	Jordan <me@jordan.im>	2022-02-11 09:46:05 -0700
committer	Jordan <me@jordan.im>	2022-02-11 09:46:05 -0700
commit	13996013034f19d0d5ddf00a2926d2a117610170 (patch)
tree	d3dbe007f27c245e321ce6944dce271ea35944cb
parent	429aa56ef914050931eb352365f26185ec07c193 (diff)
download	crawl-13996013034f19d0d5ddf00a2926d2a117610170.tar.gz crawl-13996013034f19d0d5ddf00a2926d2a117610170.zip