diff options
-rw-r--r-- | README.md | 22 |
1 files changed, 9 insertions, 13 deletions
@@ -12,7 +12,8 @@ Notable changes include: * set User-Agent fingerprint to Firefox on Windows to look more like a browser * store crawl contents in a dated directory -* update ignore regex set per updates to [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) +* update ignore regex set per updates to + [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot) * max default WARC size 100 MB -> 5 GB This tool can crawl a bunch of URLs for HTML content, and save the @@ -31,7 +32,7 @@ interrupted and restarted without issues. Assuming you have a proper [Go](https://golang.org/) environment setup, you can install this package by running: - $ go get git.jordan.im/crawl/cmd/crawl + $ go install git.jordan.im/crawl/cmd/crawl@latest This should install the *crawl* binary in your $GOPATH/bin directory. @@ -43,8 +44,7 @@ as arguments on the command line: $ crawl http://example.com/ By default, the tool will store the output WARC file and its own -temporary crawl database in the current directory. This can be -controlled with the *--output* and *--state* command-line options. +temporary crawl database in a newly-created directory. The crawling scope is controlled with a set of overlapping checks: @@ -58,9 +58,10 @@ The crawling scope is controlled with a set of overlapping checks: If the program is interrupted, running it again with the same command line from the same directory will cause it to resume crawling from -where it stopped. At the end of a successful crawl, the temporary -crawl database will be removed (unless you specify the *--keep* -option, for debugging purposes). +where it stopped when a previous crawl state directory is passed with +*--resume*. At the end of a successful crawl, the temporary crawl +database will be removed (unless you specify the *--keep* option, for +debugging purposes). It is possible to tell the crawler to exclude URLs matching specific regex patterns by using the *--exclude* or *--exclude-from-file* @@ -72,12 +73,7 @@ well-known pitfalls. This list is sourced from the If you're running a larger crawl, the tool can be told to rotate the output WARC files when they reach a certain size (100MB by default, -controlled by the *--output-max-size* flag. To do so, make sure the -*--output* option contains somewhere the literal token `%s`, which -will be replaced by a unique identifier every time a new file is -created, e.g.: - - $ crawl --output=out-%s.warc.gz http://example.com/ +controlled by the *--output-max-size* flag. ## Limitations |