From 03f5e29656ffdbcca651e9839bcfad6661e4c4e0 Mon Sep 17 00:00:00 2001 From: ale Date: Tue, 19 Dec 2017 08:36:19 +0000 Subject: Add a README --- README.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..34360fa --- /dev/null +++ b/README.md @@ -0,0 +1,44 @@ +A very simple crawler +===================== + +This tool can crawl a bunch of URLs for HTML content, and save the +results in a nice WARC file. It has little control over its traffic, +save for a limit on concurrent outbound requests. Its main purpose is +to quickly and efficiently save websites for archival purposes. + +The *crawl* tool saves its state in a database, so it can be safely +interrupted and restarted without issues. + +# Installation + +From this source directory (checked out in the correct place in your +GOPATH), run: + + $ go install cmd/crawl + +# Usage + +Just run *crawl* by passing the URLs of the websites you want to crawl +as arguments on the command line: + + $ crawl http://example.com/ + +By default, the tool will store the output WARC file and its own +database in the current directory. This can be controlled with the +*--output* and *--state* command-line options. + +The crawling scope is controlled with a set of overlapping checks: + +* URL scheme must be one of *http* or *https* +* URL must have one of the seeds as a prefix (an eventual *www.* + prefix is implicitly ignored) +* maximum crawling depth can be controlled with the *--depth* option +* resources related to a page (CSS, JS, etc) will always be fetched, + even if on external domains, if the *--include-related* option is + specified + +If the program is interrupted, running it again with the same command +line from the same directory will cause it to resume crawling from +where it stopped. At the end of a successful crawl, the database will +be removed (unless you specify the *--keep* option, for debugging +purposes). -- cgit v1.2.3-54-g00ecf