aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorale <ale@incal.net>2017-12-19 08:36:19 +0000
committerale <ale@incal.net>2017-12-19 08:36:19 +0000
commit03f5e29656ffdbcca651e9839bcfad6661e4c4e0 (patch)
tree0fdef8d385ad632ec6a1dd167268be27f59daa75
parent6f5bef5ffb58aab818cb46ad14310d2874cb1492 (diff)
downloadcrawl-03f5e29656ffdbcca651e9839bcfad6661e4c4e0.tar.gz
crawl-03f5e29656ffdbcca651e9839bcfad6661e4c4e0.zip
Add a README
-rw-r--r--README.md44
1 files changed, 44 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..34360fa
--- /dev/null
+++ b/README.md
@@ -0,0 +1,44 @@
+A very simple crawler
+=====================
+
+This tool can crawl a bunch of URLs for HTML content, and save the
+results in a nice WARC file. It has little control over its traffic,
+save for a limit on concurrent outbound requests. Its main purpose is
+to quickly and efficiently save websites for archival purposes.
+
+The *crawl* tool saves its state in a database, so it can be safely
+interrupted and restarted without issues.
+
+# Installation
+
+From this source directory (checked out in the correct place in your
+GOPATH), run:
+
+ $ go install cmd/crawl
+
+# Usage
+
+Just run *crawl* by passing the URLs of the websites you want to crawl
+as arguments on the command line:
+
+ $ crawl http://example.com/
+
+By default, the tool will store the output WARC file and its own
+database in the current directory. This can be controlled with the
+*--output* and *--state* command-line options.
+
+The crawling scope is controlled with a set of overlapping checks:
+
+* URL scheme must be one of *http* or *https*
+* URL must have one of the seeds as a prefix (an eventual *www.*
+ prefix is implicitly ignored)
+* maximum crawling depth can be controlled with the *--depth* option
+* resources related to a page (CSS, JS, etc) will always be fetched,
+ even if on external domains, if the *--include-related* option is
+ specified
+
+If the program is interrupted, running it again with the same command
+line from the same directory will cause it to resume crawling from
+where it stopped. At the end of a successful crawl, the database will
+be removed (unless you specify the *--keep* option, for debugging
+purposes).