Age | Commit message (Collapse) | Author | |
---|---|---|---|
2019-01-20 | Refactor Handlers in terms of a Publisher interface | ale | |
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation. | |||
2019-01-19 | Replace URLInfo with a simple URL presence check | ale | |
The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue(). | |||
2019-01-02 | Add multi-file output | ale | |
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique. | |||
2018-12-28 | Updated dependencies | ale | |
2018-12-27 | Normalize URLs before checking if they are in scope | ale | |
2018-12-27 | Merge branch 'master' of git.autistici.org:ale/crawl | ale | |
2018-12-06 | Apply --excludes to related resources too | ale | |
2018-09-02 | Fix typo | ale | |
2018-09-02 | Explicitly mention the crawler limitations | ale | |
2018-09-02 | Add --exclude and --exclude-file options | ale | |
Allow users to add to the exclude regexp lists easily. | |||
2018-09-02 | Minimal support for <video> and <object> tags | ale | |
2018-08-31 | Do not drop /index.html at the end of URLs | ale | |
2018-08-31 | Add a simple test for the full WARC crawler | ale | |
2018-08-31 | Explicitly delegate retry logic to handlers | ale | |
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc). | |||
2018-08-31 | Improve error handling, part two | ale | |
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort. | |||
2018-08-31 | Use a buffered Writer for WARC output | ale | |
2018-08-31 | Improve error checking | ale | |
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings. | |||
2018-08-31 | Update dependencies | ale | |
2018-08-30 | Mention trickle as a possible bandwidth limiter | ale | |
Since such bandwidth limiting is not provided by crawl directly, tell users there is another solution. Once/if crawl implements that on its own, that notice could be removed. | |||
2018-08-30 | Improve install instructions a bit more | ale | |
2018-08-30 | Update installation instructions | ale | |
2017-12-19 | Provide better defaults for command-line options | ale | |
Defaults that are more suitable to real-world site archiving. | |||
2017-12-19 | Merge branch 'master' of git.autistici.org:ale/crawl | ale | |
2017-12-19 | Exit gracefully on signals | ale | |
2017-12-19 | Add a README | ale | |
2017-12-19 | Use a global http.Client with sane settings | ale | |
2017-12-19 | Crawl IFRAMEs as related resources | ale | |
2017-12-19 | Simplify redirectHandler.Handle | ale | |
2017-12-19 | Add license | ale | |
2017-12-19 | Update cmd/links to new scope syntax | ale | |
2017-12-19 | Skip data: URLs | ale | |
2017-12-19 | Add tags (primary/related) to links | ale | |
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired). | |||
2017-12-18 | Add CI configuration (test only) | ale | |
2017-12-18 | Add support for @import syntax in css | ale | |
2017-12-18 | Update location of the uuid package | ale | |
2017-12-18 | Add vendor deps | ale | |
2017-12-18 | Switch to github.com/syndtr/goleveldb | ale | |
The native Go implementation of LevelDB. | |||
2015-07-03 | minor golint fixes | ale | |
2015-06-29 | clean up the state directory when done | ale | |
2015-06-29 | improve queue code; golint fixes | ale | |
The queuing code now performs proper lease accounting, and it will not return a URL twice if the page load is slow. | |||
2015-06-28 | add ignore list from ArchiveBot | ale | |
2015-06-28 | fix timestamp format | ale | |
The WARC-Date fields now are UTC times in proper ISO-8601 format. This makes pywb and other tools happy. | |||
2014-12-20 | move URLInfo logic into the Crawler itself | ale | |
2014-12-20 | add a prefix iterator to gobDb | ale | |
2014-12-20 | add tests to scope.go | ale | |
2014-12-20 | make Scope checking more modular | ale | |
2014-12-20 | relax the CSS url() regexp | ale | |
2014-12-20 | move link extraction to a common location | ale | |
2014-12-20 | move the WARC code into its own package | ale | |
Now generates well-formed, indexable WARC files. | |||
2014-12-19 | initial commit | ale | |