Age | Commit message (Collapse) | Author | |
---|---|---|---|
2020-08-23 | Fix the crawl.go tests | ale | |
2020-08-23 | Add minimal Debian packaging | ale | |
2020-08-23 | Allow setting DNS overrides using the --resolve option | ale | |
2020-08-20 | Panic instead of just dying with fatal error | ale | |
2020-07-30 | Retry requests on transport-level errors | ale | |
2020-07-30 | Panic on fatal errors | ale | |
This allows users of crawl-as-a-library to recover from unexpected errors as a last resort. | |||
2020-02-17 | Fix the Handler in cmd/links | ale | |
2020-02-17 | Propagate the link tag through redirects | ale | |
In order to do this we have to plumb it through the queue and the Handler interface, but it should allow fetches of the resources associated with a page via the IncludeRelatedScope even if it's behind a redirect. | |||
2019-12-04 | Fix installation instructions | ale | |
2019-11-13 | Add contact email address | ale | |
2019-11-13 | Update dependencies (legacy and go.mod) | ale | |
2019-10-07 | Add a vendor dependency used for tests | ale | |
2019-10-07 | Parse links in inline style blocks | ale | |
2019-09-26 | Switch to latest Go image for CI test | ale | |
2019-09-26 | Add Go module support | ale | |
2019-09-26 | Update vendored dependencies | ale | |
2019-01-20 | Refactor Handlers in terms of a Publisher interface | ale | |
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation. | |||
2019-01-19 | Replace URLInfo with a simple URL presence check | ale | |
The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue(). | |||
2019-01-02 | Add multi-file output | ale | |
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique. | |||
2018-12-28 | Updated dependencies | ale | |
2018-12-27 | Normalize URLs before checking if they are in scope | ale | |
2018-12-27 | Merge branch 'master' of git.autistici.org:ale/crawl | ale | |
2018-12-06 | Apply --excludes to related resources too | ale | |
2018-09-02 | Fix typo | ale | |
2018-09-02 | Explicitly mention the crawler limitations | ale | |
2018-09-02 | Add --exclude and --exclude-file options | ale | |
Allow users to add to the exclude regexp lists easily. | |||
2018-09-02 | Minimal support for <video> and <object> tags | ale | |
2018-08-31 | Do not drop /index.html at the end of URLs | ale | |
2018-08-31 | Add a simple test for the full WARC crawler | ale | |
2018-08-31 | Explicitly delegate retry logic to handlers | ale | |
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc). | |||
2018-08-31 | Improve error handling, part two | ale | |
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort. | |||
2018-08-31 | Use a buffered Writer for WARC output | ale | |
2018-08-31 | Improve error checking | ale | |
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings. | |||
2018-08-31 | Update dependencies | ale | |
2018-08-30 | Mention trickle as a possible bandwidth limiter | ale | |
Since such bandwidth limiting is not provided by crawl directly, tell users there is another solution. Once/if crawl implements that on its own, that notice could be removed. | |||
2018-08-30 | Improve install instructions a bit more | ale | |
2018-08-30 | Update installation instructions | ale | |
2017-12-19 | Provide better defaults for command-line options | ale | |
Defaults that are more suitable to real-world site archiving. | |||
2017-12-19 | Merge branch 'master' of git.autistici.org:ale/crawl | ale | |
2017-12-19 | Exit gracefully on signals | ale | |
2017-12-19 | Add a README | ale | |
2017-12-19 | Use a global http.Client with sane settings | ale | |
2017-12-19 | Crawl IFRAMEs as related resources | ale | |
2017-12-19 | Simplify redirectHandler.Handle | ale | |
2017-12-19 | Add license | ale | |
2017-12-19 | Update cmd/links to new scope syntax | ale | |
2017-12-19 | Skip data: URLs | ale | |
2017-12-19 | Add tags (primary/related) to links | ale | |
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired). | |||
2017-12-18 | Add CI configuration (test only) | ale | |
2017-12-18 | Add support for @import syntax in css | ale | |