aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-08-23Fix the crawl.go testsale
2020-08-23Add minimal Debian packagingale
2020-08-23Allow setting DNS overrides using the --resolve optionale
2020-08-20Panic instead of just dying with fatal errorale
2020-07-30Retry requests on transport-level errorsale
2020-07-30Panic on fatal errorsale
This allows users of crawl-as-a-library to recover from unexpected errors as a last resort.
2020-02-17Fix the Handler in cmd/linksale
2020-02-17Propagate the link tag through redirectsale
In order to do this we have to plumb it through the queue and the Handler interface, but it should allow fetches of the resources associated with a page via the IncludeRelatedScope even if it's behind a redirect.
2019-12-04Fix installation instructionsale
2019-11-13Add contact email addressale
2019-11-13Update dependencies (legacy and go.mod)ale
2019-10-07Add a vendor dependency used for testsale
2019-10-07Parse links in inline style blocksale
2019-09-26Switch to latest Go image for CI testale
2019-09-26Add Go module supportale
2019-09-26Update vendored dependenciesale
2019-01-20Refactor Handlers in terms of a Publisher interfaceale
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation.
2019-01-19Replace URLInfo with a simple URL presence checkale
The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue().
2019-01-02Add multi-file outputale
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique.
2018-12-28Updated dependenciesale
2018-12-27Normalize URLs before checking if they are in scopeale
2018-12-27Merge branch 'master' of git.autistici.org:ale/crawlale
2018-12-06Apply --excludes to related resources tooale
2018-09-02Fix typoale
2018-09-02Explicitly mention the crawler limitationsale
2018-09-02Add --exclude and --exclude-file optionsale
Allow users to add to the exclude regexp lists easily.
2018-09-02Minimal support for <video> and <object> tagsale
2018-08-31Do not drop /index.html at the end of URLsale
2018-08-31Add a simple test for the full WARC crawlerale
2018-08-31Explicitly delegate retry logic to handlersale
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc).
2018-08-31Improve error handling, part twoale
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort.
2018-08-31Use a buffered Writer for WARC outputale
2018-08-31Improve error checkingale
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
2018-08-31Update dependenciesale
2018-08-30Mention trickle as a possible bandwidth limiterale
Since such bandwidth limiting is not provided by crawl directly, tell users there is another solution. Once/if crawl implements that on its own, that notice could be removed.
2018-08-30Improve install instructions a bit moreale
2018-08-30Update installation instructionsale
2017-12-19Provide better defaults for command-line optionsale
Defaults that are more suitable to real-world site archiving.
2017-12-19Merge branch 'master' of git.autistici.org:ale/crawlale
2017-12-19Exit gracefully on signalsale
2017-12-19Add a READMEale
2017-12-19Use a global http.Client with sane settingsale
2017-12-19Crawl IFRAMEs as related resourcesale
2017-12-19Simplify redirectHandler.Handleale
2017-12-19Add licenseale
2017-12-19Update cmd/links to new scope syntaxale
2017-12-19Skip data: URLsale
2017-12-19Add tags (primary/related) to linksale
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired).
2017-12-18Add CI configuration (test only)ale
2017-12-18Add support for @import syntax in cssale