aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-10-07Add a vendor dependency used for testsale
2019-10-07Parse links in inline style blocksale
2019-09-26Switch to latest Go image for CI testale
2019-09-26Add Go module supportale
2019-09-26Update vendored dependenciesale
2019-01-20Refactor Handlers in terms of a Publisher interfaceale
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation.
2019-01-19Replace URLInfo with a simple URL presence checkale
The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue().
2019-01-02Add multi-file outputale
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique.
2018-12-28Updated dependenciesale
2018-12-27Normalize URLs before checking if they are in scopeale
2018-12-27Merge branch 'master' of git.autistici.org:ale/crawlale
2018-12-06Apply --excludes to related resources tooale
2018-09-02Fix typoale
2018-09-02Explicitly mention the crawler limitationsale
2018-09-02Add --exclude and --exclude-file optionsale
Allow users to add to the exclude regexp lists easily.
2018-09-02Minimal support for <video> and <object> tagsale
2018-08-31Do not drop /index.html at the end of URLsale
2018-08-31Add a simple test for the full WARC crawlerale
2018-08-31Explicitly delegate retry logic to handlersale
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc).
2018-08-31Improve error handling, part twoale
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort.
2018-08-31Use a buffered Writer for WARC outputale
2018-08-31Improve error checkingale
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
2018-08-31Update dependenciesale
2018-08-30Mention trickle as a possible bandwidth limiterale
Since such bandwidth limiting is not provided by crawl directly, tell users there is another solution. Once/if crawl implements that on its own, that notice could be removed.
2018-08-30Improve install instructions a bit moreale
2018-08-30Update installation instructionsale
2017-12-19Provide better defaults for command-line optionsale
Defaults that are more suitable to real-world site archiving.
2017-12-19Merge branch 'master' of git.autistici.org:ale/crawlale
2017-12-19Exit gracefully on signalsale
2017-12-19Add a READMEale
2017-12-19Use a global http.Client with sane settingsale
2017-12-19Crawl IFRAMEs as related resourcesale
2017-12-19Simplify redirectHandler.Handleale
2017-12-19Add licenseale
2017-12-19Update cmd/links to new scope syntaxale
2017-12-19Skip data: URLsale
2017-12-19Add tags (primary/related) to linksale
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired).
2017-12-18Add CI configuration (test only)ale
2017-12-18Add support for @import syntax in cssale
2017-12-18Update location of the uuid packageale
2017-12-18Add vendor depsale
2017-12-18Switch to github.com/syndtr/goleveldbale
The native Go implementation of LevelDB.
2015-07-03minor golint fixesale
2015-06-29clean up the state directory when doneale
2015-06-29improve queue code; golint fixesale
The queuing code now performs proper lease accounting, and it will not return a URL twice if the page load is slow.
2015-06-28add ignore list from ArchiveBotale
2015-06-28fix timestamp formatale
The WARC-Date fields now are UTC times in proper ISO-8601 format. This makes pywb and other tools happy.
2014-12-20move URLInfo logic into the Crawler itselfale
2014-12-20add a prefix iterator to gobDbale
2014-12-20add tests to scope.goale