aboutsummaryrefslogtreecommitdiff
path: root/crawler.go
AgeCommit message (Collapse)Author
2022-03-24crawler: close temporary descriptor in advance of defer (performance)Jordan
2022-03-24crawler: continue crawl when context deadline exceeded (timeout)Jordan
2022-03-24crawler: rm temporary body store once processed in advance of deferJordan
2022-03-24misc: update handler signatures, tests, housekeepingJordan
2021-06-19Ignore URL decode errorsale
This is an internal inconsistency that should be investigated.
2020-08-20Panic instead of just dying with fatal errorale
2020-07-30Panic on fatal errorsale
This allows users of crawl-as-a-library to recover from unexpected errors as a last resort.
2020-02-17Propagate the link tag through redirectsale
In order to do this we have to plumb it through the queue and the Handler interface, but it should allow fetches of the resources associated with a page via the IncludeRelatedScope even if it's behind a redirect.
2019-01-20Refactor Handlers in terms of a Publisher interfaceale
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation.
2019-01-19Replace URLInfo with a simple URL presence checkale
The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue().
2018-12-27Normalize URLs before checking if they are in scopeale
2018-08-31Do not drop /index.html at the end of URLsale
2018-08-31Explicitly delegate retry logic to handlersale
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc).
2018-08-31Improve error handling, part twoale
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort.
2018-08-31Improve error checkingale
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
2017-12-19Exit gracefully on signalsale
2017-12-19Simplify redirectHandler.Handleale
2017-12-19Add tags (primary/related) to linksale
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired).
2017-12-18Switch to github.com/syndtr/goleveldbale
The native Go implementation of LevelDB.
2015-07-03minor golint fixesale
2015-06-29clean up the state directory when doneale
2015-06-29improve queue code; golint fixesale
The queuing code now performs proper lease accounting, and it will not return a URL twice if the page load is slow.
2014-12-20move URLInfo logic into the Crawler itselfale
2014-12-20add a prefix iterator to gobDbale
2014-12-20make Scope checking more modularale
2014-12-20move the WARC code into its own packageale
Now generates well-formed, indexable WARC files.
2014-12-19initial commitale