crawl - A simple recursive web crawler which stores content in the WARC/1.0 format

Age	Commit message (Collapse)	Author
2022-03-24	crawler: close temporary descriptor in advance of defer (performance)	Jordan

2022-03-24	crawler: continue crawl when context deadline exceeded (timeout)	Jordan

2022-03-24	crawler: rm temporary body store once processed in advance of defer	Jordan

2022-03-24	misc: update handler signatures, tests, housekeeping	Jordan

2021-06-19	Ignore URL decode errors	ale
	This is an internal inconsistency that should be investigated.
2020-08-20	Panic instead of just dying with fatal error	ale

2020-07-30	Panic on fatal errors	ale
	This allows users of crawl-as-a-library to recover from unexpected errors as a last resort.
2020-02-17	Propagate the link tag through redirects	ale
	In order to do this we have to plumb it through the queue and the Handler interface, but it should allow fetches of the resources associated with a page via the IncludeRelatedScope even if it's behind a redirect.
2019-01-20	Refactor Handlers in terms of a Publisher interface	ale
	Introduce an interface to decouple the Enqueue functionality from the Crawler implementation.
2019-01-19	Replace URLInfo with a simple URL presence check	ale
	The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue().
2018-12-27	Normalize URLs before checking if they are in scope	ale

2018-08-31	Do not drop /index.html at the end of URLs	ale

2018-08-31	Explicitly delegate retry logic to handlers	ale
	Makes it possible to retry requests for temporary HTTP errors (429, 500, etc).
2018-08-31	Improve error handling, part two	ale
	Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort.
2018-08-31	Improve error checking	ale
	Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
2017-12-19	Exit gracefully on signals	ale

2017-12-19	Simplify redirectHandler.Handle	ale

2017-12-19	Add tags (primary/related) to links	ale
	This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired).
2017-12-18	Switch to github.com/syndtr/goleveldb	ale
	The native Go implementation of LevelDB.
2015-07-03	minor golint fixes	ale

2015-06-29	clean up the state directory when done	ale

2015-06-29	improve queue code; golint fixes	ale
	The queuing code now performs proper lease accounting, and it will not return a URL twice if the page load is slow.
2014-12-20	move URLInfo logic into the Crawler itself	ale

2014-12-20	add a prefix iterator to gobDb	ale

2014-12-20	make Scope checking more modular	ale

2014-12-20	move the WARC code into its own package	ale
	Now generates well-formed, indexable WARC files.
2014-12-19	initial commit	ale