Age | Commit message (Collapse) | Author | |
---|---|---|---|
2022-03-24 | crawler: close temporary descriptor in advance of defer (performance) | Jordan | |
2022-03-24 | crawler: continue crawl when context deadline exceeded (timeout) | Jordan | |
2022-03-24 | crawler: rm temporary body store once processed in advance of defer | Jordan | |
2022-03-24 | misc: update handler signatures, tests, housekeeping | Jordan | |
2021-06-19 | Ignore URL decode errors | ale | |
This is an internal inconsistency that should be investigated. | |||
2020-08-20 | Panic instead of just dying with fatal error | ale | |
2020-07-30 | Panic on fatal errors | ale | |
This allows users of crawl-as-a-library to recover from unexpected errors as a last resort. | |||
2020-02-17 | Propagate the link tag through redirects | ale | |
In order to do this we have to plumb it through the queue and the Handler interface, but it should allow fetches of the resources associated with a page via the IncludeRelatedScope even if it's behind a redirect. | |||
2019-01-20 | Refactor Handlers in terms of a Publisher interface | ale | |
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation. | |||
2019-01-19 | Replace URLInfo with a simple URL presence check | ale | |
The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue(). | |||
2018-12-27 | Normalize URLs before checking if they are in scope | ale | |
2018-08-31 | Do not drop /index.html at the end of URLs | ale | |
2018-08-31 | Explicitly delegate retry logic to handlers | ale | |
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc). | |||
2018-08-31 | Improve error handling, part two | ale | |
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort. | |||
2018-08-31 | Improve error checking | ale | |
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings. | |||
2017-12-19 | Exit gracefully on signals | ale | |
2017-12-19 | Simplify redirectHandler.Handle | ale | |
2017-12-19 | Add tags (primary/related) to links | ale | |
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired). | |||
2017-12-18 | Switch to github.com/syndtr/goleveldb | ale | |
The native Go implementation of LevelDB. | |||
2015-07-03 | minor golint fixes | ale | |
2015-06-29 | clean up the state directory when done | ale | |
2015-06-29 | improve queue code; golint fixes | ale | |
The queuing code now performs proper lease accounting, and it will not return a URL twice if the page load is slow. | |||
2014-12-20 | move URLInfo logic into the Crawler itself | ale | |
2014-12-20 | add a prefix iterator to gobDb | ale | |
2014-12-20 | make Scope checking more modular | ale | |
2014-12-20 | move the WARC code into its own package | ale | |
Now generates well-formed, indexable WARC files. | |||
2014-12-19 | initial commit | ale | |