Age | Commit message (Collapse) | Author |
|
This is an internal inconsistency that should be investigated.
|
|
|
|
This allows users of crawl-as-a-library to recover from unexpected
errors as a last resort.
|
|
In order to do this we have to plumb it through the queue and the
Handler interface, but it should allow fetches of the resources
associated with a page via the IncludeRelatedScope even if it's behind
a redirect.
|
|
Introduce an interface to decouple the Enqueue functionality from the
Crawler implementation.
|
|
The whole URLInfo structure, while neat, is unused except for the
purpose of verifying if we have already seen a specific URL.
The presence check is also now limited to Enqueue().
|
|
|
|
|
|
Makes it possible to retry requests for temporary HTTP errors (429,
500, etc).
|
|
Handler errors are fatal, so that an error writing the WARC output
will cause the crawl to abort.
|
|
Detect write errors (both on the database and to the WARC output) and
abort with an error message.
Also fix a bunch of harmless lint warnings.
|
|
|
|
|
|
This change allows more complex scope boundaries, including loosening
edges a bit to include related resources of HTML pages (which makes
for more complete archives if desired).
|
|
The native Go implementation of LevelDB.
|
|
|
|
|
|
The queuing code now performs proper lease accounting, and it will not
return a URL twice if the page load is slow.
|
|
|
|
|
|
|
|
Now generates well-formed, indexable WARC files.
|
|
|