Age | Commit message (Collapse) | Author | |
---|---|---|---|
2022-02-14 | client, crawl: fix/simplify net.Dialer overrides | Jordan | |
2022-02-14 | crawl, readme: record assembled seed URLs to seed_urls file | Jordan | |
2022-02-10 | crawl, readme: max default WARC size 100 MB -> 5 GB | Jordan | |
2022-02-10 | misc: update crawl paths to reflect fork location | Jordan | |
2022-02-10 | client, crawl: --bind, support making outbound requests from a particular ↵ | Jordan | |
address | |||
2022-02-10 | crawl: set User-Agent header to appear like Firefox on Windows | Jordan | |
2022-02-10 | crawl: include crawl start date in directory name | Jordan | |
2022-02-10 | crawl: create new directory to store crawl contents, resume param | Jordan | |
2022-02-10 | crawl, scope: recurse infinitely by default | Jordan | |
2020-08-26 | Minor logging fixes | ale | |
2020-08-23 | Fix the crawl.go tests | ale | |
2020-08-23 | Allow setting DNS overrides using the --resolve option | ale | |
2020-07-30 | Retry requests on transport-level errors | ale | |
2020-02-17 | Fix the Handler in cmd/links | ale | |
2020-02-17 | Propagate the link tag through redirects | ale | |
In order to do this we have to plumb it through the queue and the Handler interface, but it should allow fetches of the resources associated with a page via the IncludeRelatedScope even if it's behind a redirect. | |||
2019-01-20 | Refactor Handlers in terms of a Publisher interface | ale | |
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation. | |||
2019-01-02 | Add multi-file output | ale | |
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique. | |||
2018-12-06 | Apply --excludes to related resources too | ale | |
2018-09-02 | Add --exclude and --exclude-file options | ale | |
Allow users to add to the exclude regexp lists easily. | |||
2018-08-31 | Add a simple test for the full WARC crawler | ale | |
2018-08-31 | Explicitly delegate retry logic to handlers | ale | |
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc). | |||
2018-08-31 | Improve error handling, part two | ale | |
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort. | |||
2018-08-31 | Improve error checking | ale | |
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings. | |||
2017-12-19 | Provide better defaults for command-line options | ale | |
Defaults that are more suitable to real-world site archiving. | |||
2017-12-19 | Exit gracefully on signals | ale | |
2017-12-19 | Use a global http.Client with sane settings | ale | |
2017-12-19 | Update cmd/links to new scope syntax | ale | |
2017-12-19 | Add tags (primary/related) to links | ale | |
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired). | |||
2015-07-03 | minor golint fixes | ale | |
2015-06-29 | clean up the state directory when done | ale | |
2015-06-28 | add ignore list from ArchiveBot | ale | |
2014-12-20 | move URLInfo logic into the Crawler itself | ale | |
2014-12-20 | make Scope checking more modular | ale | |
2014-12-20 | move link extraction to a common location | ale | |
2014-12-20 | move the WARC code into its own package | ale | |
Now generates well-formed, indexable WARC files. | |||
2014-12-19 | initial commit | ale | |