aboutsummaryrefslogtreecommitdiff
path: root/cmd
AgeCommit message (Collapse)Author
2022-03-24misc: update handler signatures, tests, housekeepingJordan
2022-03-24links, crawl: dramatically reduce memory usageJordan
to prevent excessive memory usage and OOM crashes, rather than store and pass around response bodies in memory buffers, let's store them temporarily on the filesystem wget-style and delete them when processed
2022-02-14client, crawl: fix/simplify net.Dialer overridesJordan
2022-02-14crawl, readme: record assembled seed URLs to seed_urls fileJordan
2022-02-10crawl, readme: max default WARC size 100 MB -> 5 GBJordan
2022-02-10misc: update crawl paths to reflect fork locationJordan
2022-02-10client, crawl: --bind, support making outbound requests from a particular ↵Jordan
address
2022-02-10crawl: set User-Agent header to appear like Firefox on WindowsJordan
2022-02-10crawl: include crawl start date in directory nameJordan
2022-02-10crawl: create new directory to store crawl contents, resume paramJordan
2022-02-10crawl, scope: recurse infinitely by defaultJordan
2020-08-26Minor logging fixesale
2020-08-23Fix the crawl.go testsale
2020-08-23Allow setting DNS overrides using the --resolve optionale
2020-07-30Retry requests on transport-level errorsale
2020-02-17Fix the Handler in cmd/linksale
2020-02-17Propagate the link tag through redirectsale
In order to do this we have to plumb it through the queue and the Handler interface, but it should allow fetches of the resources associated with a page via the IncludeRelatedScope even if it's behind a redirect.
2019-01-20Refactor Handlers in terms of a Publisher interfaceale
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation.
2019-01-02Add multi-file outputale
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique.
2018-12-06Apply --excludes to related resources tooale
2018-09-02Add --exclude and --exclude-file optionsale
Allow users to add to the exclude regexp lists easily.
2018-08-31Add a simple test for the full WARC crawlerale
2018-08-31Explicitly delegate retry logic to handlersale
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc).
2018-08-31Improve error handling, part twoale
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort.
2018-08-31Improve error checkingale
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
2017-12-19Provide better defaults for command-line optionsale
Defaults that are more suitable to real-world site archiving.
2017-12-19Exit gracefully on signalsale
2017-12-19Use a global http.Client with sane settingsale
2017-12-19Update cmd/links to new scope syntaxale
2017-12-19Add tags (primary/related) to linksale
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired).
2015-07-03minor golint fixesale
2015-06-29clean up the state directory when doneale
2015-06-28add ignore list from ArchiveBotale
2014-12-20move URLInfo logic into the Crawler itselfale
2014-12-20make Scope checking more modularale
2014-12-20move link extraction to a common locationale
2014-12-20move the WARC code into its own packageale
Now generates well-formed, indexable WARC files.
2014-12-19initial commitale