aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-01-24ignore patterns: merge updates from upstream, regenerateHEADmasterJordan
2022-03-24crawler: close temporary descriptor in advance of defer (performance)Jordan
2022-03-24crawler: continue crawl when context deadline exceeded (timeout)Jordan
2022-03-24crawler: rm temporary body store once processed in advance of deferJordan
2022-03-24misc: update handler signatures, tests, housekeepingJordan
2022-03-24links, crawl: dramatically reduce memory usageJordan
to prevent excessive memory usage and OOM crashes, rather than store and pass around response bodies in memory buffers, let's store them temporarily on the filesystem wget-style and delete them when processed
2022-02-14client, crawl: fix/simplify net.Dialer overridesJordan
2022-02-14crawl, readme: record assembled seed URLs to seed_urls fileJordan
2022-02-11readme: go get -> go install (deprecated), misc updatesJordan
2022-02-10misc: add MakefileJordan
2022-02-10crawl, readme: max default WARC size 100 MB -> 5 GBJordan
2022-02-10license: add Jordan (me)Jordan
2022-02-10readme: typo correction, spacingJordan
2022-02-10misc: update crawl paths to reflect fork locationJordan
2022-02-10readme: document changes from upstreamJordan
2022-02-10gen-ignores, ignore_patterns: update to exclude unsupported Perl syntax, ↵Jordan
backreferences
2022-02-10ignore_patterns: update to reflect current ArchiveBot ignore setJordan
2022-02-10client, crawl: --bind, support making outbound requests from a particular ↵Jordan
address
2022-02-10crawl: set User-Agent header to appear like Firefox on WindowsJordan
2022-02-10crawl: include crawl start date in directory nameJordan
2022-02-10crawl: create new directory to store crawl contents, resume paramJordan
2022-02-10crawl, scope: recurse infinitely by defaultJordan
2021-07-12Merge branch 'renovate/github.com-puerkitobio-goquery-1.x' into 'master'ale
Update module github.com/PuerkitoBio/goquery to v1.7.1 See merge request ale/crawl!3
2021-07-11Update module github.com/PuerkitoBio/goquery to v1.7.1renovate
2021-06-19Merge branch 'renovate/github.com-google-go-cmp-0.x' into 'master'ale
Update module github.com/google/go-cmp to v0.5.6 See merge request ale/crawl!5
2021-06-19Merge branch 'renovate/github.com-pborman-uuid-1.x' into 'master'ale
Update module github.com/pborman/uuid to v1.2.1 See merge request ale/crawl!2
2021-06-19Merge branch 'renovate/github.com-puerkitobio-purell-0.x' into 'master'ale
Update module github.com/PuerkitoBio/purell to v0.1.0 See merge request ale/crawl!4
2021-06-19Update module github.com/google/go-cmp to v0.5.6renovate
2021-06-19Update module github.com/PuerkitoBio/purell to v0.1.0renovate
2021-06-19Update module github.com/pborman/uuid to v1.2.1renovate
2021-06-19Merge branch 'renovate/configure' into 'master'ale
Configure Renovate See merge request ale/crawl!1
2021-06-19Add renovate.jsonrenovate
2021-06-19Ignore URL decode errorsale
This is an internal inconsistency that should be investigated.
2021-06-19go mod vendorale
2020-08-26Minor logging fixesale
2020-08-26Rename the package to avoid conflictsale
2020-08-24Fix typo in CI configale
2020-08-24Build Debian packages via CIale
2020-08-23Minor fixes to Debian packagingale
2020-08-23Fix the crawl.go testsale
2020-08-23Add minimal Debian packagingale
2020-08-23Allow setting DNS overrides using the --resolve optionale
2020-08-20Panic instead of just dying with fatal errorale
2020-07-30Retry requests on transport-level errorsale
2020-07-30Panic on fatal errorsale
This allows users of crawl-as-a-library to recover from unexpected errors as a last resort.
2020-02-17Fix the Handler in cmd/linksale
2020-02-17Propagate the link tag through redirectsale
In order to do this we have to plumb it through the queue and the Handler interface, but it should allow fetches of the resources associated with a page via the IncludeRelatedScope even if it's behind a redirect.
2019-12-04Fix installation instructionsale
2019-11-13Add contact email addressale
2019-11-13Update dependencies (legacy and go.mod)ale