index
:
crawl
master
A simple recursive web crawler which stores content in the WARC/1.0 format
Jordan
about
summary
refs
log
tree
commit
diff
log msg
author
committer
range
path:
root
/
cmd
Age
Commit message (
Expand
)
Author
2022-02-14
client, crawl: fix/simplify net.Dialer overrides
Jordan
2022-02-14
crawl, readme: record assembled seed URLs to seed_urls file
Jordan
2022-02-10
crawl, readme: max default WARC size 100 MB -> 5 GB
Jordan
2022-02-10
misc: update crawl paths to reflect fork location
Jordan
2022-02-10
client, crawl: --bind, support making outbound requests from a particular add...
Jordan
2022-02-10
crawl: set User-Agent header to appear like Firefox on Windows
Jordan
2022-02-10
crawl: include crawl start date in directory name
Jordan
2022-02-10
crawl: create new directory to store crawl contents, resume param
Jordan
2022-02-10
crawl, scope: recurse infinitely by default
Jordan
2020-08-26
Minor logging fixes
ale
2020-08-23
Fix the crawl.go tests
ale
2020-08-23
Allow setting DNS overrides using the --resolve option
ale
2020-07-30
Retry requests on transport-level errors
ale
2020-02-17
Fix the Handler in cmd/links
ale
2020-02-17
Propagate the link tag through redirects
ale
2019-01-20
Refactor Handlers in terms of a Publisher interface
ale
2019-01-02
Add multi-file output
ale
2018-12-06
Apply --excludes to related resources too
ale
2018-09-02
Add --exclude and --exclude-file options
ale
2018-08-31
Add a simple test for the full WARC crawler
ale
2018-08-31
Explicitly delegate retry logic to handlers
ale
2018-08-31
Improve error handling, part two
ale
2018-08-31
Improve error checking
ale
2017-12-19
Provide better defaults for command-line options
ale
2017-12-19
Exit gracefully on signals
ale
2017-12-19
Use a global http.Client with sane settings
ale
2017-12-19
Update cmd/links to new scope syntax
ale
2017-12-19
Add tags (primary/related) to links
ale
2015-07-03
minor golint fixes
ale
2015-06-29
clean up the state directory when done
ale
2015-06-28
add ignore list from ArchiveBot
ale
2014-12-20
move URLInfo logic into the Crawler itself
ale
2014-12-20
make Scope checking more modular
ale
2014-12-20
move link extraction to a common location
ale
2014-12-20
move the WARC code into its own package
ale
2014-12-19
initial commit
ale