aboutsummaryrefslogtreecommitdiff
path: root/analysis
AgeCommit message (Collapse)Author
2022-03-24misc: update handler signatures, tests, housekeepingJordan
2022-03-24links, crawl: dramatically reduce memory usageJordan
to prevent excessive memory usage and OOM crashes, rather than store and pass around response bodies in memory buffers, let's store them temporarily on the filesystem wget-style and delete them when processed
2022-02-10misc: update crawl paths to reflect fork locationJordan
2019-10-07Parse links in inline style blocksale
2018-09-02Minimal support for <video> and <object> tagsale
2018-08-31Improve error checkingale
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
2017-12-19Crawl IFRAMEs as related resourcesale
2017-12-19Skip data: URLsale
2017-12-19Add tags (primary/related) to linksale
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired).
2017-12-18Add support for @import syntax in cssale
2015-06-29improve queue code; golint fixesale
The queuing code now performs proper lease accounting, and it will not return a URL twice if the page load is slow.
2014-12-20relax the CSS url() regexpale
2014-12-20move link extraction to a common locationale