The URLs-grab project at https://github.com/ArchiveTeam/urls-grab allows for URLs to be archived, alongside their page requisites, and optionally other found pages. This repository contains the lists of URLs to be periodically queued and instructions on how to structure the items.
There two different types of items. The first type are the items in the txt files in this repository. These items are read and processed into items that can be queued to the tracker, which are named 'Tracker items'. The main difference between the two types is that the last 'Tracker items' use percent encoding, while the first items do not. This is done for simplicity.
warning: The URLs-grab project can easily overload websites if too many URLs are queued at once.
The repository contains txt files, which follow a pattern [0-9]+_STRING.txt for the filenames, where STRING is some string to identify the contents of the txt file, and [0-9]+ is the interval for how often the items in the txt file should be queued to the tracker. Multiple files with equal intervals and different names can be created. Lines can be added and removed from the txt files.
Each txt file contains a list of parameters joined with ;, where the URLs are not percent encoded for simplicity. See the next section to supported the allowed parameters. A special case is the random parameter. If this parameter is specified (in our example case 3600_EXAMPLE.txt with value RANDOM), a random value will be assigned automatically every time the custom item is queued.
Custom URL items contain the URL to be archived and a number of parameters showing how to extract and queue subsequent URLs. These parameters are:
url: The URL to be archived. This should be the last parameter.random: A random string. Items queued to URLs-grab are deduplicated through a bloom filter with items previously queued. Thisrandomparameter allows for URLs to be requeued.keep_random: The depth up to which therandomstring shall be preserved. Ifkeep_randomis larger than 0, any discovered URLs to be queued will be queued with parameterkeep_random=keep_random-1, and have therandomparameter copied over.all: Whether all extracted URLs from the same domains should be queued, or only the page requisites.keep_all: Similar tokeep_random, but forall.depth: The depth up to which to queuecustomitems. If depth is larger than 0, any URLs found will be queued ascustomitem, else as regular URL item.deep_extract: If set to 1, patterns will be used to extract hardcoded URLs that are not extracted by Wget-Lua itself, for example from any scripts. This parameter is only kept on the initial queued URL, not any subsequently queued URLs. This should be used on for example RSS feeds.any_domain: Whether URLs from any domains should be queued, or only the current domain.allneeds to be set in order for this to work.
Using the above instructions, a few example items are
-
all=1;deep_extract=1;url=https://example.com/This will archive https://example.com/, and queue all URLs (not limited to page requisites) that can be extracted from the webpage using both Wget-Lua extraction and patterns to extract hardcoded URLs. If this item was already queued before, it will be ignored now. Parameter
depthis not specified, effectively setting it to 0. -
all=1;deep_extract=1;random=RANDOM;depth=2;keep_random=1;keep_all=2;url=https://example.com/This includes the
randomstring, thus making sure it is queued even if a similar item was queued before. Before queuing to the tracker,RANDOMis replaced by a random string.depthis set to 2, socustomitems will be queued for the found URLs which will all have parameterall, effectively allowing a recursive crawl up to depth 3.keep_randomhas value 1, so only the next queuedcustomitems will have therandomvalue copied over, and subsequently queuedcustomwill not.deep_extractis only kept for the very first item.keep_allis set to 2, which is equal todepth, so theall=1parameter will be copied over for all depths.Any found URLs will be queued as
all=1;random=RANDOM;depth=1;keep_random=0;keep_all=0;url=URL, note that parameterdeep_extractis removed,depth,keep_random, andkeep_allare reduced by 1, andrandomis copied over.
Tracker items are different from the items in the txt files in this repository. These items use the same parameters as the items in the txt files, but the URLs are structured differently. They are formatted as custom:PARAMS where PARAMS is an URL-encoded set of parameters.
The previous examples can be formatted as items that go into the tracker. The previous examples give respectively the following items
-
custom:url=https%3A%2F%2Fexample.com%2F&all=1&deep_extract=1decodes to{'url': 'https://example.com/', 'all': 1, 'deep_extract': 1}. -
custom:url=https%3A%2F%2Fexample.com%2F&all=1&deep_extract=1&random=sa7ff8pjss&depth=2&keep_random=1&keep_all=2decodes to{'url': 'https://example.com/', 'all': 1, 'deep_extract': 1, 'random': 'sa7ff8pjss', 'depth': 2, 'keep_random': 1, 'keep_all': 2}.Here,
RANDOMis replaced bysa7ff8pjssas new random string. The previous example noted that this random stringsa7ff8pjsswill also be copied over to any new items queued from this items. These new items are found and queued directly from the warrior.