Scrapy feed export storage backend for Azure Storage.
- Python 3.8+
pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage-
Add this storage backend to the FEED_STORAGES Scrapy setting. For example:
# settings.py FEED_STORAGES = {'azure': 'scrapy_azure_exporter.AzureFeedStorage'}
-
Configure authentication via any of the following settings:
AZURE_CONNECTION_STRINGAZURE_ACCOUNT_URL_WITH_SAS_TOKENAZURE_ACCOUNT_URL&AZURE_ACCOUNT_KEY- If using this method, specify both of them.
For example,
AZURE_ACCOUNT_URL = "https://<your-storage-account-name>.blob.core.windows.net/" AZURE_ACCOUNT_KEY = "Account key for the Azure account"
-
Configure in the FEEDS Scrapy setting the Azure URI where the feed needs to be exported.
FEEDS = { "azure://<account_name>.blob.core.windows.net/<container_name>/<file_name.extension>": { "format": "json" } }
The overwrite
feed option
is False by default when using this feed export storage backend.
An extra feed option is also provided, blob_type, which can be "BlockBlob"
(default) or "AppendBlob". See
Understanding blob types.
The feed options overwrite and blob_type can be combined to set the write
mode of the feed export:
overwrite=Falseandblob_type="BlockBlob"create the blob if it does not exist, and fail if it exists.overwrite=Falseandblob_type="AppendBlob"append to the blob if it exists and it is anAppendBlob, and create it otherwise.overwrite=Trueoverwrites the blob, even if it exists. Theblob_typemust match that of the target blob.
Use the Azure pipeline for Scrapy media pipelines and be able to use Azure Blob Storage.
Just add the pipeline to Scrapy:
ITEM_PIPELINES = {
"scrapy_azure_exporter.AzureFilesPipeline": 1,
}You can use Azurite as a storage emulator for Azure Blob Storage
and test your application locally. Just append or set the feed storage to azurite.
# settings.py
FEED_STORAGES = {'azurite': 'scrapy_azure_exporter.AzureFeedStorage'}And add the Azurite URI to the FEEDS setting:
FEEDS = {
"azurite://<ip>:<port>/<account_name>/<container_name>/[<file_name.extension>]": {
// ...
}
}And finally run your Scrapy project as it is usually done for FilesPipeline or ImagesPipeline.