AWS SAM application and client for scraping website og meta tags and content via css selectors.
- Git
- AWS SAM Cli
$ git clone https://github.com/68publishers/web-scraper-lambda.git
$ cd web-scraper-lambda
$ sam build
$ sam deploy --guidedThe first option is to download the client as a module.
$ npm i --save @68publishers/web-scraper-client
# or
$ yarn add @68publishers/web-scraper-clientAnd import it in your project.
import WebScraperClient from '@68publishers/web-scraper-client';
// or
const WebScraperClient = require('@68publishers/web-scraper-client');Or you can import the client into the browser from the CDN
<script src="https://unpkg.com/@68publishers/web-scraper-client/dist/web-scraper-client.min.js"></script>The client must be initialized with the URL of your lambda function and an optional configuration object.
var client = new WebScraperClient(
'https://<gateway>.execute-api.<region>.amazonaws.com/<stage>/scrap',
{} // optional configuration
);Optional configuration values table:
| Option path | Type | Default | Description |
|---|---|---|---|
cache.storage |
null or storage |
null |
Pass localStorage or sessionStorage or any compatible storage for enabled caching. |
cache.ttl |
int |
3600 |
Cache expiration in seconds. |
cache.prefix |
string |
"web-scraper-cache:" |
Prefix for cache item keys. |
To scrap data from a web page, call the scrap method with the desired URL. You can use the second optional queries argument to retrieve additional data.
The value of the argument should be an object whose keys are arbitrary names and whose values are CSS selectors, such as #main > .header > .title.
If you need an attribute value, add @attributeName to the end of the selector, for example #gallery > img @src.
// get only og meta tags
client.scrap('https://wwww.website-to-scrap.com/test')
.then(response => {
// do anything with parsed response
})
.catch(e => {
// whoops
});
// get og meta tags and some additional data
client.scrap(
'https://wwww.website-to-scrap.com/test',
{
pageLinks: "a @href",
galleryImages: "#product_gallery img @src",
productName: "#main .product-card > .product-name",
}
).then(response => {
// do anything with parsed response
}).catch(e => {
// whoops
});The response object contains all parsed meta tags and "queries".
client.scrap(/*...*/).then(response => {
var url = response.requestUrl; // url from which the data was scraped
var allMeta = response.meta(); // returns all found og meta tags
var ogTitle = response.meta('ogTitle', ''); // return the specific meta tag, the second argument is the default value
var pageLinks = response.queryValues('pageLinks', []) // return all found page links
var galleryImages = response.queryValues('galleryImages', []) // return all gallery images
var productName = response.queryValue('productName', 'Unknown product'); // the method `queryValue` returns the first value in an array
var productNameError = response.queryError('productName'); // the method `queryError` returns an error message (for example if passed css selector is invalid) or false
});The cache must be enabled in the client configuration.
var client = new WebScraperClient(
'https://<gateway>.execute-api.<region>.amazonaws.com/<stage>/scrap',
{
cache: {
storage: window.sessionStorage, // or window.localStorage
ttl: 3600, // expiration in seconds
prefix: 'web-scraper-cache:', // prefix for cache keys
},
}
);