Skip to content

Commit 659c8f8

Browse files
committed
Implemented initial setup script and wrote README
1 parent 01e0907 commit 659c8f8

File tree

2 files changed

+99
-0
lines changed

2 files changed

+99
-0
lines changed

.github/workflows/scrape.yml

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
name: Scrape
2+
3+
on:
4+
push:
5+
workflow_dispatch:
6+
schedule:
7+
# Daily at 6:23 AM UTC
8+
- cron: '23 6 * * *'
9+
# For hourly at 42 minutes past the hour: '42 * * * *'
10+
11+
permissions:
12+
contents: write
13+
14+
jobs:
15+
setup: # Delete this joh after it first runs if you like
16+
runs-on: ubuntu-latest
17+
if: ${{ !github.event.repository.is_template }}
18+
steps:
19+
- uses: actions/checkout@v4
20+
if: ${{ always() && !hashFiles('scrape.sh') }}
21+
- name: Create scrape.sh (using github context)
22+
if: ${{ always() && !hashFiles('scrape.sh') }}
23+
run: |
24+
if [ ! -f "scrape.sh" ]; then
25+
echo '#!/bin/bash' > scrape.sh
26+
if [[ "$REPO_DESC" == http://* ]] || [[ "$REPO_DESC" == https://* ]]; then
27+
echo "wget $REPO_DESC" >> scrape.sh
28+
else
29+
echo '# wget https://www.example.com/' >> scrape.sh
30+
fi
31+
chmod +x scrape.sh
32+
fi
33+
# Now push that to git
34+
git config user.name "Automated"
35+
git config user.email "[email protected]"
36+
git add scrape.sh
37+
timestamp=$(date -u)
38+
git commit -m "${timestamp}" || exit 0
39+
git pull --rebase
40+
git push
41+
env:
42+
REPO_DESC: ${{ github.event.repository.description }}
43+
44+
scrape:
45+
runs-on: ubuntu-latest
46+
if: ${{ !github.event.repository.is_template }}
47+
steps:
48+
- uses: actions/checkout@v4
49+
# Uncomment to use Python:
50+
# - name: Set up Python 3.13
51+
# uses: actions/setup-python@v5
52+
# with:
53+
# python-version: "3.13"
54+
# cache: "pip"
55+
# - name: Install dependencies
56+
# run: |
57+
# pip install -r requirements.txt
58+
# Uncomment to use Playwright via shot-scraper (put shot-scraper in requirements.txt):
59+
# - name: Cache Playwright browsers
60+
# uses: actions/cache@v4
61+
# with:
62+
# path: ~/.cache/ms-playwright/
63+
# key: ${{ runner.os }}-browsers
64+
# - name: Install Playwright dependencies
65+
# run: |
66+
# shot-scraper install
67+
- name: Run the scraper
68+
run: |
69+
if [ ! -x scrape.sh ]; then
70+
chmod 755 scrape.sh
71+
fi
72+
./scrape.sh
73+
- name: Commit and push
74+
run: |-
75+
git config user.name "Automated"
76+
git config user.email "[email protected]"
77+
git add -A
78+
timestamp=$(date -u)
79+
git commit -m "${timestamp}" || exit 0
80+
git pull --rebase
81+
git push

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,21 @@
11
# git-scraper-template
22

33
Template repository for setting up a new [git scraper](https://simonwillison.net/2020/Oct/9/git-scraping/) using GitHub Actions.
4+
5+
## How to use this
6+
7+
Visit https://github.com/simonw/git-scraper-template/generate
8+
9+
Pick a name for your new repository, then paste **the URL** of the page you would like to take scrape into the **description field** (including the `http://` or `https://`). JSON works best, but any URL will be fetched and saved.
10+
11+
Then click **Create repository from template**.
12+
13+
Your new repository will be created, and a script will run which will do the following:
14+
15+
- Add a `scrape.sh` script to your repository which uses `wget` to fetch the URL you requested
16+
- Run that `wget` command, write the result to the repository and commit it
17+
- Configure a schedule to run this script once every 24 hours
18+
19+
You can edit `scrape.sh` to customize what is scraped, and you can edit `.github/workflows/scrape.yml` to change how often the scraping happens.
20+
21+
If you want to use Python in your scraper you can uncomment the relevant block in `scrape.yml` and add a `requirements.txt` file to your repository containing any dependencies you need.

0 commit comments

Comments
 (0)