Simple GitHub scraper

This is a template repository for a simple GitHub scraper, a technique pioneered by Simon Willison.

The 'simple' part

This template only supports very simple fetching and committing of a JSON data file from somewhere on the internet. For scraping more complex sites, try the better-scrape-template.

Replace https://www.example.com/data.json in the fetch.yaml file with the URL of the data you want to scrape.

Commit and push the repo to GitHub and you're ready to go.

By default the scraper will run once per week, but you can change the cron schedule in the fetch.yaml file.

Data is stored in data.json.

You may need to update the permissions on the new repository to allow workflows to make commits to the repository.

Using the scraped data

The way this scraper works by default is to update the data as JSON a file in the repository, so the repo always contains the latest version of the data, but the repository history contains a full history of the data from when scraping began.

This makes a time series analysis of the data possible, though not exactly straight forward. The git-history tool can be used to extract the full history into an SQLite database.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
README.md		README.md
data.json		data.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple GitHub scraper

The 'simple' part

Using the scraped data

About

Releases

Packages

drzax/simple-scrape-template

Folders and files

Latest commit

History

Repository files navigation

Simple GitHub scraper

The 'simple' part

Using the scraped data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages