Scheduled Scraping#

Data is only useful if it’s fresh.

GitHub Actions Cron#

You can run scrapers for free on a schedule using GitHub Actions.

name: Daily Scraper
on:
  schedule:
    - cron: '0 0 * * *' # Every day at midnight

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python scraper.py
      # Commit changes back to repo or push to DB

Deduplication#

When running daily, don’t save the same data twice.

  • Hash the content or use a unique ID (like an article URL).
  • Check against your database before inserting.