LLM Website Scraping#
LLM-powered website scraping represents a paradigm shift from traditional web scraping. Instead of writing explicit code to navigate DOM structures and CSS selectors, you describe what data you want and let the LLM figure out how to extract it.
This approach works for both simple public pages and complex scenarios requiring authentication, JavaScript rendering, and dynamic interactions. LLMs can write scraping code, control browsers, adapt to layout changes, and even handle captchas through browser automation.
Watch these demonstrations of LLM-powered scraping:
In this module, you’ll learn:
- Chrome Remote Debugging: Control browsers programmatically for authenticated scraping
- Agentic scraping: Let LLMs write and execute scraping code autonomously
- Vibe scraping workflow: Describe desired data instead of writing selectors
- Handling dynamic content: Login walls, JavaScript rendering, hover elements
- Multi-strategy approaches: HTML, APIs, browser automation, and hybrid methods
Why LLMs for web scraping?#
Traditional web scraping is brittle and time-consuming:
- Selectors break when websites change layout
- Authentication requires complex session management
- Dynamic content needs headless browsers and wait strategies
- Maintenance consumes developer time as sites evolve
LLMs solve these problems by:
- Adapting to changes: Describe data semantically, not structurally
- Handling complexity: Write browser automation code on the fly
- Multiple strategies: Try HTML parsing, then APIs, then browser control
- Self-healing: Debug and fix scraping errors autonomously
Key insight: Don’t write scraping code—describe the data you want and let the LLM choose the best extraction strategy.
Vibe scraping workflow#
The core principle: outcomes over implementation.
Traditional scraping#
# You write this explicitly
from bs4 import BeautifulSoup
import requests
response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.content, "html.parser")
products = soup.find_all("div", class_="product-card")
data = []
for product in products:
title = product.find("h2", class_="title").text
price = product.find("span", class_="price").text
data.append({"title": title, "price": price})Vibe scraping#
Search for "AI website traffic" and scrape the content.
Convert into usable data that I can visualize.The LLM decides:
- Whether to use
requests+ BeautifulSoup, Playwright, or Chrome debugging - Which selectors to target
- How to structure output (CSV, JSON, database)
- Error handling and retries
Chrome Remote Debugging setup#
Chrome Remote Debugging allows programmatic control of Chrome, enabling authenticated scraping without explicit credential handling.
Step 1: Launch Chrome in debug mode#
Windows:
"C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="C:\chrome-debug-profile"Mac:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir="/tmp/chrome-debug-profile"Linux:
google-chrome --remote-debugging-port=9222 --user-data-dir="/tmp/chrome-debug-profile"This opens a new Chrome window with remote control enabled.
Step 2: Verify debugging works#
Open http://localhost:9222 in the debug Chrome window. You should see a JSON list of open tabs.
Step 3: Give context to your coding agent#
In Codex, Claude Code, or Cursor, provide this context:
I have Chrome remote debugging running on localhost:9222.
The browser is already logged into [website name].
[Your scraping task here]The agent will:
- Connect to Chrome via the Chrome DevTools Protocol
- Navigate to pages
- Extract data
- Handle JavaScript rendering automatically
Authenticated website scraping#
One of the most powerful LLM scraping use cases: extracting data from sites behind login walls.
Example: LinkedIn invites analysis#
I have Chrome remote debugging running on localhost:9222.
The browser is already logged into LinkedIn.
Go to my LinkedIn invites page. Of the 30 most recent invites,
how many people are from the IT industry vs outside it?
Save the details in invites.md with:
- Name
- Title
- Company
- Industry classificationThe LLM will:
- Navigate to the invites page
- Identify invite cards in the DOM
- Extract text content
- Classify industries (using its world knowledge)
- Structure data and save to markdown
Result: In ~2 minutes, you get analysis that would take 30+ minutes manually.
Example: Salesforce data extraction#
I have Chrome remote debugging running on localhost:9222.
The browser is logged into Salesforce.
Navigate to [specific Salesforce URL].
Extract all data for profile ID: [ID].
Include any hover-over tooltips or hidden content.
Save as salesforce_data.json.The LLM can:
- Trigger hovers to reveal tooltips
- Wait for lazy-loaded content
- Handle dynamic JavaScript rendering
- Export structured JSON
Security considerations#
Pros:
- No credential storage in code
- Browser maintains session state
- Full control over what’s accessed
- Can review actions in real-time
Cons:
- Agent can access anything you’re logged into
- Potential for unintended actions
- May violate site terms of service
Best practices:
- Use dedicated Chrome profiles for scraping (don’t mix with personal browsing)
- Review prompts carefully before execution in sensitive environments
- Start with read-only operations before allowing writes
- Check site terms of service—some sites explicitly ban automated access
- Prefer official APIs when available to avoid account bans
Handling dynamic content#
Modern websites use JavaScript heavily. LLMs can handle this through browser automation.
Scenario 1: Hover-triggered content#
On this page, some data appears only when hovering over elements.
Scrape all visible data AND all hover-revealed content.The LLM will:
- Identify hoverable elements
- Trigger hover events via JavaScript
- Wait for content to appear
- Extract revealed data
Scenario 2: Infinite scroll#
This page uses infinite scroll. Scrape all items, not just
the initially visible ones. Keep scrolling until no new items appear.The LLM generates code like:
previous_height = 0
while True:
# Scroll to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2) # Wait for load
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == previous_height:
break
previous_height = new_height
# Now extract all itemsScenario 3: Lazy-loaded images#
This page lazy-loads images as you scroll. Scrape all product
images, ensuring they're fully loaded before extraction.The LLM handles:
- Scrolling to trigger lazy load
- Waiting for
srcattributes to populate - Distinguishing placeholder vs actual images
Multi-strategy scraping#
LLMs excel at trying multiple approaches when one fails.
Strategy 1: HTML parsing (fastest)#
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")Pros: Fast, no browser overhead Cons: Fails with JavaScript-rendered content
Strategy 2: API discovery (most reliable)#
LLMs can analyze Network tab requests and find hidden APIs:
Open DevTools Network tab and navigate the site.
Find the API endpoint that loads this data.
Use that API directly instead of scraping HTML.Pros: Stable, structured data, faster than HTML Cons: APIs may be authenticated or undocumented
Strategy 3: Browser automation (most powerful)#
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
# Extract dataPros: Handles JavaScript, authentication, complex interactions Cons: Slower, resource-intensive
Let the LLM choose#
Instead of specifying the strategy upfront:
Scrape product prices from [website].
Try the fastest method first. If that fails, try progressively
more complex approaches. Report which method succeeded.The LLM will:
- Attempt
requests+ BeautifulSoup - If JavaScript-rendered, switch to Playwright
- If API exists, use that instead
- Document the successful approach for future runs
Extracting structured data#
LLMs understand semantic structure, not just HTML structure.
Example: Product listings#
Traditional scraping:
products = soup.find_all("div", class_="product-grid-item")Vibe scraping:
Extract all products on this page. For each, include:
- Product name
- Price (convert to USD if needed)
- Rating (as decimal, e.g., 4.5)
- Number of reviews
- Availability status
Output as products.json with this schema:
{
"products": [
{"name": "...", "price_usd": 0.00, "rating": 0.0, "reviews": 0, "available": true}
]
}The LLM:
- Identifies products semantically (not by class name)
- Normalizes currencies
- Extracts nested data (ratings from star images)
- Validates schema compliance
Example: Tables with merged cells#
This page has a complex table with merged cells and nested headers.
Convert it to a flat CSV where each row is fully denormalized.The LLM handles:
- Interpreting rowspan/colspan
- Carrying forward merged cell values
- Creating proper hierarchical headers
Making scraping repeatable#
Once a scraping session succeeds, convert it into reusable code.
Step 1: Successful scrape#
Scrape all job listings from this page, including:
- Job title
- Company
- Location
- Salary range
- Posted dateStep 2: Extract as script#
Great! Now create a standalone Python script scrape_jobs.py
that reproduces this scraping with:
- Proper error handling
- Retry logic with exponential backoff
- Logging to scrape.log
- Output to jobs.csv
- Command-line argument for URL
Include a README.md with usage instructions.Step 3: Schedule execution#
Create a GitHub Actions workflow that:
- Runs this scraper daily at 9 AM UTC
- Commits new jobs to the repo
- Sends email if scraping failsResult: One-time vibe scraping becomes a production data pipeline.
Error handling and debugging#
LLMs can debug scraping failures autonomously.
Scenario: Selector not found#
Error:
Element not found: .product-cardPrompt:
The scraper failed with "Element not found: .product-card".
Inspect the page HTML and find the correct selector.
Update the scraper and try again.The LLM:
- Fetches current page HTML
- Identifies product elements by semantic meaning
- Updates selectors
- Re-runs scraper
Scenario: Rate limiting#
Error:
HTTP 429 Too Many RequestsPrompt:
Scraping failed due to rate limiting. Modify the scraper to:
- Add delays between requests (2-5 seconds)
- Rotate user agents
- Implement exponential backoff
- Resume from last successful pageThe LLM adds resilience without you writing retry logic.
Scenario: CAPTCHA#
This site has CAPTCHAs. Since we're using Chrome remote debugging
with a logged-in session, navigate to a logged-in URL that bypasses
the CAPTCHA, then proceed with scraping.LLMs can find workarounds (authenticated endpoints, API access) rather than attempting CAPTCHA solving.
Comparing data across pages#
LLMs excel at comparative scraping.
Example: Price comparison#
Scrape product prices for "wireless headphones" from:
- Amazon
- Best Buy
- Walmart
Create a comparison table with:
- Product name (normalized across sites)
- Price at each retailer
- Prime/shipping info
- Rating average
Highlight best deals (>20% cheaper than alternatives).The LLM:
- Writes three separate scrapers
- Normalizes product names (e.g., “Sony WH-1000XM5” variations)
- Calculates savings
- Formats comparative output
Monitoring and change detection#
Set up scrapers that alert you to changes.
Example: Competitor pricing#
Create a scraper that:
1. Scrapes competitor prices daily
2. Compares to yesterday's prices
3. If any price drops >10%, send email alert
4. Maintains historical pricing in prices.dbThe LLM generates:
- Scraper script
- SQLite database schema
- Email notification code
- GitHub Actions workflow for scheduling
Legal and ethical considerations#
LLM scraping is powerful but comes with responsibilities.
When scraping is appropriate#
- ✅ Public data for analysis, research, or journalism
- ✅ Your own data on third-party platforms (e.g., exporting your LinkedIn connections)
- ✅ Permitted under ToS or with explicit permission
- ✅ API alternatives don’t exist or are prohibitively expensive
When to avoid scraping#
- ❌ Explicit ToS prohibition (e.g., LinkedIn, Facebook prohibit automated access)
- ❌ Personal data without consent
- ❌ Bypassing paywalls for commercial advantage
- ❌ Overloading servers with aggressive requests
Best practices#
- Read the ToS before scraping any site
- Respect robots.txt even if not legally binding
- Rate limit requests (1-2 per second maximum, ideally slower)
- Identify your bot with a custom User-Agent
- Provide contact info in User-Agent for site owners to reach you
- Consider API alternatives first—they’re designed for access
- Don’t scrape personal data without explicit consent
- Monitor for bans and stop immediately if blocked
Example respectful User-Agent:
headers = {"User-Agent": "ResearchBot/1.0 (+https://yoursite.com/bot; [email protected])"}Tools and platforms#
Coding agents for scraping#
- Codex: Voice prompting, repository integration
- Claude Code: Strong at autonomous debugging
- Cursor: Desktop IDE with inline scraping generation
- Jules: Web-based, good for quick experiments
Supporting libraries#
Python scraping ecosystem (LLMs know these well):
# HTML parsing
import requests
from bs4 import BeautifulSoup
# Browser automation
from playwright.sync_api import sync_playwright
from selenium import webdriver
# Chrome debugging protocol
import asyncio
from pyppeteer import connect
# Data extraction
import pandas as pd
import jsonDebugging tools#
- Chrome DevTools: Inspect Network tab for API endpoints
- Postman: Test API requests before coding
- Insomnia: Alternative REST client
Complete example: Job listing aggregator#
Full workflow from vibe scraping to deployed pipeline.
Initial prompt#
I want to track data science job postings daily. Scrape:
- Indeed
- LinkedIn (if I'm logged in via Chrome debugging)
- Glassdoor
For each job, extract:
- Title
- Company
- Location (remote vs office)
- Salary range (normalize to yearly USD)
- Required skills (extract from description)
- Posted date
Output as jobs.csv with columns:
source, title, company, location, salary_min, salary_max, skills, posted_date
Handle pagination—get first 100 jobs from each site.Refinement prompt#
Good start! Now enhance this scraper to:
1. Deduplicate across sources (same job, multiple sites)
2. Categorize location: Remote, Hybrid, On-site
3. Extract seniority level: Entry, Mid, Senior, Lead
4. Score "hotness" based on recency and salary
5. Save to SQLite database jobs.db
6. Generate a daily report: top_jobs.md showing 10 hottest jobs
Create scraper.py that does all of this.Production prompt#
Now create a production-ready deployment:
1. Add comprehensive error handling and logging
2. Create config.yaml for URLs, selectors, etc.
3. Add retry logic with exponential backoff
4. Implement --test-mode that runs on sample data
5. Generate GitHub Actions workflow:
- Runs daily at 9 AM UTC
- Commits new jobs to repo
- Posts top 5 jobs to Slack webhook (provide env var)
6. Write DEPLOYMENT.md with setup instructions
Organize as:
scraper/
__init__.py
scrape.py
database.py
config.yaml
tests/
test_scraper.py
.github/workflows/daily_scrape.yml
README.md
DEPLOYMENT.mdResult: A fully automated job tracking system built entirely through natural language prompting.
Best practices summary#
- Start with Chrome remote debugging for authenticated sites
- Use vibe prompts describing desired data, not implementation
- Let LLMs choose strategies (HTML → API → browser automation)
- Make scrapers repeatable by extracting standalone scripts
- Handle errors proactively with retry logic and logging
- Respect site ToS and rate limits
- Extract first, structure later—get data, then ask for formatting
- Version control prompts—they’re your source code now
- Monitor for changes—scraper breakage is inevitable
- Consider APIs first—they’re more reliable than scraping
Real-world use cases#
Data journalism#
Scrape all city council meeting minutes from [city website]
for the past year. Extract:
- Meeting date
- Agenda items
- Vote results (by council member)
- Attendance
Identify patterns: most contested topics, attendance rates,
voting blocs.Competitive intelligence#
Monitor competitor pricing for our top 50 products daily.
Alert if:
- Any competitor drops price >5%
- New competitor enters market
- Product goes out of stock
Maintain historical pricing in Postgres.Real estate analysis#
Scrape property listings in [city] with:
- Price, bedrooms, bathrooms, sqft
- Days on market
- Price history
- Neighborhood
Calculate median price per sqft by neighborhood.
Identify underpriced properties (>20% below neighborhood median).Academic research#
Scrape publications from [academic database] for keyword
"climate change adaptation" from 2020-2024.
Extract:
- Title, authors, abstract, year
- Citation count
- Keywords
Analyze trending topics over time.Troubleshooting common issues#
Issue 1: “Element not found”#
Cause: JavaScript-rendered content not loaded yet
Solution:
Add wait logic. After navigating to page, wait for
element with selector .product-card to appear before scraping.Issue 2: “403 Forbidden”#
Cause: Site blocks bots via User-Agent
Solution:
Use realistic headers:
User-Agent: Mozilla/5.0...
Accept-Language: en-US,en;q=0.9
Referer: [previous page URL]Issue 3: Scraped data is incomplete#
Cause: Data loads via AJAX after initial render
Solution:
Instead of scraping HTML, find the AJAX endpoint
in Network tab that loads this data. Call that API directly.Issue 4: Scraper works once, then fails#
Cause: Site changes layout or implements anti-bot measures
Solution:
Modify scraper to:
1. Take screenshots on failure for debugging
2. Extract data semantically, not by class names
3. Implement fallback selectors
4. Add monitoring that alerts on failureFurther resources#
- Web Scraping with JavaScript tutorial
- Chrome DevTools guide
- Scheduled scraping with GitHub Actions
- Beautiful Soup documentation
- Playwright documentation
- Chrome DevTools Protocol
LLM website scraping transforms a brittle, code-intensive process into a flexible, natural-language-driven workflow. By describing desired data rather than extraction mechanics, you gain resilience to site changes and can tackle complex scenarios (authentication, JavaScript, dynamic content) that previously required significant engineering effort.
The key mindset shift: you’re not a coder writing scrapers—you’re a data curator directing an AI to extract information however it sees fit.
