DuckDB + Parquet#

SQLite is great for OLTP. DuckDB is great for OLAP (analytical queries over millions of rows).

Parquet#

A columnar storage format. Highly compressed and very fast to read. Always save large scraped datasets as Parquet, not CSV.

DuckDB#

DuckDB runs in-process (like SQLite) but can execute SQL directly over Parquet files, even if they are hosted on AWS S3.

import duckdb

# Query a remote parquet file directly
duckdb.sql("""
    SELECT category, count(*) 
    FROM 's3://my-bucket/scraped_data.parquet'
    GROUP BY category
""").show()