Database: SQLite#
Relational databases are used to store data in a structured way. You’ll often access databases created by others for analysis.
PostgreSQL, MySQL, MS SQL, Oracle, etc. are popular databases. But the most installed database is SQLite. It’s embedded into many devices and apps (e.g. your phone, browser, etc.). It’s lightweight but very scalable and powerful.
Watch these introductory videos to understand SQLite and how it’s used in Python (34 min):
There are many non-relational databases (NoSQL) like ElasticSearch, MongoDB, Redis, etc. that you should know about and we may cover later.
Core Concepts:
-- Create a table
CREATE TABLE users (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Insert data
INSERT INTO users (name, email) VALUES
('Alice', '[email protected]'),
('Bob', '[email protected]');
-- Query data
SELECT name, COUNT(*) as count
FROM users
GROUP BY name
HAVING count > 1;
-- Join tables
SELECT u.name, o.product
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE o.status = 'pending';Python Integration:
import sqlite3
from pathlib import Path
import pandas as pd
def query_database(db_path: Path, query: str) -> pd.DataFrame:
"""Execute SQL query and return results as DataFrame."""
with sqlite3.connect(db_path) as conn:
return pd.read_sql_query(query, conn)
# Example usage
db = Path("data.db")
df = query_database(
db,
"""
SELECT date, COUNT(*) as count
FROM events
GROUP BY date
""",
)Common Operations:
Database Management
-- Backup database .backup 'backup.db' -- Import CSV .mode csv .import data.csv table_name -- Export results .headers on .mode csv .output results.csv SELECT * FROM table;Performance Optimization
-- Create index CREATE INDEX idx_user_email ON users(email); -- Analyze query EXPLAIN QUERY PLAN SELECT * FROM users WHERE email LIKE '%@example.com'; -- Show indexes SELECT * FROM sqlite_master WHERE type='index';Data Analysis
-- Time series aggregation SELECT date(timestamp), COUNT(*) as events, AVG(duration) as avg_duration FROM events GROUP BY date(timestamp); -- Window functions SELECT *, AVG(amount) OVER ( PARTITION BY user_id ORDER BY date ROWS BETWEEN 3 PRECEDING AND CURRENT ROW ) as moving_avg FROM transactions;
Tools to work with SQLite:
- SQLiteStudio: Lightweight GUI
- DBeaver: Full-featured GUI
- sqlite-utils: CLI tool
- Datasette: Web interface

