Tools in Data Science

Parsing JSON#

JSON is everywhere—APIs, logs, configuration files—and its nested or large structure can challenge memory and processing. In this tutorial, we’ll explore tools to flatten, stream, and query JSON data efficiently.

For example, we’ll often need to process a multi-gigabyte log file from a web service where each record is a JSON object.

This requires us to handle complex nested structures, large files that don’t fit in memory, or extract specific fields. Here are the key tools and techniques for efficient JSON parsing:

Tool	Extract from JSON…	Why
jq	JSON in the shell	Quick data exploration and pipeline processing
JMESPath	JSON in Python	Handle complex queries with a clean syntax
ijson	JSON streams in Python	Parse streaming/large JSON files memory-efficiently
Pandas	JSON columns in Python	Fast analysis of structured data
SQL JSON	JSON in databases	Combine structured and semi-structured data
DuckDB	JSON anywhere	Fast analysis of JSON files / databases without loading to memory

Examples:

Use Pandas when you need to transform API responses into a DataFrame for further analysis.
Leverage ijson when dealing with huge JSON logs where memory is at a premium.
Apply jq for quick, iterative exploration directly in your terminal.

Practice with these resources:

JSONPath Online Evaluator: Test JSON queries
jq play: Interactive jq query testing
DuckDB JSON Tutorial: Learn DuckDB JSON functions

Command-line JSON Processing with jq#

jq is a versatile command-line tool for slicing, filtering, and transforming JSON. It excels in quick data exploration and can be integrated into shell scripts for automated data pipelines.

Example: Sifting through server logs in JSON Lines format to extract error messages or aggregate metrics without launching a full-scale ETL process.

# Extract specific fields from JSONL
cat data.jsonl | jq -c 'select(.type == "user") | {id, name}'

# Transform JSON structure
cat data.json | jq '.items[] | {name: .name, count: .details.count}'

# Filter and aggregate
cat events.jsonl | jq -s 'group_by(.category) | map({category: .[0].category, count: length})'

JMESPath Queries#

JMESPath offers a declarative query language to extract and transform data from nested JSON structures without needing verbose code. It’s a neat alternative when you want to quickly pull out specific values or filter collections based on conditions.

Example: Extracting user emails or filtering out inactive records from a complex JSON payload received from a cloud service.

import jmespath

# Example queries
data = {
    "locations": [
        {"name": "Seattle", "state": "WA", "info": {"population": 737015}},
        {"name": "Portland", "state": "OR", "info": {"population": 652503}},
    ]
}

# Find all cities with population > 700000
cities = jmespath.search("locations[?info.population > `700000`].name", data)

Streaming with ijson#

Loading huge JSON files all at once can quickly exhaust system memory. ijson lets you stream and process JSON incrementally. This method is ideal when your JSON file is too large or when you only need to work with part of the data.

Example: Processing a continuous feed from an API that returns a large JSON array, such as sensor data or event logs, while filtering on the fly.

import ijson


def process_large_json(filepath: str) -> list:
    """Process a large JSON file without loading it entirely into memory."""
    results = []

    with open(filepath, "rb") as file:
        # Stream objects under the 'items' key
        parser = ijson.items(file, "items.item")
        for item in parser:  # ijson yields items synchronously
            if item.get("value", 0) > 100:  # Process conditionally
                results.append(item)

    return results

Pandas JSON Columns#

Pandas makes it easy to work with tabular data that includes JSON strings. When you receive API data where one column holds nested JSON, flattening these structures lets you analyze and visualize the data using familiar DataFrame operations.

Example: Flattening customer records stored as nested JSON in a CSV file to extract demographic details and spending patterns.

import json
import pandas as pd

# Parse JSON strings in a column, expand to columns
df = pd.DataFrame({"json_col": ['{"name": "Alice", "age": 30}', '{"name": "Bob", "age": 25}']})
expanded = pd.json_normalize(df["json_col"].apply(json.loads))

# Normalize nested JSON from another dataframe column
df = pd.read_csv("data.csv")
df_normalized = pd.json_normalize(
    df["nested_json"].apply(json.loads),
    record_path=["items"],  # List of nested objects to unpack
    meta=["id", "timestamp"],  # Keep these columns from parent
)

SQL JSON Functions#

SQL supports built-in JSON functions allow you to query and manipulate JSON stored within relational databases. These are implemented by most popular databases, including SQLite, PostgreSQL, and MySQL. This is especially handy when you have a hybrid data model, combining structured tables with semi-structured JSON columns.

Example: An application that stores user settings or application logs as JSON in a SQLite database, enabling quick lookups and modifications without external JSON parsing libraries.

SELECT
    json_extract(data, '$.name') as name,
    json_extract(data, '$.details.age') as age
FROM users
WHERE json_extract(data, '$.active') = true

DuckDB JSON Processing#

DuckDB shines when analyzing JSON/JSONL files directly, making it a powerful tool for data analytics without the overhead of loading entire datasets into memory. Its SQL-like syntax simplifies exploratory analysis on nested data.

Example: Performing ad-hoc analytics on streaming JSON logs from a web service, such as calculating average response times or aggregating user behavior metrics.

SELECT
    json_extract_string(data, '$.user.name') as name,
    avg(json_extract_float(data, '$.metrics.value')) as avg_value
FROM read_json_auto('data/*.jsonl')
GROUP BY 1
HAVING avg_value > 100