You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-23-source-diversity...

4.9 KiB

Design: Source Diversity Limit (max articles per source)

Date: 2026-03-23 Scope: Limit the number of articles from the same website across all categories in a synthesis


Context

Generated syntheses can be dominated by a single source (e.g., 8 articles from openai.com across categories). Users want source diversity — at most N articles from the same website, with articles spread across categories rather than clustered in one.

Approach

Add a post-parse filter function that enforces a per-domain article limit after the LLM search pass and before scraping. A new user setting controls the limit.

New User Setting

  • Field: max_articles_per_source in UserSettings
  • Type: i32 (non-optional, matches max_items_per_category pattern)
  • Validation: 1-10
  • Migration: ALTER TABLE user_settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3
  • Frontend label: "Articles max par source"
  • Note: 10 effectively means "no practical limit for most use cases"

Filter Function

Name: limit_articles_per_source

Signature: fn limit_articles_per_source(parsed: Vec<(String, Vec<NewsItem>)>, max_per_source: i32) -> Vec<(String, Vec<NewsItem>)>

Pipeline position: after filter_homepage_urls, before scrape_articles

Domain extraction: Parse URL with url::Url, extract via host_str() (e.g., https://openai.com/blog/postopenai.com). If URL can't be parsed, keep the article (don't drop on parse failure).

Known limitation: Subdomains are treated as different sources (blog.example.comwww.example.com). This is pragmatic for v1; registrable domain extraction (eTLD+1) can be added later if needed.

Algorithm:

  1. Pass 1 — spread: For each category (in order), keep at most 1 article per domain. Track the first occurrence of each domain's article; move remaining articles from that domain to a "dropped" list.
  2. Cap enforcement: If any domain exceeds max_per_source after pass 1 (possible when categories > limit), trim that domain's articles down to max_per_source, keeping them spread across categories in order.
  3. Pass 2 — fill: Iterate over dropped articles in their original order (categories in order, items within each category in order). Re-add each article to its original category if the domain is still under max_per_source.
  4. Return the filtered list (same category keys, fewer items per category).

Example with max_per_source = 3:

Before:

  • Category A: openai.com×3, techcrunch.com×1
  • Category B: openai.com×2, theverge.com×2

After pass 1 (1 per domain per category):

  • Category A: openai.com×1, techcrunch.com×1
  • Category B: openai.com×1, theverge.com×1
  • Dropped: openai.com×3 (2 from A, 2 from B), theverge.com×1
  • Global: openai=2, techcrunch=1, theverge=1

Cap enforcement: openai=2 ≤ 3, no trimming needed.

After pass 2 (fill up to max, dropped articles re-added to original category):

  • openai has 1 slot left → add 1 openai article back to Category A
  • theverge has 2 slots left → add 1 theverge article back to Category B
  • Final: 3 openai total, 1 techcrunch, 2 theverge

Edge case with max_per_source = 2, 5 categories all with 1 openai.com article:

After pass 1: 5 openai articles (1 per category) → exceeds limit of 2. Cap enforcement: trim to 2 openai articles, keeping categories A and B (first two in order), dropping C/D/E. Pass 2: no dropped openai articles to re-add (already at limit).

Integration

parse_llm_output → filter_homepage_urls → limit_articles_per_source → scrape_articles

Call site in run_generation_inner:

let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source);

Files to modify

  • Create: migration 20260323000012_add_max_articles_per_source.sql
  • Modify: backend/src/models/settings.rs — add field to UserSettings, SettingsResponse, UpdateSettingsRequest + validation
  • Modify: backend/src/db/settings.rs — add column to all SQL queries + SettingsRow
  • Modify: backend/src/services/synthesis.rs — add filter function + call it
  • Modify: frontend/src/pages/Settings.tsx — add number input in the generation settings grid
  • Modify: frontend/src/i18n/fr.ts — add label translation
  • Modify: frontend/src/types.ts — add field to Settings type

Unit tests

In synthesis.rs tests:

  • 5 openai.com articles across 2 categories, max=3 → keeps 3, spread across categories
  • All articles from different domains → nothing dropped
  • max_per_source = 1 → at most 1 per domain total
  • More categories than max (5 categories, 1 openai each, max=2) → caps at 2
  • Empty input → empty output
  • Articles with unparseable URLs → kept

What does NOT change

  • LLM prompts — no instruction about source diversity
  • JSON schema — no changes
  • Scraper — no changes
  • Rewrite pass — operates on already-filtered articles