4.9 KiB
Design: Source Diversity Limit (max articles per source)
Date: 2026-03-23 Scope: Limit the number of articles from the same website across all categories in a synthesis
Context
Generated syntheses can be dominated by a single source (e.g., 8 articles from openai.com across categories). Users want source diversity — at most N articles from the same website, with articles spread across categories rather than clustered in one.
Approach
Add a post-parse filter function that enforces a per-domain article limit after the LLM search pass and before scraping. A new user setting controls the limit.
New User Setting
- Field:
max_articles_per_sourceinUserSettings - Type:
i32(non-optional, matchesmax_items_per_categorypattern) - Validation: 1-10
- Migration:
ALTER TABLE user_settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3 - Frontend label: "Articles max par source"
- Note: 10 effectively means "no practical limit for most use cases"
Filter Function
Name: limit_articles_per_source
Signature: fn limit_articles_per_source(parsed: Vec<(String, Vec<NewsItem>)>, max_per_source: i32) -> Vec<(String, Vec<NewsItem>)>
Pipeline position: after filter_homepage_urls, before scrape_articles
Domain extraction: Parse URL with url::Url, extract via host_str() (e.g., https://openai.com/blog/post → openai.com). If URL can't be parsed, keep the article (don't drop on parse failure).
Known limitation: Subdomains are treated as different sources (blog.example.com ≠ www.example.com). This is pragmatic for v1; registrable domain extraction (eTLD+1) can be added later if needed.
Algorithm:
- Pass 1 — spread: For each category (in order), keep at most 1 article per domain. Track the first occurrence of each domain's article; move remaining articles from that domain to a "dropped" list.
- Cap enforcement: If any domain exceeds
max_per_sourceafter pass 1 (possible when categories > limit), trim that domain's articles down tomax_per_source, keeping them spread across categories in order. - Pass 2 — fill: Iterate over dropped articles in their original order (categories in order, items within each category in order). Re-add each article to its original category if the domain is still under
max_per_source. - Return the filtered list (same category keys, fewer items per category).
Example with max_per_source = 3:
Before:
- Category A: openai.com×3, techcrunch.com×1
- Category B: openai.com×2, theverge.com×2
After pass 1 (1 per domain per category):
- Category A: openai.com×1, techcrunch.com×1
- Category B: openai.com×1, theverge.com×1
- Dropped: openai.com×3 (2 from A, 2 from B), theverge.com×1
- Global: openai=2, techcrunch=1, theverge=1
Cap enforcement: openai=2 ≤ 3, no trimming needed.
After pass 2 (fill up to max, dropped articles re-added to original category):
- openai has 1 slot left → add 1 openai article back to Category A
- theverge has 2 slots left → add 1 theverge article back to Category B
- Final: 3 openai total, 1 techcrunch, 2 theverge
Edge case with max_per_source = 2, 5 categories all with 1 openai.com article:
After pass 1: 5 openai articles (1 per category) → exceeds limit of 2. Cap enforcement: trim to 2 openai articles, keeping categories A and B (first two in order), dropping C/D/E. Pass 2: no dropped openai articles to re-add (already at limit).
Integration
parse_llm_output → filter_homepage_urls → limit_articles_per_source → scrape_articles
Call site in run_generation_inner:
let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source);
Files to modify
- Create: migration
20260323000012_add_max_articles_per_source.sql - Modify:
backend/src/models/settings.rs— add field toUserSettings,SettingsResponse,UpdateSettingsRequest+ validation - Modify:
backend/src/db/settings.rs— add column to all SQL queries +SettingsRow - Modify:
backend/src/services/synthesis.rs— add filter function + call it - Modify:
frontend/src/pages/Settings.tsx— add number input in the generation settings grid - Modify:
frontend/src/i18n/fr.ts— add label translation - Modify:
frontend/src/types.ts— add field to Settings type
Unit tests
In synthesis.rs tests:
- 5 openai.com articles across 2 categories, max=3 → keeps 3, spread across categories
- All articles from different domains → nothing dropped
max_per_source = 1→ at most 1 per domain total- More categories than max (5 categories, 1 openai each, max=2) → caps at 2
- Empty input → empty output
- Articles with unparseable URLs → kept
What does NOT change
- LLM prompts — no instruction about source diversity
- JSON schema — no changes
- Scraper — no changes
- Rewrite pass — operates on already-filtered articles