# Design: Source Diversity Limit (max articles per source) **Date**: 2026-03-23 **Scope**: Limit the number of articles from the same website across all categories in a synthesis --- ## Context Generated syntheses can be dominated by a single source (e.g., 8 articles from openai.com across categories). Users want source diversity — at most N articles from the same website, with articles spread across categories rather than clustered in one. ## Approach Add a post-parse filter function that enforces a per-domain article limit after the LLM search pass and before scraping. A new user setting controls the limit. ## New User Setting - **Field:** `max_articles_per_source` in `UserSettings` - **Type:** `i32` (non-optional, matches `max_items_per_category` pattern) - **Validation:** 1-10 - **Migration:** `ALTER TABLE user_settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3` - **Frontend label:** "Articles max par source" - **Note:** 10 effectively means "no practical limit for most use cases" ## Filter Function **Name:** `limit_articles_per_source` **Signature:** `fn limit_articles_per_source(parsed: Vec<(String, Vec)>, max_per_source: i32) -> Vec<(String, Vec)>` **Pipeline position:** after `filter_homepage_urls`, before `scrape_articles` **Domain extraction:** Parse URL with `url::Url`, extract via `host_str()` (e.g., `https://openai.com/blog/post` → `openai.com`). If URL can't be parsed, keep the article (don't drop on parse failure). **Known limitation:** Subdomains are treated as different sources (`blog.example.com` ≠ `www.example.com`). This is pragmatic for v1; registrable domain extraction (eTLD+1) can be added later if needed. **Algorithm:** 1. **Pass 1 — spread:** For each category (in order), keep at most 1 article per domain. Track the first occurrence of each domain's article; move remaining articles from that domain to a "dropped" list. 2. **Cap enforcement:** If any domain exceeds `max_per_source` after pass 1 (possible when categories > limit), trim that domain's articles down to `max_per_source`, keeping them spread across categories in order. 3. **Pass 2 — fill:** Iterate over dropped articles in their original order (categories in order, items within each category in order). Re-add each article to its original category if the domain is still under `max_per_source`. 4. Return the filtered list (same category keys, fewer items per category). **Example** with `max_per_source = 3`: Before: - Category A: openai.com×3, techcrunch.com×1 - Category B: openai.com×2, theverge.com×2 After pass 1 (1 per domain per category): - Category A: openai.com×1, techcrunch.com×1 - Category B: openai.com×1, theverge.com×1 - Dropped: openai.com×3 (2 from A, 2 from B), theverge.com×1 - Global: openai=2, techcrunch=1, theverge=1 Cap enforcement: openai=2 ≤ 3, no trimming needed. After pass 2 (fill up to max, dropped articles re-added to original category): - openai has 1 slot left → add 1 openai article back to Category A - theverge has 2 slots left → add 1 theverge article back to Category B - Final: 3 openai total, 1 techcrunch, 2 theverge **Edge case** with `max_per_source = 2`, 5 categories all with 1 openai.com article: After pass 1: 5 openai articles (1 per category) → exceeds limit of 2. Cap enforcement: trim to 2 openai articles, keeping categories A and B (first two in order), dropping C/D/E. Pass 2: no dropped openai articles to re-add (already at limit). ## Integration ``` parse_llm_output → filter_homepage_urls → limit_articles_per_source → scrape_articles ``` Call site in `run_generation_inner`: ```rust let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source); ``` ## Files to modify - **Create:** migration `20260323000012_add_max_articles_per_source.sql` - **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + validation - **Modify:** `backend/src/db/settings.rs` — add column to all SQL queries + `SettingsRow` - **Modify:** `backend/src/services/synthesis.rs` — add filter function + call it - **Modify:** `frontend/src/pages/Settings.tsx` — add number input in the generation settings grid - **Modify:** `frontend/src/i18n/fr.ts` — add label translation - **Modify:** `frontend/src/types.ts` — add field to Settings type ## Unit tests In `synthesis.rs` tests: - 5 openai.com articles across 2 categories, max=3 → keeps 3, spread across categories - All articles from different domains → nothing dropped - `max_per_source = 1` → at most 1 per domain total - More categories than max (5 categories, 1 openai each, max=2) → caps at 2 - Empty input → empty output - Articles with unparseable URLs → kept ## What does NOT change - LLM prompts — no instruction about source diversity - JSON schema — no changes - Scraper — no changes - Rewrite pass — operates on already-filtered articles