ai_synth/docs/superpowers/specs/2026-03-23-source-diversity...

# Design: Source Diversity Limit (max articles per source)

**Date**: 2026-03-23
**Scope**: Limit the number of articles from the same website across all categories in a synthesis

---

## Context

Generated syntheses can be dominated by a single source (e.g., 8 articles from openai.com across categories). Users want source diversity — at most N articles from the same website, with articles spread across categories rather than clustered in one.

## Approach

Add a post-parse filter function that enforces a per-domain article limit after the LLM search pass and before scraping. A new user setting controls the limit.

## New User Setting

- **Field:** `max_articles_per_source` in `UserSettings`
- **Type:** `i32` (non-optional, matches `max_items_per_category` pattern)
- **Validation:** 1-10
- **Migration:** `ALTER TABLE user_settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3`
- **Frontend label:** "Articles max par source"
- **Note:** 10 effectively means "no practical limit for most use cases"

## Filter Function

**Name:** `limit_articles_per_source`

**Signature:** `fn limit_articles_per_source(parsed: Vec<(String, Vec<NewsItem>)>, max_per_source: i32) -> Vec<(String, Vec<NewsItem>)>`

**Pipeline position:** after `filter_homepage_urls`, before `scrape_articles`

**Domain extraction:** Parse URL with `url::Url`, extract via `host_str()` (e.g., `https://openai.com/blog/post` → `openai.com`). If URL can't be parsed, keep the article (don't drop on parse failure).

**Known limitation:** Subdomains are treated as different sources (`blog.example.com` ≠ `www.example.com`). This is pragmatic for v1; registrable domain extraction (eTLD+1) can be added later if needed.

**Algorithm:**
1. **Pass 1 — spread:** For each category (in order), keep at most 1 article per domain. Track the first occurrence of each domain's article; move remaining articles from that domain to a "dropped" list.
2. **Cap enforcement:** If any domain exceeds `max_per_source` after pass 1 (possible when categories > limit), trim that domain's articles down to `max_per_source`, keeping them spread across categories in order.
3. **Pass 2 — fill:** Iterate over dropped articles in their original order (categories in order, items within each category in order). Re-add each article to its original category if the domain is still under `max_per_source`.
4. Return the filtered list (same category keys, fewer items per category).

**Example** with `max_per_source = 3`:

Before:
- Category A: openai.com×3, techcrunch.com×1
- Category B: openai.com×2, theverge.com×2

After pass 1 (1 per domain per category):
- Category A: openai.com×1, techcrunch.com×1
- Category B: openai.com×1, theverge.com×1
- Dropped: openai.com×3 (2 from A, 2 from B), theverge.com×1
- Global: openai=2, techcrunch=1, theverge=1

Cap enforcement: openai=2 ≤ 3, no trimming needed.

After pass 2 (fill up to max, dropped articles re-added to original category):
- openai has 1 slot left → add 1 openai article back to Category A
- theverge has 2 slots left → add 1 theverge article back to Category B
- Final: 3 openai total, 1 techcrunch, 2 theverge

**Edge case** with `max_per_source = 2`, 5 categories all with 1 openai.com article:

After pass 1: 5 openai articles (1 per category) → exceeds limit of 2.
Cap enforcement: trim to 2 openai articles, keeping categories A and B (first two in order), dropping C/D/E.
Pass 2: no dropped openai articles to re-add (already at limit).

## Integration

```
parse_llm_output → filter_homepage_urls → limit_articles_per_source → scrape_articles
```

Call site in `run_generation_inner`:
```rust
let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source);
```

## Files to modify

- **Create:** migration `20260323000012_add_max_articles_per_source.sql`
- **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + validation
- **Modify:** `backend/src/db/settings.rs` — add column to all SQL queries + `SettingsRow`
- **Modify:** `backend/src/services/synthesis.rs` — add filter function + call it
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input in the generation settings grid
- **Modify:** `frontend/src/i18n/fr.ts` — add label translation
- **Modify:** `frontend/src/types.ts` — add field to Settings type

## Unit tests

In `synthesis.rs` tests:
- 5 openai.com articles across 2 categories, max=3 → keeps 3, spread across categories
- All articles from different domains → nothing dropped
- `max_per_source = 1` → at most 1 per domain total
- More categories than max (5 categories, 1 openai each, max=2) → caps at 2
- Empty input → empty output
- Articles with unparseable URLs → kept

## What does NOT change

- LLM prompts — no instruction about source diversity
- JSON schema — no changes
- Scraper — no changes
- Rewrite pass — operates on already-filtered articles