# Design: Source Diversity via Recent History **Date**: 2026-03-23 **Scope**: Inject recently-used domains into the search prompt to encourage source diversity across syntheses --- ## Context Users notice that successive syntheses reuse the same sources (TechCrunch, The Verge, etc.). Within a single synthesis, the `limit_articles_per_source` filter already caps per-domain articles. But across syntheses over time, the LLM gravitates toward the same popular domains. By telling the LLM which domains were recently used, it can prioritize different sources. ## New User Setting - **Field:** `source_diversity_window` in `UserSettings` - **Type:** `i32` (non-optional, matches existing pattern) - **Default:** 3 - **Validation:** 0-10 (0 = disabled) - **Migration:** `ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3` - **Frontend label:** "Syntheses a examiner pour diversite" ## Mechanism 1. At generation time, if `source_diversity_window > 0`, query the user's last N syntheses from the DB (ordered by `created_at DESC`, limit N). 2. Parse the `sections` JSONB from each synthesis, extract all article URLs, convert to domains via `host_str()`. 3. Deduplicate the domain list. 4. Pass the domain list to `build_search_prompt`, which appends a soft instruction: "Evite si possible les sources deja utilisees recemment : domaine1.com, domaine2.com, ..." 5. The LLM treats this as guidance, not a hard constraint — if no alternative exists, it can still use those domains. ## Files to modify - **Create:** migration `20260323000013_add_source_diversity_window.sql` - **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + `Default` impl + validation (0-10) - **Modify:** `backend/src/db/settings.rs` — add to `SettingsRow` struct, `TryFrom` impl, and both SQL queries (`get_or_create_default` + `upsert`: INSERT columns, VALUES, RETURNING, ON CONFLICT SET, .bind()) - **Modify:** `backend/src/services/synthesis.rs` — before calling `build_search_prompt`, load recent syntheses via existing `db::syntheses::list_for_user`, extract domains using `extract_domain` (same module, private fn), pass domain list to the prompt builder - **Modify:** `backend/src/services/prompts.rs` — add `recent_domains: &[String]` parameter to `build_search_prompt`, append soft avoidance instruction if non-empty. Update the call site in `synthesis.rs` (~line 304) to pass the domain list as the 4th argument. - **Modify:** `backend/src/services/prompts.rs` tests — add `source_diversity_window` to test fixture, test with/without recent domains - **Modify:** `frontend/src/types.ts` — add field to `UserSettings` + `DEFAULT_SETTINGS` - **Modify:** `frontend/src/i18n/fr.ts` — add label - **Modify:** `frontend/src/pages/Settings.tsx` — add number input **Note:** No new DB query function needed — the existing `db::syntheses::list_for_user(pool, user_id, limit, offset)` already returns full `Synthesis` records with `sections` JSONB. For a window of 3-10 syntheses (15-150 KB of JSON), application-level domain extraction is pragmatically fine for a single-tenant deployment. ## Domain extraction from existing syntheses The `sections` column is JSONB with structure: ```json [ { "title": "Category Name", "items": [ { "title": "...", "url": "https://example.com/article", "summary": "..." } ] } ] ``` Extract domains by parsing each item's `url` with `url::Url::parse` and `host_str()`. Reuse the existing `extract_domain` function in `synthesis.rs` (private fn, same module). ## Unit tests - `build_search_prompt` with non-empty `recent_domains` → prompt contains avoidance instruction - `build_search_prompt` with empty `recent_domains` → prompt unchanged - Validation of `source_diversity_window` bounds (0 and 10 pass, -1 and 11 fail) ## Prompt modification In `build_search_prompt`, add an optional parameter `recent_domains: &[String]`. If non-empty, append to the user prompt: ``` Evite si possible les sources deja utilisees dans les syntheses precedentes : domaine1.com, domaine2.com, ... ``` This is a soft instruction — the LLM can still use these domains if no alternatives are available. ## What does NOT change - JSON schema — no changes - Scraper — no changes - Rewrite pass — no changes - `limit_articles_per_source` — still enforces hard cap within a single synthesis - `dedup_by_url` — still deduplicates within a single synthesis - No new database table — domains are extracted from existing `syntheses.sections` JSONB