4.5 KiB
Design: Source Diversity via Recent History
Date: 2026-03-23 Scope: Inject recently-used domains into the search prompt to encourage source diversity across syntheses
Context
Users notice that successive syntheses reuse the same sources (TechCrunch, The Verge, etc.). Within a single synthesis, the limit_articles_per_source filter already caps per-domain articles. But across syntheses over time, the LLM gravitates toward the same popular domains. By telling the LLM which domains were recently used, it can prioritize different sources.
New User Setting
- Field:
source_diversity_windowinUserSettings - Type:
i32(non-optional, matches existing pattern) - Default: 3
- Validation: 0-10 (0 = disabled)
- Migration:
ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3 - Frontend label: "Syntheses a examiner pour diversite"
Mechanism
- At generation time, if
source_diversity_window > 0, query the user's last N syntheses from the DB (ordered bycreated_at DESC, limit N). - Parse the
sectionsJSONB from each synthesis, extract all article URLs, convert to domains viahost_str(). - Deduplicate the domain list.
- Pass the domain list to
build_search_prompt, which appends a soft instruction: "Evite si possible les sources deja utilisees recemment : domaine1.com, domaine2.com, ..." - The LLM treats this as guidance, not a hard constraint — if no alternative exists, it can still use those domains.
Files to modify
- Create: migration
20260323000013_add_source_diversity_window.sql - Modify:
backend/src/models/settings.rs— add field toUserSettings,SettingsResponse,UpdateSettingsRequest+Defaultimpl + validation (0-10) - Modify:
backend/src/db/settings.rs— add toSettingsRowstruct,TryFrom<SettingsRow>impl, and both SQL queries (get_or_create_default+upsert: INSERT columns, VALUES, RETURNING, ON CONFLICT SET, .bind()) - Modify:
backend/src/services/synthesis.rs— before callingbuild_search_prompt, load recent syntheses via existingdb::syntheses::list_for_user, extract domains usingextract_domain(same module, private fn), pass domain list to the prompt builder - Modify:
backend/src/services/prompts.rs— addrecent_domains: &[String]parameter tobuild_search_prompt, append soft avoidance instruction if non-empty. Update the call site insynthesis.rs(~line 304) to pass the domain list as the 4th argument. - Modify:
backend/src/services/prompts.rstests — addsource_diversity_windowto test fixture, test with/without recent domains - Modify:
frontend/src/types.ts— add field toUserSettings+DEFAULT_SETTINGS - Modify:
frontend/src/i18n/fr.ts— add label - Modify:
frontend/src/pages/Settings.tsx— add number input
Note: No new DB query function needed — the existing db::syntheses::list_for_user(pool, user_id, limit, offset) already returns full Synthesis records with sections JSONB. For a window of 3-10 syntheses (15-150 KB of JSON), application-level domain extraction is pragmatically fine for a single-tenant deployment.
Domain extraction from existing syntheses
The sections column is JSONB with structure:
[
{
"title": "Category Name",
"items": [
{ "title": "...", "url": "https://example.com/article", "summary": "..." }
]
}
]
Extract domains by parsing each item's url with url::Url::parse and host_str(). Reuse the existing extract_domain function in synthesis.rs (private fn, same module).
Unit tests
build_search_promptwith non-emptyrecent_domains→ prompt contains avoidance instructionbuild_search_promptwith emptyrecent_domains→ prompt unchanged- Validation of
source_diversity_windowbounds (0 and 10 pass, -1 and 11 fail)
Prompt modification
In build_search_prompt, add an optional parameter recent_domains: &[String]. If non-empty, append to the user prompt:
Evite si possible les sources deja utilisees dans les syntheses precedentes : domaine1.com, domaine2.com, ...
This is a soft instruction — the LLM can still use these domains if no alternatives are available.
What does NOT change
- JSON schema — no changes
- Scraper — no changes
- Rewrite pass — no changes
limit_articles_per_source— still enforces hard cap within a single synthesisdedup_by_url— still deduplicates within a single synthesis- No new database table — domains are extracted from existing
syntheses.sectionsJSONB