You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
84 lines
4.5 KiB
Markdown
84 lines
4.5 KiB
Markdown
# Design: Source Diversity via Recent History
|
|
|
|
**Date**: 2026-03-23
|
|
**Scope**: Inject recently-used domains into the search prompt to encourage source diversity across syntheses
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Users notice that successive syntheses reuse the same sources (TechCrunch, The Verge, etc.). Within a single synthesis, the `limit_articles_per_source` filter already caps per-domain articles. But across syntheses over time, the LLM gravitates toward the same popular domains. By telling the LLM which domains were recently used, it can prioritize different sources.
|
|
|
|
## New User Setting
|
|
|
|
- **Field:** `source_diversity_window` in `UserSettings`
|
|
- **Type:** `i32` (non-optional, matches existing pattern)
|
|
- **Default:** 3
|
|
- **Validation:** 0-10 (0 = disabled)
|
|
- **Migration:** `ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3`
|
|
- **Frontend label:** "Syntheses a examiner pour diversite"
|
|
|
|
## Mechanism
|
|
|
|
1. At generation time, if `source_diversity_window > 0`, query the user's last N syntheses from the DB (ordered by `created_at DESC`, limit N).
|
|
2. Parse the `sections` JSONB from each synthesis, extract all article URLs, convert to domains via `host_str()`.
|
|
3. Deduplicate the domain list.
|
|
4. Pass the domain list to `build_search_prompt`, which appends a soft instruction:
|
|
"Evite si possible les sources deja utilisees recemment : domaine1.com, domaine2.com, ..."
|
|
5. The LLM treats this as guidance, not a hard constraint — if no alternative exists, it can still use those domains.
|
|
|
|
## Files to modify
|
|
|
|
- **Create:** migration `20260323000013_add_source_diversity_window.sql`
|
|
- **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + `Default` impl + validation (0-10)
|
|
- **Modify:** `backend/src/db/settings.rs` — add to `SettingsRow` struct, `TryFrom<SettingsRow>` impl, and both SQL queries (`get_or_create_default` + `upsert`: INSERT columns, VALUES, RETURNING, ON CONFLICT SET, .bind())
|
|
- **Modify:** `backend/src/services/synthesis.rs` — before calling `build_search_prompt`, load recent syntheses via existing `db::syntheses::list_for_user`, extract domains using `extract_domain` (same module, private fn), pass domain list to the prompt builder
|
|
- **Modify:** `backend/src/services/prompts.rs` — add `recent_domains: &[String]` parameter to `build_search_prompt`, append soft avoidance instruction if non-empty. Update the call site in `synthesis.rs` (~line 304) to pass the domain list as the 4th argument.
|
|
- **Modify:** `backend/src/services/prompts.rs` tests — add `source_diversity_window` to test fixture, test with/without recent domains
|
|
- **Modify:** `frontend/src/types.ts` — add field to `UserSettings` + `DEFAULT_SETTINGS`
|
|
- **Modify:** `frontend/src/i18n/fr.ts` — add label
|
|
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input
|
|
|
|
**Note:** No new DB query function needed — the existing `db::syntheses::list_for_user(pool, user_id, limit, offset)` already returns full `Synthesis` records with `sections` JSONB. For a window of 3-10 syntheses (15-150 KB of JSON), application-level domain extraction is pragmatically fine for a single-tenant deployment.
|
|
|
|
## Domain extraction from existing syntheses
|
|
|
|
The `sections` column is JSONB with structure:
|
|
```json
|
|
[
|
|
{
|
|
"title": "Category Name",
|
|
"items": [
|
|
{ "title": "...", "url": "https://example.com/article", "summary": "..." }
|
|
]
|
|
}
|
|
]
|
|
```
|
|
|
|
Extract domains by parsing each item's `url` with `url::Url::parse` and `host_str()`. Reuse the existing `extract_domain` function in `synthesis.rs` (private fn, same module).
|
|
|
|
## Unit tests
|
|
|
|
- `build_search_prompt` with non-empty `recent_domains` → prompt contains avoidance instruction
|
|
- `build_search_prompt` with empty `recent_domains` → prompt unchanged
|
|
- Validation of `source_diversity_window` bounds (0 and 10 pass, -1 and 11 fail)
|
|
|
|
## Prompt modification
|
|
|
|
In `build_search_prompt`, add an optional parameter `recent_domains: &[String]`. If non-empty, append to the user prompt:
|
|
|
|
```
|
|
Evite si possible les sources deja utilisees dans les syntheses precedentes : domaine1.com, domaine2.com, ...
|
|
```
|
|
|
|
This is a soft instruction — the LLM can still use these domains if no alternatives are available.
|
|
|
|
## What does NOT change
|
|
|
|
- JSON schema — no changes
|
|
- Scraper — no changes
|
|
- Rewrite pass — no changes
|
|
- `limit_articles_per_source` — still enforces hard cap within a single synthesis
|
|
- `dedup_by_url` — still deduplicates within a single synthesis
|
|
- No new database table — domains are extracted from existing `syntheses.sections` JSONB
|