ai_synth/docs/superpowers/specs/2026-03-23-source-diversity...

# Design: Source Diversity via Recent History

**Date**: 2026-03-23
**Scope**: Inject recently-used domains into the search prompt to encourage source diversity across syntheses

---

## Context

Users notice that successive syntheses reuse the same sources (TechCrunch, The Verge, etc.). Within a single synthesis, the `limit_articles_per_source` filter already caps per-domain articles. But across syntheses over time, the LLM gravitates toward the same popular domains. By telling the LLM which domains were recently used, it can prioritize different sources.

## New User Setting

- **Field:** `source_diversity_window` in `UserSettings`
- **Type:** `i32` (non-optional, matches existing pattern)
- **Default:** 3
- **Validation:** 0-10 (0 = disabled)
- **Migration:** `ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3`
- **Frontend label:** "Syntheses a examiner pour diversite"

## Mechanism

1. At generation time, if `source_diversity_window > 0`, query the user's last N syntheses from the DB (ordered by `created_at DESC`, limit N).
2. Parse the `sections` JSONB from each synthesis, extract all article URLs, convert to domains via `host_str()`.
3. Deduplicate the domain list.
4. Pass the domain list to `build_search_prompt`, which appends a soft instruction:
   "Evite si possible les sources deja utilisees recemment : domaine1.com, domaine2.com, ..."
5. The LLM treats this as guidance, not a hard constraint — if no alternative exists, it can still use those domains.

## Files to modify

- **Create:** migration `20260323000013_add_source_diversity_window.sql`
- **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + `Default` impl + validation (0-10)
- **Modify:** `backend/src/db/settings.rs` — add to `SettingsRow` struct, `TryFrom<SettingsRow>` impl, and both SQL queries (`get_or_create_default` + `upsert`: INSERT columns, VALUES, RETURNING, ON CONFLICT SET, .bind())
- **Modify:** `backend/src/services/synthesis.rs` — before calling `build_search_prompt`, load recent syntheses via existing `db::syntheses::list_for_user`, extract domains using `extract_domain` (same module, private fn), pass domain list to the prompt builder
- **Modify:** `backend/src/services/prompts.rs` — add `recent_domains: &[String]` parameter to `build_search_prompt`, append soft avoidance instruction if non-empty. Update the call site in `synthesis.rs` (~line 304) to pass the domain list as the 4th argument.
- **Modify:** `backend/src/services/prompts.rs` tests — add `source_diversity_window` to test fixture, test with/without recent domains
- **Modify:** `frontend/src/types.ts` — add field to `UserSettings` + `DEFAULT_SETTINGS`
- **Modify:** `frontend/src/i18n/fr.ts` — add label
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input

**Note:** No new DB query function needed — the existing `db::syntheses::list_for_user(pool, user_id, limit, offset)` already returns full `Synthesis` records with `sections` JSONB. For a window of 3-10 syntheses (15-150 KB of JSON), application-level domain extraction is pragmatically fine for a single-tenant deployment.

## Domain extraction from existing syntheses

The `sections` column is JSONB with structure:
```json
[
  {
    "title": "Category Name",
    "items": [
      { "title": "...", "url": "https://example.com/article", "summary": "..." }
    ]
  }
]
```

Extract domains by parsing each item's `url` with `url::Url::parse` and `host_str()`. Reuse the existing `extract_domain` function in `synthesis.rs` (private fn, same module).

## Unit tests

- `build_search_prompt` with non-empty `recent_domains` → prompt contains avoidance instruction
- `build_search_prompt` with empty `recent_domains` → prompt unchanged
- Validation of `source_diversity_window` bounds (0 and 10 pass, -1 and 11 fail)

## Prompt modification

In `build_search_prompt`, add an optional parameter `recent_domains: &[String]`. If non-empty, append to the user prompt:

```
Evite si possible les sources deja utilisees dans les syntheses precedentes : domaine1.com, domaine2.com, ...
```

This is a soft instruction — the LLM can still use these domains if no alternatives are available.

## What does NOT change

- JSON schema — no changes
- Scraper — no changes
- Rewrite pass — no changes
- `limit_articles_per_source` — still enforces hard cap within a single synthesis
- `dedup_by_url` — still deduplicates within a single synthesis
- No new database table — domains are extracted from existing `syntheses.sections` JSONB