You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-23-source-diversity...

4.5 KiB

Design: Source Diversity via Recent History

Date: 2026-03-23 Scope: Inject recently-used domains into the search prompt to encourage source diversity across syntheses


Context

Users notice that successive syntheses reuse the same sources (TechCrunch, The Verge, etc.). Within a single synthesis, the limit_articles_per_source filter already caps per-domain articles. But across syntheses over time, the LLM gravitates toward the same popular domains. By telling the LLM which domains were recently used, it can prioritize different sources.

New User Setting

  • Field: source_diversity_window in UserSettings
  • Type: i32 (non-optional, matches existing pattern)
  • Default: 3
  • Validation: 0-10 (0 = disabled)
  • Migration: ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3
  • Frontend label: "Syntheses a examiner pour diversite"

Mechanism

  1. At generation time, if source_diversity_window > 0, query the user's last N syntheses from the DB (ordered by created_at DESC, limit N).
  2. Parse the sections JSONB from each synthesis, extract all article URLs, convert to domains via host_str().
  3. Deduplicate the domain list.
  4. Pass the domain list to build_search_prompt, which appends a soft instruction: "Evite si possible les sources deja utilisees recemment : domaine1.com, domaine2.com, ..."
  5. The LLM treats this as guidance, not a hard constraint — if no alternative exists, it can still use those domains.

Files to modify

  • Create: migration 20260323000013_add_source_diversity_window.sql
  • Modify: backend/src/models/settings.rs — add field to UserSettings, SettingsResponse, UpdateSettingsRequest + Default impl + validation (0-10)
  • Modify: backend/src/db/settings.rs — add to SettingsRow struct, TryFrom<SettingsRow> impl, and both SQL queries (get_or_create_default + upsert: INSERT columns, VALUES, RETURNING, ON CONFLICT SET, .bind())
  • Modify: backend/src/services/synthesis.rs — before calling build_search_prompt, load recent syntheses via existing db::syntheses::list_for_user, extract domains using extract_domain (same module, private fn), pass domain list to the prompt builder
  • Modify: backend/src/services/prompts.rs — add recent_domains: &[String] parameter to build_search_prompt, append soft avoidance instruction if non-empty. Update the call site in synthesis.rs (~line 304) to pass the domain list as the 4th argument.
  • Modify: backend/src/services/prompts.rs tests — add source_diversity_window to test fixture, test with/without recent domains
  • Modify: frontend/src/types.ts — add field to UserSettings + DEFAULT_SETTINGS
  • Modify: frontend/src/i18n/fr.ts — add label
  • Modify: frontend/src/pages/Settings.tsx — add number input

Note: No new DB query function needed — the existing db::syntheses::list_for_user(pool, user_id, limit, offset) already returns full Synthesis records with sections JSONB. For a window of 3-10 syntheses (15-150 KB of JSON), application-level domain extraction is pragmatically fine for a single-tenant deployment.

Domain extraction from existing syntheses

The sections column is JSONB with structure:

[
  {
    "title": "Category Name",
    "items": [
      { "title": "...", "url": "https://example.com/article", "summary": "..." }
    ]
  }
]

Extract domains by parsing each item's url with url::Url::parse and host_str(). Reuse the existing extract_domain function in synthesis.rs (private fn, same module).

Unit tests

  • build_search_prompt with non-empty recent_domains → prompt contains avoidance instruction
  • build_search_prompt with empty recent_domains → prompt unchanged
  • Validation of source_diversity_window bounds (0 and 10 pass, -1 and 11 fail)

Prompt modification

In build_search_prompt, add an optional parameter recent_domains: &[String]. If non-empty, append to the user prompt:

Evite si possible les sources deja utilisees dans les syntheses precedentes : domaine1.com, domaine2.com, ...

This is a soft instruction — the LLM can still use these domains if no alternatives are available.

What does NOT change

  • JSON schema — no changes
  • Scraper — no changes
  • Rewrite pass — no changes
  • limit_articles_per_source — still enforces hard cap within a single synthesis
  • dedup_by_url — still deduplicates within a single synthesis
  • No new database table — domains are extracted from existing syntheses.sections JSONB