From 420603d76a8e82cce311668068afd4e0c2099f5e Mon Sep 17 00:00:00 2001 From: oabrivard Date: Tue, 24 Mar 2026 09:19:24 +0100 Subject: [PATCH] Updated specifications of source diversity functionality --- .../2026-03-23-source-diversity-history.md | 342 +++++++++++++++ .../2026-03-23-source-diversity-limit.md | 406 ++++++++++++++++++ ...6-03-23-source-diversity-history-design.md | 83 ++++ ...026-03-23-source-diversity-limit-design.md | 104 +++++ 4 files changed, 935 insertions(+) create mode 100644 docs/superpowers/plans/2026-03-23-source-diversity-history.md create mode 100644 docs/superpowers/plans/2026-03-23-source-diversity-limit.md create mode 100644 docs/superpowers/specs/2026-03-23-source-diversity-history-design.md create mode 100644 docs/superpowers/specs/2026-03-23-source-diversity-limit-design.md diff --git a/docs/superpowers/plans/2026-03-23-source-diversity-history.md b/docs/superpowers/plans/2026-03-23-source-diversity-history.md new file mode 100644 index 0000000..8e86e3f --- /dev/null +++ b/docs/superpowers/plans/2026-03-23-source-diversity-history.md @@ -0,0 +1,342 @@ +# Source Diversity via Recent History — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Inject recently-used domains into the LLM search prompt to encourage source diversity across syntheses. + +**Architecture:** New `source_diversity_window` setting (default 3, 0=disabled). At generation time, load recent syntheses, extract domains from JSONB sections, pass to prompt builder which appends a soft avoidance instruction. + +**Tech Stack:** Rust (sqlx, serde_json, url crate), SolidJS, PostgreSQL + +**Spec:** `docs/superpowers/specs/2026-03-23-source-diversity-history-design.md` + +--- + +### Task 1: Migration + backend model + +**Files:** +- Create: `backend/migrations/20260323000013_add_source_diversity_window.sql` +- Modify: `backend/src/models/settings.rs` +- Modify: `backend/src/db/settings.rs` +- Modify: `CLAUDE.md` + +- [ ] **Step 1: Create migration** + +```sql +ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3; +``` + +- [ ] **Step 2: Add field to all structs in `models/settings.rs`** + +Add `pub source_diversity_window: i32` to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` (after `max_articles_per_source`). + +Add to `From for SettingsResponse`: +```rust +source_diversity_window: s.source_diversity_window, +``` + +Add validation in `UpdateSettingsRequest::validate()`: +```rust +if !(0..=10).contains(&self.source_diversity_window) { + return Err("source_diversity_window must be between 0 and 10".into()); +} +``` + +Add to `impl Default for UserSettings`: +```rust +source_diversity_window: 3, +``` + +- [ ] **Step 3: Add column to DB queries in `db/settings.rs`** + +Add `source_diversity_window: i32` to `SettingsRow`. Add to `TryFrom`: +```rust +source_diversity_window: row.source_diversity_window, +``` + +Add to both SQL queries (`get_or_create_default` and `upsert`): INSERT column list, VALUES placeholder, RETURNING clause, `.bind()` call, and ON CONFLICT SET (upsert only). The new column goes after `max_articles_per_source`. + +- [ ] **Step 4: Update CLAUDE.md migration count to 13** + +- [ ] **Step 5: Add validation tests in `models/settings.rs`** + +Add `source_diversity_window: 3` to the `valid_request()` test helper. Then add tests: + +```rust + #[test] + fn test_source_diversity_window_zero_is_valid() { + let mut req = valid_request(); + req.source_diversity_window = 0; + assert!(req.validate().is_ok()); + } + + #[test] + fn test_source_diversity_window_ten_is_valid() { + let mut req = valid_request(); + req.source_diversity_window = 10; + assert!(req.validate().is_ok()); + } + + #[test] + fn test_source_diversity_window_below_range() { + let mut req = valid_request(); + req.source_diversity_window = -1; + assert!(req.validate().is_err()); + } + + #[test] + fn test_source_diversity_window_above_range() { + let mut req = valid_request(); + req.source_diversity_window = 11; + assert!(req.validate().is_err()); + } +``` + +- [ ] **Step 6: Run tests** + +Run: `cd backend && cargo test --lib` +Expected: all tests pass + +- [ ] **Step 7: Commit** + +```bash +git add backend/migrations/20260323000013_add_source_diversity_window.sql backend/src/models/settings.rs backend/src/db/settings.rs CLAUDE.md +git commit -m "feat: add source_diversity_window setting (migration + model + DB)" +``` + +--- + +### Task 2: Prompt modification + tests + +**Files:** +- Modify: `backend/src/services/prompts.rs` + +- [ ] **Step 1: Add `recent_domains` parameter to `build_search_prompt`** + +Change signature from: +```rust +pub fn build_search_prompt( + settings: &UserSettings, + sources: &[Source], + current_date: &str, +) -> (String, String) { +``` + +To: +```rust +pub fn build_search_prompt( + settings: &UserSettings, + sources: &[Source], + current_date: &str, + recent_domains: &[String], +) -> (String, String) { +``` + +- [ ] **Step 2: Append avoidance instruction when domains are non-empty** + +At the end of the `user_prompt` format string (after the JSON instruction line, before the closing `"`), add a conditional block. After the `format!()` call that builds `user_prompt`, append: + +```rust + let user_prompt = if recent_domains.is_empty() { + user_prompt + } else { + let domains_list = recent_domains.join(", "); + format!( + "{}\n\nEvite si possible les sources deja utilisees dans les syntheses precedentes : {}.", + user_prompt, domains_list + ) + }; +``` + +- [ ] **Step 3: Update test fixture** + +In the `test_settings()` function (~line 137), add: +```rust +source_diversity_window: 3, +``` + +- [ ] **Step 4: Update existing test calls** + +All existing tests that call `build_search_prompt` need the 4th argument. Add `&[]` (empty slice) to each existing call. Search for `build_search_prompt(` in the test module and add `, &[]` before the closing `)`. + +- [ ] **Step 5: Add new tests** + +```rust + #[test] + fn search_prompt_includes_recent_domains_avoidance() { + let settings = test_settings(); + let sources = vec![]; + let date = "lundi 17 mars 2026"; + let domains = vec!["techcrunch.com".to_string(), "theverge.com".to_string()]; + let (_, user_prompt) = build_search_prompt(&settings, &sources, date, &domains); + assert!(user_prompt.contains("Evite si possible")); + assert!(user_prompt.contains("techcrunch.com")); + assert!(user_prompt.contains("theverge.com")); + } + + #[test] + fn search_prompt_no_avoidance_when_domains_empty() { + let settings = test_settings(); + let sources = vec![]; + let date = "lundi 17 mars 2026"; + let (_, user_prompt) = build_search_prompt(&settings, &sources, date, &[]); + assert!(!user_prompt.contains("Evite si possible")); + } +``` + +- [ ] **Step 6: Run tests** + +Run: `cd backend && cargo test --lib` +Expected: all tests pass + +- [ ] **Step 7: Commit** + +```bash +git add backend/src/services/prompts.rs +git commit -m "feat: build_search_prompt accepts recent_domains for source diversity" +``` + +--- + +### Task 3: Pipeline integration — extract domains + wire prompt + +**Files:** +- Modify: `backend/src/services/synthesis.rs` + +- [ ] **Step 1: Add domain extraction from recent syntheses** + +Before the `build_search_prompt` call (~line 303), add a new step that loads recent syntheses and extracts domains. Insert between the rate limit check (step 5) and the search pass (step 6): + +```rust + // Step 5b: Load recently-used domains for source diversity + let recent_domains = if settings.source_diversity_window > 0 { + let recent = db::syntheses::list_for_user( + &state.pool, + user_id, + settings.source_diversity_window as i64, + 0, + ) + .await + .unwrap_or_default(); + + let mut domains: Vec = recent + .iter() + .filter_map(|s| { + serde_json::from_value::>( + s.sections.clone(), + ) + .ok() + }) + .flat_map(|sections| { + sections + .into_iter() + .flat_map(|sec| sec.items.into_iter()) + .filter_map(|item| extract_domain(&item.url)) + }) + .collect(); + + domains.sort(); + domains.dedup(); + domains + } else { + Vec::new() + }; +``` + +- [ ] **Step 2: Update the `build_search_prompt` call** + +Change line ~304 from: +```rust + let (system_prompt, user_prompt) = + prompts::build_search_prompt(&settings, &sources, ¤t_date); +``` + +To: +```rust + let (system_prompt, user_prompt) = + prompts::build_search_prompt(&settings, &sources, ¤t_date, &recent_domains); +``` + +- [ ] **Step 3: Run tests** + +Run: `cd backend && cargo test --lib` +Expected: all tests pass + +- [ ] **Step 4: Commit** + +```bash +git add backend/src/services/synthesis.rs +git commit -m "feat: extract recent domains and pass to search prompt for diversity" +``` + +--- + +### Task 4: Frontend setting + +**Files:** +- Modify: `frontend/src/types.ts` +- Modify: `frontend/src/i18n/fr.ts` +- Modify: `frontend/src/pages/Settings.tsx` + +- [ ] **Step 1: Add field to frontend types** + +In `frontend/src/types.ts`, add to `UserSettings` interface (after `max_articles_per_source`): +```typescript +source_diversity_window: number; +``` + +Add to `DEFAULT_SETTINGS`: +```typescript +source_diversity_window: 3, +``` + +- [ ] **Step 2: Add i18n label** + +In `frontend/src/i18n/fr.ts`, add after `settings.maxArticlesPerSource`: +```typescript +'settings.diversityWindow': 'Syntheses a examiner pour diversite', +``` + +- [ ] **Step 3: Add number input to Settings page** + +In `frontend/src/pages/Settings.tsx`, inside the generation settings grid (after `maxArticlesPerSource`), add: + +```tsx +
+ +
+ + setSettings((prev) => ({ + ...prev, + source_diversity_window: + parseInt(e.currentTarget.value) || 3, + })) + } + /> +
+
+``` + +- [ ] **Step 4: Run frontend tests** + +Run: `cd frontend && npx tsc --noEmit && npx vitest run` +Expected: type check passes, all tests pass + +- [ ] **Step 5: Commit** + +```bash +git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx +git commit -m "feat: add source_diversity_window setting to frontend" +``` diff --git a/docs/superpowers/plans/2026-03-23-source-diversity-limit.md b/docs/superpowers/plans/2026-03-23-source-diversity-limit.md new file mode 100644 index 0000000..eee24f2 --- /dev/null +++ b/docs/superpowers/plans/2026-03-23-source-diversity-limit.md @@ -0,0 +1,406 @@ +# Source Diversity Limit — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Limit the number of articles from the same website across all categories, with source diversity spread across categories. + +**Architecture:** New `i32` field in UserSettings + migration, post-parse filter function in the generation pipeline, frontend number input. + +**Tech Stack:** Rust (sqlx, url crate), SolidJS, PostgreSQL + +**Spec:** `docs/superpowers/specs/2026-03-23-source-diversity-limit-design.md` + +--- + +### Task 1: Database migration + +**Files:** +- Create: `backend/migrations/20260323000012_add_max_articles_per_source.sql` + +- [ ] **Step 1: Create migration** + +```sql +ALTER TABLE settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3; +``` + +- [ ] **Step 2: Update CLAUDE.md migration count** + +Change `## Database (11 migrations)` to `## Database (12 migrations)`. + +- [ ] **Step 3: Commit** + +```bash +git add backend/migrations/20260323000012_add_max_articles_per_source.sql CLAUDE.md +git commit -m "feat: add max_articles_per_source column to user_settings" +``` + +--- + +### Task 2: Backend model + DB queries + +**Files:** +- Modify: `backend/src/models/settings.rs` +- Modify: `backend/src/db/settings.rs` + +- [ ] **Step 1: Add field to all three structs in `models/settings.rs`** + +Add `pub max_articles_per_source: i32` to: +- `UserSettings` (after `max_items_per_category`) +- `SettingsResponse` (after `max_items_per_category`) +- `UpdateSettingsRequest` (after `max_items_per_category`) + +Add the field to `impl From for SettingsResponse`: +```rust +max_articles_per_source: s.max_articles_per_source, +``` + +Add validation in `UpdateSettingsRequest::validate()`: +```rust +if !(1..=10).contains(&self.max_articles_per_source) { + return Err("max_articles_per_source must be between 1 and 10".into()); +} +``` + +Add to `impl Default for UserSettings`: +```rust +max_articles_per_source: 3, +``` + +- [ ] **Step 2: Add column to all SQL queries in `db/settings.rs`** + +Add `max_articles_per_source` to: +- `SettingsRow` struct field +- `get_or_create` INSERT column list, VALUES, RETURNING, and `.bind()` +- `upsert` INSERT column list, VALUES, RETURNING, ON CONFLICT SET, and `.bind()` +- `UserSettings::try_from(SettingsRow)` mapping + +This follows the exact same pattern as `max_items_per_category` in every query. + +- [ ] **Step 3: Run tests** + +Run: `cd backend && cargo test --lib` +Expected: all tests pass (existing settings tests use `Default` which now includes the new field) + +- [ ] **Step 4: Commit** + +```bash +git add backend/src/models/settings.rs backend/src/db/settings.rs +git commit -m "feat: add max_articles_per_source to settings model and DB queries" +``` + +--- + +### Task 3: Filter function with unit tests + +**Files:** +- Modify: `backend/src/services/synthesis.rs` + +- [ ] **Step 1: Add the `limit_articles_per_source` function** + +Add after `filter_homepage_urls`: + +```rust +/// Limit the number of articles from the same domain across all categories. +/// +/// Spreads articles across categories first (at most 1 per domain per category), +/// then fills remaining slots from dropped articles in encounter order. +fn limit_articles_per_source( + parsed: Vec<(String, Vec)>, + max_per_source: i32, +) -> Vec<(String, Vec)> { + let max = max_per_source as usize; + + // Pass 1: keep at most 1 article per domain per category + let mut kept: Vec<(String, Vec)> = Vec::new(); + let mut dropped: Vec<(usize, NewsItem)> = Vec::new(); // (category_index, item) + let mut domain_counts: std::collections::HashMap = + std::collections::HashMap::new(); + + for (cat_idx, (cat_key, items)) in parsed.into_iter().enumerate() { + let mut cat_kept = Vec::new(); + let mut seen_in_cat: std::collections::HashSet = std::collections::HashSet::new(); + + for item in items { + let domain = extract_domain(&item.url); + if let Some(ref d) = domain { + if seen_in_cat.contains(d) { + dropped.push((cat_idx, item)); + continue; + } + seen_in_cat.insert(d.clone()); + *domain_counts.entry(d.clone()).or_insert(0) += 1; + } + cat_kept.push(item); + } + + kept.push((cat_key, cat_kept)); + } + + // Cap enforcement: if any domain exceeds max after pass 1 (when categories > max), + // keep the first max articles in category order, drop the rest. + let mut cap_counts: std::collections::HashMap = std::collections::HashMap::new(); + for (_, items) in &mut kept { + items.retain(|item| { + let domain = extract_domain(&item.url); + match domain { + Some(ref d) => { + let count = cap_counts.entry(d.clone()).or_insert(0); + if *count >= max { + false + } else { + *count += 1; + true + } + } + None => true, // keep unparseable URLs + } + }); + } + + // Use cap_counts as the authoritative domain counts going forward + let mut domain_counts = cap_counts; + + // Pass 2: fill from dropped articles, back into their original category + for (cat_idx, item) in dropped { + if let Some(d) = extract_domain(&item.url) { + let count = domain_counts.get(&d).copied().unwrap_or(0); + if count < max { + *domain_counts.entry(d).or_insert(0) += 1; + kept[cat_idx].1.push(item); + } + } else { + // Unparseable URL — keep it + kept[cat_idx].1.push(item); + } + } + + kept +} + +/// Extract the domain (host) from a URL, or None if unparseable. +fn extract_domain(url: &str) -> Option { + url::Url::parse(url) + .ok() + .and_then(|u| u.host_str().map(|h| h.to_lowercase())) +} +``` + +- [ ] **Step 2: Wire it into the pipeline** + +In `run_generation_inner`, after `filter_homepage_urls` (line 315) and before the scrape step, add: + +```rust + // Step 7c: Limit articles per source for diversity + let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source); +``` + +- [ ] **Step 3: Add unit tests** + +Add to the `#[cfg(test)] mod tests` block at the bottom of `synthesis.rs`: + +```rust + // ── limit_articles_per_source tests ──────────────────────────── + + #[test] + fn source_limit_spreads_across_categories() { + let parsed = vec![ + ("category_0".into(), vec![ + NewsItem { title: "A1".into(), url: "https://openai.com/blog/a".into(), summary: "s".into() }, + NewsItem { title: "A2".into(), url: "https://openai.com/blog/b".into(), summary: "s".into() }, + NewsItem { title: "A3".into(), url: "https://openai.com/blog/c".into(), summary: "s".into() }, + NewsItem { title: "A4".into(), url: "https://techcrunch.com/x".into(), summary: "s".into() }, + ]), + ("category_1".into(), vec![ + NewsItem { title: "B1".into(), url: "https://openai.com/research/d".into(), summary: "s".into() }, + NewsItem { title: "B2".into(), url: "https://openai.com/research/e".into(), summary: "s".into() }, + NewsItem { title: "B3".into(), url: "https://theverge.com/y".into(), summary: "s".into() }, + ]), + ]; + + let result = limit_articles_per_source(parsed, 3); + + // Count openai.com articles across all categories + let openai_count: usize = result.iter() + .flat_map(|(_, items)| items) + .filter(|i| i.url.contains("openai.com")) + .count(); + assert_eq!(openai_count, 3, "Should keep exactly 3 openai.com articles"); + + // Both categories should have at least 1 openai article (spread) + let cat0_openai = result[0].1.iter().filter(|i| i.url.contains("openai.com")).count(); + let cat1_openai = result[1].1.iter().filter(|i| i.url.contains("openai.com")).count(); + assert!(cat0_openai >= 1, "Category 0 should have at least 1 openai article"); + assert!(cat1_openai >= 1, "Category 1 should have at least 1 openai article"); + + // techcrunch and theverge should be untouched + let tc_count: usize = result.iter().flat_map(|(_, items)| items).filter(|i| i.url.contains("techcrunch")).count(); + assert_eq!(tc_count, 1); + } + + #[test] + fn source_limit_all_different_domains() { + let parsed = vec![ + ("category_0".into(), vec![ + NewsItem { title: "A".into(), url: "https://a.com/1".into(), summary: "s".into() }, + NewsItem { title: "B".into(), url: "https://b.com/2".into(), summary: "s".into() }, + ]), + ]; + + let result = limit_articles_per_source(parsed, 3); + assert_eq!(result[0].1.len(), 2, "Nothing dropped when all domains are unique"); + } + + #[test] + fn source_limit_max_one() { + let parsed = vec![ + ("category_0".into(), vec![ + NewsItem { title: "A".into(), url: "https://openai.com/a".into(), summary: "s".into() }, + NewsItem { title: "B".into(), url: "https://openai.com/b".into(), summary: "s".into() }, + ]), + ("category_1".into(), vec![ + NewsItem { title: "C".into(), url: "https://openai.com/c".into(), summary: "s".into() }, + ]), + ]; + + let result = limit_articles_per_source(parsed, 1); + let total: usize = result.iter().flat_map(|(_, items)| items).filter(|i| i.url.contains("openai.com")).count(); + assert_eq!(total, 1, "max=1 should keep exactly 1 openai article"); + } + + #[test] + fn source_limit_more_categories_than_max() { + // 5 categories, each with 1 openai article, max=2 + let parsed: Vec<(String, Vec)> = (0..5) + .map(|i| ( + format!("category_{}", i), + vec![NewsItem { + title: format!("Art{}", i), + url: format!("https://openai.com/{}", i), + summary: "s".into(), + }], + )) + .collect(); + + let result = limit_articles_per_source(parsed, 2); + let total: usize = result.iter().flat_map(|(_, items)| items).count(); + assert_eq!(total, 2, "Should cap at max_per_source even with more categories"); + } + + #[test] + fn source_limit_empty_input() { + let result = limit_articles_per_source(vec![], 3); + assert!(result.is_empty()); + } + + #[test] + fn source_limit_unparseable_urls_kept() { + let parsed = vec![ + ("category_0".into(), vec![ + NewsItem { title: "Good".into(), url: "https://openai.com/a".into(), summary: "s".into() }, + NewsItem { title: "Bad".into(), url: "not-a-url".into(), summary: "s".into() }, + ]), + ]; + + let result = limit_articles_per_source(parsed, 3); + assert_eq!(result[0].1.len(), 2, "Unparseable URLs should be kept"); + } +``` + +- [ ] **Step 4: Run tests** + +Run: `cd backend && cargo test --lib` +Expected: all tests pass including the 6 new ones + +- [ ] **Step 5: Commit** + +```bash +git add backend/src/services/synthesis.rs +git commit -m "feat: add limit_articles_per_source filter with unit tests" +``` + +--- + +### Task 4: Frontend setting + +**Files:** +- Modify: `frontend/src/types.ts` +- Modify: `frontend/src/i18n/fr.ts` +- Modify: `frontend/src/pages/Settings.tsx` + +- [ ] **Step 1: Add field to frontend types** + +In `frontend/src/types.ts`, add to `UserSettings` interface after `max_items_per_category`: +```typescript +max_articles_per_source: number; +``` + +- [ ] **Step 2: Add i18n label** + +In `frontend/src/i18n/fr.ts`, add after the `settings.maxItems` line: +```typescript +'settings.maxArticlesPerSource': 'Articles max par source', +``` + +- [ ] **Step 3: Add number input to Settings page** + +In `frontend/src/pages/Settings.tsx`, inside the `sm:grid-cols-2` grid (before its closing `` around line 403), add a new `
` as a third child of the grid: + +```tsx +
+ +
+ + setSettings((prev) => ({ + ...prev, + max_articles_per_source: + parseInt(e.currentTarget.value) || 3, + })) + } + /> +
+
+``` + +Also add `max_articles_per_source: 3` to the default settings initializer if one exists. + +- [ ] **Step 4: Run frontend tests and type check** + +Run: `cd frontend && npx tsc --noEmit && npx vitest run` +Expected: type check passes, all tests pass + +- [ ] **Step 5: Commit** + +```bash +git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx +git commit -m "feat: add max_articles_per_source setting to frontend" +``` + +--- + +### Task 5: E2E verification + +- [ ] **Step 1: Rebuild and run Docker stack** + +```bash +docker compose down && docker compose up --build +``` + +- [ ] **Step 2: Verify the setting appears in the Settings page** + +Navigate to Settings, verify the "Articles max par source" number input is visible with default value 3. + +- [ ] **Step 3: Generate a synthesis and verify source diversity** + +Change the setting to 2, generate a synthesis, verify no domain appears more than 2 times across all categories. diff --git a/docs/superpowers/specs/2026-03-23-source-diversity-history-design.md b/docs/superpowers/specs/2026-03-23-source-diversity-history-design.md new file mode 100644 index 0000000..c346ef1 --- /dev/null +++ b/docs/superpowers/specs/2026-03-23-source-diversity-history-design.md @@ -0,0 +1,83 @@ +# Design: Source Diversity via Recent History + +**Date**: 2026-03-23 +**Scope**: Inject recently-used domains into the search prompt to encourage source diversity across syntheses + +--- + +## Context + +Users notice that successive syntheses reuse the same sources (TechCrunch, The Verge, etc.). Within a single synthesis, the `limit_articles_per_source` filter already caps per-domain articles. But across syntheses over time, the LLM gravitates toward the same popular domains. By telling the LLM which domains were recently used, it can prioritize different sources. + +## New User Setting + +- **Field:** `source_diversity_window` in `UserSettings` +- **Type:** `i32` (non-optional, matches existing pattern) +- **Default:** 3 +- **Validation:** 0-10 (0 = disabled) +- **Migration:** `ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3` +- **Frontend label:** "Syntheses a examiner pour diversite" + +## Mechanism + +1. At generation time, if `source_diversity_window > 0`, query the user's last N syntheses from the DB (ordered by `created_at DESC`, limit N). +2. Parse the `sections` JSONB from each synthesis, extract all article URLs, convert to domains via `host_str()`. +3. Deduplicate the domain list. +4. Pass the domain list to `build_search_prompt`, which appends a soft instruction: + "Evite si possible les sources deja utilisees recemment : domaine1.com, domaine2.com, ..." +5. The LLM treats this as guidance, not a hard constraint — if no alternative exists, it can still use those domains. + +## Files to modify + +- **Create:** migration `20260323000013_add_source_diversity_window.sql` +- **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + `Default` impl + validation (0-10) +- **Modify:** `backend/src/db/settings.rs` — add to `SettingsRow` struct, `TryFrom` impl, and both SQL queries (`get_or_create_default` + `upsert`: INSERT columns, VALUES, RETURNING, ON CONFLICT SET, .bind()) +- **Modify:** `backend/src/services/synthesis.rs` — before calling `build_search_prompt`, load recent syntheses via existing `db::syntheses::list_for_user`, extract domains using `extract_domain` (same module, private fn), pass domain list to the prompt builder +- **Modify:** `backend/src/services/prompts.rs` — add `recent_domains: &[String]` parameter to `build_search_prompt`, append soft avoidance instruction if non-empty. Update the call site in `synthesis.rs` (~line 304) to pass the domain list as the 4th argument. +- **Modify:** `backend/src/services/prompts.rs` tests — add `source_diversity_window` to test fixture, test with/without recent domains +- **Modify:** `frontend/src/types.ts` — add field to `UserSettings` + `DEFAULT_SETTINGS` +- **Modify:** `frontend/src/i18n/fr.ts` — add label +- **Modify:** `frontend/src/pages/Settings.tsx` — add number input + +**Note:** No new DB query function needed — the existing `db::syntheses::list_for_user(pool, user_id, limit, offset)` already returns full `Synthesis` records with `sections` JSONB. For a window of 3-10 syntheses (15-150 KB of JSON), application-level domain extraction is pragmatically fine for a single-tenant deployment. + +## Domain extraction from existing syntheses + +The `sections` column is JSONB with structure: +```json +[ + { + "title": "Category Name", + "items": [ + { "title": "...", "url": "https://example.com/article", "summary": "..." } + ] + } +] +``` + +Extract domains by parsing each item's `url` with `url::Url::parse` and `host_str()`. Reuse the existing `extract_domain` function in `synthesis.rs` (private fn, same module). + +## Unit tests + +- `build_search_prompt` with non-empty `recent_domains` → prompt contains avoidance instruction +- `build_search_prompt` with empty `recent_domains` → prompt unchanged +- Validation of `source_diversity_window` bounds (0 and 10 pass, -1 and 11 fail) + +## Prompt modification + +In `build_search_prompt`, add an optional parameter `recent_domains: &[String]`. If non-empty, append to the user prompt: + +``` +Evite si possible les sources deja utilisees dans les syntheses precedentes : domaine1.com, domaine2.com, ... +``` + +This is a soft instruction — the LLM can still use these domains if no alternatives are available. + +## What does NOT change + +- JSON schema — no changes +- Scraper — no changes +- Rewrite pass — no changes +- `limit_articles_per_source` — still enforces hard cap within a single synthesis +- `dedup_by_url` — still deduplicates within a single synthesis +- No new database table — domains are extracted from existing `syntheses.sections` JSONB diff --git a/docs/superpowers/specs/2026-03-23-source-diversity-limit-design.md b/docs/superpowers/specs/2026-03-23-source-diversity-limit-design.md new file mode 100644 index 0000000..c699fe4 --- /dev/null +++ b/docs/superpowers/specs/2026-03-23-source-diversity-limit-design.md @@ -0,0 +1,104 @@ +# Design: Source Diversity Limit (max articles per source) + +**Date**: 2026-03-23 +**Scope**: Limit the number of articles from the same website across all categories in a synthesis + +--- + +## Context + +Generated syntheses can be dominated by a single source (e.g., 8 articles from openai.com across categories). Users want source diversity — at most N articles from the same website, with articles spread across categories rather than clustered in one. + +## Approach + +Add a post-parse filter function that enforces a per-domain article limit after the LLM search pass and before scraping. A new user setting controls the limit. + +## New User Setting + +- **Field:** `max_articles_per_source` in `UserSettings` +- **Type:** `i32` (non-optional, matches `max_items_per_category` pattern) +- **Validation:** 1-10 +- **Migration:** `ALTER TABLE user_settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3` +- **Frontend label:** "Articles max par source" +- **Note:** 10 effectively means "no practical limit for most use cases" + +## Filter Function + +**Name:** `limit_articles_per_source` + +**Signature:** `fn limit_articles_per_source(parsed: Vec<(String, Vec)>, max_per_source: i32) -> Vec<(String, Vec)>` + +**Pipeline position:** after `filter_homepage_urls`, before `scrape_articles` + +**Domain extraction:** Parse URL with `url::Url`, extract via `host_str()` (e.g., `https://openai.com/blog/post` → `openai.com`). If URL can't be parsed, keep the article (don't drop on parse failure). + +**Known limitation:** Subdomains are treated as different sources (`blog.example.com` ≠ `www.example.com`). This is pragmatic for v1; registrable domain extraction (eTLD+1) can be added later if needed. + +**Algorithm:** +1. **Pass 1 — spread:** For each category (in order), keep at most 1 article per domain. Track the first occurrence of each domain's article; move remaining articles from that domain to a "dropped" list. +2. **Cap enforcement:** If any domain exceeds `max_per_source` after pass 1 (possible when categories > limit), trim that domain's articles down to `max_per_source`, keeping them spread across categories in order. +3. **Pass 2 — fill:** Iterate over dropped articles in their original order (categories in order, items within each category in order). Re-add each article to its original category if the domain is still under `max_per_source`. +4. Return the filtered list (same category keys, fewer items per category). + +**Example** with `max_per_source = 3`: + +Before: +- Category A: openai.com×3, techcrunch.com×1 +- Category B: openai.com×2, theverge.com×2 + +After pass 1 (1 per domain per category): +- Category A: openai.com×1, techcrunch.com×1 +- Category B: openai.com×1, theverge.com×1 +- Dropped: openai.com×3 (2 from A, 2 from B), theverge.com×1 +- Global: openai=2, techcrunch=1, theverge=1 + +Cap enforcement: openai=2 ≤ 3, no trimming needed. + +After pass 2 (fill up to max, dropped articles re-added to original category): +- openai has 1 slot left → add 1 openai article back to Category A +- theverge has 2 slots left → add 1 theverge article back to Category B +- Final: 3 openai total, 1 techcrunch, 2 theverge + +**Edge case** with `max_per_source = 2`, 5 categories all with 1 openai.com article: + +After pass 1: 5 openai articles (1 per category) → exceeds limit of 2. +Cap enforcement: trim to 2 openai articles, keeping categories A and B (first two in order), dropping C/D/E. +Pass 2: no dropped openai articles to re-add (already at limit). + +## Integration + +``` +parse_llm_output → filter_homepage_urls → limit_articles_per_source → scrape_articles +``` + +Call site in `run_generation_inner`: +```rust +let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source); +``` + +## Files to modify + +- **Create:** migration `20260323000012_add_max_articles_per_source.sql` +- **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + validation +- **Modify:** `backend/src/db/settings.rs` — add column to all SQL queries + `SettingsRow` +- **Modify:** `backend/src/services/synthesis.rs` — add filter function + call it +- **Modify:** `frontend/src/pages/Settings.tsx` — add number input in the generation settings grid +- **Modify:** `frontend/src/i18n/fr.ts` — add label translation +- **Modify:** `frontend/src/types.ts` — add field to Settings type + +## Unit tests + +In `synthesis.rs` tests: +- 5 openai.com articles across 2 categories, max=3 → keeps 3, spread across categories +- All articles from different domains → nothing dropped +- `max_per_source = 1` → at most 1 per domain total +- More categories than max (5 categories, 1 openai each, max=2) → caps at 2 +- Empty input → empty output +- Articles with unparseable URLs → kept + +## What does NOT change + +- LLM prompts — no instruction about source diversity +- JSON schema — no changes +- Scraper — no changes +- Rewrite pass — operates on already-filtered articles