# Source Diversity Limit — Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Limit the number of articles from the same website across all categories, with source diversity spread across categories. **Architecture:** New `i32` field in UserSettings + migration, post-parse filter function in the generation pipeline, frontend number input. **Tech Stack:** Rust (sqlx, url crate), SolidJS, PostgreSQL **Spec:** `docs/superpowers/specs/2026-03-23-source-diversity-limit-design.md` --- ### Task 1: Database migration **Files:** - Create: `backend/migrations/20260323000012_add_max_articles_per_source.sql` - [ ] **Step 1: Create migration** ```sql ALTER TABLE settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3; ``` - [ ] **Step 2: Update CLAUDE.md migration count** Change `## Database (11 migrations)` to `## Database (12 migrations)`. - [ ] **Step 3: Commit** ```bash git add backend/migrations/20260323000012_add_max_articles_per_source.sql CLAUDE.md git commit -m "feat: add max_articles_per_source column to user_settings" ``` --- ### Task 2: Backend model + DB queries **Files:** - Modify: `backend/src/models/settings.rs` - Modify: `backend/src/db/settings.rs` - [ ] **Step 1: Add field to all three structs in `models/settings.rs`** Add `pub max_articles_per_source: i32` to: - `UserSettings` (after `max_items_per_category`) - `SettingsResponse` (after `max_items_per_category`) - `UpdateSettingsRequest` (after `max_items_per_category`) Add the field to `impl From for SettingsResponse`: ```rust max_articles_per_source: s.max_articles_per_source, ``` Add validation in `UpdateSettingsRequest::validate()`: ```rust if !(1..=10).contains(&self.max_articles_per_source) { return Err("max_articles_per_source must be between 1 and 10".into()); } ``` Add to `impl Default for UserSettings`: ```rust max_articles_per_source: 3, ``` - [ ] **Step 2: Add column to all SQL queries in `db/settings.rs`** Add `max_articles_per_source` to: - `SettingsRow` struct field - `get_or_create` INSERT column list, VALUES, RETURNING, and `.bind()` - `upsert` INSERT column list, VALUES, RETURNING, ON CONFLICT SET, and `.bind()` - `UserSettings::try_from(SettingsRow)` mapping This follows the exact same pattern as `max_items_per_category` in every query. - [ ] **Step 3: Run tests** Run: `cd backend && cargo test --lib` Expected: all tests pass (existing settings tests use `Default` which now includes the new field) - [ ] **Step 4: Commit** ```bash git add backend/src/models/settings.rs backend/src/db/settings.rs git commit -m "feat: add max_articles_per_source to settings model and DB queries" ``` --- ### Task 3: Filter function with unit tests **Files:** - Modify: `backend/src/services/synthesis.rs` - [ ] **Step 1: Add the `limit_articles_per_source` function** Add after `filter_homepage_urls`: ```rust /// Limit the number of articles from the same domain across all categories. /// /// Spreads articles across categories first (at most 1 per domain per category), /// then fills remaining slots from dropped articles in encounter order. fn limit_articles_per_source( parsed: Vec<(String, Vec)>, max_per_source: i32, ) -> Vec<(String, Vec)> { let max = max_per_source as usize; // Pass 1: keep at most 1 article per domain per category let mut kept: Vec<(String, Vec)> = Vec::new(); let mut dropped: Vec<(usize, NewsItem)> = Vec::new(); // (category_index, item) let mut domain_counts: std::collections::HashMap = std::collections::HashMap::new(); for (cat_idx, (cat_key, items)) in parsed.into_iter().enumerate() { let mut cat_kept = Vec::new(); let mut seen_in_cat: std::collections::HashSet = std::collections::HashSet::new(); for item in items { let domain = extract_domain(&item.url); if let Some(ref d) = domain { if seen_in_cat.contains(d) { dropped.push((cat_idx, item)); continue; } seen_in_cat.insert(d.clone()); *domain_counts.entry(d.clone()).or_insert(0) += 1; } cat_kept.push(item); } kept.push((cat_key, cat_kept)); } // Cap enforcement: if any domain exceeds max after pass 1 (when categories > max), // keep the first max articles in category order, drop the rest. let mut cap_counts: std::collections::HashMap = std::collections::HashMap::new(); for (_, items) in &mut kept { items.retain(|item| { let domain = extract_domain(&item.url); match domain { Some(ref d) => { let count = cap_counts.entry(d.clone()).or_insert(0); if *count >= max { false } else { *count += 1; true } } None => true, // keep unparseable URLs } }); } // Use cap_counts as the authoritative domain counts going forward let mut domain_counts = cap_counts; // Pass 2: fill from dropped articles, back into their original category for (cat_idx, item) in dropped { if let Some(d) = extract_domain(&item.url) { let count = domain_counts.get(&d).copied().unwrap_or(0); if count < max { *domain_counts.entry(d).or_insert(0) += 1; kept[cat_idx].1.push(item); } } else { // Unparseable URL — keep it kept[cat_idx].1.push(item); } } kept } /// Extract the domain (host) from a URL, or None if unparseable. fn extract_domain(url: &str) -> Option { url::Url::parse(url) .ok() .and_then(|u| u.host_str().map(|h| h.to_lowercase())) } ``` - [ ] **Step 2: Wire it into the pipeline** In `run_generation_inner`, after `filter_homepage_urls` (line 315) and before the scrape step, add: ```rust // Step 7c: Limit articles per source for diversity let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source); ``` - [ ] **Step 3: Add unit tests** Add to the `#[cfg(test)] mod tests` block at the bottom of `synthesis.rs`: ```rust // ── limit_articles_per_source tests ──────────────────────────── #[test] fn source_limit_spreads_across_categories() { let parsed = vec![ ("category_0".into(), vec![ NewsItem { title: "A1".into(), url: "https://openai.com/blog/a".into(), summary: "s".into() }, NewsItem { title: "A2".into(), url: "https://openai.com/blog/b".into(), summary: "s".into() }, NewsItem { title: "A3".into(), url: "https://openai.com/blog/c".into(), summary: "s".into() }, NewsItem { title: "A4".into(), url: "https://techcrunch.com/x".into(), summary: "s".into() }, ]), ("category_1".into(), vec![ NewsItem { title: "B1".into(), url: "https://openai.com/research/d".into(), summary: "s".into() }, NewsItem { title: "B2".into(), url: "https://openai.com/research/e".into(), summary: "s".into() }, NewsItem { title: "B3".into(), url: "https://theverge.com/y".into(), summary: "s".into() }, ]), ]; let result = limit_articles_per_source(parsed, 3); // Count openai.com articles across all categories let openai_count: usize = result.iter() .flat_map(|(_, items)| items) .filter(|i| i.url.contains("openai.com")) .count(); assert_eq!(openai_count, 3, "Should keep exactly 3 openai.com articles"); // Both categories should have at least 1 openai article (spread) let cat0_openai = result[0].1.iter().filter(|i| i.url.contains("openai.com")).count(); let cat1_openai = result[1].1.iter().filter(|i| i.url.contains("openai.com")).count(); assert!(cat0_openai >= 1, "Category 0 should have at least 1 openai article"); assert!(cat1_openai >= 1, "Category 1 should have at least 1 openai article"); // techcrunch and theverge should be untouched let tc_count: usize = result.iter().flat_map(|(_, items)| items).filter(|i| i.url.contains("techcrunch")).count(); assert_eq!(tc_count, 1); } #[test] fn source_limit_all_different_domains() { let parsed = vec![ ("category_0".into(), vec![ NewsItem { title: "A".into(), url: "https://a.com/1".into(), summary: "s".into() }, NewsItem { title: "B".into(), url: "https://b.com/2".into(), summary: "s".into() }, ]), ]; let result = limit_articles_per_source(parsed, 3); assert_eq!(result[0].1.len(), 2, "Nothing dropped when all domains are unique"); } #[test] fn source_limit_max_one() { let parsed = vec![ ("category_0".into(), vec![ NewsItem { title: "A".into(), url: "https://openai.com/a".into(), summary: "s".into() }, NewsItem { title: "B".into(), url: "https://openai.com/b".into(), summary: "s".into() }, ]), ("category_1".into(), vec![ NewsItem { title: "C".into(), url: "https://openai.com/c".into(), summary: "s".into() }, ]), ]; let result = limit_articles_per_source(parsed, 1); let total: usize = result.iter().flat_map(|(_, items)| items).filter(|i| i.url.contains("openai.com")).count(); assert_eq!(total, 1, "max=1 should keep exactly 1 openai article"); } #[test] fn source_limit_more_categories_than_max() { // 5 categories, each with 1 openai article, max=2 let parsed: Vec<(String, Vec)> = (0..5) .map(|i| ( format!("category_{}", i), vec![NewsItem { title: format!("Art{}", i), url: format!("https://openai.com/{}", i), summary: "s".into(), }], )) .collect(); let result = limit_articles_per_source(parsed, 2); let total: usize = result.iter().flat_map(|(_, items)| items).count(); assert_eq!(total, 2, "Should cap at max_per_source even with more categories"); } #[test] fn source_limit_empty_input() { let result = limit_articles_per_source(vec![], 3); assert!(result.is_empty()); } #[test] fn source_limit_unparseable_urls_kept() { let parsed = vec![ ("category_0".into(), vec![ NewsItem { title: "Good".into(), url: "https://openai.com/a".into(), summary: "s".into() }, NewsItem { title: "Bad".into(), url: "not-a-url".into(), summary: "s".into() }, ]), ]; let result = limit_articles_per_source(parsed, 3); assert_eq!(result[0].1.len(), 2, "Unparseable URLs should be kept"); } ``` - [ ] **Step 4: Run tests** Run: `cd backend && cargo test --lib` Expected: all tests pass including the 6 new ones - [ ] **Step 5: Commit** ```bash git add backend/src/services/synthesis.rs git commit -m "feat: add limit_articles_per_source filter with unit tests" ``` --- ### Task 4: Frontend setting **Files:** - Modify: `frontend/src/types.ts` - Modify: `frontend/src/i18n/fr.ts` - Modify: `frontend/src/pages/Settings.tsx` - [ ] **Step 1: Add field to frontend types** In `frontend/src/types.ts`, add to `UserSettings` interface after `max_items_per_category`: ```typescript max_articles_per_source: number; ``` - [ ] **Step 2: Add i18n label** In `frontend/src/i18n/fr.ts`, add after the `settings.maxItems` line: ```typescript 'settings.maxArticlesPerSource': 'Articles max par source', ``` - [ ] **Step 3: Add number input to Settings page** In `frontend/src/pages/Settings.tsx`, inside the `sm:grid-cols-2` grid (before its closing `` around line 403), add a new `
` as a third child of the grid: ```tsx
setSettings((prev) => ({ ...prev, max_articles_per_source: parseInt(e.currentTarget.value) || 3, })) } />
``` Also add `max_articles_per_source: 3` to the default settings initializer if one exists. - [ ] **Step 4: Run frontend tests and type check** Run: `cd frontend && npx tsc --noEmit && npx vitest run` Expected: type check passes, all tests pass - [ ] **Step 5: Commit** ```bash git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx git commit -m "feat: add max_articles_per_source setting to frontend" ``` --- ### Task 5: E2E verification - [ ] **Step 1: Rebuild and run Docker stack** ```bash docker compose down && docker compose up --build ``` - [ ] **Step 2: Verify the setting appears in the Settings page** Navigate to Settings, verify the "Articles max par source" number input is visible with default value 3. - [ ] **Step 3: Generate a synthesis and verify source diversity** Change the setting to 2, generate a synthesis, verify no domain appears more than 2 times across all categories.