You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/plans/2026-03-23-source-diversity...

14 KiB

Source Diversity Limit — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Limit the number of articles from the same website across all categories, with source diversity spread across categories.

Architecture: New i32 field in UserSettings + migration, post-parse filter function in the generation pipeline, frontend number input.

Tech Stack: Rust (sqlx, url crate), SolidJS, PostgreSQL

Spec: docs/superpowers/specs/2026-03-23-source-diversity-limit-design.md


Task 1: Database migration

Files:

  • Create: backend/migrations/20260323000012_add_max_articles_per_source.sql

  • Step 1: Create migration

ALTER TABLE settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3;
  • Step 2: Update CLAUDE.md migration count

Change ## Database (11 migrations) to ## Database (12 migrations).

  • Step 3: Commit
git add backend/migrations/20260323000012_add_max_articles_per_source.sql CLAUDE.md
git commit -m "feat: add max_articles_per_source column to user_settings"

Task 2: Backend model + DB queries

Files:

  • Modify: backend/src/models/settings.rs

  • Modify: backend/src/db/settings.rs

  • Step 1: Add field to all three structs in models/settings.rs

Add pub max_articles_per_source: i32 to:

  • UserSettings (after max_items_per_category)
  • SettingsResponse (after max_items_per_category)
  • UpdateSettingsRequest (after max_items_per_category)

Add the field to impl From<UserSettings> for SettingsResponse:

max_articles_per_source: s.max_articles_per_source,

Add validation in UpdateSettingsRequest::validate():

if !(1..=10).contains(&self.max_articles_per_source) {
    return Err("max_articles_per_source must be between 1 and 10".into());
}

Add to impl Default for UserSettings:

max_articles_per_source: 3,
  • Step 2: Add column to all SQL queries in db/settings.rs

Add max_articles_per_source to:

  • SettingsRow struct field
  • get_or_create INSERT column list, VALUES, RETURNING, and .bind()
  • upsert INSERT column list, VALUES, RETURNING, ON CONFLICT SET, and .bind()
  • UserSettings::try_from(SettingsRow) mapping

This follows the exact same pattern as max_items_per_category in every query.

  • Step 3: Run tests

Run: cd backend && cargo test --lib Expected: all tests pass (existing settings tests use Default which now includes the new field)

  • Step 4: Commit
git add backend/src/models/settings.rs backend/src/db/settings.rs
git commit -m "feat: add max_articles_per_source to settings model and DB queries"

Task 3: Filter function with unit tests

Files:

  • Modify: backend/src/services/synthesis.rs

  • Step 1: Add the limit_articles_per_source function

Add after filter_homepage_urls:

/// Limit the number of articles from the same domain across all categories.
///
/// Spreads articles across categories first (at most 1 per domain per category),
/// then fills remaining slots from dropped articles in encounter order.
fn limit_articles_per_source(
    parsed: Vec<(String, Vec<NewsItem>)>,
    max_per_source: i32,
) -> Vec<(String, Vec<NewsItem>)> {
    let max = max_per_source as usize;

    // Pass 1: keep at most 1 article per domain per category
    let mut kept: Vec<(String, Vec<NewsItem>)> = Vec::new();
    let mut dropped: Vec<(usize, NewsItem)> = Vec::new(); // (category_index, item)
    let mut domain_counts: std::collections::HashMap<String, usize> =
        std::collections::HashMap::new();

    for (cat_idx, (cat_key, items)) in parsed.into_iter().enumerate() {
        let mut cat_kept = Vec::new();
        let mut seen_in_cat: std::collections::HashSet<String> = std::collections::HashSet::new();

        for item in items {
            let domain = extract_domain(&item.url);
            if let Some(ref d) = domain {
                if seen_in_cat.contains(d) {
                    dropped.push((cat_idx, item));
                    continue;
                }
                seen_in_cat.insert(d.clone());
                *domain_counts.entry(d.clone()).or_insert(0) += 1;
            }
            cat_kept.push(item);
        }

        kept.push((cat_key, cat_kept));
    }

    // Cap enforcement: if any domain exceeds max after pass 1 (when categories > max),
    // keep the first max articles in category order, drop the rest.
    let mut cap_counts: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
    for (_, items) in &mut kept {
        items.retain(|item| {
            let domain = extract_domain(&item.url);
            match domain {
                Some(ref d) => {
                    let count = cap_counts.entry(d.clone()).or_insert(0);
                    if *count >= max {
                        false
                    } else {
                        *count += 1;
                        true
                    }
                }
                None => true, // keep unparseable URLs
            }
        });
    }

    // Use cap_counts as the authoritative domain counts going forward
    let mut domain_counts = cap_counts;

    // Pass 2: fill from dropped articles, back into their original category
    for (cat_idx, item) in dropped {
        if let Some(d) = extract_domain(&item.url) {
            let count = domain_counts.get(&d).copied().unwrap_or(0);
            if count < max {
                *domain_counts.entry(d).or_insert(0) += 1;
                kept[cat_idx].1.push(item);
            }
        } else {
            // Unparseable URL — keep it
            kept[cat_idx].1.push(item);
        }
    }

    kept
}

/// Extract the domain (host) from a URL, or None if unparseable.
fn extract_domain(url: &str) -> Option<String> {
    url::Url::parse(url)
        .ok()
        .and_then(|u| u.host_str().map(|h| h.to_lowercase()))
}
  • Step 2: Wire it into the pipeline

In run_generation_inner, after filter_homepage_urls (line 315) and before the scrape step, add:

    // Step 7c: Limit articles per source for diversity
    let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source);
  • Step 3: Add unit tests

Add to the #[cfg(test)] mod tests block at the bottom of synthesis.rs:

    // ── limit_articles_per_source tests ────────────────────────────

    #[test]
    fn source_limit_spreads_across_categories() {
        let parsed = vec![
            ("category_0".into(), vec![
                NewsItem { title: "A1".into(), url: "https://openai.com/blog/a".into(), summary: "s".into() },
                NewsItem { title: "A2".into(), url: "https://openai.com/blog/b".into(), summary: "s".into() },
                NewsItem { title: "A3".into(), url: "https://openai.com/blog/c".into(), summary: "s".into() },
                NewsItem { title: "A4".into(), url: "https://techcrunch.com/x".into(), summary: "s".into() },
            ]),
            ("category_1".into(), vec![
                NewsItem { title: "B1".into(), url: "https://openai.com/research/d".into(), summary: "s".into() },
                NewsItem { title: "B2".into(), url: "https://openai.com/research/e".into(), summary: "s".into() },
                NewsItem { title: "B3".into(), url: "https://theverge.com/y".into(), summary: "s".into() },
            ]),
        ];

        let result = limit_articles_per_source(parsed, 3);

        // Count openai.com articles across all categories
        let openai_count: usize = result.iter()
            .flat_map(|(_, items)| items)
            .filter(|i| i.url.contains("openai.com"))
            .count();
        assert_eq!(openai_count, 3, "Should keep exactly 3 openai.com articles");

        // Both categories should have at least 1 openai article (spread)
        let cat0_openai = result[0].1.iter().filter(|i| i.url.contains("openai.com")).count();
        let cat1_openai = result[1].1.iter().filter(|i| i.url.contains("openai.com")).count();
        assert!(cat0_openai >= 1, "Category 0 should have at least 1 openai article");
        assert!(cat1_openai >= 1, "Category 1 should have at least 1 openai article");

        // techcrunch and theverge should be untouched
        let tc_count: usize = result.iter().flat_map(|(_, items)| items).filter(|i| i.url.contains("techcrunch")).count();
        assert_eq!(tc_count, 1);
    }

    #[test]
    fn source_limit_all_different_domains() {
        let parsed = vec![
            ("category_0".into(), vec![
                NewsItem { title: "A".into(), url: "https://a.com/1".into(), summary: "s".into() },
                NewsItem { title: "B".into(), url: "https://b.com/2".into(), summary: "s".into() },
            ]),
        ];

        let result = limit_articles_per_source(parsed, 3);
        assert_eq!(result[0].1.len(), 2, "Nothing dropped when all domains are unique");
    }

    #[test]
    fn source_limit_max_one() {
        let parsed = vec![
            ("category_0".into(), vec![
                NewsItem { title: "A".into(), url: "https://openai.com/a".into(), summary: "s".into() },
                NewsItem { title: "B".into(), url: "https://openai.com/b".into(), summary: "s".into() },
            ]),
            ("category_1".into(), vec![
                NewsItem { title: "C".into(), url: "https://openai.com/c".into(), summary: "s".into() },
            ]),
        ];

        let result = limit_articles_per_source(parsed, 1);
        let total: usize = result.iter().flat_map(|(_, items)| items).filter(|i| i.url.contains("openai.com")).count();
        assert_eq!(total, 1, "max=1 should keep exactly 1 openai article");
    }

    #[test]
    fn source_limit_more_categories_than_max() {
        // 5 categories, each with 1 openai article, max=2
        let parsed: Vec<(String, Vec<NewsItem>)> = (0..5)
            .map(|i| (
                format!("category_{}", i),
                vec![NewsItem {
                    title: format!("Art{}", i),
                    url: format!("https://openai.com/{}", i),
                    summary: "s".into(),
                }],
            ))
            .collect();

        let result = limit_articles_per_source(parsed, 2);
        let total: usize = result.iter().flat_map(|(_, items)| items).count();
        assert_eq!(total, 2, "Should cap at max_per_source even with more categories");
    }

    #[test]
    fn source_limit_empty_input() {
        let result = limit_articles_per_source(vec![], 3);
        assert!(result.is_empty());
    }

    #[test]
    fn source_limit_unparseable_urls_kept() {
        let parsed = vec![
            ("category_0".into(), vec![
                NewsItem { title: "Good".into(), url: "https://openai.com/a".into(), summary: "s".into() },
                NewsItem { title: "Bad".into(), url: "not-a-url".into(), summary: "s".into() },
            ]),
        ];

        let result = limit_articles_per_source(parsed, 3);
        assert_eq!(result[0].1.len(), 2, "Unparseable URLs should be kept");
    }
  • Step 4: Run tests

Run: cd backend && cargo test --lib Expected: all tests pass including the 6 new ones

  • Step 5: Commit
git add backend/src/services/synthesis.rs
git commit -m "feat: add limit_articles_per_source filter with unit tests"

Task 4: Frontend setting

Files:

  • Modify: frontend/src/types.ts

  • Modify: frontend/src/i18n/fr.ts

  • Modify: frontend/src/pages/Settings.tsx

  • Step 1: Add field to frontend types

In frontend/src/types.ts, add to UserSettings interface after max_items_per_category:

max_articles_per_source: number;
  • Step 2: Add i18n label

In frontend/src/i18n/fr.ts, add after the settings.maxItems line:

'settings.maxArticlesPerSource': 'Articles max par source',
  • Step 3: Add number input to Settings page

In frontend/src/pages/Settings.tsx, inside the sm:grid-cols-2 grid (before its closing </div> around line 403), add a new <div> as a third child of the grid:

            <div>
              <label
                for="maxArticlesPerSource"
                class="block text-sm font-medium text-gray-700"
              >
                {t('settings.maxArticlesPerSource')}
              </label>
              <div class="mt-1">
                <input
                  type="number"
                  id="maxArticlesPerSource"
                  min="1"
                  max="10"
                  class="shadow-sm focus:ring-indigo-500 focus:border-indigo-500 block w-full sm:text-sm border-gray-300 rounded-md py-2 px-3 border"
                  value={settings().max_articles_per_source}
                  onInput={(e) =>
                    setSettings((prev) => ({
                      ...prev,
                      max_articles_per_source:
                        parseInt(e.currentTarget.value) || 3,
                    }))
                  }
                />
              </div>
            </div>

Also add max_articles_per_source: 3 to the default settings initializer if one exists.

  • Step 4: Run frontend tests and type check

Run: cd frontend && npx tsc --noEmit && npx vitest run Expected: type check passes, all tests pass

  • Step 5: Commit
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
git commit -m "feat: add max_articles_per_source setting to frontend"

Task 5: E2E verification

  • Step 1: Rebuild and run Docker stack
docker compose down && docker compose up --build
  • Step 2: Verify the setting appears in the Settings page

Navigate to Settings, verify the "Articles max par source" number input is visible with default value 3.

  • Step 3: Generate a synthesis and verify source diversity

Change the setting to 2, generate a synthesis, verify no domain appears more than 2 times across all categories.