You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-23-source-diversity...

105 lines
4.9 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Design: Source Diversity Limit (max articles per source)
**Date**: 2026-03-23
**Scope**: Limit the number of articles from the same website across all categories in a synthesis
---
## Context
Generated syntheses can be dominated by a single source (e.g., 8 articles from openai.com across categories). Users want source diversity — at most N articles from the same website, with articles spread across categories rather than clustered in one.
## Approach
Add a post-parse filter function that enforces a per-domain article limit after the LLM search pass and before scraping. A new user setting controls the limit.
## New User Setting
- **Field:** `max_articles_per_source` in `UserSettings`
- **Type:** `i32` (non-optional, matches `max_items_per_category` pattern)
- **Validation:** 1-10
- **Migration:** `ALTER TABLE user_settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3`
- **Frontend label:** "Articles max par source"
- **Note:** 10 effectively means "no practical limit for most use cases"
## Filter Function
**Name:** `limit_articles_per_source`
**Signature:** `fn limit_articles_per_source(parsed: Vec<(String, Vec<NewsItem>)>, max_per_source: i32) -> Vec<(String, Vec<NewsItem>)>`
**Pipeline position:** after `filter_homepage_urls`, before `scrape_articles`
**Domain extraction:** Parse URL with `url::Url`, extract via `host_str()` (e.g., `https://openai.com/blog/post``openai.com`). If URL can't be parsed, keep the article (don't drop on parse failure).
**Known limitation:** Subdomains are treated as different sources (`blog.example.com` ≠ `www.example.com`). This is pragmatic for v1; registrable domain extraction (eTLD+1) can be added later if needed.
**Algorithm:**
1. **Pass 1 — spread:** For each category (in order), keep at most 1 article per domain. Track the first occurrence of each domain's article; move remaining articles from that domain to a "dropped" list.
2. **Cap enforcement:** If any domain exceeds `max_per_source` after pass 1 (possible when categories > limit), trim that domain's articles down to `max_per_source`, keeping them spread across categories in order.
3. **Pass 2 — fill:** Iterate over dropped articles in their original order (categories in order, items within each category in order). Re-add each article to its original category if the domain is still under `max_per_source`.
4. Return the filtered list (same category keys, fewer items per category).
**Example** with `max_per_source = 3`:
Before:
- Category A: openai.com×3, techcrunch.com×1
- Category B: openai.com×2, theverge.com×2
After pass 1 (1 per domain per category):
- Category A: openai.com×1, techcrunch.com×1
- Category B: openai.com×1, theverge.com×1
- Dropped: openai.com×3 (2 from A, 2 from B), theverge.com×1
- Global: openai=2, techcrunch=1, theverge=1
Cap enforcement: openai=2 ≤ 3, no trimming needed.
After pass 2 (fill up to max, dropped articles re-added to original category):
- openai has 1 slot left → add 1 openai article back to Category A
- theverge has 2 slots left → add 1 theverge article back to Category B
- Final: 3 openai total, 1 techcrunch, 2 theverge
**Edge case** with `max_per_source = 2`, 5 categories all with 1 openai.com article:
After pass 1: 5 openai articles (1 per category) → exceeds limit of 2.
Cap enforcement: trim to 2 openai articles, keeping categories A and B (first two in order), dropping C/D/E.
Pass 2: no dropped openai articles to re-add (already at limit).
## Integration
```
parse_llm_output → filter_homepage_urls → limit_articles_per_source → scrape_articles
```
Call site in `run_generation_inner`:
```rust
let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source);
```
## Files to modify
- **Create:** migration `20260323000012_add_max_articles_per_source.sql`
- **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + validation
- **Modify:** `backend/src/db/settings.rs` — add column to all SQL queries + `SettingsRow`
- **Modify:** `backend/src/services/synthesis.rs` — add filter function + call it
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input in the generation settings grid
- **Modify:** `frontend/src/i18n/fr.ts` — add label translation
- **Modify:** `frontend/src/types.ts` — add field to Settings type
## Unit tests
In `synthesis.rs` tests:
- 5 openai.com articles across 2 categories, max=3 → keeps 3, spread across categories
- All articles from different domains → nothing dropped
- `max_per_source = 1` → at most 1 per domain total
- More categories than max (5 categories, 1 openai each, max=2) → caps at 2
- Empty input → empty output
- Articles with unparseable URLs → kept
## What does NOT change
- LLM prompts — no instruction about source diversity
- JSON schema — no changes
- Scraper — no changes
- Rewrite pass — operates on already-filtered articles