|
|
# Design: Source Diversity Limit (max articles per source)
|
|
|
|
|
|
**Date**: 2026-03-23
|
|
|
**Scope**: Limit the number of articles from the same website across all categories in a synthesis
|
|
|
|
|
|
---
|
|
|
|
|
|
## Context
|
|
|
|
|
|
Generated syntheses can be dominated by a single source (e.g., 8 articles from openai.com across categories). Users want source diversity — at most N articles from the same website, with articles spread across categories rather than clustered in one.
|
|
|
|
|
|
## Approach
|
|
|
|
|
|
Add a post-parse filter function that enforces a per-domain article limit after the LLM search pass and before scraping. A new user setting controls the limit.
|
|
|
|
|
|
## New User Setting
|
|
|
|
|
|
- **Field:** `max_articles_per_source` in `UserSettings`
|
|
|
- **Type:** `i32` (non-optional, matches `max_items_per_category` pattern)
|
|
|
- **Validation:** 1-10
|
|
|
- **Migration:** `ALTER TABLE user_settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3`
|
|
|
- **Frontend label:** "Articles max par source"
|
|
|
- **Note:** 10 effectively means "no practical limit for most use cases"
|
|
|
|
|
|
## Filter Function
|
|
|
|
|
|
**Name:** `limit_articles_per_source`
|
|
|
|
|
|
**Signature:** `fn limit_articles_per_source(parsed: Vec<(String, Vec<NewsItem>)>, max_per_source: i32) -> Vec<(String, Vec<NewsItem>)>`
|
|
|
|
|
|
**Pipeline position:** after `filter_homepage_urls`, before `scrape_articles`
|
|
|
|
|
|
**Domain extraction:** Parse URL with `url::Url`, extract via `host_str()` (e.g., `https://openai.com/blog/post` → `openai.com`). If URL can't be parsed, keep the article (don't drop on parse failure).
|
|
|
|
|
|
**Known limitation:** Subdomains are treated as different sources (`blog.example.com` ≠ `www.example.com`). This is pragmatic for v1; registrable domain extraction (eTLD+1) can be added later if needed.
|
|
|
|
|
|
**Algorithm:**
|
|
|
1. **Pass 1 — spread:** For each category (in order), keep at most 1 article per domain. Track the first occurrence of each domain's article; move remaining articles from that domain to a "dropped" list.
|
|
|
2. **Cap enforcement:** If any domain exceeds `max_per_source` after pass 1 (possible when categories > limit), trim that domain's articles down to `max_per_source`, keeping them spread across categories in order.
|
|
|
3. **Pass 2 — fill:** Iterate over dropped articles in their original order (categories in order, items within each category in order). Re-add each article to its original category if the domain is still under `max_per_source`.
|
|
|
4. Return the filtered list (same category keys, fewer items per category).
|
|
|
|
|
|
**Example** with `max_per_source = 3`:
|
|
|
|
|
|
Before:
|
|
|
- Category A: openai.com×3, techcrunch.com×1
|
|
|
- Category B: openai.com×2, theverge.com×2
|
|
|
|
|
|
After pass 1 (1 per domain per category):
|
|
|
- Category A: openai.com×1, techcrunch.com×1
|
|
|
- Category B: openai.com×1, theverge.com×1
|
|
|
- Dropped: openai.com×3 (2 from A, 2 from B), theverge.com×1
|
|
|
- Global: openai=2, techcrunch=1, theverge=1
|
|
|
|
|
|
Cap enforcement: openai=2 ≤ 3, no trimming needed.
|
|
|
|
|
|
After pass 2 (fill up to max, dropped articles re-added to original category):
|
|
|
- openai has 1 slot left → add 1 openai article back to Category A
|
|
|
- theverge has 2 slots left → add 1 theverge article back to Category B
|
|
|
- Final: 3 openai total, 1 techcrunch, 2 theverge
|
|
|
|
|
|
**Edge case** with `max_per_source = 2`, 5 categories all with 1 openai.com article:
|
|
|
|
|
|
After pass 1: 5 openai articles (1 per category) → exceeds limit of 2.
|
|
|
Cap enforcement: trim to 2 openai articles, keeping categories A and B (first two in order), dropping C/D/E.
|
|
|
Pass 2: no dropped openai articles to re-add (already at limit).
|
|
|
|
|
|
## Integration
|
|
|
|
|
|
```
|
|
|
parse_llm_output → filter_homepage_urls → limit_articles_per_source → scrape_articles
|
|
|
```
|
|
|
|
|
|
Call site in `run_generation_inner`:
|
|
|
```rust
|
|
|
let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source);
|
|
|
```
|
|
|
|
|
|
## Files to modify
|
|
|
|
|
|
- **Create:** migration `20260323000012_add_max_articles_per_source.sql`
|
|
|
- **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + validation
|
|
|
- **Modify:** `backend/src/db/settings.rs` — add column to all SQL queries + `SettingsRow`
|
|
|
- **Modify:** `backend/src/services/synthesis.rs` — add filter function + call it
|
|
|
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input in the generation settings grid
|
|
|
- **Modify:** `frontend/src/i18n/fr.ts` — add label translation
|
|
|
- **Modify:** `frontend/src/types.ts` — add field to Settings type
|
|
|
|
|
|
## Unit tests
|
|
|
|
|
|
In `synthesis.rs` tests:
|
|
|
- 5 openai.com articles across 2 categories, max=3 → keeps 3, spread across categories
|
|
|
- All articles from different domains → nothing dropped
|
|
|
- `max_per_source = 1` → at most 1 per domain total
|
|
|
- More categories than max (5 categories, 1 openai each, max=2) → caps at 2
|
|
|
- Empty input → empty output
|
|
|
- Articles with unparseable URLs → kept
|
|
|
|
|
|
## What does NOT change
|
|
|
|
|
|
- LLM prompts — no instruction about source diversity
|
|
|
- JSON schema — no changes
|
|
|
- Scraper — no changes
|
|
|
- Rewrite pass — operates on already-filtered articles
|