Updated specifications of source diversity functionality
parent
53ecce84b0
commit
420603d76a
@ -0,0 +1,342 @@
|
|||||||
|
# Source Diversity via Recent History — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Inject recently-used domains into the LLM search prompt to encourage source diversity across syntheses.
|
||||||
|
|
||||||
|
**Architecture:** New `source_diversity_window` setting (default 3, 0=disabled). At generation time, load recent syntheses, extract domains from JSONB sections, pass to prompt builder which appends a soft avoidance instruction.
|
||||||
|
|
||||||
|
**Tech Stack:** Rust (sqlx, serde_json, url crate), SolidJS, PostgreSQL
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-03-23-source-diversity-history-design.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Migration + backend model
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `backend/migrations/20260323000013_add_source_diversity_window.sql`
|
||||||
|
- Modify: `backend/src/models/settings.rs`
|
||||||
|
- Modify: `backend/src/db/settings.rs`
|
||||||
|
- Modify: `CLAUDE.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create migration**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3;
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add field to all structs in `models/settings.rs`**
|
||||||
|
|
||||||
|
Add `pub source_diversity_window: i32` to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` (after `max_articles_per_source`).
|
||||||
|
|
||||||
|
Add to `From<UserSettings> for SettingsResponse`:
|
||||||
|
```rust
|
||||||
|
source_diversity_window: s.source_diversity_window,
|
||||||
|
```
|
||||||
|
|
||||||
|
Add validation in `UpdateSettingsRequest::validate()`:
|
||||||
|
```rust
|
||||||
|
if !(0..=10).contains(&self.source_diversity_window) {
|
||||||
|
return Err("source_diversity_window must be between 0 and 10".into());
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Add to `impl Default for UserSettings`:
|
||||||
|
```rust
|
||||||
|
source_diversity_window: 3,
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add column to DB queries in `db/settings.rs`**
|
||||||
|
|
||||||
|
Add `source_diversity_window: i32` to `SettingsRow`. Add to `TryFrom<SettingsRow>`:
|
||||||
|
```rust
|
||||||
|
source_diversity_window: row.source_diversity_window,
|
||||||
|
```
|
||||||
|
|
||||||
|
Add to both SQL queries (`get_or_create_default` and `upsert`): INSERT column list, VALUES placeholder, RETURNING clause, `.bind()` call, and ON CONFLICT SET (upsert only). The new column goes after `max_articles_per_source`.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Update CLAUDE.md migration count to 13**
|
||||||
|
|
||||||
|
- [ ] **Step 5: Add validation tests in `models/settings.rs`**
|
||||||
|
|
||||||
|
Add `source_diversity_window: 3` to the `valid_request()` test helper. Then add tests:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
#[test]
|
||||||
|
fn test_source_diversity_window_zero_is_valid() {
|
||||||
|
let mut req = valid_request();
|
||||||
|
req.source_diversity_window = 0;
|
||||||
|
assert!(req.validate().is_ok());
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn test_source_diversity_window_ten_is_valid() {
|
||||||
|
let mut req = valid_request();
|
||||||
|
req.source_diversity_window = 10;
|
||||||
|
assert!(req.validate().is_ok());
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn test_source_diversity_window_below_range() {
|
||||||
|
let mut req = valid_request();
|
||||||
|
req.source_diversity_window = -1;
|
||||||
|
assert!(req.validate().is_err());
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn test_source_diversity_window_above_range() {
|
||||||
|
let mut req = valid_request();
|
||||||
|
req.source_diversity_window = 11;
|
||||||
|
assert!(req.validate().is_err());
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Run tests**
|
||||||
|
|
||||||
|
Run: `cd backend && cargo test --lib`
|
||||||
|
Expected: all tests pass
|
||||||
|
|
||||||
|
- [ ] **Step 7: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add backend/migrations/20260323000013_add_source_diversity_window.sql backend/src/models/settings.rs backend/src/db/settings.rs CLAUDE.md
|
||||||
|
git commit -m "feat: add source_diversity_window setting (migration + model + DB)"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Prompt modification + tests
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `backend/src/services/prompts.rs`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add `recent_domains` parameter to `build_search_prompt`**
|
||||||
|
|
||||||
|
Change signature from:
|
||||||
|
```rust
|
||||||
|
pub fn build_search_prompt(
|
||||||
|
settings: &UserSettings,
|
||||||
|
sources: &[Source],
|
||||||
|
current_date: &str,
|
||||||
|
) -> (String, String) {
|
||||||
|
```
|
||||||
|
|
||||||
|
To:
|
||||||
|
```rust
|
||||||
|
pub fn build_search_prompt(
|
||||||
|
settings: &UserSettings,
|
||||||
|
sources: &[Source],
|
||||||
|
current_date: &str,
|
||||||
|
recent_domains: &[String],
|
||||||
|
) -> (String, String) {
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Append avoidance instruction when domains are non-empty**
|
||||||
|
|
||||||
|
At the end of the `user_prompt` format string (after the JSON instruction line, before the closing `"`), add a conditional block. After the `format!()` call that builds `user_prompt`, append:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
let user_prompt = if recent_domains.is_empty() {
|
||||||
|
user_prompt
|
||||||
|
} else {
|
||||||
|
let domains_list = recent_domains.join(", ");
|
||||||
|
format!(
|
||||||
|
"{}\n\nEvite si possible les sources deja utilisees dans les syntheses precedentes : {}.",
|
||||||
|
user_prompt, domains_list
|
||||||
|
)
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Update test fixture**
|
||||||
|
|
||||||
|
In the `test_settings()` function (~line 137), add:
|
||||||
|
```rust
|
||||||
|
source_diversity_window: 3,
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Update existing test calls**
|
||||||
|
|
||||||
|
All existing tests that call `build_search_prompt` need the 4th argument. Add `&[]` (empty slice) to each existing call. Search for `build_search_prompt(` in the test module and add `, &[]` before the closing `)`.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Add new tests**
|
||||||
|
|
||||||
|
```rust
|
||||||
|
#[test]
|
||||||
|
fn search_prompt_includes_recent_domains_avoidance() {
|
||||||
|
let settings = test_settings();
|
||||||
|
let sources = vec![];
|
||||||
|
let date = "lundi 17 mars 2026";
|
||||||
|
let domains = vec!["techcrunch.com".to_string(), "theverge.com".to_string()];
|
||||||
|
let (_, user_prompt) = build_search_prompt(&settings, &sources, date, &domains);
|
||||||
|
assert!(user_prompt.contains("Evite si possible"));
|
||||||
|
assert!(user_prompt.contains("techcrunch.com"));
|
||||||
|
assert!(user_prompt.contains("theverge.com"));
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn search_prompt_no_avoidance_when_domains_empty() {
|
||||||
|
let settings = test_settings();
|
||||||
|
let sources = vec![];
|
||||||
|
let date = "lundi 17 mars 2026";
|
||||||
|
let (_, user_prompt) = build_search_prompt(&settings, &sources, date, &[]);
|
||||||
|
assert!(!user_prompt.contains("Evite si possible"));
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 6: Run tests**
|
||||||
|
|
||||||
|
Run: `cd backend && cargo test --lib`
|
||||||
|
Expected: all tests pass
|
||||||
|
|
||||||
|
- [ ] **Step 7: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add backend/src/services/prompts.rs
|
||||||
|
git commit -m "feat: build_search_prompt accepts recent_domains for source diversity"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Pipeline integration — extract domains + wire prompt
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `backend/src/services/synthesis.rs`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add domain extraction from recent syntheses**
|
||||||
|
|
||||||
|
Before the `build_search_prompt` call (~line 303), add a new step that loads recent syntheses and extracts domains. Insert between the rate limit check (step 5) and the search pass (step 6):
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// Step 5b: Load recently-used domains for source diversity
|
||||||
|
let recent_domains = if settings.source_diversity_window > 0 {
|
||||||
|
let recent = db::syntheses::list_for_user(
|
||||||
|
&state.pool,
|
||||||
|
user_id,
|
||||||
|
settings.source_diversity_window as i64,
|
||||||
|
0,
|
||||||
|
)
|
||||||
|
.await
|
||||||
|
.unwrap_or_default();
|
||||||
|
|
||||||
|
let mut domains: Vec<String> = recent
|
||||||
|
.iter()
|
||||||
|
.filter_map(|s| {
|
||||||
|
serde_json::from_value::<Vec<crate::models::synthesis::NewsSection>>(
|
||||||
|
s.sections.clone(),
|
||||||
|
)
|
||||||
|
.ok()
|
||||||
|
})
|
||||||
|
.flat_map(|sections| {
|
||||||
|
sections
|
||||||
|
.into_iter()
|
||||||
|
.flat_map(|sec| sec.items.into_iter())
|
||||||
|
.filter_map(|item| extract_domain(&item.url))
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
|
||||||
|
domains.sort();
|
||||||
|
domains.dedup();
|
||||||
|
domains
|
||||||
|
} else {
|
||||||
|
Vec::new()
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Update the `build_search_prompt` call**
|
||||||
|
|
||||||
|
Change line ~304 from:
|
||||||
|
```rust
|
||||||
|
let (system_prompt, user_prompt) =
|
||||||
|
prompts::build_search_prompt(&settings, &sources, ¤t_date);
|
||||||
|
```
|
||||||
|
|
||||||
|
To:
|
||||||
|
```rust
|
||||||
|
let (system_prompt, user_prompt) =
|
||||||
|
prompts::build_search_prompt(&settings, &sources, ¤t_date, &recent_domains);
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Run tests**
|
||||||
|
|
||||||
|
Run: `cd backend && cargo test --lib`
|
||||||
|
Expected: all tests pass
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add backend/src/services/synthesis.rs
|
||||||
|
git commit -m "feat: extract recent domains and pass to search prompt for diversity"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Frontend setting
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `frontend/src/types.ts`
|
||||||
|
- Modify: `frontend/src/i18n/fr.ts`
|
||||||
|
- Modify: `frontend/src/pages/Settings.tsx`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add field to frontend types**
|
||||||
|
|
||||||
|
In `frontend/src/types.ts`, add to `UserSettings` interface (after `max_articles_per_source`):
|
||||||
|
```typescript
|
||||||
|
source_diversity_window: number;
|
||||||
|
```
|
||||||
|
|
||||||
|
Add to `DEFAULT_SETTINGS`:
|
||||||
|
```typescript
|
||||||
|
source_diversity_window: 3,
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add i18n label**
|
||||||
|
|
||||||
|
In `frontend/src/i18n/fr.ts`, add after `settings.maxArticlesPerSource`:
|
||||||
|
```typescript
|
||||||
|
'settings.diversityWindow': 'Syntheses a examiner pour diversite',
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add number input to Settings page**
|
||||||
|
|
||||||
|
In `frontend/src/pages/Settings.tsx`, inside the generation settings grid (after `maxArticlesPerSource`), add:
|
||||||
|
|
||||||
|
```tsx
|
||||||
|
<div>
|
||||||
|
<label
|
||||||
|
for="diversityWindow"
|
||||||
|
class="block text-sm font-medium text-gray-700"
|
||||||
|
>
|
||||||
|
{t('settings.diversityWindow')}
|
||||||
|
</label>
|
||||||
|
<div class="mt-1">
|
||||||
|
<input
|
||||||
|
type="number"
|
||||||
|
id="diversityWindow"
|
||||||
|
min="0"
|
||||||
|
max="10"
|
||||||
|
class="shadow-sm focus:ring-indigo-500 focus:border-indigo-500 block w-full sm:text-sm border-gray-300 rounded-md py-2 px-3 border"
|
||||||
|
value={settings().source_diversity_window}
|
||||||
|
onInput={(e) =>
|
||||||
|
setSettings((prev) => ({
|
||||||
|
...prev,
|
||||||
|
source_diversity_window:
|
||||||
|
parseInt(e.currentTarget.value) || 3,
|
||||||
|
}))
|
||||||
|
}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run frontend tests**
|
||||||
|
|
||||||
|
Run: `cd frontend && npx tsc --noEmit && npx vitest run`
|
||||||
|
Expected: type check passes, all tests pass
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
|
||||||
|
git commit -m "feat: add source_diversity_window setting to frontend"
|
||||||
|
```
|
||||||
@ -0,0 +1,406 @@
|
|||||||
|
# Source Diversity Limit — Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Limit the number of articles from the same website across all categories, with source diversity spread across categories.
|
||||||
|
|
||||||
|
**Architecture:** New `i32` field in UserSettings + migration, post-parse filter function in the generation pipeline, frontend number input.
|
||||||
|
|
||||||
|
**Tech Stack:** Rust (sqlx, url crate), SolidJS, PostgreSQL
|
||||||
|
|
||||||
|
**Spec:** `docs/superpowers/specs/2026-03-23-source-diversity-limit-design.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Database migration
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `backend/migrations/20260323000012_add_max_articles_per_source.sql`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create migration**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ALTER TABLE settings ADD COLUMN max_articles_per_source INTEGER NOT NULL DEFAULT 3;
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Update CLAUDE.md migration count**
|
||||||
|
|
||||||
|
Change `## Database (11 migrations)` to `## Database (12 migrations)`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add backend/migrations/20260323000012_add_max_articles_per_source.sql CLAUDE.md
|
||||||
|
git commit -m "feat: add max_articles_per_source column to user_settings"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Backend model + DB queries
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `backend/src/models/settings.rs`
|
||||||
|
- Modify: `backend/src/db/settings.rs`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add field to all three structs in `models/settings.rs`**
|
||||||
|
|
||||||
|
Add `pub max_articles_per_source: i32` to:
|
||||||
|
- `UserSettings` (after `max_items_per_category`)
|
||||||
|
- `SettingsResponse` (after `max_items_per_category`)
|
||||||
|
- `UpdateSettingsRequest` (after `max_items_per_category`)
|
||||||
|
|
||||||
|
Add the field to `impl From<UserSettings> for SettingsResponse`:
|
||||||
|
```rust
|
||||||
|
max_articles_per_source: s.max_articles_per_source,
|
||||||
|
```
|
||||||
|
|
||||||
|
Add validation in `UpdateSettingsRequest::validate()`:
|
||||||
|
```rust
|
||||||
|
if !(1..=10).contains(&self.max_articles_per_source) {
|
||||||
|
return Err("max_articles_per_source must be between 1 and 10".into());
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Add to `impl Default for UserSettings`:
|
||||||
|
```rust
|
||||||
|
max_articles_per_source: 3,
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add column to all SQL queries in `db/settings.rs`**
|
||||||
|
|
||||||
|
Add `max_articles_per_source` to:
|
||||||
|
- `SettingsRow` struct field
|
||||||
|
- `get_or_create` INSERT column list, VALUES, RETURNING, and `.bind()`
|
||||||
|
- `upsert` INSERT column list, VALUES, RETURNING, ON CONFLICT SET, and `.bind()`
|
||||||
|
- `UserSettings::try_from(SettingsRow)` mapping
|
||||||
|
|
||||||
|
This follows the exact same pattern as `max_items_per_category` in every query.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Run tests**
|
||||||
|
|
||||||
|
Run: `cd backend && cargo test --lib`
|
||||||
|
Expected: all tests pass (existing settings tests use `Default` which now includes the new field)
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add backend/src/models/settings.rs backend/src/db/settings.rs
|
||||||
|
git commit -m "feat: add max_articles_per_source to settings model and DB queries"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Filter function with unit tests
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `backend/src/services/synthesis.rs`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the `limit_articles_per_source` function**
|
||||||
|
|
||||||
|
Add after `filter_homepage_urls`:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
/// Limit the number of articles from the same domain across all categories.
|
||||||
|
///
|
||||||
|
/// Spreads articles across categories first (at most 1 per domain per category),
|
||||||
|
/// then fills remaining slots from dropped articles in encounter order.
|
||||||
|
fn limit_articles_per_source(
|
||||||
|
parsed: Vec<(String, Vec<NewsItem>)>,
|
||||||
|
max_per_source: i32,
|
||||||
|
) -> Vec<(String, Vec<NewsItem>)> {
|
||||||
|
let max = max_per_source as usize;
|
||||||
|
|
||||||
|
// Pass 1: keep at most 1 article per domain per category
|
||||||
|
let mut kept: Vec<(String, Vec<NewsItem>)> = Vec::new();
|
||||||
|
let mut dropped: Vec<(usize, NewsItem)> = Vec::new(); // (category_index, item)
|
||||||
|
let mut domain_counts: std::collections::HashMap<String, usize> =
|
||||||
|
std::collections::HashMap::new();
|
||||||
|
|
||||||
|
for (cat_idx, (cat_key, items)) in parsed.into_iter().enumerate() {
|
||||||
|
let mut cat_kept = Vec::new();
|
||||||
|
let mut seen_in_cat: std::collections::HashSet<String> = std::collections::HashSet::new();
|
||||||
|
|
||||||
|
for item in items {
|
||||||
|
let domain = extract_domain(&item.url);
|
||||||
|
if let Some(ref d) = domain {
|
||||||
|
if seen_in_cat.contains(d) {
|
||||||
|
dropped.push((cat_idx, item));
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
seen_in_cat.insert(d.clone());
|
||||||
|
*domain_counts.entry(d.clone()).or_insert(0) += 1;
|
||||||
|
}
|
||||||
|
cat_kept.push(item);
|
||||||
|
}
|
||||||
|
|
||||||
|
kept.push((cat_key, cat_kept));
|
||||||
|
}
|
||||||
|
|
||||||
|
// Cap enforcement: if any domain exceeds max after pass 1 (when categories > max),
|
||||||
|
// keep the first max articles in category order, drop the rest.
|
||||||
|
let mut cap_counts: std::collections::HashMap<String, usize> = std::collections::HashMap::new();
|
||||||
|
for (_, items) in &mut kept {
|
||||||
|
items.retain(|item| {
|
||||||
|
let domain = extract_domain(&item.url);
|
||||||
|
match domain {
|
||||||
|
Some(ref d) => {
|
||||||
|
let count = cap_counts.entry(d.clone()).or_insert(0);
|
||||||
|
if *count >= max {
|
||||||
|
false
|
||||||
|
} else {
|
||||||
|
*count += 1;
|
||||||
|
true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None => true, // keep unparseable URLs
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// Use cap_counts as the authoritative domain counts going forward
|
||||||
|
let mut domain_counts = cap_counts;
|
||||||
|
|
||||||
|
// Pass 2: fill from dropped articles, back into their original category
|
||||||
|
for (cat_idx, item) in dropped {
|
||||||
|
if let Some(d) = extract_domain(&item.url) {
|
||||||
|
let count = domain_counts.get(&d).copied().unwrap_or(0);
|
||||||
|
if count < max {
|
||||||
|
*domain_counts.entry(d).or_insert(0) += 1;
|
||||||
|
kept[cat_idx].1.push(item);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
// Unparseable URL — keep it
|
||||||
|
kept[cat_idx].1.push(item);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
kept
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Extract the domain (host) from a URL, or None if unparseable.
|
||||||
|
fn extract_domain(url: &str) -> Option<String> {
|
||||||
|
url::Url::parse(url)
|
||||||
|
.ok()
|
||||||
|
.and_then(|u| u.host_str().map(|h| h.to_lowercase()))
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Wire it into the pipeline**
|
||||||
|
|
||||||
|
In `run_generation_inner`, after `filter_homepage_urls` (line 315) and before the scrape step, add:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// Step 7c: Limit articles per source for diversity
|
||||||
|
let parsed = limit_articles_per_source(parsed, settings.max_articles_per_source);
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add unit tests**
|
||||||
|
|
||||||
|
Add to the `#[cfg(test)] mod tests` block at the bottom of `synthesis.rs`:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// ── limit_articles_per_source tests ────────────────────────────
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn source_limit_spreads_across_categories() {
|
||||||
|
let parsed = vec![
|
||||||
|
("category_0".into(), vec![
|
||||||
|
NewsItem { title: "A1".into(), url: "https://openai.com/blog/a".into(), summary: "s".into() },
|
||||||
|
NewsItem { title: "A2".into(), url: "https://openai.com/blog/b".into(), summary: "s".into() },
|
||||||
|
NewsItem { title: "A3".into(), url: "https://openai.com/blog/c".into(), summary: "s".into() },
|
||||||
|
NewsItem { title: "A4".into(), url: "https://techcrunch.com/x".into(), summary: "s".into() },
|
||||||
|
]),
|
||||||
|
("category_1".into(), vec![
|
||||||
|
NewsItem { title: "B1".into(), url: "https://openai.com/research/d".into(), summary: "s".into() },
|
||||||
|
NewsItem { title: "B2".into(), url: "https://openai.com/research/e".into(), summary: "s".into() },
|
||||||
|
NewsItem { title: "B3".into(), url: "https://theverge.com/y".into(), summary: "s".into() },
|
||||||
|
]),
|
||||||
|
];
|
||||||
|
|
||||||
|
let result = limit_articles_per_source(parsed, 3);
|
||||||
|
|
||||||
|
// Count openai.com articles across all categories
|
||||||
|
let openai_count: usize = result.iter()
|
||||||
|
.flat_map(|(_, items)| items)
|
||||||
|
.filter(|i| i.url.contains("openai.com"))
|
||||||
|
.count();
|
||||||
|
assert_eq!(openai_count, 3, "Should keep exactly 3 openai.com articles");
|
||||||
|
|
||||||
|
// Both categories should have at least 1 openai article (spread)
|
||||||
|
let cat0_openai = result[0].1.iter().filter(|i| i.url.contains("openai.com")).count();
|
||||||
|
let cat1_openai = result[1].1.iter().filter(|i| i.url.contains("openai.com")).count();
|
||||||
|
assert!(cat0_openai >= 1, "Category 0 should have at least 1 openai article");
|
||||||
|
assert!(cat1_openai >= 1, "Category 1 should have at least 1 openai article");
|
||||||
|
|
||||||
|
// techcrunch and theverge should be untouched
|
||||||
|
let tc_count: usize = result.iter().flat_map(|(_, items)| items).filter(|i| i.url.contains("techcrunch")).count();
|
||||||
|
assert_eq!(tc_count, 1);
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn source_limit_all_different_domains() {
|
||||||
|
let parsed = vec![
|
||||||
|
("category_0".into(), vec![
|
||||||
|
NewsItem { title: "A".into(), url: "https://a.com/1".into(), summary: "s".into() },
|
||||||
|
NewsItem { title: "B".into(), url: "https://b.com/2".into(), summary: "s".into() },
|
||||||
|
]),
|
||||||
|
];
|
||||||
|
|
||||||
|
let result = limit_articles_per_source(parsed, 3);
|
||||||
|
assert_eq!(result[0].1.len(), 2, "Nothing dropped when all domains are unique");
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn source_limit_max_one() {
|
||||||
|
let parsed = vec![
|
||||||
|
("category_0".into(), vec![
|
||||||
|
NewsItem { title: "A".into(), url: "https://openai.com/a".into(), summary: "s".into() },
|
||||||
|
NewsItem { title: "B".into(), url: "https://openai.com/b".into(), summary: "s".into() },
|
||||||
|
]),
|
||||||
|
("category_1".into(), vec![
|
||||||
|
NewsItem { title: "C".into(), url: "https://openai.com/c".into(), summary: "s".into() },
|
||||||
|
]),
|
||||||
|
];
|
||||||
|
|
||||||
|
let result = limit_articles_per_source(parsed, 1);
|
||||||
|
let total: usize = result.iter().flat_map(|(_, items)| items).filter(|i| i.url.contains("openai.com")).count();
|
||||||
|
assert_eq!(total, 1, "max=1 should keep exactly 1 openai article");
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn source_limit_more_categories_than_max() {
|
||||||
|
// 5 categories, each with 1 openai article, max=2
|
||||||
|
let parsed: Vec<(String, Vec<NewsItem>)> = (0..5)
|
||||||
|
.map(|i| (
|
||||||
|
format!("category_{}", i),
|
||||||
|
vec![NewsItem {
|
||||||
|
title: format!("Art{}", i),
|
||||||
|
url: format!("https://openai.com/{}", i),
|
||||||
|
summary: "s".into(),
|
||||||
|
}],
|
||||||
|
))
|
||||||
|
.collect();
|
||||||
|
|
||||||
|
let result = limit_articles_per_source(parsed, 2);
|
||||||
|
let total: usize = result.iter().flat_map(|(_, items)| items).count();
|
||||||
|
assert_eq!(total, 2, "Should cap at max_per_source even with more categories");
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn source_limit_empty_input() {
|
||||||
|
let result = limit_articles_per_source(vec![], 3);
|
||||||
|
assert!(result.is_empty());
|
||||||
|
}
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn source_limit_unparseable_urls_kept() {
|
||||||
|
let parsed = vec![
|
||||||
|
("category_0".into(), vec![
|
||||||
|
NewsItem { title: "Good".into(), url: "https://openai.com/a".into(), summary: "s".into() },
|
||||||
|
NewsItem { title: "Bad".into(), url: "not-a-url".into(), summary: "s".into() },
|
||||||
|
]),
|
||||||
|
];
|
||||||
|
|
||||||
|
let result = limit_articles_per_source(parsed, 3);
|
||||||
|
assert_eq!(result[0].1.len(), 2, "Unparseable URLs should be kept");
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run tests**
|
||||||
|
|
||||||
|
Run: `cd backend && cargo test --lib`
|
||||||
|
Expected: all tests pass including the 6 new ones
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add backend/src/services/synthesis.rs
|
||||||
|
git commit -m "feat: add limit_articles_per_source filter with unit tests"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Frontend setting
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `frontend/src/types.ts`
|
||||||
|
- Modify: `frontend/src/i18n/fr.ts`
|
||||||
|
- Modify: `frontend/src/pages/Settings.tsx`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add field to frontend types**
|
||||||
|
|
||||||
|
In `frontend/src/types.ts`, add to `UserSettings` interface after `max_items_per_category`:
|
||||||
|
```typescript
|
||||||
|
max_articles_per_source: number;
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add i18n label**
|
||||||
|
|
||||||
|
In `frontend/src/i18n/fr.ts`, add after the `settings.maxItems` line:
|
||||||
|
```typescript
|
||||||
|
'settings.maxArticlesPerSource': 'Articles max par source',
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Add number input to Settings page**
|
||||||
|
|
||||||
|
In `frontend/src/pages/Settings.tsx`, inside the `sm:grid-cols-2` grid (before its closing `</div>` around line 403), add a new `<div>` as a third child of the grid:
|
||||||
|
|
||||||
|
```tsx
|
||||||
|
<div>
|
||||||
|
<label
|
||||||
|
for="maxArticlesPerSource"
|
||||||
|
class="block text-sm font-medium text-gray-700"
|
||||||
|
>
|
||||||
|
{t('settings.maxArticlesPerSource')}
|
||||||
|
</label>
|
||||||
|
<div class="mt-1">
|
||||||
|
<input
|
||||||
|
type="number"
|
||||||
|
id="maxArticlesPerSource"
|
||||||
|
min="1"
|
||||||
|
max="10"
|
||||||
|
class="shadow-sm focus:ring-indigo-500 focus:border-indigo-500 block w-full sm:text-sm border-gray-300 rounded-md py-2 px-3 border"
|
||||||
|
value={settings().max_articles_per_source}
|
||||||
|
onInput={(e) =>
|
||||||
|
setSettings((prev) => ({
|
||||||
|
...prev,
|
||||||
|
max_articles_per_source:
|
||||||
|
parseInt(e.currentTarget.value) || 3,
|
||||||
|
}))
|
||||||
|
}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
```
|
||||||
|
|
||||||
|
Also add `max_articles_per_source: 3` to the default settings initializer if one exists.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run frontend tests and type check**
|
||||||
|
|
||||||
|
Run: `cd frontend && npx tsc --noEmit && npx vitest run`
|
||||||
|
Expected: type check passes, all tests pass
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
|
||||||
|
git commit -m "feat: add max_articles_per_source setting to frontend"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: E2E verification
|
||||||
|
|
||||||
|
- [ ] **Step 1: Rebuild and run Docker stack**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose down && docker compose up --build
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Verify the setting appears in the Settings page**
|
||||||
|
|
||||||
|
Navigate to Settings, verify the "Articles max par source" number input is visible with default value 3.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Generate a synthesis and verify source diversity**
|
||||||
|
|
||||||
|
Change the setting to 2, generate a synthesis, verify no domain appears more than 2 times across all categories.
|
||||||
@ -0,0 +1,83 @@
|
|||||||
|
# Design: Source Diversity via Recent History
|
||||||
|
|
||||||
|
**Date**: 2026-03-23
|
||||||
|
**Scope**: Inject recently-used domains into the search prompt to encourage source diversity across syntheses
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Users notice that successive syntheses reuse the same sources (TechCrunch, The Verge, etc.). Within a single synthesis, the `limit_articles_per_source` filter already caps per-domain articles. But across syntheses over time, the LLM gravitates toward the same popular domains. By telling the LLM which domains were recently used, it can prioritize different sources.
|
||||||
|
|
||||||
|
## New User Setting
|
||||||
|
|
||||||
|
- **Field:** `source_diversity_window` in `UserSettings`
|
||||||
|
- **Type:** `i32` (non-optional, matches existing pattern)
|
||||||
|
- **Default:** 3
|
||||||
|
- **Validation:** 0-10 (0 = disabled)
|
||||||
|
- **Migration:** `ALTER TABLE settings ADD COLUMN source_diversity_window INTEGER NOT NULL DEFAULT 3`
|
||||||
|
- **Frontend label:** "Syntheses a examiner pour diversite"
|
||||||
|
|
||||||
|
## Mechanism
|
||||||
|
|
||||||
|
1. At generation time, if `source_diversity_window > 0`, query the user's last N syntheses from the DB (ordered by `created_at DESC`, limit N).
|
||||||
|
2. Parse the `sections` JSONB from each synthesis, extract all article URLs, convert to domains via `host_str()`.
|
||||||
|
3. Deduplicate the domain list.
|
||||||
|
4. Pass the domain list to `build_search_prompt`, which appends a soft instruction:
|
||||||
|
"Evite si possible les sources deja utilisees recemment : domaine1.com, domaine2.com, ..."
|
||||||
|
5. The LLM treats this as guidance, not a hard constraint — if no alternative exists, it can still use those domains.
|
||||||
|
|
||||||
|
## Files to modify
|
||||||
|
|
||||||
|
- **Create:** migration `20260323000013_add_source_diversity_window.sql`
|
||||||
|
- **Modify:** `backend/src/models/settings.rs` — add field to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest` + `Default` impl + validation (0-10)
|
||||||
|
- **Modify:** `backend/src/db/settings.rs` — add to `SettingsRow` struct, `TryFrom<SettingsRow>` impl, and both SQL queries (`get_or_create_default` + `upsert`: INSERT columns, VALUES, RETURNING, ON CONFLICT SET, .bind())
|
||||||
|
- **Modify:** `backend/src/services/synthesis.rs` — before calling `build_search_prompt`, load recent syntheses via existing `db::syntheses::list_for_user`, extract domains using `extract_domain` (same module, private fn), pass domain list to the prompt builder
|
||||||
|
- **Modify:** `backend/src/services/prompts.rs` — add `recent_domains: &[String]` parameter to `build_search_prompt`, append soft avoidance instruction if non-empty. Update the call site in `synthesis.rs` (~line 304) to pass the domain list as the 4th argument.
|
||||||
|
- **Modify:** `backend/src/services/prompts.rs` tests — add `source_diversity_window` to test fixture, test with/without recent domains
|
||||||
|
- **Modify:** `frontend/src/types.ts` — add field to `UserSettings` + `DEFAULT_SETTINGS`
|
||||||
|
- **Modify:** `frontend/src/i18n/fr.ts` — add label
|
||||||
|
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input
|
||||||
|
|
||||||
|
**Note:** No new DB query function needed — the existing `db::syntheses::list_for_user(pool, user_id, limit, offset)` already returns full `Synthesis` records with `sections` JSONB. For a window of 3-10 syntheses (15-150 KB of JSON), application-level domain extraction is pragmatically fine for a single-tenant deployment.
|
||||||
|
|
||||||
|
## Domain extraction from existing syntheses
|
||||||
|
|
||||||
|
The `sections` column is JSONB with structure:
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"title": "Category Name",
|
||||||
|
"items": [
|
||||||
|
{ "title": "...", "url": "https://example.com/article", "summary": "..." }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Extract domains by parsing each item's `url` with `url::Url::parse` and `host_str()`. Reuse the existing `extract_domain` function in `synthesis.rs` (private fn, same module).
|
||||||
|
|
||||||
|
## Unit tests
|
||||||
|
|
||||||
|
- `build_search_prompt` with non-empty `recent_domains` → prompt contains avoidance instruction
|
||||||
|
- `build_search_prompt` with empty `recent_domains` → prompt unchanged
|
||||||
|
- Validation of `source_diversity_window` bounds (0 and 10 pass, -1 and 11 fail)
|
||||||
|
|
||||||
|
## Prompt modification
|
||||||
|
|
||||||
|
In `build_search_prompt`, add an optional parameter `recent_domains: &[String]`. If non-empty, append to the user prompt:
|
||||||
|
|
||||||
|
```
|
||||||
|
Evite si possible les sources deja utilisees dans les syntheses precedentes : domaine1.com, domaine2.com, ...
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a soft instruction — the LLM can still use these domains if no alternatives are available.
|
||||||
|
|
||||||
|
## What does NOT change
|
||||||
|
|
||||||
|
- JSON schema — no changes
|
||||||
|
- Scraper — no changes
|
||||||
|
- Rewrite pass — no changes
|
||||||
|
- `limit_articles_per_source` — still enforces hard cap within a single synthesis
|
||||||
|
- `dedup_by_url` — still deduplicates within a single synthesis
|
||||||
|
- No new database table — domains are extracted from existing `syntheses.sections` JSONB
|
||||||
Loading…
Reference in New Issue