diff --git a/docs/superpowers/plans/2026-03-24-article-history.md b/docs/superpowers/plans/2026-03-24-article-history.md new file mode 100644 index 0000000..592cfde --- /dev/null +++ b/docs/superpowers/plans/2026-03-24-article-history.md @@ -0,0 +1,590 @@ +# Article History — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Prevent duplicate articles across syntheses by maintaining a persistent per-user article URL history with configurable TTL. + +**Architecture:** New `article_history` table with SHA-256 hashed URLs. Pipeline filters candidates against history before classification. URLs inserted after synthesis saved. Cleanup of old entries before each generation. + +**Tech Stack:** Rust (sqlx, sha2, url crate), PostgreSQL + +**Spec:** `docs/superpowers/specs/2026-03-24-article-history-design.md` + +--- + +### Task 1: Migration + settings field + +**Files:** +- Create: `backend/migrations/20260324000015_add_article_history.sql` +- Modify: `backend/src/models/settings.rs` +- Modify: `backend/src/db/settings.rs` +- Modify: `backend/src/services/prompts.rs` (test fixture) +- Modify: `CLAUDE.md` + +- [ ] **Step 1: Create migration** + +```sql +-- Article history table for cross-synthesis URL deduplication +CREATE TABLE article_history ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, + url_hash TEXT NOT NULL, + url TEXT NOT NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT now() +); +CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash); + +-- Setting for history TTL +ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90; +``` + +- [ ] **Step 2: Add `article_history_days` to settings model** + +Add `pub article_history_days: i32` to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`. Add to `From` impl, `Default` (90), and validation: +```rust +if !(0..=365).contains(&self.article_history_days) { + return Err("article_history_days must be between 0 and 365".into()); +} +``` + +- [ ] **Step 3: Add to DB queries** + +Add to `SettingsRow`, `TryFrom`, both SQL queries in `db/settings.rs`. + +- [ ] **Step 4: Update test fixtures** + +Add `article_history_days: 90` to `valid_request()` in settings tests and `test_settings()` in prompts tests. + +- [ ] **Step 5: Update CLAUDE.md migration count to 15** + +- [ ] **Step 6: Run tests + commit** + +```bash +cd backend && cargo test --lib +git add backend/migrations/20260324000015_add_article_history.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md +git commit -m "feat: add article_history table and article_history_days setting" +``` + +--- + +### Task 2: DB module for article history + +**Files:** +- Create: `backend/src/db/article_history.rs` +- Modify: `backend/src/db/mod.rs` + +- [ ] **Step 1: Create `backend/src/db/article_history.rs`** + +```rust +//! Article history: tracks which article URLs have been used in past syntheses. +//! +//! Prevents the same article from appearing in multiple syntheses. + +use std::collections::HashSet; +use sqlx::PgPool; +use uuid::Uuid; +use crate::errors::AppError; + +/// Check which URL hashes already exist in history for this user. +/// +/// Returns the set of url_hashes that were found (i.e., already used). +pub async fn check_urls_exist( + pool: &PgPool, + user_id: Uuid, + url_hashes: &[String], +) -> Result, AppError> { + if url_hashes.is_empty() { + return Ok(HashSet::new()); + } + + let rows = sqlx::query_scalar::<_, String>( + "SELECT url_hash FROM article_history WHERE user_id = $1 AND url_hash = ANY($2)", + ) + .bind(user_id) + .bind(url_hashes) + .fetch_all(pool) + .await?; + + Ok(rows.into_iter().collect()) +} + +/// Insert article URLs into history (batch). +/// +/// Uses ON CONFLICT DO NOTHING to silently skip duplicates. +pub async fn insert_urls( + pool: &PgPool, + user_id: Uuid, + urls: &[(String, String)], // Vec<(url, url_hash)> +) -> Result<(), AppError> { + if urls.is_empty() { + return Ok(()); + } + + for (url, url_hash) in urls { + sqlx::query( + "INSERT INTO article_history (user_id, url_hash, url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING", + ) + .bind(user_id) + .bind(url_hash) + .bind(url) + .execute(pool) + .await?; + } + + Ok(()) +} + +/// Delete history entries older than N days for this user. +/// +/// Returns the number of deleted rows. +pub async fn cleanup_old( + pool: &PgPool, + user_id: Uuid, + days: i32, +) -> Result { + let result = sqlx::query( + "DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - make_interval(days => $2)", + ) + .bind(user_id) + .bind(days) + .execute(pool) + .await?; + + Ok(result.rows_affected()) +} +``` + +- [ ] **Step 2: Register module in `db/mod.rs`** + +Add `pub mod article_history;` (alphabetical order — after `api_keys`). + +- [ ] **Step 3: Run tests + commit** + +```bash +cd backend && cargo test --lib && cargo build +git add backend/src/db/article_history.rs backend/src/db/mod.rs +git commit -m "feat: add article_history DB module (check, insert, cleanup)" +``` + +--- + +### Task 3: URL normalization utility + unit tests + +**Files:** +- Modify: `backend/src/services/synthesis.rs` + +- [ ] **Step 1: Add `normalize_article_url` function** + +Add near the other URL helper functions (near `extract_domain`): + +```rust +/// Normalize an article URL for consistent history hashing. +/// +/// Strips fragments, trailing slashes, and known tracking query parameters +/// so that the same article with different UTM tags is recognized as a duplicate. +fn normalize_article_url(url_str: &str) -> String { + let Ok(mut parsed) = url::Url::parse(url_str) else { + return url_str.to_lowercase(); + }; + + // Strip fragment + parsed.set_fragment(None); + + // Strip known tracking query parameters + let dominated_params: &[&str] = &[ + "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content", + "ref", "source", "fbclid", "gclid", + ]; + + let filtered_pairs: Vec<(String, String)> = parsed + .query_pairs() + .filter(|(key, _)| !dominated_params.contains(&key.as_ref())) + .map(|(k, v)| (k.into_owned(), v.into_owned())) + .collect(); + + if filtered_pairs.is_empty() { + parsed.set_query(None); + } else { + let query_string = filtered_pairs + .iter() + .map(|(k, v)| format!("{}={}", k, v)) + .collect::>() + .join("&"); + parsed.set_query(Some(&query_string)); + } + + // Strip trailing slash (unless path is just "/") + let path = parsed.path().to_string(); + if path.len() > 1 && path.ends_with('/') { + parsed.set_path(&path[..path.len() - 1]); + } + + parsed.to_string().to_lowercase() +} + +/// Compute the hash of a normalized article URL for history lookup. +fn hash_article_url(url: &str) -> String { + let normalized = normalize_article_url(url); + crate::util::token::hash_token(&normalized) +} +``` + +- [ ] **Step 2: Add unit tests** + +```rust + // ── normalize_article_url tests ───────────────────────────── + + #[test] + fn normalize_strips_fragment() { + assert_eq!( + normalize_article_url("https://example.com/article#section"), + "https://example.com/article" + ); + } + + #[test] + fn normalize_strips_utm_params() { + assert_eq!( + normalize_article_url("https://example.com/article?utm_source=twitter&utm_medium=social"), + "https://example.com/article" + ); + } + + #[test] + fn normalize_keeps_non_tracking_params() { + let result = normalize_article_url("https://example.com/search?q=test&utm_source=twitter"); + assert!(result.contains("q=test")); + assert!(!result.contains("utm_source")); + } + + #[test] + fn normalize_strips_trailing_slash() { + assert_eq!( + normalize_article_url("https://example.com/article/"), + "https://example.com/article" + ); + } + + #[test] + fn normalize_keeps_root_slash() { + assert_eq!( + normalize_article_url("https://example.com/"), + "https://example.com/" + ); + } + + #[test] + fn normalize_lowercases() { + assert_eq!( + normalize_article_url("https://Example.COM/Article"), + "https://example.com/article" // entire URL lowercased for consistent hashing + ); + } + + #[test] + fn normalize_handles_invalid_url() { + let result = normalize_article_url("not a url at all"); + assert_eq!(result, "not a url at all"); + } + + #[test] + fn normalize_strips_fbclid() { + let result = normalize_article_url("https://example.com/post?fbclid=abc123"); + assert!(!result.contains("fbclid")); + assert!(!result.contains("?")); + } + + #[test] + fn hash_article_url_deterministic() { + let h1 = hash_article_url("https://example.com/article?utm_source=twitter"); + let h2 = hash_article_url("https://example.com/article?utm_source=newsletter"); + assert_eq!(h1, h2, "Same article with different UTM params should hash the same"); + } +``` + +- [ ] **Step 3: Run tests + commit** + +```bash +cd backend && cargo test --lib +git add backend/src/services/synthesis.rs +git commit -m "feat: add normalize_article_url and hash_article_url utilities" +``` + +--- + +### Task 4: Pipeline integration — history filtering, insert, cleanup + +**Files:** +- Modify: `backend/src/services/synthesis.rs` + +This is the core integration task. Changes are in `run_generation_inner`. + +- [ ] **Step 1: Add cleanup at the start of generation** + +After loading settings (around line 259), add: +```rust + // Cleanup old article history entries + if settings.article_history_days > 0 { + let deleted = db::article_history::cleanup_old( + &state.pool, + user_id, + settings.article_history_days, + ) + .await + .unwrap_or(0); + if deleted > 0 { + tracing::info!(deleted = deleted, "Cleaned up old article history entries"); + } + } +``` + +- [ ] **Step 2: Add history filtering in Phase 1** + +After filtering empty content (around line 376, after `let valid_articles = ...`), add history filtering: + +```rust + // 1d. Filter against article history (cross-synthesis dedup) + let valid_articles = if settings.article_history_days > 0 { + let hashes: Vec = valid_articles.iter().map(|a| hash_article_url(&a.url)).collect(); + let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes) + .await + .unwrap_or_default(); + if !existing.is_empty() { + tracing::info!(filtered = existing.len(), "Phase 1: filtered articles already in history"); + } + valid_articles + .into_iter() + .filter(|a| !existing.contains(&hash_article_url(&a.url))) + .collect::>() + } else { + valid_articles + }; +``` + +- [ ] **Step 3: Add Phase 1 retry logic when under-filled** + +After the history filtering in Phase 1 (Step 2), check if we have enough articles. If under-filled, do one retry with the same sources, excluding already-fetched URLs: + +```rust + // 1e. Retry if under-filled (1 attempt) + let target = settings.categories.len() * settings.max_items_per_category as usize; + if valid_articles.len() < target && settings.article_history_days > 0 { + tracing::info!( + have = valid_articles.len(), + need = target, + "Phase 1 under-filled after history filter, retrying with same sources" + ); + + // Collect all URLs already fetched (valid + filtered) + let mut already_fetched: std::collections::HashSet = candidate_urls + .iter() + .map(|u| u.to_lowercase()) + .collect(); + + // Re-scrape source pages for new links + let mut retry_urls: Vec = Vec::new(); + for source in sources.iter().take(max_sources) { + let links = if settings.use_llm_for_source_links { + source_scraper::extract_article_links_with_llm( + &state.http_client, &source.url, max_links_per_source, + &provider, &model_research, + ).await + } else { + source_scraper::extract_article_links( + &state.http_client, &source.url, max_links_per_source, + ).await + }; + if let Ok(links) = links { + for link in links { + if !already_fetched.contains(&link.to_lowercase()) { + retry_urls.push(link); + } + } + } + } + + if !retry_urls.is_empty() { + // Scrape retry candidates + let retry_scraped = scrape_flat_urls( + state, &retry_urls, settings.max_age_days as i64, tx, + llm_for_scraping.clone(), + ).await; + let retry_valid: Vec = retry_scraped + .into_iter() + .filter(|a| !a.scraped_content.trim().is_empty()) + .collect(); + + // Filter against history + let retry_valid = if !retry_valid.is_empty() { + let hashes: Vec = retry_valid.iter().map(|a| hash_article_url(&a.url)).collect(); + let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes) + .await.unwrap_or_default(); + retry_valid.into_iter() + .filter(|a| !existing.contains(&hash_article_url(&a.url))) + .collect::>() + } else { + retry_valid + }; + + // Merge with existing valid articles + valid_articles.extend(retry_valid); + tracing::info!(total = valid_articles.len(), "Phase 1 after retry"); + } + } +``` + +Note: `valid_articles` must be declared as `let mut valid_articles` earlier for this to work. + +- [ ] **Step 4: Add history filtering in Phase 2 (before scraping)** + +In Phase 2, after `dedup_by_url` and `limit_articles_per_source` (around line 552), before `scrape_articles`, add: + +```rust + // Filter against article history BEFORE scraping (saves HTTP requests) + let parsed = if settings.article_history_days > 0 { + let all_urls: Vec = parsed.iter() + .flat_map(|(_, items)| items.iter().map(|i| i.url.clone())) + .collect(); + let hashes: Vec = all_urls.iter().map(|u| hash_article_url(u)).collect(); + let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes) + .await + .unwrap_or_default(); + if !existing.is_empty() { + tracing::info!(filtered = existing.len(), "Phase 2: filtered articles already in history"); + } + parsed + .into_iter() + .map(|(cat_key, items)| { + let filtered = items + .into_iter() + .filter(|item| !existing.contains(&hash_article_url(&item.url))) + .collect(); + (cat_key, filtered) + }) + .collect() + } else { + parsed + }; +``` + +- [ ] **Step 5: Insert article URLs after saving synthesis** + +After `db::syntheses::create` (around line 638), add: + +```rust + // Record article URLs in history for cross-synthesis dedup + if settings.article_history_days > 0 { + let article_urls: Vec<(String, String)> = final_sections + .iter() + .flat_map(|section| section.items.iter()) + .map(|item| (item.url.clone(), hash_article_url(&item.url))) + .collect(); + db::article_history::insert_urls(&state.pool, user_id, &article_urls) + .await + .ok(); // Don't fail synthesis if history insert fails + } +``` + +- [ ] **Step 6: Run tests + commit** + +```bash +cd backend && cargo test --lib && cargo build +git add backend/src/services/synthesis.rs +git commit -m "feat: article history filtering in pipeline — cleanup, Phase 1/2 filter, retry, insert after save" +``` + +--- + +### Task 5: Frontend setting + +**Files:** +- Modify: `frontend/src/types.ts` +- Modify: `frontend/src/i18n/fr.ts` +- Modify: `frontend/src/pages/Settings.tsx` + +- [ ] **Step 1: Add field to types + DEFAULT_SETTINGS** + +```typescript +// In UserSettings: +article_history_days: number; + +// In DEFAULT_SETTINGS: +article_history_days: 90, +``` + +- [ ] **Step 2: Add i18n label** + +```typescript +'settings.articleHistoryDays': 'Historique des articles (jours)', +``` + +- [ ] **Step 3: Add number input to Settings page** + +Add inside the generation settings grid (alongside the other number inputs): + +```tsx +
+ +
+ + setSettings((prev) => ({ + ...prev, + article_history_days: + parseInt(e.currentTarget.value) || 90, + })) + } + /> +
+
+``` + +- [ ] **Step 4: Run frontend tests + commit** + +```bash +cd frontend && npx tsc --noEmit && npx vitest run +git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx +git commit -m "feat: add article_history_days setting to frontend" +``` + +--- + +### Task 6: Update E2E and integration tests + +**Files:** +- Modify: `e2e/tests/generation-live.spec.ts` +- Modify: `backend/tests/api_syntheses_test.rs` + +- [ ] **Step 1: Update E2E settings payload** + +Add `article_history_days: 90` to the PUT settings body. + +- [ ] **Step 2: Update integration test settings payload** + +In `api_syntheses_test.rs`, add `"article_history_days": 90` to the PUT settings body. + +- [ ] **Step 3: Run E2E test to verify** + +```bash +cd e2e && docker compose -f docker-compose.test.yml down +docker compose -f docker-compose.test.yml up --build -d +sleep 25 && npx tsx seed.ts && npx playwright test generation-live --reporter=list +``` + +- [ ] **Step 4: Commit** + +```bash +git add e2e/tests/generation-live.spec.ts backend/tests/api_syntheses_test.rs +git commit -m "test: update E2E and integration tests with article_history_days setting" +```