# Article History — Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Prevent duplicate articles across syntheses by maintaining a persistent per-user article URL history with configurable TTL. **Architecture:** New `article_history` table with SHA-256 hashed URLs. Pipeline filters candidates against history before classification. URLs inserted after synthesis saved. Cleanup of old entries before each generation. **Tech Stack:** Rust (sqlx, sha2, url crate), PostgreSQL **Spec:** `docs/superpowers/specs/2026-03-24-article-history-design.md` --- ### Task 1: Migration + settings field **Files:** - Create: `backend/migrations/20260324000015_add_article_history.sql` - Modify: `backend/src/models/settings.rs` - Modify: `backend/src/db/settings.rs` - Modify: `backend/src/services/prompts.rs` (test fixture) - Modify: `CLAUDE.md` - [ ] **Step 1: Create migration** ```sql -- Article history table for cross-synthesis URL deduplication CREATE TABLE article_history ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, url_hash TEXT NOT NULL, url TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash); -- Setting for history TTL ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90; ``` - [ ] **Step 2: Add `article_history_days` to settings model** Add `pub article_history_days: i32` to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`. Add to `From` impl, `Default` (90), and validation: ```rust if !(0..=365).contains(&self.article_history_days) { return Err("article_history_days must be between 0 and 365".into()); } ``` - [ ] **Step 3: Add to DB queries** Add to `SettingsRow`, `TryFrom`, both SQL queries in `db/settings.rs`. - [ ] **Step 4: Update test fixtures** Add `article_history_days: 90` to `valid_request()` in settings tests and `test_settings()` in prompts tests. - [ ] **Step 5: Update CLAUDE.md migration count to 15** - [ ] **Step 6: Run tests + commit** ```bash cd backend && cargo test --lib git add backend/migrations/20260324000015_add_article_history.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md git commit -m "feat: add article_history table and article_history_days setting" ``` --- ### Task 2: DB module for article history **Files:** - Create: `backend/src/db/article_history.rs` - Modify: `backend/src/db/mod.rs` - [ ] **Step 1: Create `backend/src/db/article_history.rs`** ```rust //! Article history: tracks which article URLs have been used in past syntheses. //! //! Prevents the same article from appearing in multiple syntheses. use std::collections::HashSet; use sqlx::PgPool; use uuid::Uuid; use crate::errors::AppError; /// Check which URL hashes already exist in history for this user. /// /// Returns the set of url_hashes that were found (i.e., already used). pub async fn check_urls_exist( pool: &PgPool, user_id: Uuid, url_hashes: &[String], ) -> Result, AppError> { if url_hashes.is_empty() { return Ok(HashSet::new()); } let rows = sqlx::query_scalar::<_, String>( "SELECT url_hash FROM article_history WHERE user_id = $1 AND url_hash = ANY($2)", ) .bind(user_id) .bind(url_hashes) .fetch_all(pool) .await?; Ok(rows.into_iter().collect()) } /// Insert article URLs into history (batch). /// /// Uses ON CONFLICT DO NOTHING to silently skip duplicates. pub async fn insert_urls( pool: &PgPool, user_id: Uuid, urls: &[(String, String)], // Vec<(url, url_hash)> ) -> Result<(), AppError> { if urls.is_empty() { return Ok(()); } for (url, url_hash) in urls { sqlx::query( "INSERT INTO article_history (user_id, url_hash, url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING", ) .bind(user_id) .bind(url_hash) .bind(url) .execute(pool) .await?; } Ok(()) } /// Delete history entries older than N days for this user. /// /// Returns the number of deleted rows. pub async fn cleanup_old( pool: &PgPool, user_id: Uuid, days: i32, ) -> Result { let result = sqlx::query( "DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - make_interval(days => $2)", ) .bind(user_id) .bind(days) .execute(pool) .await?; Ok(result.rows_affected()) } ``` - [ ] **Step 2: Register module in `db/mod.rs`** Add `pub mod article_history;` (alphabetical order — after `api_keys`). - [ ] **Step 3: Run tests + commit** ```bash cd backend && cargo test --lib && cargo build git add backend/src/db/article_history.rs backend/src/db/mod.rs git commit -m "feat: add article_history DB module (check, insert, cleanup)" ``` --- ### Task 3: URL normalization utility + unit tests **Files:** - Modify: `backend/src/services/synthesis.rs` - [ ] **Step 1: Add `normalize_article_url` function** Add near the other URL helper functions (near `extract_domain`): ```rust /// Normalize an article URL for consistent history hashing. /// /// Strips fragments, trailing slashes, and known tracking query parameters /// so that the same article with different UTM tags is recognized as a duplicate. fn normalize_article_url(url_str: &str) -> String { let Ok(mut parsed) = url::Url::parse(url_str) else { return url_str.to_lowercase(); }; // Strip fragment parsed.set_fragment(None); // Strip known tracking query parameters let dominated_params: &[&str] = &[ "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content", "ref", "source", "fbclid", "gclid", ]; let filtered_pairs: Vec<(String, String)> = parsed .query_pairs() .filter(|(key, _)| !dominated_params.contains(&key.as_ref())) .map(|(k, v)| (k.into_owned(), v.into_owned())) .collect(); if filtered_pairs.is_empty() { parsed.set_query(None); } else { let query_string = filtered_pairs .iter() .map(|(k, v)| format!("{}={}", k, v)) .collect::>() .join("&"); parsed.set_query(Some(&query_string)); } // Strip trailing slash (unless path is just "/") let path = parsed.path().to_string(); if path.len() > 1 && path.ends_with('/') { parsed.set_path(&path[..path.len() - 1]); } parsed.to_string().to_lowercase() } /// Compute the hash of a normalized article URL for history lookup. fn hash_article_url(url: &str) -> String { let normalized = normalize_article_url(url); crate::util::token::hash_token(&normalized) } ``` - [ ] **Step 2: Add unit tests** ```rust // ── normalize_article_url tests ───────────────────────────── #[test] fn normalize_strips_fragment() { assert_eq!( normalize_article_url("https://example.com/article#section"), "https://example.com/article" ); } #[test] fn normalize_strips_utm_params() { assert_eq!( normalize_article_url("https://example.com/article?utm_source=twitter&utm_medium=social"), "https://example.com/article" ); } #[test] fn normalize_keeps_non_tracking_params() { let result = normalize_article_url("https://example.com/search?q=test&utm_source=twitter"); assert!(result.contains("q=test")); assert!(!result.contains("utm_source")); } #[test] fn normalize_strips_trailing_slash() { assert_eq!( normalize_article_url("https://example.com/article/"), "https://example.com/article" ); } #[test] fn normalize_keeps_root_slash() { assert_eq!( normalize_article_url("https://example.com/"), "https://example.com/" ); } #[test] fn normalize_lowercases() { assert_eq!( normalize_article_url("https://Example.COM/Article"), "https://example.com/article" // entire URL lowercased for consistent hashing ); } #[test] fn normalize_handles_invalid_url() { let result = normalize_article_url("not a url at all"); assert_eq!(result, "not a url at all"); } #[test] fn normalize_strips_fbclid() { let result = normalize_article_url("https://example.com/post?fbclid=abc123"); assert!(!result.contains("fbclid")); assert!(!result.contains("?")); } #[test] fn hash_article_url_deterministic() { let h1 = hash_article_url("https://example.com/article?utm_source=twitter"); let h2 = hash_article_url("https://example.com/article?utm_source=newsletter"); assert_eq!(h1, h2, "Same article with different UTM params should hash the same"); } ``` - [ ] **Step 3: Run tests + commit** ```bash cd backend && cargo test --lib git add backend/src/services/synthesis.rs git commit -m "feat: add normalize_article_url and hash_article_url utilities" ``` --- ### Task 4: Pipeline integration — history filtering, insert, cleanup **Files:** - Modify: `backend/src/services/synthesis.rs` This is the core integration task. Changes are in `run_generation_inner`. - [ ] **Step 1: Add cleanup at the start of generation** After loading settings (around line 259), add: ```rust // Cleanup old article history entries if settings.article_history_days > 0 { let deleted = db::article_history::cleanup_old( &state.pool, user_id, settings.article_history_days, ) .await .unwrap_or(0); if deleted > 0 { tracing::info!(deleted = deleted, "Cleaned up old article history entries"); } } ``` - [ ] **Step 2: Add history filtering in Phase 1** After filtering empty content (around line 376, after `let valid_articles = ...`), add history filtering: ```rust // 1d. Filter against article history (cross-synthesis dedup) let valid_articles = if settings.article_history_days > 0 { let hashes: Vec = valid_articles.iter().map(|a| hash_article_url(&a.url)).collect(); let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes) .await .unwrap_or_default(); if !existing.is_empty() { tracing::info!(filtered = existing.len(), "Phase 1: filtered articles already in history"); } valid_articles .into_iter() .filter(|a| !existing.contains(&hash_article_url(&a.url))) .collect::>() } else { valid_articles }; ``` - [ ] **Step 3: Add Phase 1 retry logic when under-filled** After the history filtering in Phase 1 (Step 2), check if we have enough articles. If under-filled, do one retry with the same sources, excluding already-fetched URLs: ```rust // 1e. Retry if under-filled (1 attempt) let target = settings.categories.len() * settings.max_items_per_category as usize; if valid_articles.len() < target && settings.article_history_days > 0 { tracing::info!( have = valid_articles.len(), need = target, "Phase 1 under-filled after history filter, retrying with same sources" ); // Collect all URLs already fetched (valid + filtered) let mut already_fetched: std::collections::HashSet = candidate_urls .iter() .map(|u| u.to_lowercase()) .collect(); // Re-scrape source pages for new links let mut retry_urls: Vec = Vec::new(); for source in sources.iter().take(max_sources) { let links = if settings.use_llm_for_source_links { source_scraper::extract_article_links_with_llm( &state.http_client, &source.url, max_links_per_source, &provider, &model_research, ).await } else { source_scraper::extract_article_links( &state.http_client, &source.url, max_links_per_source, ).await }; if let Ok(links) = links { for link in links { if !already_fetched.contains(&link.to_lowercase()) { retry_urls.push(link); } } } } if !retry_urls.is_empty() { // Scrape retry candidates let retry_scraped = scrape_flat_urls( state, &retry_urls, settings.max_age_days as i64, tx, llm_for_scraping.clone(), ).await; let retry_valid: Vec = retry_scraped .into_iter() .filter(|a| !a.scraped_content.trim().is_empty()) .collect(); // Filter against history let retry_valid = if !retry_valid.is_empty() { let hashes: Vec = retry_valid.iter().map(|a| hash_article_url(&a.url)).collect(); let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes) .await.unwrap_or_default(); retry_valid.into_iter() .filter(|a| !existing.contains(&hash_article_url(&a.url))) .collect::>() } else { retry_valid }; // Merge with existing valid articles valid_articles.extend(retry_valid); tracing::info!(total = valid_articles.len(), "Phase 1 after retry"); } } ``` Note: `valid_articles` must be declared as `let mut valid_articles` earlier for this to work. - [ ] **Step 4: Add history filtering in Phase 2 (before scraping)** In Phase 2, after `dedup_by_url` and `limit_articles_per_source` (around line 552), before `scrape_articles`, add: ```rust // Filter against article history BEFORE scraping (saves HTTP requests) let parsed = if settings.article_history_days > 0 { let all_urls: Vec = parsed.iter() .flat_map(|(_, items)| items.iter().map(|i| i.url.clone())) .collect(); let hashes: Vec = all_urls.iter().map(|u| hash_article_url(u)).collect(); let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes) .await .unwrap_or_default(); if !existing.is_empty() { tracing::info!(filtered = existing.len(), "Phase 2: filtered articles already in history"); } parsed .into_iter() .map(|(cat_key, items)| { let filtered = items .into_iter() .filter(|item| !existing.contains(&hash_article_url(&item.url))) .collect(); (cat_key, filtered) }) .collect() } else { parsed }; ``` - [ ] **Step 5: Insert article URLs after saving synthesis** After `db::syntheses::create` (around line 638), add: ```rust // Record article URLs in history for cross-synthesis dedup if settings.article_history_days > 0 { let article_urls: Vec<(String, String)> = final_sections .iter() .flat_map(|section| section.items.iter()) .map(|item| (item.url.clone(), hash_article_url(&item.url))) .collect(); db::article_history::insert_urls(&state.pool, user_id, &article_urls) .await .ok(); // Don't fail synthesis if history insert fails } ``` - [ ] **Step 6: Run tests + commit** ```bash cd backend && cargo test --lib && cargo build git add backend/src/services/synthesis.rs git commit -m "feat: article history filtering in pipeline — cleanup, Phase 1/2 filter, retry, insert after save" ``` --- ### Task 5: Frontend setting **Files:** - Modify: `frontend/src/types.ts` - Modify: `frontend/src/i18n/fr.ts` - Modify: `frontend/src/pages/Settings.tsx` - [ ] **Step 1: Add field to types + DEFAULT_SETTINGS** ```typescript // In UserSettings: article_history_days: number; // In DEFAULT_SETTINGS: article_history_days: 90, ``` - [ ] **Step 2: Add i18n label** ```typescript 'settings.articleHistoryDays': 'Historique des articles (jours)', ``` - [ ] **Step 3: Add number input to Settings page** Add inside the generation settings grid (alongside the other number inputs): ```tsx
setSettings((prev) => ({ ...prev, article_history_days: parseInt(e.currentTarget.value) || 90, })) } />
``` - [ ] **Step 4: Run frontend tests + commit** ```bash cd frontend && npx tsc --noEmit && npx vitest run git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx git commit -m "feat: add article_history_days setting to frontend" ``` --- ### Task 6: Update E2E and integration tests **Files:** - Modify: `e2e/tests/generation-live.spec.ts` - Modify: `backend/tests/api_syntheses_test.rs` - [ ] **Step 1: Update E2E settings payload** Add `article_history_days: 90` to the PUT settings body. - [ ] **Step 2: Update integration test settings payload** In `api_syntheses_test.rs`, add `"article_history_days": 90` to the PUT settings body. - [ ] **Step 3: Run E2E test to verify** ```bash cd e2e && docker compose -f docker-compose.test.yml down docker compose -f docker-compose.test.yml up --build -d sleep 25 && npx tsx seed.ts && npx playwright test generation-live --reporter=list ``` - [ ] **Step 4: Commit** ```bash git add e2e/tests/generation-live.spec.ts backend/tests/api_syntheses_test.rs git commit -m "test: update E2E and integration tests with article_history_days setting" ```