You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/plans/2026-03-24-article-history.md

591 lines
19 KiB
Markdown

# Article History — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Prevent duplicate articles across syntheses by maintaining a persistent per-user article URL history with configurable TTL.
**Architecture:** New `article_history` table with SHA-256 hashed URLs. Pipeline filters candidates against history before classification. URLs inserted after synthesis saved. Cleanup of old entries before each generation.
**Tech Stack:** Rust (sqlx, sha2, url crate), PostgreSQL
**Spec:** `docs/superpowers/specs/2026-03-24-article-history-design.md`
---
### Task 1: Migration + settings field
**Files:**
- Create: `backend/migrations/20260324000015_add_article_history.sql`
- Modify: `backend/src/models/settings.rs`
- Modify: `backend/src/db/settings.rs`
- Modify: `backend/src/services/prompts.rs` (test fixture)
- Modify: `CLAUDE.md`
- [ ] **Step 1: Create migration**
```sql
-- Article history table for cross-synthesis URL deduplication
CREATE TABLE article_history (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
url_hash TEXT NOT NULL,
url TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
-- Setting for history TTL
ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90;
```
- [ ] **Step 2: Add `article_history_days` to settings model**
Add `pub article_history_days: i32` to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`. Add to `From` impl, `Default` (90), and validation:
```rust
if !(0..=365).contains(&self.article_history_days) {
return Err("article_history_days must be between 0 and 365".into());
}
```
- [ ] **Step 3: Add to DB queries**
Add to `SettingsRow`, `TryFrom`, both SQL queries in `db/settings.rs`.
- [ ] **Step 4: Update test fixtures**
Add `article_history_days: 90` to `valid_request()` in settings tests and `test_settings()` in prompts tests.
- [ ] **Step 5: Update CLAUDE.md migration count to 15**
- [ ] **Step 6: Run tests + commit**
```bash
cd backend && cargo test --lib
git add backend/migrations/20260324000015_add_article_history.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md
git commit -m "feat: add article_history table and article_history_days setting"
```
---
### Task 2: DB module for article history
**Files:**
- Create: `backend/src/db/article_history.rs`
- Modify: `backend/src/db/mod.rs`
- [ ] **Step 1: Create `backend/src/db/article_history.rs`**
```rust
//! Article history: tracks which article URLs have been used in past syntheses.
//!
//! Prevents the same article from appearing in multiple syntheses.
use std::collections::HashSet;
use sqlx::PgPool;
use uuid::Uuid;
use crate::errors::AppError;
/// Check which URL hashes already exist in history for this user.
///
/// Returns the set of url_hashes that were found (i.e., already used).
pub async fn check_urls_exist(
pool: &PgPool,
user_id: Uuid,
url_hashes: &[String],
) -> Result<HashSet<String>, AppError> {
if url_hashes.is_empty() {
return Ok(HashSet::new());
}
let rows = sqlx::query_scalar::<_, String>(
"SELECT url_hash FROM article_history WHERE user_id = $1 AND url_hash = ANY($2)",
)
.bind(user_id)
.bind(url_hashes)
.fetch_all(pool)
.await?;
Ok(rows.into_iter().collect())
}
/// Insert article URLs into history (batch).
///
/// Uses ON CONFLICT DO NOTHING to silently skip duplicates.
pub async fn insert_urls(
pool: &PgPool,
user_id: Uuid,
urls: &[(String, String)], // Vec<(url, url_hash)>
) -> Result<(), AppError> {
if urls.is_empty() {
return Ok(());
}
for (url, url_hash) in urls {
sqlx::query(
"INSERT INTO article_history (user_id, url_hash, url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
)
.bind(user_id)
.bind(url_hash)
.bind(url)
.execute(pool)
.await?;
}
Ok(())
}
/// Delete history entries older than N days for this user.
///
/// Returns the number of deleted rows.
pub async fn cleanup_old(
pool: &PgPool,
user_id: Uuid,
days: i32,
) -> Result<u64, AppError> {
let result = sqlx::query(
"DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - make_interval(days => $2)",
)
.bind(user_id)
.bind(days)
.execute(pool)
.await?;
Ok(result.rows_affected())
}
```
- [ ] **Step 2: Register module in `db/mod.rs`**
Add `pub mod article_history;` (alphabetical order — after `api_keys`).
- [ ] **Step 3: Run tests + commit**
```bash
cd backend && cargo test --lib && cargo build
git add backend/src/db/article_history.rs backend/src/db/mod.rs
git commit -m "feat: add article_history DB module (check, insert, cleanup)"
```
---
### Task 3: URL normalization utility + unit tests
**Files:**
- Modify: `backend/src/services/synthesis.rs`
- [ ] **Step 1: Add `normalize_article_url` function**
Add near the other URL helper functions (near `extract_domain`):
```rust
/// Normalize an article URL for consistent history hashing.
///
/// Strips fragments, trailing slashes, and known tracking query parameters
/// so that the same article with different UTM tags is recognized as a duplicate.
fn normalize_article_url(url_str: &str) -> String {
let Ok(mut parsed) = url::Url::parse(url_str) else {
return url_str.to_lowercase();
};
// Strip fragment
parsed.set_fragment(None);
// Strip known tracking query parameters
let dominated_params: &[&str] = &[
"utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
"ref", "source", "fbclid", "gclid",
];
let filtered_pairs: Vec<(String, String)> = parsed
.query_pairs()
.filter(|(key, _)| !dominated_params.contains(&key.as_ref()))
.map(|(k, v)| (k.into_owned(), v.into_owned()))
.collect();
if filtered_pairs.is_empty() {
parsed.set_query(None);
} else {
let query_string = filtered_pairs
.iter()
.map(|(k, v)| format!("{}={}", k, v))
.collect::<Vec<_>>()
.join("&");
parsed.set_query(Some(&query_string));
}
// Strip trailing slash (unless path is just "/")
let path = parsed.path().to_string();
if path.len() > 1 && path.ends_with('/') {
parsed.set_path(&path[..path.len() - 1]);
}
parsed.to_string().to_lowercase()
}
/// Compute the hash of a normalized article URL for history lookup.
fn hash_article_url(url: &str) -> String {
let normalized = normalize_article_url(url);
crate::util::token::hash_token(&normalized)
}
```
- [ ] **Step 2: Add unit tests**
```rust
// ── normalize_article_url tests ─────────────────────────────
#[test]
fn normalize_strips_fragment() {
assert_eq!(
normalize_article_url("https://example.com/article#section"),
"https://example.com/article"
);
}
#[test]
fn normalize_strips_utm_params() {
assert_eq!(
normalize_article_url("https://example.com/article?utm_source=twitter&utm_medium=social"),
"https://example.com/article"
);
}
#[test]
fn normalize_keeps_non_tracking_params() {
let result = normalize_article_url("https://example.com/search?q=test&utm_source=twitter");
assert!(result.contains("q=test"));
assert!(!result.contains("utm_source"));
}
#[test]
fn normalize_strips_trailing_slash() {
assert_eq!(
normalize_article_url("https://example.com/article/"),
"https://example.com/article"
);
}
#[test]
fn normalize_keeps_root_slash() {
assert_eq!(
normalize_article_url("https://example.com/"),
"https://example.com/"
);
}
#[test]
fn normalize_lowercases() {
assert_eq!(
normalize_article_url("https://Example.COM/Article"),
"https://example.com/article" // entire URL lowercased for consistent hashing
);
}
#[test]
fn normalize_handles_invalid_url() {
let result = normalize_article_url("not a url at all");
assert_eq!(result, "not a url at all");
}
#[test]
fn normalize_strips_fbclid() {
let result = normalize_article_url("https://example.com/post?fbclid=abc123");
assert!(!result.contains("fbclid"));
assert!(!result.contains("?"));
}
#[test]
fn hash_article_url_deterministic() {
let h1 = hash_article_url("https://example.com/article?utm_source=twitter");
let h2 = hash_article_url("https://example.com/article?utm_source=newsletter");
assert_eq!(h1, h2, "Same article with different UTM params should hash the same");
}
```
- [ ] **Step 3: Run tests + commit**
```bash
cd backend && cargo test --lib
git add backend/src/services/synthesis.rs
git commit -m "feat: add normalize_article_url and hash_article_url utilities"
```
---
### Task 4: Pipeline integration — history filtering, insert, cleanup
**Files:**
- Modify: `backend/src/services/synthesis.rs`
This is the core integration task. Changes are in `run_generation_inner`.
- [ ] **Step 1: Add cleanup at the start of generation**
After loading settings (around line 259), add:
```rust
// Cleanup old article history entries
if settings.article_history_days > 0 {
let deleted = db::article_history::cleanup_old(
&state.pool,
user_id,
settings.article_history_days,
)
.await
.unwrap_or(0);
if deleted > 0 {
tracing::info!(deleted = deleted, "Cleaned up old article history entries");
}
}
```
- [ ] **Step 2: Add history filtering in Phase 1**
After filtering empty content (around line 376, after `let valid_articles = ...`), add history filtering:
```rust
// 1d. Filter against article history (cross-synthesis dedup)
let valid_articles = if settings.article_history_days > 0 {
let hashes: Vec<String> = valid_articles.iter().map(|a| hash_article_url(&a.url)).collect();
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
.await
.unwrap_or_default();
if !existing.is_empty() {
tracing::info!(filtered = existing.len(), "Phase 1: filtered articles already in history");
}
valid_articles
.into_iter()
.filter(|a| !existing.contains(&hash_article_url(&a.url)))
.collect::<Vec<_>>()
} else {
valid_articles
};
```
- [ ] **Step 3: Add Phase 1 retry logic when under-filled**
After the history filtering in Phase 1 (Step 2), check if we have enough articles. If under-filled, do one retry with the same sources, excluding already-fetched URLs:
```rust
// 1e. Retry if under-filled (1 attempt)
let target = settings.categories.len() * settings.max_items_per_category as usize;
if valid_articles.len() < target && settings.article_history_days > 0 {
tracing::info!(
have = valid_articles.len(),
need = target,
"Phase 1 under-filled after history filter, retrying with same sources"
);
// Collect all URLs already fetched (valid + filtered)
let mut already_fetched: std::collections::HashSet<String> = candidate_urls
.iter()
.map(|u| u.to_lowercase())
.collect();
// Re-scrape source pages for new links
let mut retry_urls: Vec<String> = Vec::new();
for source in sources.iter().take(max_sources) {
let links = if settings.use_llm_for_source_links {
source_scraper::extract_article_links_with_llm(
&state.http_client, &source.url, max_links_per_source,
&provider, &model_research,
).await
} else {
source_scraper::extract_article_links(
&state.http_client, &source.url, max_links_per_source,
).await
};
if let Ok(links) = links {
for link in links {
if !already_fetched.contains(&link.to_lowercase()) {
retry_urls.push(link);
}
}
}
}
if !retry_urls.is_empty() {
// Scrape retry candidates
let retry_scraped = scrape_flat_urls(
state, &retry_urls, settings.max_age_days as i64, tx,
llm_for_scraping.clone(),
).await;
let retry_valid: Vec<ScrapedNewsItem> = retry_scraped
.into_iter()
.filter(|a| !a.scraped_content.trim().is_empty())
.collect();
// Filter against history
let retry_valid = if !retry_valid.is_empty() {
let hashes: Vec<String> = retry_valid.iter().map(|a| hash_article_url(&a.url)).collect();
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
.await.unwrap_or_default();
retry_valid.into_iter()
.filter(|a| !existing.contains(&hash_article_url(&a.url)))
.collect::<Vec<_>>()
} else {
retry_valid
};
// Merge with existing valid articles
valid_articles.extend(retry_valid);
tracing::info!(total = valid_articles.len(), "Phase 1 after retry");
}
}
```
Note: `valid_articles` must be declared as `let mut valid_articles` earlier for this to work.
- [ ] **Step 4: Add history filtering in Phase 2 (before scraping)**
In Phase 2, after `dedup_by_url` and `limit_articles_per_source` (around line 552), before `scrape_articles`, add:
```rust
// Filter against article history BEFORE scraping (saves HTTP requests)
let parsed = if settings.article_history_days > 0 {
let all_urls: Vec<String> = parsed.iter()
.flat_map(|(_, items)| items.iter().map(|i| i.url.clone()))
.collect();
let hashes: Vec<String> = all_urls.iter().map(|u| hash_article_url(u)).collect();
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
.await
.unwrap_or_default();
if !existing.is_empty() {
tracing::info!(filtered = existing.len(), "Phase 2: filtered articles already in history");
}
parsed
.into_iter()
.map(|(cat_key, items)| {
let filtered = items
.into_iter()
.filter(|item| !existing.contains(&hash_article_url(&item.url)))
.collect();
(cat_key, filtered)
})
.collect()
} else {
parsed
};
```
- [ ] **Step 5: Insert article URLs after saving synthesis**
After `db::syntheses::create` (around line 638), add:
```rust
// Record article URLs in history for cross-synthesis dedup
if settings.article_history_days > 0 {
let article_urls: Vec<(String, String)> = final_sections
.iter()
.flat_map(|section| section.items.iter())
.map(|item| (item.url.clone(), hash_article_url(&item.url)))
.collect();
db::article_history::insert_urls(&state.pool, user_id, &article_urls)
.await
.ok(); // Don't fail synthesis if history insert fails
}
```
- [ ] **Step 6: Run tests + commit**
```bash
cd backend && cargo test --lib && cargo build
git add backend/src/services/synthesis.rs
git commit -m "feat: article history filtering in pipeline — cleanup, Phase 1/2 filter, retry, insert after save"
```
---
### Task 5: Frontend setting
**Files:**
- Modify: `frontend/src/types.ts`
- Modify: `frontend/src/i18n/fr.ts`
- Modify: `frontend/src/pages/Settings.tsx`
- [ ] **Step 1: Add field to types + DEFAULT_SETTINGS**
```typescript
// In UserSettings:
article_history_days: number;
// In DEFAULT_SETTINGS:
article_history_days: 90,
```
- [ ] **Step 2: Add i18n label**
```typescript
'settings.articleHistoryDays': 'Historique des articles (jours)',
```
- [ ] **Step 3: Add number input to Settings page**
Add inside the generation settings grid (alongside the other number inputs):
```tsx
<div>
<label
for="articleHistoryDays"
class="block text-sm font-medium text-gray-700"
>
{t('settings.articleHistoryDays')}
</label>
<div class="mt-1">
<input
type="number"
id="articleHistoryDays"
min="0"
max="365"
class="shadow-sm focus:ring-indigo-500 focus:border-indigo-500 block w-full sm:text-sm border-gray-300 rounded-md py-2 px-3 border"
value={settings().article_history_days}
onInput={(e) =>
setSettings((prev) => ({
...prev,
article_history_days:
parseInt(e.currentTarget.value) || 90,
}))
}
/>
</div>
</div>
```
- [ ] **Step 4: Run frontend tests + commit**
```bash
cd frontend && npx tsc --noEmit && npx vitest run
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
git commit -m "feat: add article_history_days setting to frontend"
```
---
### Task 6: Update E2E and integration tests
**Files:**
- Modify: `e2e/tests/generation-live.spec.ts`
- Modify: `backend/tests/api_syntheses_test.rs`
- [ ] **Step 1: Update E2E settings payload**
Add `article_history_days: 90` to the PUT settings body.
- [ ] **Step 2: Update integration test settings payload**
In `api_syntheses_test.rs`, add `"article_history_days": 90` to the PUT settings body.
- [ ] **Step 3: Run E2E test to verify**
```bash
cd e2e && docker compose -f docker-compose.test.yml down
docker compose -f docker-compose.test.yml up --build -d
sleep 25 && npx tsx seed.ts && npx playwright test generation-live --reporter=list
```
- [ ] **Step 4: Commit**
```bash
git add e2e/tests/generation-live.spec.ts backend/tests/api_syntheses_test.rs
git commit -m "test: update E2E and integration tests with article_history_days setting"
```