docs: add article history implementation plan with retry logic
parent
633a51dc8c
commit
d7c91c956f
@ -0,0 +1,590 @@
|
||||
# Article History — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Prevent duplicate articles across syntheses by maintaining a persistent per-user article URL history with configurable TTL.
|
||||
|
||||
**Architecture:** New `article_history` table with SHA-256 hashed URLs. Pipeline filters candidates against history before classification. URLs inserted after synthesis saved. Cleanup of old entries before each generation.
|
||||
|
||||
**Tech Stack:** Rust (sqlx, sha2, url crate), PostgreSQL
|
||||
|
||||
**Spec:** `docs/superpowers/specs/2026-03-24-article-history-design.md`
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Migration + settings field
|
||||
|
||||
**Files:**
|
||||
- Create: `backend/migrations/20260324000015_add_article_history.sql`
|
||||
- Modify: `backend/src/models/settings.rs`
|
||||
- Modify: `backend/src/db/settings.rs`
|
||||
- Modify: `backend/src/services/prompts.rs` (test fixture)
|
||||
- Modify: `CLAUDE.md`
|
||||
|
||||
- [ ] **Step 1: Create migration**
|
||||
|
||||
```sql
|
||||
-- Article history table for cross-synthesis URL deduplication
|
||||
CREATE TABLE article_history (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||||
url_hash TEXT NOT NULL,
|
||||
url TEXT NOT NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
|
||||
|
||||
-- Setting for history TTL
|
||||
ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90;
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add `article_history_days` to settings model**
|
||||
|
||||
Add `pub article_history_days: i32` to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`. Add to `From` impl, `Default` (90), and validation:
|
||||
```rust
|
||||
if !(0..=365).contains(&self.article_history_days) {
|
||||
return Err("article_history_days must be between 0 and 365".into());
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add to DB queries**
|
||||
|
||||
Add to `SettingsRow`, `TryFrom`, both SQL queries in `db/settings.rs`.
|
||||
|
||||
- [ ] **Step 4: Update test fixtures**
|
||||
|
||||
Add `article_history_days: 90` to `valid_request()` in settings tests and `test_settings()` in prompts tests.
|
||||
|
||||
- [ ] **Step 5: Update CLAUDE.md migration count to 15**
|
||||
|
||||
- [ ] **Step 6: Run tests + commit**
|
||||
|
||||
```bash
|
||||
cd backend && cargo test --lib
|
||||
git add backend/migrations/20260324000015_add_article_history.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md
|
||||
git commit -m "feat: add article_history table and article_history_days setting"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: DB module for article history
|
||||
|
||||
**Files:**
|
||||
- Create: `backend/src/db/article_history.rs`
|
||||
- Modify: `backend/src/db/mod.rs`
|
||||
|
||||
- [ ] **Step 1: Create `backend/src/db/article_history.rs`**
|
||||
|
||||
```rust
|
||||
//! Article history: tracks which article URLs have been used in past syntheses.
|
||||
//!
|
||||
//! Prevents the same article from appearing in multiple syntheses.
|
||||
|
||||
use std::collections::HashSet;
|
||||
use sqlx::PgPool;
|
||||
use uuid::Uuid;
|
||||
use crate::errors::AppError;
|
||||
|
||||
/// Check which URL hashes already exist in history for this user.
|
||||
///
|
||||
/// Returns the set of url_hashes that were found (i.e., already used).
|
||||
pub async fn check_urls_exist(
|
||||
pool: &PgPool,
|
||||
user_id: Uuid,
|
||||
url_hashes: &[String],
|
||||
) -> Result<HashSet<String>, AppError> {
|
||||
if url_hashes.is_empty() {
|
||||
return Ok(HashSet::new());
|
||||
}
|
||||
|
||||
let rows = sqlx::query_scalar::<_, String>(
|
||||
"SELECT url_hash FROM article_history WHERE user_id = $1 AND url_hash = ANY($2)",
|
||||
)
|
||||
.bind(user_id)
|
||||
.bind(url_hashes)
|
||||
.fetch_all(pool)
|
||||
.await?;
|
||||
|
||||
Ok(rows.into_iter().collect())
|
||||
}
|
||||
|
||||
/// Insert article URLs into history (batch).
|
||||
///
|
||||
/// Uses ON CONFLICT DO NOTHING to silently skip duplicates.
|
||||
pub async fn insert_urls(
|
||||
pool: &PgPool,
|
||||
user_id: Uuid,
|
||||
urls: &[(String, String)], // Vec<(url, url_hash)>
|
||||
) -> Result<(), AppError> {
|
||||
if urls.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
for (url, url_hash) in urls {
|
||||
sqlx::query(
|
||||
"INSERT INTO article_history (user_id, url_hash, url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
|
||||
)
|
||||
.bind(user_id)
|
||||
.bind(url_hash)
|
||||
.bind(url)
|
||||
.execute(pool)
|
||||
.await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Delete history entries older than N days for this user.
|
||||
///
|
||||
/// Returns the number of deleted rows.
|
||||
pub async fn cleanup_old(
|
||||
pool: &PgPool,
|
||||
user_id: Uuid,
|
||||
days: i32,
|
||||
) -> Result<u64, AppError> {
|
||||
let result = sqlx::query(
|
||||
"DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - make_interval(days => $2)",
|
||||
)
|
||||
.bind(user_id)
|
||||
.bind(days)
|
||||
.execute(pool)
|
||||
.await?;
|
||||
|
||||
Ok(result.rows_affected())
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Register module in `db/mod.rs`**
|
||||
|
||||
Add `pub mod article_history;` (alphabetical order — after `api_keys`).
|
||||
|
||||
- [ ] **Step 3: Run tests + commit**
|
||||
|
||||
```bash
|
||||
cd backend && cargo test --lib && cargo build
|
||||
git add backend/src/db/article_history.rs backend/src/db/mod.rs
|
||||
git commit -m "feat: add article_history DB module (check, insert, cleanup)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: URL normalization utility + unit tests
|
||||
|
||||
**Files:**
|
||||
- Modify: `backend/src/services/synthesis.rs`
|
||||
|
||||
- [ ] **Step 1: Add `normalize_article_url` function**
|
||||
|
||||
Add near the other URL helper functions (near `extract_domain`):
|
||||
|
||||
```rust
|
||||
/// Normalize an article URL for consistent history hashing.
|
||||
///
|
||||
/// Strips fragments, trailing slashes, and known tracking query parameters
|
||||
/// so that the same article with different UTM tags is recognized as a duplicate.
|
||||
fn normalize_article_url(url_str: &str) -> String {
|
||||
let Ok(mut parsed) = url::Url::parse(url_str) else {
|
||||
return url_str.to_lowercase();
|
||||
};
|
||||
|
||||
// Strip fragment
|
||||
parsed.set_fragment(None);
|
||||
|
||||
// Strip known tracking query parameters
|
||||
let dominated_params: &[&str] = &[
|
||||
"utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
|
||||
"ref", "source", "fbclid", "gclid",
|
||||
];
|
||||
|
||||
let filtered_pairs: Vec<(String, String)> = parsed
|
||||
.query_pairs()
|
||||
.filter(|(key, _)| !dominated_params.contains(&key.as_ref()))
|
||||
.map(|(k, v)| (k.into_owned(), v.into_owned()))
|
||||
.collect();
|
||||
|
||||
if filtered_pairs.is_empty() {
|
||||
parsed.set_query(None);
|
||||
} else {
|
||||
let query_string = filtered_pairs
|
||||
.iter()
|
||||
.map(|(k, v)| format!("{}={}", k, v))
|
||||
.collect::<Vec<_>>()
|
||||
.join("&");
|
||||
parsed.set_query(Some(&query_string));
|
||||
}
|
||||
|
||||
// Strip trailing slash (unless path is just "/")
|
||||
let path = parsed.path().to_string();
|
||||
if path.len() > 1 && path.ends_with('/') {
|
||||
parsed.set_path(&path[..path.len() - 1]);
|
||||
}
|
||||
|
||||
parsed.to_string().to_lowercase()
|
||||
}
|
||||
|
||||
/// Compute the hash of a normalized article URL for history lookup.
|
||||
fn hash_article_url(url: &str) -> String {
|
||||
let normalized = normalize_article_url(url);
|
||||
crate::util::token::hash_token(&normalized)
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add unit tests**
|
||||
|
||||
```rust
|
||||
// ── normalize_article_url tests ─────────────────────────────
|
||||
|
||||
#[test]
|
||||
fn normalize_strips_fragment() {
|
||||
assert_eq!(
|
||||
normalize_article_url("https://example.com/article#section"),
|
||||
"https://example.com/article"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_strips_utm_params() {
|
||||
assert_eq!(
|
||||
normalize_article_url("https://example.com/article?utm_source=twitter&utm_medium=social"),
|
||||
"https://example.com/article"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_keeps_non_tracking_params() {
|
||||
let result = normalize_article_url("https://example.com/search?q=test&utm_source=twitter");
|
||||
assert!(result.contains("q=test"));
|
||||
assert!(!result.contains("utm_source"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_strips_trailing_slash() {
|
||||
assert_eq!(
|
||||
normalize_article_url("https://example.com/article/"),
|
||||
"https://example.com/article"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_keeps_root_slash() {
|
||||
assert_eq!(
|
||||
normalize_article_url("https://example.com/"),
|
||||
"https://example.com/"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_lowercases() {
|
||||
assert_eq!(
|
||||
normalize_article_url("https://Example.COM/Article"),
|
||||
"https://example.com/article" // entire URL lowercased for consistent hashing
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_handles_invalid_url() {
|
||||
let result = normalize_article_url("not a url at all");
|
||||
assert_eq!(result, "not a url at all");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalize_strips_fbclid() {
|
||||
let result = normalize_article_url("https://example.com/post?fbclid=abc123");
|
||||
assert!(!result.contains("fbclid"));
|
||||
assert!(!result.contains("?"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn hash_article_url_deterministic() {
|
||||
let h1 = hash_article_url("https://example.com/article?utm_source=twitter");
|
||||
let h2 = hash_article_url("https://example.com/article?utm_source=newsletter");
|
||||
assert_eq!(h1, h2, "Same article with different UTM params should hash the same");
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Run tests + commit**
|
||||
|
||||
```bash
|
||||
cd backend && cargo test --lib
|
||||
git add backend/src/services/synthesis.rs
|
||||
git commit -m "feat: add normalize_article_url and hash_article_url utilities"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Pipeline integration — history filtering, insert, cleanup
|
||||
|
||||
**Files:**
|
||||
- Modify: `backend/src/services/synthesis.rs`
|
||||
|
||||
This is the core integration task. Changes are in `run_generation_inner`.
|
||||
|
||||
- [ ] **Step 1: Add cleanup at the start of generation**
|
||||
|
||||
After loading settings (around line 259), add:
|
||||
```rust
|
||||
// Cleanup old article history entries
|
||||
if settings.article_history_days > 0 {
|
||||
let deleted = db::article_history::cleanup_old(
|
||||
&state.pool,
|
||||
user_id,
|
||||
settings.article_history_days,
|
||||
)
|
||||
.await
|
||||
.unwrap_or(0);
|
||||
if deleted > 0 {
|
||||
tracing::info!(deleted = deleted, "Cleaned up old article history entries");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add history filtering in Phase 1**
|
||||
|
||||
After filtering empty content (around line 376, after `let valid_articles = ...`), add history filtering:
|
||||
|
||||
```rust
|
||||
// 1d. Filter against article history (cross-synthesis dedup)
|
||||
let valid_articles = if settings.article_history_days > 0 {
|
||||
let hashes: Vec<String> = valid_articles.iter().map(|a| hash_article_url(&a.url)).collect();
|
||||
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
|
||||
.await
|
||||
.unwrap_or_default();
|
||||
if !existing.is_empty() {
|
||||
tracing::info!(filtered = existing.len(), "Phase 1: filtered articles already in history");
|
||||
}
|
||||
valid_articles
|
||||
.into_iter()
|
||||
.filter(|a| !existing.contains(&hash_article_url(&a.url)))
|
||||
.collect::<Vec<_>>()
|
||||
} else {
|
||||
valid_articles
|
||||
};
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add Phase 1 retry logic when under-filled**
|
||||
|
||||
After the history filtering in Phase 1 (Step 2), check if we have enough articles. If under-filled, do one retry with the same sources, excluding already-fetched URLs:
|
||||
|
||||
```rust
|
||||
// 1e. Retry if under-filled (1 attempt)
|
||||
let target = settings.categories.len() * settings.max_items_per_category as usize;
|
||||
if valid_articles.len() < target && settings.article_history_days > 0 {
|
||||
tracing::info!(
|
||||
have = valid_articles.len(),
|
||||
need = target,
|
||||
"Phase 1 under-filled after history filter, retrying with same sources"
|
||||
);
|
||||
|
||||
// Collect all URLs already fetched (valid + filtered)
|
||||
let mut already_fetched: std::collections::HashSet<String> = candidate_urls
|
||||
.iter()
|
||||
.map(|u| u.to_lowercase())
|
||||
.collect();
|
||||
|
||||
// Re-scrape source pages for new links
|
||||
let mut retry_urls: Vec<String> = Vec::new();
|
||||
for source in sources.iter().take(max_sources) {
|
||||
let links = if settings.use_llm_for_source_links {
|
||||
source_scraper::extract_article_links_with_llm(
|
||||
&state.http_client, &source.url, max_links_per_source,
|
||||
&provider, &model_research,
|
||||
).await
|
||||
} else {
|
||||
source_scraper::extract_article_links(
|
||||
&state.http_client, &source.url, max_links_per_source,
|
||||
).await
|
||||
};
|
||||
if let Ok(links) = links {
|
||||
for link in links {
|
||||
if !already_fetched.contains(&link.to_lowercase()) {
|
||||
retry_urls.push(link);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !retry_urls.is_empty() {
|
||||
// Scrape retry candidates
|
||||
let retry_scraped = scrape_flat_urls(
|
||||
state, &retry_urls, settings.max_age_days as i64, tx,
|
||||
llm_for_scraping.clone(),
|
||||
).await;
|
||||
let retry_valid: Vec<ScrapedNewsItem> = retry_scraped
|
||||
.into_iter()
|
||||
.filter(|a| !a.scraped_content.trim().is_empty())
|
||||
.collect();
|
||||
|
||||
// Filter against history
|
||||
let retry_valid = if !retry_valid.is_empty() {
|
||||
let hashes: Vec<String> = retry_valid.iter().map(|a| hash_article_url(&a.url)).collect();
|
||||
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
|
||||
.await.unwrap_or_default();
|
||||
retry_valid.into_iter()
|
||||
.filter(|a| !existing.contains(&hash_article_url(&a.url)))
|
||||
.collect::<Vec<_>>()
|
||||
} else {
|
||||
retry_valid
|
||||
};
|
||||
|
||||
// Merge with existing valid articles
|
||||
valid_articles.extend(retry_valid);
|
||||
tracing::info!(total = valid_articles.len(), "Phase 1 after retry");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Note: `valid_articles` must be declared as `let mut valid_articles` earlier for this to work.
|
||||
|
||||
- [ ] **Step 4: Add history filtering in Phase 2 (before scraping)**
|
||||
|
||||
In Phase 2, after `dedup_by_url` and `limit_articles_per_source` (around line 552), before `scrape_articles`, add:
|
||||
|
||||
```rust
|
||||
// Filter against article history BEFORE scraping (saves HTTP requests)
|
||||
let parsed = if settings.article_history_days > 0 {
|
||||
let all_urls: Vec<String> = parsed.iter()
|
||||
.flat_map(|(_, items)| items.iter().map(|i| i.url.clone()))
|
||||
.collect();
|
||||
let hashes: Vec<String> = all_urls.iter().map(|u| hash_article_url(u)).collect();
|
||||
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
|
||||
.await
|
||||
.unwrap_or_default();
|
||||
if !existing.is_empty() {
|
||||
tracing::info!(filtered = existing.len(), "Phase 2: filtered articles already in history");
|
||||
}
|
||||
parsed
|
||||
.into_iter()
|
||||
.map(|(cat_key, items)| {
|
||||
let filtered = items
|
||||
.into_iter()
|
||||
.filter(|item| !existing.contains(&hash_article_url(&item.url)))
|
||||
.collect();
|
||||
(cat_key, filtered)
|
||||
})
|
||||
.collect()
|
||||
} else {
|
||||
parsed
|
||||
};
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Insert article URLs after saving synthesis**
|
||||
|
||||
After `db::syntheses::create` (around line 638), add:
|
||||
|
||||
```rust
|
||||
// Record article URLs in history for cross-synthesis dedup
|
||||
if settings.article_history_days > 0 {
|
||||
let article_urls: Vec<(String, String)> = final_sections
|
||||
.iter()
|
||||
.flat_map(|section| section.items.iter())
|
||||
.map(|item| (item.url.clone(), hash_article_url(&item.url)))
|
||||
.collect();
|
||||
db::article_history::insert_urls(&state.pool, user_id, &article_urls)
|
||||
.await
|
||||
.ok(); // Don't fail synthesis if history insert fails
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 6: Run tests + commit**
|
||||
|
||||
```bash
|
||||
cd backend && cargo test --lib && cargo build
|
||||
git add backend/src/services/synthesis.rs
|
||||
git commit -m "feat: article history filtering in pipeline — cleanup, Phase 1/2 filter, retry, insert after save"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Frontend setting
|
||||
|
||||
**Files:**
|
||||
- Modify: `frontend/src/types.ts`
|
||||
- Modify: `frontend/src/i18n/fr.ts`
|
||||
- Modify: `frontend/src/pages/Settings.tsx`
|
||||
|
||||
- [ ] **Step 1: Add field to types + DEFAULT_SETTINGS**
|
||||
|
||||
```typescript
|
||||
// In UserSettings:
|
||||
article_history_days: number;
|
||||
|
||||
// In DEFAULT_SETTINGS:
|
||||
article_history_days: 90,
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add i18n label**
|
||||
|
||||
```typescript
|
||||
'settings.articleHistoryDays': 'Historique des articles (jours)',
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Add number input to Settings page**
|
||||
|
||||
Add inside the generation settings grid (alongside the other number inputs):
|
||||
|
||||
```tsx
|
||||
<div>
|
||||
<label
|
||||
for="articleHistoryDays"
|
||||
class="block text-sm font-medium text-gray-700"
|
||||
>
|
||||
{t('settings.articleHistoryDays')}
|
||||
</label>
|
||||
<div class="mt-1">
|
||||
<input
|
||||
type="number"
|
||||
id="articleHistoryDays"
|
||||
min="0"
|
||||
max="365"
|
||||
class="shadow-sm focus:ring-indigo-500 focus:border-indigo-500 block w-full sm:text-sm border-gray-300 rounded-md py-2 px-3 border"
|
||||
value={settings().article_history_days}
|
||||
onInput={(e) =>
|
||||
setSettings((prev) => ({
|
||||
...prev,
|
||||
article_history_days:
|
||||
parseInt(e.currentTarget.value) || 90,
|
||||
}))
|
||||
}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run frontend tests + commit**
|
||||
|
||||
```bash
|
||||
cd frontend && npx tsc --noEmit && npx vitest run
|
||||
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
|
||||
git commit -m "feat: add article_history_days setting to frontend"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Update E2E and integration tests
|
||||
|
||||
**Files:**
|
||||
- Modify: `e2e/tests/generation-live.spec.ts`
|
||||
- Modify: `backend/tests/api_syntheses_test.rs`
|
||||
|
||||
- [ ] **Step 1: Update E2E settings payload**
|
||||
|
||||
Add `article_history_days: 90` to the PUT settings body.
|
||||
|
||||
- [ ] **Step 2: Update integration test settings payload**
|
||||
|
||||
In `api_syntheses_test.rs`, add `"article_history_days": 90` to the PUT settings body.
|
||||
|
||||
- [ ] **Step 3: Run E2E test to verify**
|
||||
|
||||
```bash
|
||||
cd e2e && docker compose -f docker-compose.test.yml down
|
||||
docker compose -f docker-compose.test.yml up --build -d
|
||||
sleep 25 && npx tsx seed.ts && npx playwright test generation-live --reporter=list
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add e2e/tests/generation-live.spec.ts backend/tests/api_syntheses_test.rs
|
||||
git commit -m "test: update E2E and integration tests with article_history_days setting"
|
||||
```
|
||||
Loading…
Reference in New Issue