You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
591 lines
19 KiB
Markdown
591 lines
19 KiB
Markdown
# Article History — Implementation Plan
|
|
|
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
|
|
**Goal:** Prevent duplicate articles across syntheses by maintaining a persistent per-user article URL history with configurable TTL.
|
|
|
|
**Architecture:** New `article_history` table with SHA-256 hashed URLs. Pipeline filters candidates against history before classification. URLs inserted after synthesis saved. Cleanup of old entries before each generation.
|
|
|
|
**Tech Stack:** Rust (sqlx, sha2, url crate), PostgreSQL
|
|
|
|
**Spec:** `docs/superpowers/specs/2026-03-24-article-history-design.md`
|
|
|
|
---
|
|
|
|
### Task 1: Migration + settings field
|
|
|
|
**Files:**
|
|
- Create: `backend/migrations/20260324000015_add_article_history.sql`
|
|
- Modify: `backend/src/models/settings.rs`
|
|
- Modify: `backend/src/db/settings.rs`
|
|
- Modify: `backend/src/services/prompts.rs` (test fixture)
|
|
- Modify: `CLAUDE.md`
|
|
|
|
- [ ] **Step 1: Create migration**
|
|
|
|
```sql
|
|
-- Article history table for cross-synthesis URL deduplication
|
|
CREATE TABLE article_history (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
|
url_hash TEXT NOT NULL,
|
|
url TEXT NOT NULL,
|
|
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
|
);
|
|
CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
|
|
|
|
-- Setting for history TTL
|
|
ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90;
|
|
```
|
|
|
|
- [ ] **Step 2: Add `article_history_days` to settings model**
|
|
|
|
Add `pub article_history_days: i32` to `UserSettings`, `SettingsResponse`, `UpdateSettingsRequest`. Add to `From` impl, `Default` (90), and validation:
|
|
```rust
|
|
if !(0..=365).contains(&self.article_history_days) {
|
|
return Err("article_history_days must be between 0 and 365".into());
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 3: Add to DB queries**
|
|
|
|
Add to `SettingsRow`, `TryFrom`, both SQL queries in `db/settings.rs`.
|
|
|
|
- [ ] **Step 4: Update test fixtures**
|
|
|
|
Add `article_history_days: 90` to `valid_request()` in settings tests and `test_settings()` in prompts tests.
|
|
|
|
- [ ] **Step 5: Update CLAUDE.md migration count to 15**
|
|
|
|
- [ ] **Step 6: Run tests + commit**
|
|
|
|
```bash
|
|
cd backend && cargo test --lib
|
|
git add backend/migrations/20260324000015_add_article_history.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md
|
|
git commit -m "feat: add article_history table and article_history_days setting"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 2: DB module for article history
|
|
|
|
**Files:**
|
|
- Create: `backend/src/db/article_history.rs`
|
|
- Modify: `backend/src/db/mod.rs`
|
|
|
|
- [ ] **Step 1: Create `backend/src/db/article_history.rs`**
|
|
|
|
```rust
|
|
//! Article history: tracks which article URLs have been used in past syntheses.
|
|
//!
|
|
//! Prevents the same article from appearing in multiple syntheses.
|
|
|
|
use std::collections::HashSet;
|
|
use sqlx::PgPool;
|
|
use uuid::Uuid;
|
|
use crate::errors::AppError;
|
|
|
|
/// Check which URL hashes already exist in history for this user.
|
|
///
|
|
/// Returns the set of url_hashes that were found (i.e., already used).
|
|
pub async fn check_urls_exist(
|
|
pool: &PgPool,
|
|
user_id: Uuid,
|
|
url_hashes: &[String],
|
|
) -> Result<HashSet<String>, AppError> {
|
|
if url_hashes.is_empty() {
|
|
return Ok(HashSet::new());
|
|
}
|
|
|
|
let rows = sqlx::query_scalar::<_, String>(
|
|
"SELECT url_hash FROM article_history WHERE user_id = $1 AND url_hash = ANY($2)",
|
|
)
|
|
.bind(user_id)
|
|
.bind(url_hashes)
|
|
.fetch_all(pool)
|
|
.await?;
|
|
|
|
Ok(rows.into_iter().collect())
|
|
}
|
|
|
|
/// Insert article URLs into history (batch).
|
|
///
|
|
/// Uses ON CONFLICT DO NOTHING to silently skip duplicates.
|
|
pub async fn insert_urls(
|
|
pool: &PgPool,
|
|
user_id: Uuid,
|
|
urls: &[(String, String)], // Vec<(url, url_hash)>
|
|
) -> Result<(), AppError> {
|
|
if urls.is_empty() {
|
|
return Ok(());
|
|
}
|
|
|
|
for (url, url_hash) in urls {
|
|
sqlx::query(
|
|
"INSERT INTO article_history (user_id, url_hash, url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
|
|
)
|
|
.bind(user_id)
|
|
.bind(url_hash)
|
|
.bind(url)
|
|
.execute(pool)
|
|
.await?;
|
|
}
|
|
|
|
Ok(())
|
|
}
|
|
|
|
/// Delete history entries older than N days for this user.
|
|
///
|
|
/// Returns the number of deleted rows.
|
|
pub async fn cleanup_old(
|
|
pool: &PgPool,
|
|
user_id: Uuid,
|
|
days: i32,
|
|
) -> Result<u64, AppError> {
|
|
let result = sqlx::query(
|
|
"DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - make_interval(days => $2)",
|
|
)
|
|
.bind(user_id)
|
|
.bind(days)
|
|
.execute(pool)
|
|
.await?;
|
|
|
|
Ok(result.rows_affected())
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Register module in `db/mod.rs`**
|
|
|
|
Add `pub mod article_history;` (alphabetical order — after `api_keys`).
|
|
|
|
- [ ] **Step 3: Run tests + commit**
|
|
|
|
```bash
|
|
cd backend && cargo test --lib && cargo build
|
|
git add backend/src/db/article_history.rs backend/src/db/mod.rs
|
|
git commit -m "feat: add article_history DB module (check, insert, cleanup)"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 3: URL normalization utility + unit tests
|
|
|
|
**Files:**
|
|
- Modify: `backend/src/services/synthesis.rs`
|
|
|
|
- [ ] **Step 1: Add `normalize_article_url` function**
|
|
|
|
Add near the other URL helper functions (near `extract_domain`):
|
|
|
|
```rust
|
|
/// Normalize an article URL for consistent history hashing.
|
|
///
|
|
/// Strips fragments, trailing slashes, and known tracking query parameters
|
|
/// so that the same article with different UTM tags is recognized as a duplicate.
|
|
fn normalize_article_url(url_str: &str) -> String {
|
|
let Ok(mut parsed) = url::Url::parse(url_str) else {
|
|
return url_str.to_lowercase();
|
|
};
|
|
|
|
// Strip fragment
|
|
parsed.set_fragment(None);
|
|
|
|
// Strip known tracking query parameters
|
|
let dominated_params: &[&str] = &[
|
|
"utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
|
|
"ref", "source", "fbclid", "gclid",
|
|
];
|
|
|
|
let filtered_pairs: Vec<(String, String)> = parsed
|
|
.query_pairs()
|
|
.filter(|(key, _)| !dominated_params.contains(&key.as_ref()))
|
|
.map(|(k, v)| (k.into_owned(), v.into_owned()))
|
|
.collect();
|
|
|
|
if filtered_pairs.is_empty() {
|
|
parsed.set_query(None);
|
|
} else {
|
|
let query_string = filtered_pairs
|
|
.iter()
|
|
.map(|(k, v)| format!("{}={}", k, v))
|
|
.collect::<Vec<_>>()
|
|
.join("&");
|
|
parsed.set_query(Some(&query_string));
|
|
}
|
|
|
|
// Strip trailing slash (unless path is just "/")
|
|
let path = parsed.path().to_string();
|
|
if path.len() > 1 && path.ends_with('/') {
|
|
parsed.set_path(&path[..path.len() - 1]);
|
|
}
|
|
|
|
parsed.to_string().to_lowercase()
|
|
}
|
|
|
|
/// Compute the hash of a normalized article URL for history lookup.
|
|
fn hash_article_url(url: &str) -> String {
|
|
let normalized = normalize_article_url(url);
|
|
crate::util::token::hash_token(&normalized)
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Add unit tests**
|
|
|
|
```rust
|
|
// ── normalize_article_url tests ─────────────────────────────
|
|
|
|
#[test]
|
|
fn normalize_strips_fragment() {
|
|
assert_eq!(
|
|
normalize_article_url("https://example.com/article#section"),
|
|
"https://example.com/article"
|
|
);
|
|
}
|
|
|
|
#[test]
|
|
fn normalize_strips_utm_params() {
|
|
assert_eq!(
|
|
normalize_article_url("https://example.com/article?utm_source=twitter&utm_medium=social"),
|
|
"https://example.com/article"
|
|
);
|
|
}
|
|
|
|
#[test]
|
|
fn normalize_keeps_non_tracking_params() {
|
|
let result = normalize_article_url("https://example.com/search?q=test&utm_source=twitter");
|
|
assert!(result.contains("q=test"));
|
|
assert!(!result.contains("utm_source"));
|
|
}
|
|
|
|
#[test]
|
|
fn normalize_strips_trailing_slash() {
|
|
assert_eq!(
|
|
normalize_article_url("https://example.com/article/"),
|
|
"https://example.com/article"
|
|
);
|
|
}
|
|
|
|
#[test]
|
|
fn normalize_keeps_root_slash() {
|
|
assert_eq!(
|
|
normalize_article_url("https://example.com/"),
|
|
"https://example.com/"
|
|
);
|
|
}
|
|
|
|
#[test]
|
|
fn normalize_lowercases() {
|
|
assert_eq!(
|
|
normalize_article_url("https://Example.COM/Article"),
|
|
"https://example.com/article" // entire URL lowercased for consistent hashing
|
|
);
|
|
}
|
|
|
|
#[test]
|
|
fn normalize_handles_invalid_url() {
|
|
let result = normalize_article_url("not a url at all");
|
|
assert_eq!(result, "not a url at all");
|
|
}
|
|
|
|
#[test]
|
|
fn normalize_strips_fbclid() {
|
|
let result = normalize_article_url("https://example.com/post?fbclid=abc123");
|
|
assert!(!result.contains("fbclid"));
|
|
assert!(!result.contains("?"));
|
|
}
|
|
|
|
#[test]
|
|
fn hash_article_url_deterministic() {
|
|
let h1 = hash_article_url("https://example.com/article?utm_source=twitter");
|
|
let h2 = hash_article_url("https://example.com/article?utm_source=newsletter");
|
|
assert_eq!(h1, h2, "Same article with different UTM params should hash the same");
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 3: Run tests + commit**
|
|
|
|
```bash
|
|
cd backend && cargo test --lib
|
|
git add backend/src/services/synthesis.rs
|
|
git commit -m "feat: add normalize_article_url and hash_article_url utilities"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 4: Pipeline integration — history filtering, insert, cleanup
|
|
|
|
**Files:**
|
|
- Modify: `backend/src/services/synthesis.rs`
|
|
|
|
This is the core integration task. Changes are in `run_generation_inner`.
|
|
|
|
- [ ] **Step 1: Add cleanup at the start of generation**
|
|
|
|
After loading settings (around line 259), add:
|
|
```rust
|
|
// Cleanup old article history entries
|
|
if settings.article_history_days > 0 {
|
|
let deleted = db::article_history::cleanup_old(
|
|
&state.pool,
|
|
user_id,
|
|
settings.article_history_days,
|
|
)
|
|
.await
|
|
.unwrap_or(0);
|
|
if deleted > 0 {
|
|
tracing::info!(deleted = deleted, "Cleaned up old article history entries");
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Add history filtering in Phase 1**
|
|
|
|
After filtering empty content (around line 376, after `let valid_articles = ...`), add history filtering:
|
|
|
|
```rust
|
|
// 1d. Filter against article history (cross-synthesis dedup)
|
|
let valid_articles = if settings.article_history_days > 0 {
|
|
let hashes: Vec<String> = valid_articles.iter().map(|a| hash_article_url(&a.url)).collect();
|
|
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
|
|
.await
|
|
.unwrap_or_default();
|
|
if !existing.is_empty() {
|
|
tracing::info!(filtered = existing.len(), "Phase 1: filtered articles already in history");
|
|
}
|
|
valid_articles
|
|
.into_iter()
|
|
.filter(|a| !existing.contains(&hash_article_url(&a.url)))
|
|
.collect::<Vec<_>>()
|
|
} else {
|
|
valid_articles
|
|
};
|
|
```
|
|
|
|
- [ ] **Step 3: Add Phase 1 retry logic when under-filled**
|
|
|
|
After the history filtering in Phase 1 (Step 2), check if we have enough articles. If under-filled, do one retry with the same sources, excluding already-fetched URLs:
|
|
|
|
```rust
|
|
// 1e. Retry if under-filled (1 attempt)
|
|
let target = settings.categories.len() * settings.max_items_per_category as usize;
|
|
if valid_articles.len() < target && settings.article_history_days > 0 {
|
|
tracing::info!(
|
|
have = valid_articles.len(),
|
|
need = target,
|
|
"Phase 1 under-filled after history filter, retrying with same sources"
|
|
);
|
|
|
|
// Collect all URLs already fetched (valid + filtered)
|
|
let mut already_fetched: std::collections::HashSet<String> = candidate_urls
|
|
.iter()
|
|
.map(|u| u.to_lowercase())
|
|
.collect();
|
|
|
|
// Re-scrape source pages for new links
|
|
let mut retry_urls: Vec<String> = Vec::new();
|
|
for source in sources.iter().take(max_sources) {
|
|
let links = if settings.use_llm_for_source_links {
|
|
source_scraper::extract_article_links_with_llm(
|
|
&state.http_client, &source.url, max_links_per_source,
|
|
&provider, &model_research,
|
|
).await
|
|
} else {
|
|
source_scraper::extract_article_links(
|
|
&state.http_client, &source.url, max_links_per_source,
|
|
).await
|
|
};
|
|
if let Ok(links) = links {
|
|
for link in links {
|
|
if !already_fetched.contains(&link.to_lowercase()) {
|
|
retry_urls.push(link);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
if !retry_urls.is_empty() {
|
|
// Scrape retry candidates
|
|
let retry_scraped = scrape_flat_urls(
|
|
state, &retry_urls, settings.max_age_days as i64, tx,
|
|
llm_for_scraping.clone(),
|
|
).await;
|
|
let retry_valid: Vec<ScrapedNewsItem> = retry_scraped
|
|
.into_iter()
|
|
.filter(|a| !a.scraped_content.trim().is_empty())
|
|
.collect();
|
|
|
|
// Filter against history
|
|
let retry_valid = if !retry_valid.is_empty() {
|
|
let hashes: Vec<String> = retry_valid.iter().map(|a| hash_article_url(&a.url)).collect();
|
|
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
|
|
.await.unwrap_or_default();
|
|
retry_valid.into_iter()
|
|
.filter(|a| !existing.contains(&hash_article_url(&a.url)))
|
|
.collect::<Vec<_>>()
|
|
} else {
|
|
retry_valid
|
|
};
|
|
|
|
// Merge with existing valid articles
|
|
valid_articles.extend(retry_valid);
|
|
tracing::info!(total = valid_articles.len(), "Phase 1 after retry");
|
|
}
|
|
}
|
|
```
|
|
|
|
Note: `valid_articles` must be declared as `let mut valid_articles` earlier for this to work.
|
|
|
|
- [ ] **Step 4: Add history filtering in Phase 2 (before scraping)**
|
|
|
|
In Phase 2, after `dedup_by_url` and `limit_articles_per_source` (around line 552), before `scrape_articles`, add:
|
|
|
|
```rust
|
|
// Filter against article history BEFORE scraping (saves HTTP requests)
|
|
let parsed = if settings.article_history_days > 0 {
|
|
let all_urls: Vec<String> = parsed.iter()
|
|
.flat_map(|(_, items)| items.iter().map(|i| i.url.clone()))
|
|
.collect();
|
|
let hashes: Vec<String> = all_urls.iter().map(|u| hash_article_url(u)).collect();
|
|
let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
|
|
.await
|
|
.unwrap_or_default();
|
|
if !existing.is_empty() {
|
|
tracing::info!(filtered = existing.len(), "Phase 2: filtered articles already in history");
|
|
}
|
|
parsed
|
|
.into_iter()
|
|
.map(|(cat_key, items)| {
|
|
let filtered = items
|
|
.into_iter()
|
|
.filter(|item| !existing.contains(&hash_article_url(&item.url)))
|
|
.collect();
|
|
(cat_key, filtered)
|
|
})
|
|
.collect()
|
|
} else {
|
|
parsed
|
|
};
|
|
```
|
|
|
|
- [ ] **Step 5: Insert article URLs after saving synthesis**
|
|
|
|
After `db::syntheses::create` (around line 638), add:
|
|
|
|
```rust
|
|
// Record article URLs in history for cross-synthesis dedup
|
|
if settings.article_history_days > 0 {
|
|
let article_urls: Vec<(String, String)> = final_sections
|
|
.iter()
|
|
.flat_map(|section| section.items.iter())
|
|
.map(|item| (item.url.clone(), hash_article_url(&item.url)))
|
|
.collect();
|
|
db::article_history::insert_urls(&state.pool, user_id, &article_urls)
|
|
.await
|
|
.ok(); // Don't fail synthesis if history insert fails
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 6: Run tests + commit**
|
|
|
|
```bash
|
|
cd backend && cargo test --lib && cargo build
|
|
git add backend/src/services/synthesis.rs
|
|
git commit -m "feat: article history filtering in pipeline — cleanup, Phase 1/2 filter, retry, insert after save"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 5: Frontend setting
|
|
|
|
**Files:**
|
|
- Modify: `frontend/src/types.ts`
|
|
- Modify: `frontend/src/i18n/fr.ts`
|
|
- Modify: `frontend/src/pages/Settings.tsx`
|
|
|
|
- [ ] **Step 1: Add field to types + DEFAULT_SETTINGS**
|
|
|
|
```typescript
|
|
// In UserSettings:
|
|
article_history_days: number;
|
|
|
|
// In DEFAULT_SETTINGS:
|
|
article_history_days: 90,
|
|
```
|
|
|
|
- [ ] **Step 2: Add i18n label**
|
|
|
|
```typescript
|
|
'settings.articleHistoryDays': 'Historique des articles (jours)',
|
|
```
|
|
|
|
- [ ] **Step 3: Add number input to Settings page**
|
|
|
|
Add inside the generation settings grid (alongside the other number inputs):
|
|
|
|
```tsx
|
|
<div>
|
|
<label
|
|
for="articleHistoryDays"
|
|
class="block text-sm font-medium text-gray-700"
|
|
>
|
|
{t('settings.articleHistoryDays')}
|
|
</label>
|
|
<div class="mt-1">
|
|
<input
|
|
type="number"
|
|
id="articleHistoryDays"
|
|
min="0"
|
|
max="365"
|
|
class="shadow-sm focus:ring-indigo-500 focus:border-indigo-500 block w-full sm:text-sm border-gray-300 rounded-md py-2 px-3 border"
|
|
value={settings().article_history_days}
|
|
onInput={(e) =>
|
|
setSettings((prev) => ({
|
|
...prev,
|
|
article_history_days:
|
|
parseInt(e.currentTarget.value) || 90,
|
|
}))
|
|
}
|
|
/>
|
|
</div>
|
|
</div>
|
|
```
|
|
|
|
- [ ] **Step 4: Run frontend tests + commit**
|
|
|
|
```bash
|
|
cd frontend && npx tsc --noEmit && npx vitest run
|
|
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
|
|
git commit -m "feat: add article_history_days setting to frontend"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 6: Update E2E and integration tests
|
|
|
|
**Files:**
|
|
- Modify: `e2e/tests/generation-live.spec.ts`
|
|
- Modify: `backend/tests/api_syntheses_test.rs`
|
|
|
|
- [ ] **Step 1: Update E2E settings payload**
|
|
|
|
Add `article_history_days: 90` to the PUT settings body.
|
|
|
|
- [ ] **Step 2: Update integration test settings payload**
|
|
|
|
In `api_syntheses_test.rs`, add `"article_history_days": 90` to the PUT settings body.
|
|
|
|
- [ ] **Step 3: Run E2E test to verify**
|
|
|
|
```bash
|
|
cd e2e && docker compose -f docker-compose.test.yml down
|
|
docker compose -f docker-compose.test.yml up --build -d
|
|
sleep 25 && npx tsx seed.ts && npx playwright test generation-live --reporter=list
|
|
```
|
|
|
|
- [ ] **Step 4: Commit**
|
|
|
|
```bash
|
|
git add e2e/tests/generation-live.spec.ts backend/tests/api_syntheses_test.rs
|
|
git commit -m "test: update E2E and integration tests with article_history_days setting"
|
|
```
|