docs: add spec for article history to prevent cross-synthesis duplicates
parent
8e06357b47
commit
633a51dc8c
@ -0,0 +1,131 @@
|
||||
# Design: Article History — Prevent Duplicate Articles Across Syntheses
|
||||
|
||||
**Date**: 2026-03-24
|
||||
**Scope**: Persistent article URL history to prevent the same article from appearing in multiple syntheses
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Article URLs are only deduplicated within a single synthesis (via `dedup_by_url` and `seen_urls`). Across syntheses, there is no dedup — if an article stays on a blog's front page, it reappears every week. The `source_diversity_window` only avoids domains, not specific URLs.
|
||||
|
||||
## Approach
|
||||
|
||||
New `article_history` table stores all article URLs per user. During generation, candidate articles are filtered against this history. A configurable TTL allows articles to reappear after N days.
|
||||
|
||||
## New Table: `article_history`
|
||||
|
||||
```sql
|
||||
CREATE TABLE article_history (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||||
url_hash TEXT NOT NULL,
|
||||
url TEXT NOT NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
|
||||
```
|
||||
|
||||
- `url_hash`: SHA-256 of lowercased URL (fast indexed lookup, avoids indexing long text)
|
||||
- Scoped per `user_id` — each user has independent history
|
||||
- `UNIQUE(user_id, url_hash)` constraint — inserts use `ON CONFLICT DO NOTHING` to silently skip duplicates
|
||||
|
||||
## New User Setting
|
||||
|
||||
- **Field:** `article_history_days: i32` (default 90, range 0-365, 0 = disabled)
|
||||
- **Label:** "Historique des articles (jours)"
|
||||
- **Migration:** `ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90`
|
||||
|
||||
## Pipeline Integration
|
||||
|
||||
**Where history filtering happens:** After scraping + filtering empty content, before classification. Avoids wasting LLM classification calls on articles that will be filtered out.
|
||||
|
||||
### Phase 1 (personalized sources):
|
||||
|
||||
1. Scrape source pages → extract article links
|
||||
2. Scrape candidate articles → filter empty content
|
||||
3. **Filter against article_history** — query `check_urls_exist(user_id, url_hashes)`, remove matches
|
||||
4. **If under-filled: 1 retry** — go back to step 1 with the same sources, excluding already-fetched URLs (both history-filtered and previously-scraped URLs in this run). Only 1 retry attempt.
|
||||
5. LLM classification → fill categories
|
||||
|
||||
### Phase 2 (web search fallback):
|
||||
|
||||
1. LLM search → parse + filter homepage + dedup
|
||||
2. **Filter against article_history** — remove already-seen URLs BEFORE scraping (saves HTTP requests)
|
||||
3. Scrape remaining articles → filter empty
|
||||
4. LLM classification → fill remaining
|
||||
|
||||
### After saving synthesis:
|
||||
|
||||
Insert all article URLs from the final synthesis into `article_history`:
|
||||
- `insert_urls(user_id, urls)` — batch insert with SHA-256 hashes
|
||||
|
||||
### Cleanup:
|
||||
|
||||
Before each generation, delete entries older than `article_history_days` for this user:
|
||||
- `cleanup_old(user_id, days)` — `DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - interval '$2 days'`
|
||||
- When `article_history_days = 0`, skip both filtering and insertion (feature disabled)
|
||||
|
||||
## DB Module: `db/article_history.rs`
|
||||
|
||||
Three functions:
|
||||
|
||||
```rust
|
||||
/// Check which URLs already exist in history. Returns the set of url_hashes that exist.
|
||||
pub async fn check_urls_exist(pool: &PgPool, user_id: Uuid, url_hashes: &[String]) -> Result<HashSet<String>, AppError>
|
||||
|
||||
/// Insert article URLs into history (batch).
|
||||
pub async fn insert_urls(pool: &PgPool, user_id: Uuid, urls: &[(String, String)]) -> Result<(), AppError>
|
||||
// urls is Vec<(url, url_hash)>
|
||||
|
||||
/// Delete history entries older than N days.
|
||||
pub async fn cleanup_old(pool: &PgPool, user_id: Uuid, days: i32) -> Result<u64, AppError>
|
||||
```
|
||||
|
||||
**URL normalization before hashing:** URLs must be normalized before hashing to catch common variations:
|
||||
1. Lowercase the URL
|
||||
2. Strip fragment (`#...`)
|
||||
3. Strip trailing slash from path (unless path is just `/`)
|
||||
4. Strip known tracking query parameters: `utm_source`, `utm_medium`, `utm_campaign`, `utm_term`, `utm_content`, `ref`, `source`, `fbclid`, `gclid`
|
||||
5. If no query parameters remain after stripping, remove the `?` entirely
|
||||
|
||||
Add a `normalize_article_url(url: &str) -> String` utility (in `synthesis.rs` or a shared util). Use `url::Url::parse` for reliable parsing.
|
||||
|
||||
**URL hashing:** `sha256(normalize_article_url(url))` — reuse existing `crate::util::token::hash_token` (which does SHA-256 → hex).
|
||||
|
||||
## Retry Logic (Phase 1)
|
||||
|
||||
When articles are filtered out by history and categories are under-filled:
|
||||
|
||||
1. Collect URLs already fetched in this run (both valid and history-filtered)
|
||||
2. Re-scrape source pages, excluding already-fetched URLs
|
||||
3. Scrape new candidates → filter empty → filter history
|
||||
4. Merge with existing results
|
||||
5. Proceed to classification
|
||||
|
||||
Only 1 retry attempt. "Under-filled" means the total number of valid scraped articles (after history filtering) is less than `categories.len() * max_items_per_category`. If still under-filled after retry, Phase 2 fills the gaps.
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- **Create:** migration `20260324000015_add_article_history.sql` — table + setting column
|
||||
- **Create:** `backend/src/db/article_history.rs` — check_urls_exist, insert_urls, cleanup_old
|
||||
- **Modify:** `backend/src/db/mod.rs` — register module
|
||||
- **Modify:** `backend/src/models/settings.rs` — add `article_history_days` field
|
||||
- **Modify:** `backend/src/db/settings.rs` — add to queries
|
||||
- **Modify:** `backend/src/services/synthesis.rs` — history filtering, retry logic, insert after save, cleanup before generation
|
||||
- **Modify:** `frontend/src/types.ts` — add setting field
|
||||
- **Modify:** `frontend/src/i18n/fr.ts` — add label
|
||||
- **Modify:** `frontend/src/pages/Settings.tsx` — add number input
|
||||
- **Modify:** `CLAUDE.md` — migration count to 15
|
||||
- **Modify:** `e2e/tests/generation-live.spec.ts` — update settings payload
|
||||
- **Modify:** `backend/tests/api_syntheses_test.rs` — update settings payload
|
||||
- **Add:** unit tests in `synthesis.rs` — history filtering logic
|
||||
|
||||
## What Does NOT Change
|
||||
|
||||
- `dedup_by_url` — still handles within-synthesis dedup (fast, in-memory)
|
||||
- `seen_urls` HashSet — still handles cross-phase dedup within one generation
|
||||
- `source_diversity_window` — still avoids domains across syntheses (complementary)
|
||||
- Frontend synthesis display — no changes
|
||||
- Scraper — no changes
|
||||
- LLM providers — no changes
|
||||
Loading…
Reference in New Issue