6.2 KiB
Design: Article History — Prevent Duplicate Articles Across Syntheses
Date: 2026-03-24 Scope: Persistent article URL history to prevent the same article from appearing in multiple syntheses
Context
Article URLs are only deduplicated within a single synthesis (via dedup_by_url and seen_urls). Across syntheses, there is no dedup — if an article stays on a blog's front page, it reappears every week. The source_diversity_window only avoids domains, not specific URLs.
Approach
New article_history table stores all article URLs per user. During generation, candidate articles are filtered against this history. A configurable TTL allows articles to reappear after N days.
New Table: article_history
CREATE TABLE article_history (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
url_hash TEXT NOT NULL,
url TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
url_hash: SHA-256 of lowercased URL (fast indexed lookup, avoids indexing long text)- Scoped per
user_id— each user has independent history UNIQUE(user_id, url_hash)constraint — inserts useON CONFLICT DO NOTHINGto silently skip duplicates
New User Setting
- Field:
article_history_days: i32(default 90, range 0-365, 0 = disabled) - Label: "Historique des articles (jours)"
- Migration:
ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90
Pipeline Integration
Where history filtering happens: After scraping + filtering empty content, before classification. Avoids wasting LLM classification calls on articles that will be filtered out.
Phase 1 (personalized sources):
- Scrape source pages → extract article links
- Scrape candidate articles → filter empty content
- Filter against article_history — query
check_urls_exist(user_id, url_hashes), remove matches - If under-filled: 1 retry — go back to step 1 with the same sources, excluding already-fetched URLs (both history-filtered and previously-scraped URLs in this run). Only 1 retry attempt.
- LLM classification → fill categories
Phase 2 (web search fallback):
- LLM search → parse + filter homepage + dedup
- Filter against article_history — remove already-seen URLs BEFORE scraping (saves HTTP requests)
- Scrape remaining articles → filter empty
- LLM classification → fill remaining
After saving synthesis:
Insert all article URLs from the final synthesis into article_history:
insert_urls(user_id, urls)— batch insert with SHA-256 hashes
Cleanup:
Before each generation, delete entries older than article_history_days for this user:
cleanup_old(user_id, days)—DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - interval '$2 days'- When
article_history_days = 0, skip both filtering and insertion (feature disabled)
DB Module: db/article_history.rs
Three functions:
/// Check which URLs already exist in history. Returns the set of url_hashes that exist.
pub async fn check_urls_exist(pool: &PgPool, user_id: Uuid, url_hashes: &[String]) -> Result<HashSet<String>, AppError>
/// Insert article URLs into history (batch).
pub async fn insert_urls(pool: &PgPool, user_id: Uuid, urls: &[(String, String)]) -> Result<(), AppError>
// urls is Vec<(url, url_hash)>
/// Delete history entries older than N days.
pub async fn cleanup_old(pool: &PgPool, user_id: Uuid, days: i32) -> Result<u64, AppError>
URL normalization before hashing: URLs must be normalized before hashing to catch common variations:
- Lowercase the URL
- Strip fragment (
#...) - Strip trailing slash from path (unless path is just
/) - Strip known tracking query parameters:
utm_source,utm_medium,utm_campaign,utm_term,utm_content,ref,source,fbclid,gclid - If no query parameters remain after stripping, remove the
?entirely
Add a normalize_article_url(url: &str) -> String utility (in synthesis.rs or a shared util). Use url::Url::parse for reliable parsing.
URL hashing: sha256(normalize_article_url(url)) — reuse existing crate::util::token::hash_token (which does SHA-256 → hex).
Retry Logic (Phase 1)
When articles are filtered out by history and categories are under-filled:
- Collect URLs already fetched in this run (both valid and history-filtered)
- Re-scrape source pages, excluding already-fetched URLs
- Scrape new candidates → filter empty → filter history
- Merge with existing results
- Proceed to classification
Only 1 retry attempt. "Under-filled" means the total number of valid scraped articles (after history filtering) is less than categories.len() * max_items_per_category. If still under-filled after retry, Phase 2 fills the gaps.
Files to Modify
- Create: migration
20260324000015_add_article_history.sql— table + setting column - Create:
backend/src/db/article_history.rs— check_urls_exist, insert_urls, cleanup_old - Modify:
backend/src/db/mod.rs— register module - Modify:
backend/src/models/settings.rs— addarticle_history_daysfield - Modify:
backend/src/db/settings.rs— add to queries - Modify:
backend/src/services/synthesis.rs— history filtering, retry logic, insert after save, cleanup before generation - Modify:
frontend/src/types.ts— add setting field - Modify:
frontend/src/i18n/fr.ts— add label - Modify:
frontend/src/pages/Settings.tsx— add number input - Modify:
CLAUDE.md— migration count to 15 - Modify:
e2e/tests/generation-live.spec.ts— update settings payload - Modify:
backend/tests/api_syntheses_test.rs— update settings payload - Add: unit tests in
synthesis.rs— history filtering logic
What Does NOT Change
dedup_by_url— still handles within-synthesis dedup (fast, in-memory)seen_urlsHashSet — still handles cross-phase dedup within one generationsource_diversity_window— still avoids domains across syntheses (complementary)- Frontend synthesis display — no changes
- Scraper — no changes
- LLM providers — no changes