You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.2 KiB

Raw Blame History

Design: Article History — Prevent Duplicate Articles Across Syntheses

Date: 2026-03-24 Scope: Persistent article URL history to prevent the same article from appearing in multiple syntheses

Context

Article URLs are only deduplicated within a single synthesis (via dedup_by_url and seen_urls). Across syntheses, there is no dedup — if an article stays on a blog's front page, it reappears every week. The source_diversity_window only avoids domains, not specific URLs.

Approach

New article_history table stores all article URLs per user. During generation, candidate articles are filtered against this history. A configurable TTL allows articles to reappear after N days.

New Table: `article_history`

CREATE TABLE article_history (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    url_hash    TEXT NOT NULL,
    url         TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);

url_hash: SHA-256 of lowercased URL (fast indexed lookup, avoids indexing long text)
Scoped per user_id — each user has independent history
UNIQUE(user_id, url_hash) constraint — inserts use ON CONFLICT DO NOTHING to silently skip duplicates

New User Setting

Field: article_history_days: i32 (default 90, range 0-365, 0 = disabled)
Label: "Historique des articles (jours)"
Migration: ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90

Pipeline Integration

Where history filtering happens: After scraping + filtering empty content, before classification. Avoids wasting LLM classification calls on articles that will be filtered out.

Phase 1 (personalized sources):

Scrape source pages → extract article links
Scrape candidate articles → filter empty content
Filter against article_history — query check_urls_exist(user_id, url_hashes), remove matches
If under-filled: 1 retry — go back to step 1 with the same sources, excluding already-fetched URLs (both history-filtered and previously-scraped URLs in this run). Only 1 retry attempt.
LLM classification → fill categories

Phase 2 (web search fallback):

LLM search → parse + filter homepage + dedup
Filter against article_history — remove already-seen URLs BEFORE scraping (saves HTTP requests)
Scrape remaining articles → filter empty
LLM classification → fill remaining

After saving synthesis:

Insert all article URLs from the final synthesis into article_history:

insert_urls(user_id, urls) — batch insert with SHA-256 hashes

Cleanup:

Before each generation, delete entries older than article_history_days for this user:

cleanup_old(user_id, days) — DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - interval '$2 days'
When article_history_days = 0, skip both filtering and insertion (feature disabled)

DB Module: `db/article_history.rs`

Three functions:

/// Check which URLs already exist in history. Returns the set of url_hashes that exist.
pub async fn check_urls_exist(pool: &PgPool, user_id: Uuid, url_hashes: &[String]) -> Result<HashSet<String>, AppError>

/// Insert article URLs into history (batch).
pub async fn insert_urls(pool: &PgPool, user_id: Uuid, urls: &[(String, String)]) -> Result<(), AppError>
// urls is Vec<(url, url_hash)>

/// Delete history entries older than N days.
pub async fn cleanup_old(pool: &PgPool, user_id: Uuid, days: i32) -> Result<u64, AppError>

URL normalization before hashing: URLs must be normalized before hashing to catch common variations:

Lowercase the URL
Strip fragment (#...)
Strip trailing slash from path (unless path is just /)
Strip known tracking query parameters: utm_source, utm_medium, utm_campaign, utm_term, utm_content, ref, source, fbclid, gclid
If no query parameters remain after stripping, remove the ? entirely

Add a normalize_article_url(url: &str) -> String utility (in synthesis.rs or a shared util). Use url::Url::parse for reliable parsing.

URL hashing: sha256(normalize_article_url(url)) — reuse existing crate::util::token::hash_token (which does SHA-256 → hex).

Retry Logic (Phase 1)

When articles are filtered out by history and categories are under-filled:

Collect URLs already fetched in this run (both valid and history-filtered)
Re-scrape source pages, excluding already-fetched URLs
Scrape new candidates → filter empty → filter history
Merge with existing results
Proceed to classification

Only 1 retry attempt. "Under-filled" means the total number of valid scraped articles (after history filtering) is less than categories.len() * max_items_per_category. If still under-filled after retry, Phase 2 fills the gaps.

Files to Modify

Create: migration 20260324000015_add_article_history.sql — table + setting column
Create: backend/src/db/article_history.rs — check_urls_exist, insert_urls, cleanup_old
Modify: backend/src/db/mod.rs — register module
Modify: backend/src/models/settings.rs — add article_history_days field
Modify: backend/src/db/settings.rs — add to queries
Modify: backend/src/services/synthesis.rs — history filtering, retry logic, insert after save, cleanup before generation
Modify: frontend/src/types.ts — add setting field
Modify: frontend/src/i18n/fr.ts — add label
Modify: frontend/src/pages/Settings.tsx — add number input
Modify: CLAUDE.md — migration count to 15
Modify: e2e/tests/generation-live.spec.ts — update settings payload
Modify: backend/tests/api_syntheses_test.rs — update settings payload
Add: unit tests in synthesis.rs — history filtering logic

What Does NOT Change

dedup_by_url — still handles within-synthesis dedup (fast, in-memory)
seen_urls HashSet — still handles cross-phase dedup within one generation
source_diversity_window — still avoids domains across syntheses (complementary)
Frontend synthesis display — no changes
Scraper — no changes
LLM providers — no changes

6.2 KiB Raw Blame History