You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/specs/2026-03-24-article-history-...

6.2 KiB

Design: Article History — Prevent Duplicate Articles Across Syntheses

Date: 2026-03-24 Scope: Persistent article URL history to prevent the same article from appearing in multiple syntheses


Context

Article URLs are only deduplicated within a single synthesis (via dedup_by_url and seen_urls). Across syntheses, there is no dedup — if an article stays on a blog's front page, it reappears every week. The source_diversity_window only avoids domains, not specific URLs.

Approach

New article_history table stores all article URLs per user. During generation, candidate articles are filtered against this history. A configurable TTL allows articles to reappear after N days.

New Table: article_history

CREATE TABLE article_history (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    url_hash    TEXT NOT NULL,
    url         TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
  • url_hash: SHA-256 of lowercased URL (fast indexed lookup, avoids indexing long text)
  • Scoped per user_id — each user has independent history
  • UNIQUE(user_id, url_hash) constraint — inserts use ON CONFLICT DO NOTHING to silently skip duplicates

New User Setting

  • Field: article_history_days: i32 (default 90, range 0-365, 0 = disabled)
  • Label: "Historique des articles (jours)"
  • Migration: ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90

Pipeline Integration

Where history filtering happens: After scraping + filtering empty content, before classification. Avoids wasting LLM classification calls on articles that will be filtered out.

Phase 1 (personalized sources):

  1. Scrape source pages → extract article links
  2. Scrape candidate articles → filter empty content
  3. Filter against article_history — query check_urls_exist(user_id, url_hashes), remove matches
  4. If under-filled: 1 retry — go back to step 1 with the same sources, excluding already-fetched URLs (both history-filtered and previously-scraped URLs in this run). Only 1 retry attempt.
  5. LLM classification → fill categories

Phase 2 (web search fallback):

  1. LLM search → parse + filter homepage + dedup
  2. Filter against article_history — remove already-seen URLs BEFORE scraping (saves HTTP requests)
  3. Scrape remaining articles → filter empty
  4. LLM classification → fill remaining

After saving synthesis:

Insert all article URLs from the final synthesis into article_history:

  • insert_urls(user_id, urls) — batch insert with SHA-256 hashes

Cleanup:

Before each generation, delete entries older than article_history_days for this user:

  • cleanup_old(user_id, days)DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - interval '$2 days'
  • When article_history_days = 0, skip both filtering and insertion (feature disabled)

DB Module: db/article_history.rs

Three functions:

/// Check which URLs already exist in history. Returns the set of url_hashes that exist.
pub async fn check_urls_exist(pool: &PgPool, user_id: Uuid, url_hashes: &[String]) -> Result<HashSet<String>, AppError>

/// Insert article URLs into history (batch).
pub async fn insert_urls(pool: &PgPool, user_id: Uuid, urls: &[(String, String)]) -> Result<(), AppError>
// urls is Vec<(url, url_hash)>

/// Delete history entries older than N days.
pub async fn cleanup_old(pool: &PgPool, user_id: Uuid, days: i32) -> Result<u64, AppError>

URL normalization before hashing: URLs must be normalized before hashing to catch common variations:

  1. Lowercase the URL
  2. Strip fragment (#...)
  3. Strip trailing slash from path (unless path is just /)
  4. Strip known tracking query parameters: utm_source, utm_medium, utm_campaign, utm_term, utm_content, ref, source, fbclid, gclid
  5. If no query parameters remain after stripping, remove the ? entirely

Add a normalize_article_url(url: &str) -> String utility (in synthesis.rs or a shared util). Use url::Url::parse for reliable parsing.

URL hashing: sha256(normalize_article_url(url)) — reuse existing crate::util::token::hash_token (which does SHA-256 → hex).

Retry Logic (Phase 1)

When articles are filtered out by history and categories are under-filled:

  1. Collect URLs already fetched in this run (both valid and history-filtered)
  2. Re-scrape source pages, excluding already-fetched URLs
  3. Scrape new candidates → filter empty → filter history
  4. Merge with existing results
  5. Proceed to classification

Only 1 retry attempt. "Under-filled" means the total number of valid scraped articles (after history filtering) is less than categories.len() * max_items_per_category. If still under-filled after retry, Phase 2 fills the gaps.

Files to Modify

  • Create: migration 20260324000015_add_article_history.sql — table + setting column
  • Create: backend/src/db/article_history.rs — check_urls_exist, insert_urls, cleanup_old
  • Modify: backend/src/db/mod.rs — register module
  • Modify: backend/src/models/settings.rs — add article_history_days field
  • Modify: backend/src/db/settings.rs — add to queries
  • Modify: backend/src/services/synthesis.rs — history filtering, retry logic, insert after save, cleanup before generation
  • Modify: frontend/src/types.ts — add setting field
  • Modify: frontend/src/i18n/fr.ts — add label
  • Modify: frontend/src/pages/Settings.tsx — add number input
  • Modify: CLAUDE.md — migration count to 15
  • Modify: e2e/tests/generation-live.spec.ts — update settings payload
  • Modify: backend/tests/api_syntheses_test.rs — update settings payload
  • Add: unit tests in synthesis.rs — history filtering logic

What Does NOT Change

  • dedup_by_url — still handles within-synthesis dedup (fast, in-memory)
  • seen_urls HashSet — still handles cross-phase dedup within one generation
  • source_diversity_window — still avoids domains across syntheses (complementary)
  • Frontend synthesis display — no changes
  • Scraper — no changes
  • LLM providers — no changes