You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ai_synth/docs/superpowers/plans/2026-03-24-article-history.md

19 KiB

Article History — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Prevent duplicate articles across syntheses by maintaining a persistent per-user article URL history with configurable TTL.

Architecture: New article_history table with SHA-256 hashed URLs. Pipeline filters candidates against history before classification. URLs inserted after synthesis saved. Cleanup of old entries before each generation.

Tech Stack: Rust (sqlx, sha2, url crate), PostgreSQL

Spec: docs/superpowers/specs/2026-03-24-article-history-design.md


Task 1: Migration + settings field

Files:

  • Create: backend/migrations/20260324000015_add_article_history.sql

  • Modify: backend/src/models/settings.rs

  • Modify: backend/src/db/settings.rs

  • Modify: backend/src/services/prompts.rs (test fixture)

  • Modify: CLAUDE.md

  • Step 1: Create migration

-- Article history table for cross-synthesis URL deduplication
CREATE TABLE article_history (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    url_hash    TEXT NOT NULL,
    url         TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);

-- Setting for history TTL
ALTER TABLE settings ADD COLUMN article_history_days INTEGER NOT NULL DEFAULT 90;
  • Step 2: Add article_history_days to settings model

Add pub article_history_days: i32 to UserSettings, SettingsResponse, UpdateSettingsRequest. Add to From impl, Default (90), and validation:

if !(0..=365).contains(&self.article_history_days) {
    return Err("article_history_days must be between 0 and 365".into());
}
  • Step 3: Add to DB queries

Add to SettingsRow, TryFrom, both SQL queries in db/settings.rs.

  • Step 4: Update test fixtures

Add article_history_days: 90 to valid_request() in settings tests and test_settings() in prompts tests.

  • Step 5: Update CLAUDE.md migration count to 15

  • Step 6: Run tests + commit

cd backend && cargo test --lib
git add backend/migrations/20260324000015_add_article_history.sql backend/src/models/settings.rs backend/src/db/settings.rs backend/src/services/prompts.rs CLAUDE.md
git commit -m "feat: add article_history table and article_history_days setting"

Task 2: DB module for article history

Files:

  • Create: backend/src/db/article_history.rs

  • Modify: backend/src/db/mod.rs

  • Step 1: Create backend/src/db/article_history.rs

//! Article history: tracks which article URLs have been used in past syntheses.
//!
//! Prevents the same article from appearing in multiple syntheses.

use std::collections::HashSet;
use sqlx::PgPool;
use uuid::Uuid;
use crate::errors::AppError;

/// Check which URL hashes already exist in history for this user.
///
/// Returns the set of url_hashes that were found (i.e., already used).
pub async fn check_urls_exist(
    pool: &PgPool,
    user_id: Uuid,
    url_hashes: &[String],
) -> Result<HashSet<String>, AppError> {
    if url_hashes.is_empty() {
        return Ok(HashSet::new());
    }

    let rows = sqlx::query_scalar::<_, String>(
        "SELECT url_hash FROM article_history WHERE user_id = $1 AND url_hash = ANY($2)",
    )
    .bind(user_id)
    .bind(url_hashes)
    .fetch_all(pool)
    .await?;

    Ok(rows.into_iter().collect())
}

/// Insert article URLs into history (batch).
///
/// Uses ON CONFLICT DO NOTHING to silently skip duplicates.
pub async fn insert_urls(
    pool: &PgPool,
    user_id: Uuid,
    urls: &[(String, String)], // Vec<(url, url_hash)>
) -> Result<(), AppError> {
    if urls.is_empty() {
        return Ok(());
    }

    for (url, url_hash) in urls {
        sqlx::query(
            "INSERT INTO article_history (user_id, url_hash, url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
        )
        .bind(user_id)
        .bind(url_hash)
        .bind(url)
        .execute(pool)
        .await?;
    }

    Ok(())
}

/// Delete history entries older than N days for this user.
///
/// Returns the number of deleted rows.
pub async fn cleanup_old(
    pool: &PgPool,
    user_id: Uuid,
    days: i32,
) -> Result<u64, AppError> {
    let result = sqlx::query(
        "DELETE FROM article_history WHERE user_id = $1 AND created_at < now() - make_interval(days => $2)",
    )
    .bind(user_id)
    .bind(days)
    .execute(pool)
    .await?;

    Ok(result.rows_affected())
}
  • Step 2: Register module in db/mod.rs

Add pub mod article_history; (alphabetical order — after api_keys).

  • Step 3: Run tests + commit
cd backend && cargo test --lib && cargo build
git add backend/src/db/article_history.rs backend/src/db/mod.rs
git commit -m "feat: add article_history DB module (check, insert, cleanup)"

Task 3: URL normalization utility + unit tests

Files:

  • Modify: backend/src/services/synthesis.rs

  • Step 1: Add normalize_article_url function

Add near the other URL helper functions (near extract_domain):

/// Normalize an article URL for consistent history hashing.
///
/// Strips fragments, trailing slashes, and known tracking query parameters
/// so that the same article with different UTM tags is recognized as a duplicate.
fn normalize_article_url(url_str: &str) -> String {
    let Ok(mut parsed) = url::Url::parse(url_str) else {
        return url_str.to_lowercase();
    };

    // Strip fragment
    parsed.set_fragment(None);

    // Strip known tracking query parameters
    let dominated_params: &[&str] = &[
        "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
        "ref", "source", "fbclid", "gclid",
    ];

    let filtered_pairs: Vec<(String, String)> = parsed
        .query_pairs()
        .filter(|(key, _)| !dominated_params.contains(&key.as_ref()))
        .map(|(k, v)| (k.into_owned(), v.into_owned()))
        .collect();

    if filtered_pairs.is_empty() {
        parsed.set_query(None);
    } else {
        let query_string = filtered_pairs
            .iter()
            .map(|(k, v)| format!("{}={}", k, v))
            .collect::<Vec<_>>()
            .join("&");
        parsed.set_query(Some(&query_string));
    }

    // Strip trailing slash (unless path is just "/")
    let path = parsed.path().to_string();
    if path.len() > 1 && path.ends_with('/') {
        parsed.set_path(&path[..path.len() - 1]);
    }

    parsed.to_string().to_lowercase()
}

/// Compute the hash of a normalized article URL for history lookup.
fn hash_article_url(url: &str) -> String {
    let normalized = normalize_article_url(url);
    crate::util::token::hash_token(&normalized)
}
  • Step 2: Add unit tests
    // ── normalize_article_url tests ─────────────────────────────

    #[test]
    fn normalize_strips_fragment() {
        assert_eq!(
            normalize_article_url("https://example.com/article#section"),
            "https://example.com/article"
        );
    }

    #[test]
    fn normalize_strips_utm_params() {
        assert_eq!(
            normalize_article_url("https://example.com/article?utm_source=twitter&utm_medium=social"),
            "https://example.com/article"
        );
    }

    #[test]
    fn normalize_keeps_non_tracking_params() {
        let result = normalize_article_url("https://example.com/search?q=test&utm_source=twitter");
        assert!(result.contains("q=test"));
        assert!(!result.contains("utm_source"));
    }

    #[test]
    fn normalize_strips_trailing_slash() {
        assert_eq!(
            normalize_article_url("https://example.com/article/"),
            "https://example.com/article"
        );
    }

    #[test]
    fn normalize_keeps_root_slash() {
        assert_eq!(
            normalize_article_url("https://example.com/"),
            "https://example.com/"
        );
    }

    #[test]
    fn normalize_lowercases() {
        assert_eq!(
            normalize_article_url("https://Example.COM/Article"),
            "https://example.com/article" // entire URL lowercased for consistent hashing
        );
    }

    #[test]
    fn normalize_handles_invalid_url() {
        let result = normalize_article_url("not a url at all");
        assert_eq!(result, "not a url at all");
    }

    #[test]
    fn normalize_strips_fbclid() {
        let result = normalize_article_url("https://example.com/post?fbclid=abc123");
        assert!(!result.contains("fbclid"));
        assert!(!result.contains("?"));
    }

    #[test]
    fn hash_article_url_deterministic() {
        let h1 = hash_article_url("https://example.com/article?utm_source=twitter");
        let h2 = hash_article_url("https://example.com/article?utm_source=newsletter");
        assert_eq!(h1, h2, "Same article with different UTM params should hash the same");
    }
  • Step 3: Run tests + commit
cd backend && cargo test --lib
git add backend/src/services/synthesis.rs
git commit -m "feat: add normalize_article_url and hash_article_url utilities"

Task 4: Pipeline integration — history filtering, insert, cleanup

Files:

  • Modify: backend/src/services/synthesis.rs

This is the core integration task. Changes are in run_generation_inner.

  • Step 1: Add cleanup at the start of generation

After loading settings (around line 259), add:

    // Cleanup old article history entries
    if settings.article_history_days > 0 {
        let deleted = db::article_history::cleanup_old(
            &state.pool,
            user_id,
            settings.article_history_days,
        )
        .await
        .unwrap_or(0);
        if deleted > 0 {
            tracing::info!(deleted = deleted, "Cleaned up old article history entries");
        }
    }
  • Step 2: Add history filtering in Phase 1

After filtering empty content (around line 376, after let valid_articles = ...), add history filtering:

            // 1d. Filter against article history (cross-synthesis dedup)
            let valid_articles = if settings.article_history_days > 0 {
                let hashes: Vec<String> = valid_articles.iter().map(|a| hash_article_url(&a.url)).collect();
                let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
                    .await
                    .unwrap_or_default();
                if !existing.is_empty() {
                    tracing::info!(filtered = existing.len(), "Phase 1: filtered articles already in history");
                }
                valid_articles
                    .into_iter()
                    .filter(|a| !existing.contains(&hash_article_url(&a.url)))
                    .collect::<Vec<_>>()
            } else {
                valid_articles
            };
  • Step 3: Add Phase 1 retry logic when under-filled

After the history filtering in Phase 1 (Step 2), check if we have enough articles. If under-filled, do one retry with the same sources, excluding already-fetched URLs:

            // 1e. Retry if under-filled (1 attempt)
            let target = settings.categories.len() * settings.max_items_per_category as usize;
            if valid_articles.len() < target && settings.article_history_days > 0 {
                tracing::info!(
                    have = valid_articles.len(),
                    need = target,
                    "Phase 1 under-filled after history filter, retrying with same sources"
                );

                // Collect all URLs already fetched (valid + filtered)
                let mut already_fetched: std::collections::HashSet<String> = candidate_urls
                    .iter()
                    .map(|u| u.to_lowercase())
                    .collect();

                // Re-scrape source pages for new links
                let mut retry_urls: Vec<String> = Vec::new();
                for source in sources.iter().take(max_sources) {
                    let links = if settings.use_llm_for_source_links {
                        source_scraper::extract_article_links_with_llm(
                            &state.http_client, &source.url, max_links_per_source,
                            &provider, &model_research,
                        ).await
                    } else {
                        source_scraper::extract_article_links(
                            &state.http_client, &source.url, max_links_per_source,
                        ).await
                    };
                    if let Ok(links) = links {
                        for link in links {
                            if !already_fetched.contains(&link.to_lowercase()) {
                                retry_urls.push(link);
                            }
                        }
                    }
                }

                if !retry_urls.is_empty() {
                    // Scrape retry candidates
                    let retry_scraped = scrape_flat_urls(
                        state, &retry_urls, settings.max_age_days as i64, tx,
                        llm_for_scraping.clone(),
                    ).await;
                    let retry_valid: Vec<ScrapedNewsItem> = retry_scraped
                        .into_iter()
                        .filter(|a| !a.scraped_content.trim().is_empty())
                        .collect();

                    // Filter against history
                    let retry_valid = if !retry_valid.is_empty() {
                        let hashes: Vec<String> = retry_valid.iter().map(|a| hash_article_url(&a.url)).collect();
                        let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
                            .await.unwrap_or_default();
                        retry_valid.into_iter()
                            .filter(|a| !existing.contains(&hash_article_url(&a.url)))
                            .collect::<Vec<_>>()
                    } else {
                        retry_valid
                    };

                    // Merge with existing valid articles
                    valid_articles.extend(retry_valid);
                    tracing::info!(total = valid_articles.len(), "Phase 1 after retry");
                }
            }

Note: valid_articles must be declared as let mut valid_articles earlier for this to work.

  • Step 4: Add history filtering in Phase 2 (before scraping)

In Phase 2, after dedup_by_url and limit_articles_per_source (around line 552), before scrape_articles, add:

        // Filter against article history BEFORE scraping (saves HTTP requests)
        let parsed = if settings.article_history_days > 0 {
            let all_urls: Vec<String> = parsed.iter()
                .flat_map(|(_, items)| items.iter().map(|i| i.url.clone()))
                .collect();
            let hashes: Vec<String> = all_urls.iter().map(|u| hash_article_url(u)).collect();
            let existing = db::article_history::check_urls_exist(&state.pool, user_id, &hashes)
                .await
                .unwrap_or_default();
            if !existing.is_empty() {
                tracing::info!(filtered = existing.len(), "Phase 2: filtered articles already in history");
            }
            parsed
                .into_iter()
                .map(|(cat_key, items)| {
                    let filtered = items
                        .into_iter()
                        .filter(|item| !existing.contains(&hash_article_url(&item.url)))
                        .collect();
                    (cat_key, filtered)
                })
                .collect()
        } else {
            parsed
        };
  • Step 5: Insert article URLs after saving synthesis

After db::syntheses::create (around line 638), add:

    // Record article URLs in history for cross-synthesis dedup
    if settings.article_history_days > 0 {
        let article_urls: Vec<(String, String)> = final_sections
            .iter()
            .flat_map(|section| section.items.iter())
            .map(|item| (item.url.clone(), hash_article_url(&item.url)))
            .collect();
        db::article_history::insert_urls(&state.pool, user_id, &article_urls)
            .await
            .ok(); // Don't fail synthesis if history insert fails
    }
  • Step 6: Run tests + commit
cd backend && cargo test --lib && cargo build
git add backend/src/services/synthesis.rs
git commit -m "feat: article history filtering in pipeline — cleanup, Phase 1/2 filter, retry, insert after save"

Task 5: Frontend setting

Files:

  • Modify: frontend/src/types.ts

  • Modify: frontend/src/i18n/fr.ts

  • Modify: frontend/src/pages/Settings.tsx

  • Step 1: Add field to types + DEFAULT_SETTINGS

// In UserSettings:
article_history_days: number;

// In DEFAULT_SETTINGS:
article_history_days: 90,
  • Step 2: Add i18n label
'settings.articleHistoryDays': 'Historique des articles (jours)',
  • Step 3: Add number input to Settings page

Add inside the generation settings grid (alongside the other number inputs):

            <div>
              <label
                for="articleHistoryDays"
                class="block text-sm font-medium text-gray-700"
              >
                {t('settings.articleHistoryDays')}
              </label>
              <div class="mt-1">
                <input
                  type="number"
                  id="articleHistoryDays"
                  min="0"
                  max="365"
                  class="shadow-sm focus:ring-indigo-500 focus:border-indigo-500 block w-full sm:text-sm border-gray-300 rounded-md py-2 px-3 border"
                  value={settings().article_history_days}
                  onInput={(e) =>
                    setSettings((prev) => ({
                      ...prev,
                      article_history_days:
                        parseInt(e.currentTarget.value) || 90,
                    }))
                  }
                />
              </div>
            </div>
  • Step 4: Run frontend tests + commit
cd frontend && npx tsc --noEmit && npx vitest run
git add frontend/src/types.ts frontend/src/i18n/fr.ts frontend/src/pages/Settings.tsx
git commit -m "feat: add article_history_days setting to frontend"

Task 6: Update E2E and integration tests

Files:

  • Modify: e2e/tests/generation-live.spec.ts

  • Modify: backend/tests/api_syntheses_test.rs

  • Step 1: Update E2E settings payload

Add article_history_days: 90 to the PUT settings body.

  • Step 2: Update integration test settings payload

In api_syntheses_test.rs, add "article_history_days": 90 to the PUT settings body.

  • Step 3: Run E2E test to verify
cd e2e && docker compose -f docker-compose.test.yml down
docker compose -f docker-compose.test.yml up --build -d
sleep 25 && npx tsx seed.ts && npx playwright test generation-live --reporter=list
  • Step 4: Commit
git add e2e/tests/generation-live.spec.ts backend/tests/api_syntheses_test.rs
git commit -m "test: update E2E and integration tests with article_history_days setting"