You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

9.6 KiB

Raw Blame History

Design: Article Tracing — Track Origin and Status of Every Article Candidate

Date: 2026-03-24 Scope: Enrich article_history with provenance metadata, track dropped articles, add frontend viewers

Context

The synthesis pipeline drops articles at various stages (history dedup, empty content, too old, source diversity) but doesn't record why. When the output quality is low (generic links, archives instead of articles), there's no way to diagnose which step failed. Users need visibility into the full candidate pipeline to improve their sources and settings.

Approach

Enrich the existing article_history table with tracing metadata. Insert dropped articles immediately at each filtering step. Add two frontend views: a global history page and a per-synthesis provenance section.

Enriched `article_history` Table

New columns via migration:

ALTER TABLE article_history ADD COLUMN title TEXT NOT NULL DEFAULT '';
ALTER TABLE article_history ADD COLUMN source_type TEXT NOT NULL DEFAULT 'unknown';
ALTER TABLE article_history ADD COLUMN source_url TEXT;
ALTER TABLE article_history ADD COLUMN category TEXT;
ALTER TABLE article_history ADD COLUMN synthesis_id UUID REFERENCES syntheses(id) ON DELETE SET NULL;
ALTER TABLE article_history ADD COLUMN status TEXT NOT NULL DEFAULT 'used';
ALTER TABLE article_history ADD COLUMN scraped_ok BOOLEAN NOT NULL DEFAULT true;
ALTER TABLE article_history ADD COLUMN job_id UUID NOT NULL DEFAULT gen_random_uuid();

-- Drop unique index — table is now a trace log, same URL can appear in multiple runs
DROP INDEX idx_article_history_user_url;
CREATE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
CREATE INDEX idx_article_history_job_id ON article_history(job_id);

-- Store job_id on syntheses for direct provenance lookup
ALTER TABLE syntheses ADD COLUMN job_id UUID;

source_type values: personalized_source, web_search, overflow

status values: used, filtered_history, filtered_empty, filtered_old, filtered_diversity, filtered_homepage, filtered_duplicate, filtered_cross_phase_dedup

synthesis_id: nullable — dropped articles don't belong to a synthesis. ON DELETE SET NULL preserves history.
job_id: the generation pipeline's UUID — links both used and dropped articles from the same generation run. Enables the provenance view to show all candidates, not just used ones.

Pipeline Integration

Insert dropped articles immediately at each filtering step. Insert used articles after synthesis saved with the real synthesis_id.

Insertion points (in pipeline order):

Phase 1 (personalized sources):

Scrape — after scrape_flat_urls + filter_empty, empty articles → status: filtered_empty, scraped_ok: false, source_type: personalized_source
History filter — articles matching existing history → status: filtered_history
Retry scrape (if under-filled) — same as steps 1-2 for retry candidates
Classification overflow — articles that overflow both target category and "Autre" → status: filtered_diversity (tracked via overflow vec)
Source diversity — articles pruned by max_articles_per_source → status: filtered_diversity

Phase 2 (web search): 6. Homepage filter — articles with homepage URLs → status: filtered_homepage, source_type: web_search 7. Cross-phase dedup — Phase 2 URLs already seen in Phase 1 → status: filtered_cross_phase_dedup 8. Dedup by URL — duplicate URLs within Phase 2 → status: filtered_duplicate 9. Source diversity — limit_articles_per_source drops → status: filtered_diversity 10. History filter — before scraping → status: filtered_history 11. Scrape — after scraping, empty articles → status: filtered_empty, scraped_ok: false

After save: 12. Used articles — insert with status: used, synthesis_id set, scraped_ok: true

All inserts include job_id from the pipeline's _job_id parameter (rename to job_id — remove underscore prefix).

source_url tracking: In Phase 1, thread the source page URL through candidate_urls as Vec<(String, String)> (article_url, source_url) and update scrape_flat_urls to accept pairs. In Phase 2, source_url is NULL. This requires modifying ScrapedNewsItem to add an optional source_url: Option<String> field.

Provenance lookup: Store job_id directly on the syntheses table (set during save). The provenance endpoint does a single query: SELECT * FROM article_history WHERE job_id = (SELECT job_id FROM syntheses WHERE id = $1).

Cleanup: cleanup_old only deletes entries where synthesis_id IS NULL (dropped articles). Used articles linked to syntheses are kept until the synthesis is deleted (ON DELETE SET NULL then next cleanup removes them). This preserves provenance for existing syntheses.

DB Module Updates

db/article_history.rs needs:

Updated insert function:

pub struct ArticleHistoryEntry {
    pub user_id: Uuid,
    pub url: String,
    pub url_hash: String,
    pub title: String,
    pub source_type: String,
    pub source_url: Option<String>,
    pub category: Option<String>,
    pub synthesis_id: Option<Uuid>,
    pub status: String,
    pub scraped_ok: bool,
    pub job_id: Uuid,
}

pub async fn insert_entry(pool: &PgPool, entry: &ArticleHistoryEntry) -> Result<(), AppError>

Query functions:

pub async fn list_history(
    pool: &PgPool, user_id: Uuid,
    limit: i64, offset: i64,
    status_filter: Option<&str>,
    source_type_filter: Option<&str>,
    synthesis_id_filter: Option<Uuid>,
) -> Result<Vec<ArticleHistoryRow>, AppError>

pub async fn list_by_job_id(
    pool: &PgPool, user_id: Uuid, job_id: Uuid,
) -> Result<Vec<ArticleHistoryRow>, AppError>

pub async fn count_history(
    pool: &PgPool, user_id: Uuid,
    status_filter: Option<&str>,
    source_type_filter: Option<&str>,
) -> Result<i64, AppError>

The existing check_urls_exist, insert_urls, and cleanup_old stay — they're used for the dedup logic. The new insert_entry is for tracing. The old insert_urls (batch insert used articles after save) will be replaced by insert_entry calls with status: used and synthesis_id set.

API Endpoints

GET /api/v1/article-history?limit=50&offset=0&status=&source_type=&synthesis_id=

Returns paginated history with optional filters. Response:

{
  "items": [
    {
      "id": "uuid",
      "url": "https://...",
      "title": "Article title",
      "source_type": "personalized_source",
      "source_url": "https://source-page.com/blog",
      "category": "AI News",
      "synthesis_id": "uuid or null",
      "status": "used",
      "scraped_ok": true,
      "job_id": "uuid",
      "created_at": "2026-03-24T..."
    }
  ],
  "total": 150
}

GET /api/v1/syntheses/:id/provenance

Returns all article_history entries for the generation run that produced this synthesis. Finds the job_id from the used entries with this synthesis_id, then queries all entries with that job_id. Response: same shape as above (array of entries).

Frontend Views

Settings — "Historique des articles" page

New page at /article-history, accessible via button in Settings
Table: Date, Title, URL (truncated, clickable), Source Type, Status, Category, Synthesis (link)
Filters: dropdown for status, dropdown for source_type
Pagination (50 per page)
Color-coded status badges: green=used, red=filtered_*, grey=overflow

Synthesis detail — "Provenance" section

Collapsible section at the bottom of SynthesisDetail.tsx
Same table but pre-filtered to the generation run's job_id
Shows all candidates: used + dropped, so the user sees the full funnel
Grouped or sortable by status to quickly see what was dropped and why

Files to Modify

Backend:

Create: migration 20260324000016_enrich_article_history.sql
Modify: backend/src/db/article_history.rs — add ArticleHistoryEntry struct, insert_entry, list_history, list_by_job_id, count_history. Update insert_urls to use new columns or replace with insert_entry.
Modify: backend/src/services/synthesis.rs — at each filtering step, call insert_entry with appropriate status/source_type/job_id. After save, insert used articles with synthesis_id.
Create: backend/src/handlers/article_history.rs — handlers for both endpoints
Modify: backend/src/handlers/mod.rs — register module
Modify: backend/src/router.rs — add routes
Modify: backend/src/models/synthesis.rs — add source_url: Option<String> to ScrapedNewsItem
Modify: backend/src/db/syntheses.rs — save job_id on synthesis creation
Modify: CLAUDE.md — migration count to 16

Frontend:

Create: frontend/src/pages/ArticleHistory.tsx — global history viewer
Create: frontend/src/api/articleHistory.ts — API client
Modify: frontend/src/pages/SynthesisDetail.tsx — collapsible provenance section
Modify: frontend/src/App.tsx — add route
Modify: frontend/src/pages/Settings.tsx — add link to history page
Modify: frontend/src/i18n/fr.ts — labels
Modify: frontend/src/types.ts — ArticleHistoryEntry type

Tests:

Modify: e2e/tests/generation-live.spec.ts — verify provenance endpoint
Unit tests for DB queries

What Does NOT Change

check_urls_exist — still used for dedup filtering (checks url_hash)
cleanup_old — still deletes old entries (now richer but same cleanup logic)
Pipeline logic — filtering steps unchanged, just adding insert calls alongside
Existing synthesis display — unchanged
Settings values — no new settings

9.6 KiB Raw Blame History