9.6 KiB
Design: Article Tracing — Track Origin and Status of Every Article Candidate
Date: 2026-03-24 Scope: Enrich article_history with provenance metadata, track dropped articles, add frontend viewers
Context
The synthesis pipeline drops articles at various stages (history dedup, empty content, too old, source diversity) but doesn't record why. When the output quality is low (generic links, archives instead of articles), there's no way to diagnose which step failed. Users need visibility into the full candidate pipeline to improve their sources and settings.
Approach
Enrich the existing article_history table with tracing metadata. Insert dropped articles immediately at each filtering step. Add two frontend views: a global history page and a per-synthesis provenance section.
Enriched article_history Table
New columns via migration:
ALTER TABLE article_history ADD COLUMN title TEXT NOT NULL DEFAULT '';
ALTER TABLE article_history ADD COLUMN source_type TEXT NOT NULL DEFAULT 'unknown';
ALTER TABLE article_history ADD COLUMN source_url TEXT;
ALTER TABLE article_history ADD COLUMN category TEXT;
ALTER TABLE article_history ADD COLUMN synthesis_id UUID REFERENCES syntheses(id) ON DELETE SET NULL;
ALTER TABLE article_history ADD COLUMN status TEXT NOT NULL DEFAULT 'used';
ALTER TABLE article_history ADD COLUMN scraped_ok BOOLEAN NOT NULL DEFAULT true;
ALTER TABLE article_history ADD COLUMN job_id UUID NOT NULL DEFAULT gen_random_uuid();
-- Drop unique index — table is now a trace log, same URL can appear in multiple runs
DROP INDEX idx_article_history_user_url;
CREATE INDEX idx_article_history_user_url ON article_history(user_id, url_hash);
CREATE INDEX idx_article_history_job_id ON article_history(job_id);
-- Store job_id on syntheses for direct provenance lookup
ALTER TABLE syntheses ADD COLUMN job_id UUID;
source_type values: personalized_source, web_search, overflow
status values: used, filtered_history, filtered_empty, filtered_old, filtered_diversity, filtered_homepage, filtered_duplicate, filtered_cross_phase_dedup
synthesis_id: nullable — dropped articles don't belong to a synthesis.ON DELETE SET NULLpreserves history.job_id: the generation pipeline's UUID — links both used and dropped articles from the same generation run. Enables the provenance view to show all candidates, not just used ones.
Pipeline Integration
Insert dropped articles immediately at each filtering step. Insert used articles after synthesis saved with the real synthesis_id.
Insertion points (in pipeline order):
Phase 1 (personalized sources):
- Scrape — after
scrape_flat_urls+filter_empty, empty articles →status: filtered_empty,scraped_ok: false,source_type: personalized_source - History filter — articles matching existing history →
status: filtered_history - Retry scrape (if under-filled) — same as steps 1-2 for retry candidates
- Classification overflow — articles that overflow both target category and "Autre" →
status: filtered_diversity(tracked via overflow vec) - Source diversity — articles pruned by
max_articles_per_source→status: filtered_diversity
Phase 2 (web search):
6. Homepage filter — articles with homepage URLs → status: filtered_homepage, source_type: web_search
7. Cross-phase dedup — Phase 2 URLs already seen in Phase 1 → status: filtered_cross_phase_dedup
8. Dedup by URL — duplicate URLs within Phase 2 → status: filtered_duplicate
9. Source diversity — limit_articles_per_source drops → status: filtered_diversity
10. History filter — before scraping → status: filtered_history
11. Scrape — after scraping, empty articles → status: filtered_empty, scraped_ok: false
After save:
12. Used articles — insert with status: used, synthesis_id set, scraped_ok: true
All inserts include job_id from the pipeline's _job_id parameter (rename to job_id — remove underscore prefix).
source_url tracking: In Phase 1, thread the source page URL through candidate_urls as Vec<(String, String)> (article_url, source_url) and update scrape_flat_urls to accept pairs. In Phase 2, source_url is NULL. This requires modifying ScrapedNewsItem to add an optional source_url: Option<String> field.
Provenance lookup: Store job_id directly on the syntheses table (set during save). The provenance endpoint does a single query: SELECT * FROM article_history WHERE job_id = (SELECT job_id FROM syntheses WHERE id = $1).
Cleanup: cleanup_old only deletes entries where synthesis_id IS NULL (dropped articles). Used articles linked to syntheses are kept until the synthesis is deleted (ON DELETE SET NULL then next cleanup removes them). This preserves provenance for existing syntheses.
DB Module Updates
db/article_history.rs needs:
Updated insert function:
pub struct ArticleHistoryEntry {
pub user_id: Uuid,
pub url: String,
pub url_hash: String,
pub title: String,
pub source_type: String,
pub source_url: Option<String>,
pub category: Option<String>,
pub synthesis_id: Option<Uuid>,
pub status: String,
pub scraped_ok: bool,
pub job_id: Uuid,
}
pub async fn insert_entry(pool: &PgPool, entry: &ArticleHistoryEntry) -> Result<(), AppError>
Query functions:
pub async fn list_history(
pool: &PgPool, user_id: Uuid,
limit: i64, offset: i64,
status_filter: Option<&str>,
source_type_filter: Option<&str>,
synthesis_id_filter: Option<Uuid>,
) -> Result<Vec<ArticleHistoryRow>, AppError>
pub async fn list_by_job_id(
pool: &PgPool, user_id: Uuid, job_id: Uuid,
) -> Result<Vec<ArticleHistoryRow>, AppError>
pub async fn count_history(
pool: &PgPool, user_id: Uuid,
status_filter: Option<&str>,
source_type_filter: Option<&str>,
) -> Result<i64, AppError>
The existing check_urls_exist, insert_urls, and cleanup_old stay — they're used for the dedup logic. The new insert_entry is for tracing. The old insert_urls (batch insert used articles after save) will be replaced by insert_entry calls with status: used and synthesis_id set.
API Endpoints
GET /api/v1/article-history?limit=50&offset=0&status=&source_type=&synthesis_id=
Returns paginated history with optional filters. Response:
{
"items": [
{
"id": "uuid",
"url": "https://...",
"title": "Article title",
"source_type": "personalized_source",
"source_url": "https://source-page.com/blog",
"category": "AI News",
"synthesis_id": "uuid or null",
"status": "used",
"scraped_ok": true,
"job_id": "uuid",
"created_at": "2026-03-24T..."
}
],
"total": 150
}
GET /api/v1/syntheses/:id/provenance
Returns all article_history entries for the generation run that produced this synthesis. Finds the job_id from the used entries with this synthesis_id, then queries all entries with that job_id. Response: same shape as above (array of entries).
Frontend Views
Settings — "Historique des articles" page
- New page at
/article-history, accessible via button in Settings - Table: Date, Title, URL (truncated, clickable), Source Type, Status, Category, Synthesis (link)
- Filters: dropdown for status, dropdown for source_type
- Pagination (50 per page)
- Color-coded status badges: green=used, red=filtered_*, grey=overflow
Synthesis detail — "Provenance" section
- Collapsible section at the bottom of
SynthesisDetail.tsx - Same table but pre-filtered to the generation run's
job_id - Shows all candidates: used + dropped, so the user sees the full funnel
- Grouped or sortable by status to quickly see what was dropped and why
Files to Modify
Backend:
- Create: migration
20260324000016_enrich_article_history.sql - Modify:
backend/src/db/article_history.rs— addArticleHistoryEntrystruct,insert_entry,list_history,list_by_job_id,count_history. Updateinsert_urlsto use new columns or replace withinsert_entry. - Modify:
backend/src/services/synthesis.rs— at each filtering step, callinsert_entrywith appropriate status/source_type/job_id. After save, insert used articles with synthesis_id. - Create:
backend/src/handlers/article_history.rs— handlers for both endpoints - Modify:
backend/src/handlers/mod.rs— register module - Modify:
backend/src/router.rs— add routes - Modify:
backend/src/models/synthesis.rs— addsource_url: Option<String>toScrapedNewsItem - Modify:
backend/src/db/syntheses.rs— savejob_idon synthesis creation - Modify:
CLAUDE.md— migration count to 16
Frontend:
- Create:
frontend/src/pages/ArticleHistory.tsx— global history viewer - Create:
frontend/src/api/articleHistory.ts— API client - Modify:
frontend/src/pages/SynthesisDetail.tsx— collapsible provenance section - Modify:
frontend/src/App.tsx— add route - Modify:
frontend/src/pages/Settings.tsx— add link to history page - Modify:
frontend/src/i18n/fr.ts— labels - Modify:
frontend/src/types.ts—ArticleHistoryEntrytype
Tests:
- Modify:
e2e/tests/generation-live.spec.ts— verify provenance endpoint - Unit tests for DB queries
What Does NOT Change
check_urls_exist— still used for dedup filtering (checks url_hash)cleanup_old— still deletes old entries (now richer but same cleanup logic)- Pipeline logic — filtering steps unchanged, just adding insert calls alongside
- Existing synthesis display — unchanged
- Settings values — no new settings